Project 3

CPSC 115L: Introduction to Computing

Fall 2010

Project 3: Frequency Analysis

Due Tuesday, December 14, noon

As before, you are encouraged to work with a partner, although you may work on your own if you prefer. If submitted as a pair, both you and your partner will receive the same grade. For this project, you must e-mail source files and hand in hard copies.

The Problem

You are given a set of text files and your task is to write a program that can identify which of the files are written in English.

One of the differences among languages that use the same alphabet is that the letters of the alphabet will occur with different frequency. For example, in English the most frequent letter is 'e', followed by 't'. Whereas in Italian the vowels 'a', 'e', 'i', and 'o' are the most frequent letters. (See the Wikipedia page on frequency analysis for an illustration of this point.)

Given the expected frequency distribution for English letters, one way to identify a text as English would be count the letters in the text and compare their frequencies with the known frequencies. The closer the observed letter frequencies match the known English frequencies, the more likely the text is written in English.

The Chi square statistic is used to analyze whether two frequency distributions are the same or different. Suppose you have two arrays containing the relative freqencies of the letters 'a' through 'z'. For example, the value for 'e' would be something like 0.111, which means that 'e' has a relative frequency of 11.1%---i.e., it occurs 11.1% of the time. The first array, expected, contains the known or expected frequencies for English text. The second array, observed contains the relative frequencies that were observed in a given text. To compute the Chi square statistic for these two distributions, you would perform the following calculation:

Χ² = Σ_i (observed_i - expected_i)²/expected_i

where i here ranges over the frequencies from 'a' to 'z'. The smaller the value of Χ², the more likely the observed text is English. You can see that in the very unlikely occurrence the observed frequencies are exactly the same as the expected frequencies -- i.e., observed_i = expected_i -- then Χ² would be 0.

Given a text file, your program should analyze the text using the Χ² statistic and determine whether or not it is English. Here's a table of values that you can use to help you in your decisions:

Chi Square	Decision
Less than 0.01	Almost Certainly English
Between 0.01 and 0.15	Very Likely English
Between 0.15 and 0.25	Probably English
Greater than 0.25	Probably NOT English

Here's the output from a couple of runs of my command-line solution to this problem:

$ java FrequencyAnalyzer                       
Usage: java FrequencyAnalyzer filename

$ java FrequencyAnalyzer tomsawyer.eng.txt 
The Chi square value for this text is 4.325632418223285E-4.
This text is almost CERTAINLY written in English

$ java FrequencyAnalyzer divine.ital.txt   
The Chi square value for this text is 0.5096821878488532.
This text is PROBABLY NOT written in English

Command Line Arguments

Note in these sample runs that a command line argument is used to specify the name of the file containing the text. In fact if the filename is omitted, the program prints a line showing the expected usage: Usage: java FrequencyAnalyzer filename. Command line arguments in Java are passed to the main() via its String array parameter. That's what the parameter in the declaration of the main() method is used for:

public static void main(String args[])

In this case, the file name is passed as args[0] and can then be used by the program. If there were additional command line arguments, the would be passed as args[1], args[2], and so on. The args.length value can be used to determine how many command line arguments were passed to the program.

User Interface

This is not primarily an interactive program. So there is no user interface. The command line will be used to pass the program the name of a text file to analyze.

Program Design

Problem Decomposition. Your program will consist of two classes, the TextAnalyzer, an abstract superclass that contains useful methods, the FrequencyAnalyzer subclass, which contains the main() program, as well as the implementations of the abstract methods inherited from TextAnalyzer. The following UML diagrams summarize these two classes. Methods and instance variables in blue font are already completed. Methods in italic are abstract and must be implemented in the subclass.

TextAnalyzer

# englishFrequency[]: double final

+ readFile(filename: String): String
# applyChiSquare(text: String): double
# chiSquare(exp[]: double, obs[]: double): double
# calcFreqs(text:String, freqs[]: double)
+ analyze(text:String): String

FrequencyAnalyzer
extends TextAnalyzer

# calcFreqs(text:String, freqs[]: double)
+ analyze(text:String): String
+ main(args:String[])

Partially coded versions of these files may be downloaded here:

TextAnalyzer.java FrequencyAnalyzer.java
The englishFrequency variable is an array containing the known English letter frequencies, as computed from the text Tom Sawyer. You will use this in calculating the Χ² statistic.
The readFile() method is passed the name of a file and returns a String containing the text in that file.
The applyChiSquare() method takes a text, calculates its letter frequencies, and applies the Χ² statistic. It returns the Χ² value for that text.
The chiSquare() method computes the Χ² value for a pair of arrays that contain the expected and observed frequencies.
The calcFreqs() method is given a text and a 26-element array and calculates the letter frequencies for that text, storing them in the array.
The analyze() method takes a text and performs the frequency analysis on it and returns a string that represents the report you saw above:
```
The Chi square value for this text is 0.5096821878488532.
This text is PROBABLY NOT written in English
```

Note that the FrequencyAnalyzer class can use all of the public and protected methods it inherits from TextAnalyzer.

Testing Your Program

Here are a set of plaintext files that you can use to test your program:

Divine Comedy Excerpt by Dante (Italian)
Voyage by Cartier Excerpt (French)
Mindanao Excerpt by Aguilar (Spanish)
Tom Sawyer Excerpt by Twain (English)
Tale of Two Cities Excerpt by Dickens (English)

You may also want to test your program on additional text files. We will also post some encrypted files, where you won't be able to tell by reading the files what language they are in. They will be encrypted in such a way that their relative letter frequencies are not changed.

Optional Features

The specifications laid out here are minimum specifications. Feel free to give your program whatever optional features you like. Enhancements might include giving your program the ability to recognize texts written in another language.

Evaluation Criteria. Your program will be evaluated on the following basis:

Correctness -- Does it work correctly? Does it meet specifications?
Design -- Is it well-designed and well-structured? Does it make appropriate use of methods and parameters? Does it make appropriate use of control structures, such as if..else?
Readability -- Is it well-documented and neatly formatted? Is it easily readable? You should follow the documentation and coding conventions found in the files that you are given for this assignment.

General Hints and Suggestions.

Stepwise refinement. Write the program in stages. Write and test each of the methods separately (unit testing). For example, design tests to make sure you chiSquare method is correct before applying it to some text.
Document your program as you build it. Don't leave it for the end. A good criterion to use for deciding how much documentation to provide is that a non-programmer should be able to read your code and understand what its doing (even if they don't understand Java).
Test each stage of your program thoroughly.
Get started early. Don't wait till the night before it is due.

Email Source Files

Email a copy of your TextAnalyzer.java and FrequencyAnalyzer.java to your instructor: ralph dot morelli at trincoll dot edu or takunari dot miyazaki at trincoll dot edu.

Hand In Hardcopy

Hand in a hard copy of your source programs in class on the due date. NOTE: The readability of your hard copy counts towards your grade, so make sure you print it in a legible form. For example, to avoid having lines of code or comments wrap around and mess up the indentation, you should print in landscape mode.

Plagiarism and academic dishonesty

Please remember our course policy on plagiarism and academic dishonesty: You are encouraged to consult with one another when you work on homework assignments, but in the end everyone must do one's own work to hand in. In particular, discussion of homework assignments should be limited to brainstorming and verbally going through strategies, but it must not involve one student sharing written solutions with another student. In the end everyone must write up solutions independently. If you have discussed with classmates or used any outside source, you must clearly indicate so on your solutions and provide all references. Turning in another person's work under your name is plagiarism and qualifies as academic dishonesty. Academic dishonesty is a serious intellectual violation, and the consequences can be severe. For more details, read the Student Handbook 2010–2011, pp. 21–29.

CPSC 115L home page