| CPSC 115L: Introduction to Computing | Fall 2010 |
One of the differences among languages that use the same alphabet is that the letters of the alphabet will occur with different frequency. For example, in English the most frequent letter is 'e', followed by 't'. Whereas in Italian the vowels 'a', 'e', 'i', and 'o' are the most frequent letters. (See the Wikipedia page on frequency analysis for an illustration of this point.)
Given the expected frequency distribution for English letters, one way to identify a text as English would be count the letters in the text and compare their frequencies with the known frequencies. The closer the observed letter frequencies match the known English frequencies, the more likely the text is written in English.
The Chi square statistic is used to analyze whether two frequency distributions are the same or different. Suppose you have two arrays containing the relative freqencies of the letters 'a' through 'z'. For example, the value for 'e' would be something like 0.111, which means that 'e' has a relative frequency of 11.1%---i.e., it occurs 11.1% of the time. The first array, expected, contains the known or expected frequencies for English text. The second array, observed contains the relative frequencies that were observed in a given text. To compute the Chi square statistic for these two distributions, you would perform the following calculation:
where i here ranges over the frequencies from 'a' to 'z'. The smaller the value of Χ2, the more likely the observed text is English. You can see that in the very unlikely occurrence the observed frequencies are exactly the same as the expected frequencies -- i.e., observedi = expectedi -- then Χ2 would be 0.Χ2 = Σi (observedi - expectedi)2/expectedi
Given a text file, your program should analyze the text using the Χ2 statistic and determine whether or not it is English. Here's a table of values that you can use to help you in your decisions:
| Chi Square | Decision |
|---|---|
| Less than 0.01 | Almost Certainly English |
| Between 0.01 and 0.15 | Very Likely English |
| Between 0.15 and 0.25 | Probably English |
| Greater than 0.25 | Probably NOT English |
Here's the output from a couple of runs of my command-line solution to this problem:
$ java FrequencyAnalyzer Usage: java FrequencyAnalyzer filename $ java FrequencyAnalyzer tomsawyer.eng.txt The Chi square value for this text is 4.325632418223285E-4. This text is almost CERTAINLY written in English $ java FrequencyAnalyzer divine.ital.txt The Chi square value for this text is 0.5096821878488532. This text is PROBABLY NOT written in English
In this case, the file name is passed as args[0] and can then be used by the program. If there were additional command line arguments, the would be passed as args[1], args[2], and so on. The args.length value can be used to determine how many command line arguments were passed to the program.public static void main(String args[])
Problem Decomposition. Your program will consist of two classes, the TextAnalyzer, an abstract superclass that contains useful methods, the FrequencyAnalyzer subclass, which contains the main() program, as well as the implementations of the abstract methods inherited from TextAnalyzer. The following UML diagrams summarize these two classes. Methods and instance variables in blue font are already completed. Methods in italic are abstract and must be implemented in the subclass.
|
|
TextAnalyzer.java FrequencyAnalyzer.java
The Chi square value for this text is 0.5096821878488532. This text is PROBABLY NOT written in English
Divine Comedy Excerpt by Dante (Italian)
Voyage by Cartier Excerpt (French)
Mindanao Excerpt by Aguilar (Spanish)
Tom Sawyer Excerpt by Twain (English)
Tale of Two Cities Excerpt by Dickens (English)
You may also want to test your program on additional text files. We will also post some encrypted files, where you won't be able to tell by reading the files what language they are in. They will be encrypted in such a way that their relative letter frequencies are not changed.
Evaluation Criteria. Your program will be evaluated on the following basis:
General Hints and Suggestions.
Stepwise refinement. Write the program in stages. Write and test each of the methods separately (unit testing). For example, design tests to make sure you chiSquare method is correct before applying it to some text.
Document your program as you build it. Don't leave it for the end. A good criterion to use for deciding how much documentation to provide is that a non-programmer should be able to read your code and understand what its doing (even if they don't understand Java).
Test each stage of your program thoroughly.
Get started early. Don't wait till the night before it is due.
Email a copy of your TextAnalyzer.java and FrequencyAnalyzer.java to your instructor: ralph dot morelli at trincoll dot edu or takunari dot miyazaki at trincoll dot edu.
|
|
CPSC 115L home page |
|
|