CPSC 115L: Introduction to Computing Fall 2010

Project 3: Frequency Analysis

Due Tuesday, December 14, noon

As before, you are encouraged to work with a partner, although you may work on your own if you prefer. If submitted as a pair, both you and your partner will receive the same grade. For this project, you must e-mail source files and hand in hard copies.

The Problem

You are given a set of text files and your task is to write a program that can identify which of the files are written in English.

One of the differences among languages that use the same alphabet is that the letters of the alphabet will occur with different frequency. For example, in English the most frequent letter is 'e', followed by 't'. Whereas in Italian the vowels 'a', 'e', 'i', and 'o' are the most frequent letters. (See the Wikipedia page on frequency analysis for an illustration of this point.)

Given the expected frequency distribution for English letters, one way to identify a text as English would be count the letters in the text and compare their frequencies with the known frequencies. The closer the observed letter frequencies match the known English frequencies, the more likely the text is written in English.

The Chi square statistic is used to analyze whether two frequency distributions are the same or different. Suppose you have two arrays containing the relative freqencies of the letters 'a' through 'z'. For example, the value for 'e' would be something like 0.111, which means that 'e' has a relative frequency of 11.1%---i.e., it occurs 11.1% of the time. The first array, expected, contains the known or expected frequencies for English text. The second array, observed contains the relative frequencies that were observed in a given text. To compute the Chi square statistic for these two distributions, you would perform the following calculation:

Χ2 = Σi (observedi - expectedi)2/expectedi
where i here ranges over the frequencies from 'a' to 'z'. The smaller the value of Χ2, the more likely the observed text is English. You can see that in the very unlikely occurrence the observed frequencies are exactly the same as the expected frequencies -- i.e., observedi = expectedi -- then Χ2 would be 0.

Given a text file, your program should analyze the text using the Χ2 statistic and determine whether or not it is English. Here's a table of values that you can use to help you in your decisions:

Chi SquareDecision
Less than 0.01Almost Certainly English
Between 0.01 and 0.15Very Likely English
Between 0.15 and 0.25Probably English
Greater than 0.25Probably NOT English

Here's the output from a couple of runs of my command-line solution to this problem:

$ java FrequencyAnalyzer                       
Usage: java FrequencyAnalyzer filename

$ java FrequencyAnalyzer tomsawyer.eng.txt 
The Chi square value for this text is 4.325632418223285E-4.
This text is almost CERTAINLY written in English

$ java FrequencyAnalyzer divine.ital.txt   
The Chi square value for this text is 0.5096821878488532.
This text is PROBABLY NOT written in English

Command Line Arguments

Note in these sample runs that a command line argument is used to specify the name of the file containing the text. In fact if the filename is omitted, the program prints a line showing the expected usage: Usage: java FrequencyAnalyzer filename. Command line arguments in Java are passed to the main() via its String array parameter. That's what the parameter in the declaration of the main() method is used for:
public static void main(String args[])
In this case, the file name is passed as args[0] and can then be used by the program. If there were additional command line arguments, the would be passed as args[1], args[2], and so on. The args.length value can be used to determine how many command line arguments were passed to the program.

User Interface

This is not primarily an interactive program. So there is no user interface. The command line will be used to pass the program the name of a text file to analyze.

Program Design

Testing Your Program

Here are a set of plaintext files that you can use to test your program:

Divine Comedy Excerpt by Dante (Italian)
Voyage by Cartier Excerpt (French)
Mindanao Excerpt by Aguilar (Spanish)
Tom Sawyer Excerpt by Twain (English)
Tale of Two Cities Excerpt by Dickens (English)

You may also want to test your program on additional text files. We will also post some encrypted files, where you won't be able to tell by reading the files what language they are in. They will be encrypted in such a way that their relative letter frequencies are not changed.

Optional Features

The specifications laid out here are minimum specifications. Feel free to give your program whatever optional features you like. Enhancements might include giving your program the ability to recognize texts written in another language.

Evaluation Criteria. Your program will be evaluated on the following basis:

General Hints and Suggestions.

Email Source Files

Email a copy of your TextAnalyzer.java and FrequencyAnalyzer.java to your instructor: ralph dot morelli at trincoll dot edu or takunari dot miyazaki at trincoll dot edu.

Hand In Hardcopy

Hand in a hard copy of your source programs in class on the due date. NOTE: The readability of your hard copy counts towards your grade, so make sure you print it in a legible form. For example, to avoid having lines of code or comments wrap around and mess up the indentation, you should print in landscape mode.

Plagiarism and academic dishonesty

Please remember our course policy on plagiarism and academic dishonesty: You are encouraged to consult with one another when you work on homework assignments, but in the end everyone must do one's own work to hand in. In particular, discussion of homework assignments should be limited to brainstorming and verbally going through strategies, but it must not involve one student sharing written solutions with another student. In the end everyone must write up solutions independently. If you have discussed with classmates or used any outside source, you must clearly indicate so on your solutions and provide all references. Turning in another person's work under your name is plagiarism and qualifies as academic dishonesty. Academic dishonesty is a serious intellectual violation, and the consequences can be severe. For more details, read the Student Handbook 2010–2011, pp. 21–29.


* CPSC 115L home page
Valid HTML 4.01!