Download the file taleshort.txt, which contains a few paragraphs from the well known Dicken's novel A Tale of Two Cities (or the entire book, Tale of Two Cities). The problem you are asked to solve is to write a Java program that will count the number of occurrences of each word. Display the list of words and their respective counts in non-increasing order---that is, from the most to the least frequent words.
import java.io.*;
public static void main(String args[]) {
...
try {
File f = new File(args[0]);
InputStreamReader iStream = new InputStreamReader( new FileInputStream(f));
int length = (int)f.length();
char input[] = new char[length];
iStream.read(input);
inString = new String(input);
} catch (FileNotFoundException e) {
System.err.println("Error: File " + args[0] + " not found");
e.printStackTrace();
} catch (IOException e) {
System.err.println("Error: I/O exception");
e.printStackTrace();
}
}
This code assumes that the name of the file is stored in the
first command-line argument (args[0]). You would use the
following command-line to run the main program:
java CountFrequencies textfile.txtOnce you have the file's text stored in a string, you can use a java.util.StringTokenizer to break the string into individual words.
new StringTokenizer(inString,"`~?!*:;,.()- '\"");
You have several choices for hash tables in the Java library, but I suggest you use the java.util.HashMap class. As described in the documentation, a HashMap is a hash table based implementation of the Map interface. It gives you methods such as get(), put(), and contains(), which are the methods you need for this task. Look at the documentation to see the signatures of these methods.
The HashMap class uses generic types. So you will want to make use of these in your program. Suppose you decide to create a HashMap that uses a String as the key (the word itself), and an Integer as the value (its count). Here is how you would construct a HashMap for this case:
HashMap<String,Integer> ht = new HashMap<String,Integer>();
You should then be able to use the HashMap methods without casting operators.
A good outcome for this task would be simply to print a string representation of the hash table after you are done processing all the words in the file.
Once the words and their corresponding frequencies are stored in the hash table, we want to display the words from most to least freqeuent. This is easier said than done. The Java Collections framework has a number of classes that will be helpful for this task. For example, the java.util.Collections class contains the following static sort method:
However as you see, you have to convert your data into a List in order to use it. Fortunately, the Hashmap.values() method can be used to convert the Hashmap into a Collection, which can then be converted into a List:
Listlist = new ArrayList (ht.values());
As you see here, we are using the values() method on the Hashmap to get a Collection, which we are then passing to the ArrayList constructor to create a List. We can then sort the list using the Collections.sort() method.
You may want to read the documentation on these methods to get further information.
A Problem: If you store the word frequencies as Integer objects, then it will not suffice merely to sort the Integers into non-decreasing order. You have to sort the integers and their associated words together. Unless you can find a way to sort the hash table and preserve the association between the words and their frequencies, it will be necessary to create your own class to associate the words and their frequencies. I suggest the following design:
| WordFreq |
| - word: String
- count: int |
| + WordFreq(word:String)
+ increment(n: int) + getCount(): int + getWord(): String + toString(): String + compareTo(obj: Object): int <Comparable interface> |
Once you define this class, you can store instances of it as the values in your hash table. Note that this class must implement the Comparable interface. This means you must write an implementation of the compareTo(Object obj) method. In order to use this method to sort in non-increasing order, it must meet the following conditions:
If this object is less than obj, return a postive value. If this object is greater than obj, return a negative value. If this and obj are equal, return 0.
There were a total of 182 distinct words, listed here in non-increasing order: [[the,31], [of,25], [was,13], [it,12], [a,11], [in,8], [to,7], [and,7], [were,7], [had,6], [for,4], [with,4], [that,4], [period,4], [we,4], [its,3], [on,3], [only,2], [king,2], [as,2], [than,2], [times,2], [season,2], [this,2], [messages,2], [face,2], [age,2], [large,2], [us,2], [jaw,2], [out,2], [lane,2], [queen,2], [all,2], [epoch,2], [any,2], [five,2], [there,2], [direct,2], [before,2], [received,2], [life,2], [cock,2], [at,2], [year,2], [throne,2], [going,2], [theirs,1], [england,1], [some,1], [france,1], [race,1], [blessed,1], [southcott,1], [noisiest,1], [after,1], [being,1], [so,1], [our,1], [favoured,1], [mere,1], [spiritual,1], [other,1], [light,1], [plain,1], [first,1], [preserves,1], [from,1], [order,1], [by,1], [lords,1], [announcing,1], [yet,1], [wisdom,1], [incredulity,1], [brood,1], [book,1], [up,1], [deficient,1], [state,1], [have,1], [events,1], [originality,1], [which,1], [attained,1], [swallowing,1], [strange,1], [seven,1], [heralded,1], [countries,1], [i,1], [laid,1], [degree,1], [comparison,1], [worst,1], [birthday,1], [important,1], [earthly,1], [more,1], [her,1], [guards,1], [general,1], [nothing,1], [insisted,1], [english,1], [far,1], [settled,1], [sublime,1], [congress,1], [belief,1], [darkness,1], [chickens,1], [fair,1], [people,1], [spirits,1], [like,1], [ever,1], [america:,1], [superlative,1], [appearance,1], [hundred,1], [relate,1], [been,1], [authorities,1], [rapping,1], [come,1], [or,1], [england;,1], [arrangements,1], [crown,1], [years,1], [one,1], [past,1], [made,1], [clearer,1], [thousand,1], [both,1], [westminster,1], [seventy,1], [hope,1], [loaves,1], [recently,1], [lord,1], [communications,1], [last,1], [fishes,1], [supernaturally,1], [present,1], [short,1], [good,1], [private,1], [twentieth,1], [winter,1], [ghost,1], [dozen,1], [round,1], [conceded,1], [heaven,1], [mrs,1], [everything,1], [whom,1], [through,1], [london,1], [way,1], [despair,1], [human,1], [proved,1], [rapped,1], [subjects,1], [best,1], [revelations,1], [very,1], [evil,1], [foolishness,1], [british,1], [even,1], [recalled,1], [things,1], [lately,1], [crystal,1], [prophetic,1], [spring,1]]