Lab 11: Using a Hash Table

Objectives

The objectives of this lab are:

The Problem

Download the file taleshort.txt, which contains a few paragraphs from the well known Dicken's novel A Tale of Two Cities (or the entire book, Tale of Two Cities). The problem you are asked to solve is to write a Java program that will count the number of occurrences of each word. Display the list of words and their respective counts in non-increasing order---that is, from the most to the least frequent words.

Strategy/Approach

Your overall strategy is to write as little Java code as necessary to solve this problem. If there is a library class in the Java API that solves the various subproblems involved in this problem, then use that class and its methods. This document will give you ideas on where to look in the Java API.

Task 1: Reading a Text File

For this lab we will read an input text file whose name will be passed to the program as a command-line argument. Here's a code segment that will read the text file into a string instance variable named inString.
import java.io.*;

public static void main(String args[]) {
...
  try {
    File f = new File(args[0]);
    InputStreamReader iStream = new InputStreamReader( new FileInputStream(f));
    int length = (int)f.length();
    char input[] = new char[length];
    iStream.read(input);
    inString = new String(input);
  } catch (FileNotFoundException e) {
    System.err.println("Error: File " + args[0] + " not found");
    e.printStackTrace();
  } catch (IOException e) {
    System.err.println("Error: I/O exception");
    e.printStackTrace();
  }
}
This code assumes that the name of the file is stored in the first command-line argument (args[0]). You would use the following command-line to run the main program:
java CountFrequencies textfile.txt
Once you have the file's text stored in a string, you can use a java.util.StringTokenizer to break the string into individual words. A good outcome for this phase of the project would to read the file and simply print out a list of its words. For this problem, we want to just pull the words out of the file, not the punctuation marks. So you want to specify the constructor so that the punctuation marks are used as delimiters:
new StringTokenizer(inString,"`~?!*:;,.()- '\"");

Task 2: Counting Word Frequencies

A hash table would be a good data structure to use for this task. One strategy would be use the word itself as the hash key and its count as the associated value. As you process the words in the file, you would look up each word in the hash table (fast access). If the word is already contained in the hash table, you increment its value (by getting, incrementing, and putting). If not, you make an entry for the word and store it in the hash table with a count of 1.

You have several choices for hash tables in the Java library, but I suggest you use the java.util.HashMap class. As described in the documentation, a HashMap is a hash table based implementation of the Map interface. It gives you methods such as get(), put(), and contains(), which are the methods you need for this task. Look at the documentation to see the signatures of these methods.

The HashMap class uses generic types. So you will want to make use of these in your program. Suppose you decide to create a HashMap that uses a String as the key (the word itself), and an Integer as the value (its count). Here is how you would construct a HashMap for this case:

HashMap<String,Integer> ht = new HashMap<String,Integer>();

You should then be able to use the HashMap methods without casting operators.

A good outcome for this task would be simply to print a string representation of the hash table after you are done processing all the words in the file.

Task 3: Sorting the Words

Once the words and their corresponding frequencies are stored in the hash table, we want to display the words from most to least freqeuent. This is easier said than done. The Java Collections framework has a number of classes that will be helpful for this task. For example, the java.util.Collections class contains the following static sort method:

However as you see, you have to convert your data into a List in order to use it. Fortunately, the Hashmap.values() method can be used to convert the Hashmap into a Collection, which can then be converted into a List:

   List list = new ArrayList(ht.values());

As you see here, we are using the values() method on the Hashmap to get a Collection, which we are then passing to the ArrayList constructor to create a List. We can then sort the list using the Collections.sort() method.

You may want to read the documentation on these methods to get further information.

A Problem: If you store the word frequencies as Integer objects, then it will not suffice merely to sort the Integers into non-decreasing order. You have to sort the integers and their associated words together. Unless you can find a way to sort the hash table and preserve the association between the words and their frequencies, it will be necessary to create your own class to associate the words and their frequencies. I suggest the following design:

WordFreq
- word: String
- count: int
+ WordFreq(word:String)
+ increment(n: int)
+ getCount(): int
+ getWord(): String
+ toString(): String
+ compareTo(obj: Object): int <Comparable interface>

Once you define this class, you can store instances of it as the values in your hash table. Note that this class must implement the Comparable interface. This means you must write an implementation of the compareTo(Object obj) method. In order to use this method to sort in non-increasing order, it must meet the following conditions:

If this object is less than obj, return a postive value.
If this object is greater than obj, return a negative value.
If this and obj are equal, return 0.

Output Requirements

No need to be fussy about formatting your output. So you can use the toString() method to generate output. For example, here is the output that my solution generates:
There were a total of 182 distinct words, listed here in non-increasing order: 
[[the,31], [of,25], [was,13], [it,12], [a,11], [in,8], [to,7],
[and,7], [were,7], [had,6], [for,4], [with,4], [that,4], [period,4],
[we,4], [its,3], [on,3], [only,2], [king,2], [as,2], [than,2],
[times,2], [season,2], [this,2], [messages,2], [face,2], [age,2],
[large,2], [us,2], [jaw,2], [out,2], [lane,2], [queen,2], [all,2],
[epoch,2], [any,2], [five,2], [there,2], [direct,2], [before,2],
[received,2], [life,2], [cock,2], [at,2], [year,2], [throne,2],
[going,2], [theirs,1], [england,1], [some,1], [france,1], [race,1],
[blessed,1], [southcott,1], [noisiest,1], [after,1], [being,1],
[so,1], [our,1], [favoured,1], [mere,1], [spiritual,1], [other,1],
[light,1], [plain,1], [first,1], [preserves,1], [from,1], [order,1],
[by,1], [lords,1], [announcing,1], [yet,1], [wisdom,1],
[incredulity,1], [brood,1], [book,1], [up,1], [deficient,1],
[state,1], [have,1], [events,1], [originality,1], [which,1],
[attained,1], [swallowing,1], [strange,1], [seven,1], [heralded,1],
[countries,1], [i,1], [laid,1], [degree,1], [comparison,1], [worst,1],
[birthday,1], [important,1], [earthly,1], [more,1], [her,1],
[guards,1], [general,1], [nothing,1], [insisted,1], [english,1],
[far,1], [settled,1], [sublime,1], [congress,1], [belief,1],
[darkness,1], [chickens,1], [fair,1], [people,1], [spirits,1],
[like,1], [ever,1], [america:,1], [superlative,1], [appearance,1],
[hundred,1], [relate,1], [been,1], [authorities,1], [rapping,1],
[come,1], [or,1], [england;,1], [arrangements,1], [crown,1],
[years,1], [one,1], [past,1], [made,1], [clearer,1], [thousand,1],
[both,1], [westminster,1], [seventy,1], [hope,1], [loaves,1],
[recently,1], [lord,1], [communications,1], [last,1], [fishes,1],
[supernaturally,1], [present,1], [short,1], [good,1], [private,1],
[twentieth,1], [winter,1], [ghost,1], [dozen,1], [round,1],
[conceded,1], [heaven,1], [mrs,1], [everything,1], [whom,1],
[through,1], [london,1], [way,1], [despair,1], [human,1], [proved,1],
[rapped,1], [subjects,1], [best,1], [revelations,1], [very,1],
[evil,1], [foolishness,1], [british,1], [even,1], [recalled,1],
[things,1], [lately,1], [crystal,1], [prophetic,1], [spring,1]]

You're done. Great work!