CPSC 115L: Laboratory 8

CPSC 115L: Introduction to Computing

Fall 2010

Laboratory 8: Analyzing crosswords

October 27, 28

As usual, you are expected to work with an assigned partner as a pair. Both you and your partner will receive the same grade. Both of you should always save your laboratory work on your own accounts.

Objectives

The main objectives of this laboratory are

to learn how to work with files and lists, and
to continue learning how to work with strings.

1. Warm-up exercise

The Moby lexicon project is an extensive public-domain collection of lexical resources (such as words, phrases, synonyms, etc.) started by Grady Ward in 1996. It is now part of Project Gutenberg, an ambitious effort to digitize and archive virtually all historically important books and documents. In this laboratory, we will play with the official list of 113,809 crosswords (i.e., words considered to be valid in crosswords puzzles and other word games).

1.1. First, create a directory named lab7 inside your cpsc115 directory and download the file crosswords.txt (for this, you must be on this Laboratory 7 webpage and right-click the link provided). To work with a file in Python, we must first create a file object. The built-in function open takes a file and returns a file object; for example,

>>> f = open('crosswords.txt')
>>> print f
<open file 'crosswords.txt', mode 'r' at 0xb7cdf770>

The mode 'r' indicates that this file is now open for reading (as opposed to 'w' for writing). As shown above, print f does not print the content of the file. There are actually several ways to read the content. For example, the method readline reads a single line of text from a file object and returns the result in a string:

>>> f.readline()
'aa\r\n'

This shows that the first word is aa, ending with the two special characters \r\n for a carriage return and a newline, respectively. If you type f.readline() again, you will get the second word:

>>> f.readline()
'aah\r\n'

To extract only a word without special characters, use the string method strip:

>>> line = f.readline()
>>> word = line.strip()
>>> print word
aahed

To print all the words, you can use a for statement:

f = open('crosswords.txt')
for line in f:
  word = line.strip()
  print word

You can also form a list of all the words (and then print the list):

f = open('crosswords.txt')
list = []
for line in f:
  word = line.strip()
  list.append(word)
print list

1.2. Write a Python script named long_words.py that opens the file crosswords.txt and prints all the words with more than 20 letters. Run your script and save the snapshot of your test run in a text file named long_words.out.

1.3. Recall that a palindrome is a word that is spelled the same forward and backward. Write a Python script named palindromes.py that opens the file crosswords.txt, counts the number of all the palindromes, and then prints the shortest and longest palindromes. Your script should output in the following format:

In the official list of 113,809 crosswords, there are 91 palindromes.  The
shortest palindrome is 'aa', and the longest palindrome is 'deified'.

For this, it will be helpful to use a function named is_palindrome that takes a word w in a parameter and returns True if w is a palindrome and False otherwise. (Most of you have already implemented this function last week.) Run your script and save the snapshot of your test run in a text file named palindromes.out.

1.4. In 1939, Ernest Vincent Wright published a 50,000-word novel titled Gadsby that does not contain the letter ‘e’. Since ‘e’ is the most common letter in English, that was not easy to do. Write a Python script named no_e_words.py that opens the file crosswords.txt, counts the number of all the words that do not have ‘e’, and then prints the shortest and longest such words. Your script should output in the following format:

In the official list of 113,809 crosswords, there are 37641 words that do 
not have 'e'.  The shortest such word is 'aa', and the longest such word is 
'microminiaturizations'.

For this, it will be helpful to define a function named has_no_e that takes a word w in a parameter and returns True if w has 'e' and False otherwise. Run your script and save the snapshot of your test run in a text file named no_e_words.out. When completed, show your work to the instructor or TA.

2. Frequency analysis

In English, certain letters are used more frequently than others. For example, there are more words that begin with the letter ‘s’ than others. It is also well-known that ‘e’ is the most frequently-used letter in English. Historically, such knowledge has played very important rôles in cryptanalysis (i.e., the study of breaking ciphers). The following two exercises concern frequency analysis of letters used in the official list of 113,809 crosswords.

Note. For the following two exercises, you must first form a list of all the words from the given file crosswords.txt and then traverse each word in the list multiple times. To form such a list, see the example given above.

2.1. In this official list of 113,809 crosswords, how many words begin with the letter ‘a’? How many words begin the letter ‘b’? Is it true that ‘s’ is the most frequently-used first letter? Write a Python script that counts, for each letter in the alphabet, the number of all the words that begin with the letter and then finds the most frequently-used first letter. Your script should output in the following format:

In the official list of 113,809 crosswords,
6557 words begin with 'a',
6848 words begin with 'b',
  .
  .
  .
398 words begin with 'z',
and 's' is the most frequently-used first letter.

Your script should have a loop to iterate through the 26 letters of the alphabet (that is, you should not use 26 separate loops cover the alphabet). To begin, first write down on a piece of paper your I/O specification and algorithm. Then implement your algorithm in a Python script named frequent_first_letters.py. Run your script and save the snapshot of your test run in a text file named frequent_first_letters.out.

2.2. In this official list of 113,809 crosswords, how many words use the letter ‘a’? How many words use the letter ‘b’? Is it true that ‘e’ is the most frequently-used letter? Write a Python script that counts, for each letter in the alphabet, the number of all the words that use the letter and then finds the most frequently-used letter. Your script should output in the following format:

In the official list of 113,809 crosswords,
56613 words use 'a',
16305 words use 'b',
  .
  .
  .
3451 words use 'z',
and 'e' is the most frequently-used letter.

As before, your script should have a loop to iterate through the 26 letters of the alphabet. To begin, first write down your I/O specification and algorithm. Then implement your algorithm in a Python script named frequent_letters.py. Run your script and save the snapshot of your test run in a text file named frequent_letters.out. When completed, show your work to the instructor or TA.

What to hand in

Upon completion of your laboratory, submit the following in paper. Be sure to put a file header at the top of each file.

Your I/O specifications and algorithms for 2.1 and 2.2
Printouts of the Python scripts long_words.py, palindromes.py, no_e_words.py, frequent_first_letters.py and frequent_letters.py.
Printouts of the output files long_words.out, palindromes.out, no_e_words.out, frequent_first_letters.out and frequent_letters.out.

CPSC 115L home page