|
CPSC 115L: Introduction to Computing
|
Fall 2010
|
Laboratory 8: Analyzing crosswords
October 27, 28
As usual, you are expected to work with an assigned partner as a pair. Both
you and your partner will receive the same grade. Both of you should always
save your laboratory work on your own accounts.
Objectives
The main objectives of this laboratory are
- to learn how to work with files and lists, and
- to continue learning how to work with strings.
1. Warm-up exercise
The Moby lexicon project is
an extensive public-domain collection of lexical resources (such as words,
phrases, synonyms, etc.) started by Grady Ward in 1996. It is now part of
Project Gutenberg, an ambitious
effort to digitize and archive virtually all historically important books
and documents. In this laboratory, we will play with the official list of
113,809 crosswords (i.e., words considered to be valid in crosswords puzzles
and other word games).
1.1.
First, create a directory named lab7 inside your cpsc115
directory and download the file
crosswords.txt (for this, you must be on
this Laboratory 7 webpage and right-click the link provided). To work with a
file in Python, we must first create a file object. The built-in
function open takes a file and returns a file object; for example,
>>> f = open('crosswords.txt')
>>> print f
<open file 'crosswords.txt', mode 'r' at 0xb7cdf770>
The mode 'r' indicates that this file is now open for reading (as
opposed to 'w' for writing). As shown above, print f does
not print the content of the file. There are actually several ways to read
the content. For example, the method readline reads a single line of
text from a file object and returns the result in a string:
>>> f.readline()
'aa\r\n'
This shows that the first word is aa, ending with the two special characters
\r\n for a carriage return and a newline, respectively. If you type
f.readline() again, you will get the second word:
>>> f.readline()
'aah\r\n'
To extract only a word without special characters, use the string method
strip:
>>> line = f.readline()
>>> word = line.strip()
>>> print word
aahed
To print all the words, you can use a for statement:
f = open('crosswords.txt')
for line in f:
word = line.strip()
print word
You can also form a list of all the words (and then print the list):
f = open('crosswords.txt')
list = []
for line in f:
word = line.strip()
list.append(word)
print list
1.2.
Write a Python script named long_words.py that opens the file
crosswords.txt and prints all the words with more than 20 letters.
Run your script and save the snapshot of your test run in a text file named
long_words.out.
1.3.
Recall that a palindrome is a word that is spelled the same forward and
backward. Write a Python script named palindromes.py that opens the
file crosswords.txt, counts the number of all the palindromes, and
then prints the shortest and longest palindromes. Your script should output
in the following format:
In the official list of 113,809 crosswords, there are 91 palindromes. The
shortest palindrome is 'aa', and the longest palindrome is 'deified'.
For this, it will be helpful to use a function named is_palindrome
that takes a word w in a parameter and returns True if
w is a palindrome and False otherwise. (Most of you have
already implemented this function last week.) Run your script and save the
snapshot of your test run in a text file named palindromes.out.
1.4.
In 1939, Ernest Vincent Wright published a 50,000-word novel titled
Gadsby that does not contain the letter ‘e’. Since
‘e’ is the most common letter in English, that was not easy to
do. Write a Python script named no_e_words.py that opens the file
crosswords.txt, counts the number of all the words that do not have
‘e’, and then prints the shortest and longest such words. Your
script should output in the following format:
In the official list of 113,809 crosswords, there are 37641 words that do
not have 'e'. The shortest such word is 'aa', and the longest such word is
'microminiaturizations'.
For this, it will be helpful to define a function named has_no_e
that takes a word w in a parameter and returns True if
w has 'e' and False otherwise. Run your script and
save the snapshot of your test run in a text file named
no_e_words.out. When completed, show your work to the instructor or
TA.
2. Frequency analysis
In English, certain letters are used more frequently than others. For
example, there are more words that begin with the letter ‘s’ than
others. It is also well-known that ‘e’ is the most
frequently-used letter in English. Historically, such knowledge has played
very important rôles in cryptanalysis (i.e., the study of breaking
ciphers). The following two exercises concern frequency analysis of letters
used in the official list of 113,809 crosswords.
Note.
For the following two exercises, you must first form a list of all the
words from the given file crosswords.txt and then traverse each word
in the list multiple times. To form such a list, see the example given above.
2.1.
In this official list of 113,809 crosswords, how many words begin with the
letter ‘a’? How many words begin the letter ‘b’? Is
it true that ‘s’ is the most frequently-used first letter? Write
a Python script that counts, for each letter in the alphabet, the number of
all the words that begin with the letter and then finds the most
frequently-used first letter. Your script should output in the following
format:
In the official list of 113,809 crosswords,
6557 words begin with 'a',
6848 words begin with 'b',
.
.
.
398 words begin with 'z',
and 's' is the most frequently-used first letter.
Your script should have a loop to iterate through the 26 letters of the
alphabet (that is, you should not use 26 separate loops cover the alphabet).
To begin, first write down on a piece of paper your I/O specification and
algorithm. Then implement your algorithm in a Python script named
frequent_first_letters.py. Run your script and save the snapshot of
your test run in a text file named frequent_first_letters.out.
2.2.
In this official list of 113,809 crosswords, how many words use the letter
‘a’? How many words use the letter ‘b’? Is it true
that ‘e’ is the most frequently-used letter? Write a Python
script that counts, for each letter in the alphabet, the number of
all the words that use the letter and then finds the most frequently-used
letter. Your script should output in the following format:
In the official list of 113,809 crosswords,
56613 words use 'a',
16305 words use 'b',
.
.
.
3451 words use 'z',
and 'e' is the most frequently-used letter.
As before, your script should have a loop to iterate through the 26 letters of
the alphabet. To begin, first write down your I/O specification and
algorithm. Then implement your algorithm in a Python script named
frequent_letters.py. Run your script and save the snapshot of your
test run in a text file named frequent_letters.out. When completed,
show your work to the instructor or TA.
What to hand in
Upon completion of your laboratory, submit the following in paper. Be sure to
put a file header at the top of each file.
- Your I/O specifications and algorithms for 2.1 and 2.2
- Printouts of the Python scripts long_words.py,
palindromes.py, no_e_words.py,
frequent_first_letters.py and frequent_letters.py.
- Printouts of the output files long_words.out,
palindromes.out, no_e_words.out,
frequent_first_letters.out and frequent_letters.out.