There are two parts to this homework: a written component worth 8 points, and programming component worth 17 points. Submission instructions are available here.
Note: parts in red are revisions/clarifications (last revised 4/7/05).
As described in the homework submission instructions, you may submit this as a hardcopy, or as a file along with your programming problems in one of four formats (Word, PDF, HTML, or plaintext). Make sure to put at least your name and the section number on top of the homework whether it's submitted in written or electronic form, and if it's submitted electronically, make sure you name your file correctly.
Note that problems assigned from Schneider/Gersting or Lewis/Loftus are the exercise problems at the end of each chapter, not the practice problems or self-review questions. (The practice/self-review problems are optional, and solutions for them are provided in the book. For obvious reasons, the solutions for the exercises are not. ;-))
As described below, you will submit this part of the assignment as five files: three .java files, corresponding to the source for each problem, a README file, and a typescript showing that each function works, similar to the one pictured below. Make sure to put comments in your code - you may lose points if you don't comment your code. Also, make sure you're familiar with code in class and in the book -- we've covered most of these topics already, and the assignment becomes reasonably straightforward if you're up to speed with the course material.
The Problem
We are interested in doing simple statistical calculations on ASCII files: in particular, we're interested in the frequency with which words appear in a body of text. Such a program has applications in data compression and natural-language processing. For this assignment, you will write an interactive program that reads a text file, counts all the words in the file, and stores the word counts and frequencies in an array. Once you have this word frequency information, your program will provide answers to numerous questions. For example: how many unique words were in the file? What are the words with the top N (say 10) frequencies? etc.
We will walk you through some of the design, asking you to code certain classes and methods.
Please pay particular attention to the specifications provided. For example, if we ask for
a method that returns the total number of words as an int
, then your method should
do only that. It should not print anything unless specified.
Ready? Here we go!
At a high level, here's the plan...we are going to build three classes: Word, WordArray, and WordBank. WordBank is our driver program, and it therefore will be the only class with a main method (in fact, it will be the only method in that class). WordBank will open the text file that we wish to count the words for and read each word. Every time it reads a word, it will tell the WordArray to add the word. The WordArray will determine if this is a new word or one it has already seen, and it will act accordingly. To do this, WordArray will have an array of Words - a class that holds a word and its frequency of occurrence.
Word
. Here is the UML diagram for Word:It is a rather straightforward class that will store a word (a
String
) and the frequency (int
) as private
member variables. The constructor initializes the word to an initial String
value,
passed as a parameter, and the incrementFrequency
- takes no parameters, increments the word frequency by 1, and returns nothing.getWord
- takes no parameters but returns the actual
word as a String
.getFrequency
- takes no
parameters but returns the value of frequency as an int
.
toString
- returns a String which is the concatenation of a word and its frequency,
with space or prose in between (e.g., "foo occurs 10 times
").WordArray
. You should implement WordArray; it has the following UML diagram:
(1 point) The
wordList
is your array of words. (Use the array syntax we discussed in class;
do not use an ArrayList
object!) Your list should have the capacity to store 1000 words initially.
count
is an int
that will keep track of how many (unique) words are in your wordList
(and will tell you where to put the next word). You should
initialize these appropriately in your default constructor (which will take
no parameters). Finally, you should write the following methods.
find
- should implement linear search over
the words in your wordList
. It takes a word (String
)
to search for and returns the index of the word if it is in the list.
If it's not in the list, return -1. Note: You should ignore
case when checking to see if a word is in your list, i.e., "The", "the"
and "tHe" all the same word.add
- should take a word (as a String) and add it to your
wordList
array. If the word is already in the array, you
should not add it again; instead, increase the frequency of the
word by one. Also, be sure to check that
there is space in the array for any new items before adding a word. If
not, call the local method growList
(see below) to increase
the size of the array for you.growList
- should double the size of your
array. Recall that arrays (as we covered) cannot be resized. So, to grow
the list, you should create a new array that is twice the size of the
current list, copy all the elements from wordList
to this
temporary array, and then replace the wordList
reference
with this new list.sort
- should sort the list by word
frequency in ascending order. You should
implement the bubble sort algorithm to perform the sort, leaving you with a sorted array. You should implement this using the Java covered thus far in class. Do not use Comparable classes or built-in Java sorting mechanisms. If in doubt, consult an instructor or TA.getUniqueWordCount
- return the number of unique words in your list.getFrequency
- takes a String
and returns either the frequency of the word, if found, or a -1 if not
found.getTopWords
- takes an int (i.e., n
) and,
assuming someone has sorted the list first, returns a
String containing
the n most frequent words with their frequency. Hint: you
can put "\n
" characters into the String to allow
WordBank
to nicely print more than one word's results
(i.e., on multiple lines) on the screen.WordBank
driver class. WordBank
should
first take a filename as a command-line parameter, create a Scanner
for the file, and read all the words from the file one word at-a-time, adding each to your local
WordArray
. (3 points)Scanner
breaks tokens (via the next()
method call) based on whitespace, and will not ignore punctuation
in the file. However, the Scanner
for your file can be customized to consider non-word
characters (where word characters are defined to be a-z, A-Z and 0-9) as
delimeters in addition to whitespace, thereby effectively ignoring punctuation.
To do this, use the following code to create your file scanner:
Scanner fileScan = new Scanner(new File(filename));
fileScan.useDelimiter("[\\s\\W]+");
filename
is the text name of the file you want to open. (This method for changing
the delimiter is non-optimal as it breaks hyphenated words, possessive quantifiers, etc.
Can you think of a better way to do it? One point of extra credit if
you can correctly handle words with hyphens and apostrophes.)sort
them (so
that getTopWords
will work). Next, use WordArray's
getUniqueWordCount
to print out the total number of unique words, create
a new Scanner
for user input, and repeatedly prompt the
user until they hit enter at an empty prompt to quit the program.
$ java WordBank test.txt
test.txt has 865 unique words.
Enter a word to get its frequency, a number to
list the top N words, or a blank line to quit.
> a
Word a occurs 50 times.
> 10
the with frequency 191
I with frequency 97
to with frequency 74
and with frequency 71
of with frequency 64
a with frequency 50
my with frequency 43
in with frequency 41
was with frequency 40
her with frequency 35
> (user just hits Enter)
$
WordArray
and printing out the returned result. If
the user enters a word that doesn't exist or enters a value of N less than
1, print an error. Apart from these two scenarios, you can assume
"valid" input. Hint: java.lang's Character
class has a few utility methods that
help you determine if a character is a number or a letter; you can grab the
first character of the String
containing a line of user input, use these methods to decide
what kind of input the user has made, and then process them accordingly.