CS3134 Homework #5
Due on November 25, 2003 at 11:00am
There are two parts to this homework: a written component worth
15 points, and a programming assignment
worth 10 points. See the homework submission instructions on how to
hand it in and for important notes on programming style and
structure.
Note: parts in red are
revisions/clarifications.
Written questions
- (10 points) You're given the following list of numbers to
work with.
35, 21, 48, 1, 93, 87, 55, 100, 34, 97
- (3 points) Insert the numbers in a heap. Use
remove-largest conventions. Draw the resulting tree as well
as its array representation (use any appropriate array size
you like).
- (2 points) Remove the root node from the heap in (a),
and show the resulting tree after any bubble-down
operations.
- (3 points) Insert the numbers into an 13-element
array-backed hash table using double hashing. The first
hash function is key % 13, and the second
hash function is 5 - (key % 5).
Draw the resulting table.
- (2 points) State the number of initial collisions
(e.g., using the first hash function), and compute the
average "found" probe length (e.g., find the number of steps
to find each of the numbers in that list in the hash table in
(c), and average these over the set of 10 numbers).
- (5 points) You are to work out the steps to generate a
Huffman code for the string "MISSISSIPPI SAILING"
(without quotes).
- (3 points) Using the algorithm outlined in the book and
in class, create a Huffman tree for this String. Make sure
to show the steps involved in the Huffman tree's creation
(e.g., starting with singleton trees in a priority queue).
- (2 points) Given the Huffman tree in (a), encode the
aforementioned string. If we would ordinarily use 5 bits
per character, what's the savings in total # of bits?
Programming problem
The goal of this programming exercise is to develop a tool that
helps you query the frequency of individual words in a body of
text. Your program will read a corpus of words (Machiavelli's
Prince is a simple example, and you can download it here; other texts are available at Project Gutenberg), and
prompts the user to enter one or more words, for which it will
produce the frequency. Treat words as case-preserving but
case-insensitive, like the previous homework.
You will implement a hash table to store this information. Use
a "separate chaining" style of hash table, that is, a hash table
that consists of linked lists pointing to one or more entries
that hash to that location. You can either use the book's code,
which has its own ordered Linked List implementation as part of
the hash table, or (I recommend) you can implement your own,
using Java's LinkedList data structure, much like we
did for HW#4. You will be storing a tuple of information in
this hash table: the word will be the key, and a
frequency count will be kept along with the key.
- (1 point) Develop a Word class that contains a word
and a frequency and methods to manipulate its content. The word
and the frequency fields themselves should be marked
private. Note that the initial frequency of a new
Word will be 1.
- (7 points) Build a WordHashtable class that
contains the code necessary to:
- Construct the hash table. This should take one
parameter -- the size of the array of LinkedLists
(i.e., the number of buckets). Note that you should
not actually create the linked lists here, just the array
that may potentially contain them.
- (4 points) insert words. This method takes one
parameter (the word, as String) and hashes it. It
then either inserts it if it's not currently there (first
creating a LinkedList), or increments the frequency
of the Word if it's found. It then (potentially)
updates two object-level variables: uniqueWords is
incremented when a word is first inserted, and
numCollisions is incremented if there are already
words in that bucket, but if none of them match what
you're looking for
(e.g., only if
nothing is inserted and no existing frequency is
updated).
- (1 point) getFrequency of words. This takes a
String, looks up the word, and returns the frequency (an
int) if found, or -1 if it's not found.
- (1 point) getNumCollisions and
getNumUniqueWords returns the number of collisions
and unique words, respectively, as generated by calls to
insert above;
- (1 point) hashFunc computes the hash of a
String. Use the "third" String hash technique in the book,
i.e., on page 565. However, note that this hash function
can only handle lowercase characters and no punctuation.
Therefore, modify the hash function so that if you encounter
a non-alphabet character as you walk through the String,
it's ignored, and uppercase characters should be converted
to lowercase characters before they're computed as part of
the hashVal. Do this within the existing
for loop (you may find methods in the
Character class useful for this purpose). Make
sure you don't actually remove the punctuation from the word
when you store it -- just for the purposes of computing the
hash function.
- (2 points) Develop a main method to read the corpus
line-by-line, tokenize it into individual words separated by
whitespace, and feed each of the individual words into an
instance of the WordHashtable (of which you should create an
instance that holds at least 10,000 buckets). This main method
should then print out the number of words, collisions, and
unique words found. Finally, it should repeatedly prompt the
user to enter a word, and either print out its frequency or "Not
found". To quit, the user should just hit [Enter] (e.g., an
empty string).
An execution on Machiavelli's "The Prince" resulted in the
following, with input italicized.
$ java WordHashtable
51385 words, 3760 collisions, 7939 unique words
> power
Word power has frequency 28
> ruler
Not found
> prince
Word prince has frequency 142
>
Note that I did not create a WordHashtableApp class, but
rather put the main method in the WordHashtable class
itself. Do the same thing. The main method will be
written in exactly the same fashion -- the only
difference is that the command line execution string changes,
since there's no explicit "App".
- (6 points) Extra credit: In addition to being able to
lookup words in the WordHashtable, I'd like to know the
top 10 words occuring in the corpus of text. This can be
accomplished by using a heap. The strategy is that we create an
array-backed heap that contains the same instances of
Word that are inserted in the hash table -- that is,
both the heap and the hash table contain references to Word
objects -- just organized in a different fashion. (If you do
this, make sure to clearly indicate you've done so in your
README.)
- Modify your Word class so that it maintains its
index position in the heap (heapIndex), with
default value -1. The rationale for this will become clear
in part (b).
- Create a WordHeap class, consisting of:
- A constructor that takes the number of objects this
heap may contain;
- An insert method that takes a Word
object, inserts it at the end, and updates that
Word's heapIndex. Interestingly, you
don't need to worry about bubbling up here (why?).
- An increasedFreq method that takes a
Word object and bubbles it up as necessary,
based on its recently-changed frequency. Make sure the
Word actually belongs in the heap before doing
this. This method won't actually change the frequency;
it'll just make sure the heap property is maintained
when an external entity changes it. Make sure
you update the heapIndexes here as well. You
must implement this method in O(log n) time.
- A remove method that takes the
maximum-frequency Word object (i.e., the one at
the root), removes it, "fixes" the heap (by swapping the
last to the root, and bubbling down), and returns the
removed Word. The Word that's removed
should have its heap index changed to -1.
- An isEmpty method that returns a boolean.
- Modify the WordHashtable class such that:
- In the hash table's constructor, it creates an
object-level WordHeap the same size as the hash
table;
- Whenever it inserts a word for the first
time, that same Word is inserted into the
heap;
- Whenever a word's frequency is incremented, the
heap's increasedFreq method is called;
- A method called getTop10 is implemented
that takes no parameters and returns an array of 10
Words -- the result of 10 removes from
the heap -- in order, with the highest-frequency word
first. (Yes, this does assume there are 10 to remove;
if there aren't, make those cells null.)
- Modify the main method such that it calls
getTop10 and prints out the results out after it
prints out the hash table's statistics.
Note that this is a non-trivial extra credit assignment, and
it's essentially binary: you'll either get full credit or no
credit, unlike the rest of the homework. If you have
the time, I encourage you to do it, as it will provide
invaluable practice in working with heaps, but make sure to
see me with questions first!