CS3134: HW#5

CS3134 Homework #5
Due on November 25, 2003 at 11:00am

There are two parts to this homework: a written component worth 15 points, and a programming assignment worth 10 points. See the homework submission instructions on how to hand it in and for important notes on programming style and structure.

Note: parts in red are revisions/clarifications.

Written questions

(10 points) You're given the following list of numbers to work with.
35, 21, 48, 1, 93, 87, 55, 100, 34, 97
1. (3 points) Insert the numbers in a heap. Use remove-largest conventions. Draw the resulting tree as well as its array representation (use any appropriate array size you like).
2. (2 points) Remove the root node from the heap in (a), and show the resulting tree after any bubble-down operations.
3. (3 points) Insert the numbers into an 13-element array-backed hash table using double hashing. The first hash function is key % 13, and the second hash function is 5 - (key % 5). Draw the resulting table.
4. (2 points) State the number of initial collisions (e.g., using the first hash function), and compute the average "found" probe length (e.g., find the number of steps to find each of the numbers in that list in the hash table in (c), and average these over the set of 10 numbers).
(5 points) You are to work out the steps to generate a Huffman code for the string "MISSISSIPPI SAILING" (without quotes).
1. (3 points) Using the algorithm outlined in the book and in class, create a Huffman tree for this String. Make sure to show the steps involved in the Huffman tree's creation (e.g., starting with singleton trees in a priority queue).
2. (2 points) Given the Huffman tree in (a), encode the aforementioned string. If we would ordinarily use 5 bits per character, what's the savings in total # of bits?

Programming problem

The goal of this programming exercise is to develop a tool that helps you query the frequency of individual words in a body of text. Your program will read a corpus of words (Machiavelli's Prince is a simple example, and you can download it here; other texts are available at Project Gutenberg), and prompts the user to enter one or more words, for which it will produce the frequency. Treat words as case-preserving but case-insensitive, like the previous homework.

You will implement a hash table to store this information. Use a "separate chaining" style of hash table, that is, a hash table that consists of linked lists pointing to one or more entries that hash to that location. You can either use the book's code, which has its own ordered Linked List implementation as part of the hash table, or (I recommend) you can implement your own, using Java's LinkedList data structure, much like we did for HW#4. You will be storing a tuple of information in this hash table: the word will be the key, and a frequency count will be kept along with the key.

(1 point) Develop a Word class that contains a word and a frequency and methods to manipulate its content. The word and the frequency fields themselves should be marked private. Note that the initial frequency of a new Word will be 1.
(7 points) Build a WordHashtable class that contains the code necessary to:
1. Construct the hash table. This should take one parameter -- the size of the array of LinkedLists (i.e., the number of buckets). Note that you should not actually create the linked lists here, just the array that may potentially contain them.
2. (4 points) insert words. This method takes one parameter (the word, as String) and hashes it. It then either inserts it if it's not currently there (first creating a LinkedList), or increments the frequency of the Word if it's found. It then (potentially) updates two object-level variables: uniqueWords is incremented when a word is first inserted, and numCollisions is incremented if there are already words in that bucket, but if none of them match what you're looking for ~~(e.g., only if nothing is inserted and no existing frequency is updated)~~.
3. (1 point) getFrequency of words. This takes a String, looks up the word, and returns the frequency (an int) if found, or -1 if it's not found.
4. (1 point) getNumCollisions and getNumUniqueWords returns the number of collisions and unique words, respectively, as generated by calls to insert above;
5. (1 point) hashFunc computes the hash of a String. Use the "third" String hash technique in the book, i.e., on page 565. However, note that this hash function can only handle lowercase characters and no punctuation. Therefore, modify the hash function so that if you encounter a non-alphabet character as you walk through the String, it's ignored, and uppercase characters should be converted to lowercase characters before they're computed as part of the hashVal. Do this within the existing for loop (you may find methods in the Character class useful for this purpose). Make sure you don't actually remove the punctuation from the word when you store it -- just for the purposes of computing the hash function.
(2 points) Develop a main method to read the corpus line-by-line, tokenize it into individual words separated by whitespace, and feed each of the individual words into an instance of the WordHashtable (of which you should create an instance that holds at least 10,000 buckets). This main method should then print out the number of words, collisions, and unique words found. Finally, it should repeatedly prompt the user to enter a word, and either print out its frequency or "Not found". To quit, the user should just hit [Enter] (e.g., an empty string).

An execution on Machiavelli's "The Prince" resulted in the following, with input italicized.
$ java WordHashtable 51385 words, 3760 collisions, 7939 unique words > power Word power has frequency 28 > ruler Not found > prince Word prince has frequency 142 >
Note that I did not create a WordHashtableApp class, but rather put the main method in the WordHashtable class itself. Do the same thing. The main method will be written in exactly the same fashion -- the only difference is that the command line execution string changes, since there's no explicit "App".
(6 points) Extra credit: In addition to being able to lookup words in the WordHashtable, I'd like to know the top 10 words occuring in the corpus of text. This can be accomplished by using a heap. The strategy is that we create an array-backed heap that contains the same instances of Word that are inserted in the hash table -- that is, both the heap and the hash table contain references to Word objects -- just organized in a different fashion. (If you do this, make sure to clearly indicate you've done so in your README.)
1. Modify your Word class so that it maintains its index position in the heap (heapIndex), with default value -1. The rationale for this will become clear in part (b).
2. Create a WordHeap class, consisting of:
  1. A constructor that takes the number of objects this heap may contain;
  2. An insert method that takes a Word object, inserts it at the end, and updates that Word's heapIndex. Interestingly, you don't need to worry about bubbling up here (why?).
  3. An increasedFreq method that takes a Word object and bubbles it up as necessary, based on its recently-changed frequency. Make sure the Word actually belongs in the heap before doing this. This method won't actually change the frequency; it'll just make sure the heap property is maintained when an external entity changes it. Make sure you update the heapIndexes here as well. You must implement this method in O(log n) time.
  4. A remove method that takes the maximum-frequency Word object (i.e., the one at the root), removes it, "fixes" the heap (by swapping the last to the root, and bubbling down), and returns the removed Word. The Word that's removed should have its heap index changed to -1.
  5. An isEmpty method that returns a boolean.
3. Modify the WordHashtable class such that:
  1. In the hash table's constructor, it creates an object-level WordHeap the same size as the hash table;
  2. Whenever it inserts a word for the first time, that same Word is inserted into the heap;
  3. Whenever a word's frequency is incremented, the heap's increasedFreq method is called;
  4. A method called getTop10 is implemented that takes no parameters and returns an array of 10 Words -- the result of 10 removes from the heap -- in order, with the highest-frequency word first. (Yes, this does assume there are 10 to remove; if there aren't, make those cells null.)
4. Modify the main method such that it calls getTop10 and prints out the results out after it prints out the hash table's statistics.
Note that this is a non-trivial extra credit assignment, and it's essentially binary: you'll either get full credit or no credit, unlike the rest of the homework. If you have the time, I encourage you to do it, as it will provide invaluable practice in working with heaps, but make sure to see me with questions first!