CS3134: HW#5

CS3134 Homework #5
Due on Tuesday, November 30, 2004 at 11:00am

There are two parts to this homework: a written component worth 12 points, and a programming assignment worth 13 points. See the homework submission instructions on how to hand it in and for important notes on programming style and structure.

Written questions

(9 points) You're given the following list of numbers to work with.
48, 21, 45, 1, 93, 87, 55, 100, 34, 97
1. (1 point) Insert the numbers into a binary search tree.
2. (1 point) Using the book's convention, draw the resulting tree when 48 is deleted.
3. (1 point) Using the inorder predecessor instead of the inorder successor, draw the resulting tree when 48 is deleted.
4. (1 point) Insert the numbers into an 11-element-array-backed hash table using linear probing. Use the hash function key % 11. Draw the resulting table.
5. (3 points) Insert the numbers into an 13-element array-backed hash table using double hashing. The first hash function is key % 13, and the second hash function is 5 - (key % 5). Draw the resulting table.
6. (2 points) From (e), state the number of initial collisions (e.g., using the first hash function), and compute the average "found" probe length (e.g., find the number of steps to find each of the numbers in that list in the double hashing hash table, and average these over the set of 10 numbers).
(3 points) You are to work out the steps to generate a Huffman code for the string "SUNNING IN MISSISSIPPI" (without quotes).
1. (2 points) Using the algorithm outlined in the book and in class, create a Huffman tree for this String. Make sure to show the steps involved in the Huffman tree's creation (e.g., starting with singleton trees in a priority queue).
2. (1 points) Given the Huffman tree in (a), encode the aforementioned string. If we would ordinarily use 7 bits per character, what's the savings in total # of bits?

Programming problem

The goal of this programming exercise is to develop a search tool for email using a tree-based dictionary structure. There are two major parts to this assignment: modifying the Tree data structure to support this data, and to write an App that parses an email folder file and inserts the relevant data into the tree, thereby allowing lookup. The keys that will be used are every word in an email body, and the associated value will be the email header. In other words, you'll build the email header from the header fields, and then will insert it against every word key. Since a word may be contained in multiple emails, we'll use a linked list to store a list of email headers associated with each key word.

The mail format that we're going to read is the UNIX mbox format - this is the format your mail is stored in if you use CUBMail or Pine. If you look on your CUNIX account, there may be a mail folder containing all your mail, with one file per folder. Your INBOX isn't stored there, though -- that may be stored in a special file called mbox in your homedirectory, or may be located in a more esoteric location. You're welcome to test your code against your mail, but I've also provided a sample mailbox here -- it's a collection of the public emails I've gotten from Dean Zvi Galil, of SEAS, from the last year or so.

The mbox format can be described as follows (look at the sample mailbox I've provided). For a full technical description, try typing "man 5 mbox" on CUNIX.

Each new message starts with the string "From " -- that is, the word From, with a capital F, followed by a space. That's the beginning of the headers of a new set of email.
That line is not the From field in an email. That, along with the others, are stored in the header lines starting with the strings "From:", "To:", "Date:", and "Subject:". Apart from those four, you can safely ignore the rest of the headers for the purposes of this assignment (except the above trick that acts as the separator of emails). Note: There can potentially be emails missing one or more of the headers, as they're optional, so make sure your program doesn't get confused if it doesn't find it.
The body of the email is separated from the headers by exactly one blank line. We will treat each word case-insensitive, so that "Hello" is the same as "hello", but we will treat words with punctuation distinct for simplicity purposes, so "Hello" is different from "Hello!"

So, what are you going to do with this?

(2 points) Write an EmailHeader class that will store four Strings: the From address, the To address, the Subject, and the Date. Note that you can store the date as a String without any side-effects for this assignment. Also write a constructor that takes those four parameters and stores them, and a toString method that generates out a String representation of the email header (recommendation: use a tab whitespace character, e.g., "\t", between each field so that they appear nicely in columns when the EmailHeaders are later printed out).
(4 points) Modify the Tree code as supplied with the book (downloadable from here) as follows.
- (1 point) In the Node class, change the key to be of type String and the associated value to be of type LinkedList. By doing so, we will be able to look up any word in the tree and get a list of matching EmailHeader objects (i.e., those that would be stored in the LinkedList). Like the previous homework assignment, you can use Java's LinkedList class.
- (2 points) In the Tree class, modify the insert method so that it takes a String key and an EmailHeader value. It should then search for the key in the tree. If it's not found, it should add it, create a new LinkedList associated with it, and add the EmailHeader as the first item in that LinkedList. If it is found, it should first search the LinkedList to see if this EmailHeader has been added (to take care of the situation where an email has two instances of the same word), and if not, add it to the LinkedList.
- (1 point) In the Tree class, modify the find method so that it takes a String key and returns the associated LinkedList, or null if it's not found.
- For simplicity's sake, you can remove all the other methods from the Tree class - we won't be using them.
(7 points) Write a EmailSearcher app class that reads mail files, inserts the matches into the tree, and lets us search for the results. You will only need one main method.
- (1 point) Make sure the main method takes one argument -- the filename of the mailbox to read -- and opens up a BufferedReader for it. Also make sure to instantiate a new Tree to store our information.
- (4 points) Parse the mailbox. To do this, keep on reading one String at a time to see what it's about.
  - If it's the start of a new email, i.e., the String starts with "From ", start up another loop that reads the headers (i.e., until the blank line that starts the body). This loop should look for the four aforementioned headers and store them in a new instanace of EmailHeader.
  - Otherwise, read each line of body text, split it into individual words, and insert the (word, email header) pair into the tree for every word encountered on every line. You should use the last EmailHeader object built from the headers. This sounds wasteful, but in fact it's not - Java will recycle the one EmailHeader object over and over until you get to the next email.
- (2 points) Finally, prompt the user, repeatedly, to enter a word. Once the user enters that word, search the tree for that word. Print out the resulting email headers, or "not found" if the word is not found. Input should end when the user enters a blank line.
(2 points extra credit) Handle punctuation by stripping it out before inserting it into the tree, so that the limitation as described at the beginning of this section is no longer an issue. Make sure to indicate in your README if you have done so.

Tips:

Use the LinkedList Javadoc to your advantage. There are lots of methods to make your life easier (such as the contains() method). Also, remember the rules about casting from HW4.
Avoid using mailboxes with attachments -- it'll add lots of garbage to your tree. It should work, but it may cause your program to eat more memory than you desired. We will grade you on simple mailboxes, like the one provided.
There's several useful methods in the String class. (Once again, use the Javadocs to your advantage!)
- trim() removes whitespace on either side of the String. In particular, you may find this useful when processing email headers, as the actual From, etc. may be separated from the header with an indeterminate amount of whitespace.
- split() lets you split a String into multiple pieces. In particular, s.split(" ") splits the String into multiple pieces separated by a space. This is the "modern" equivalent of StringTokenizer, although you're welcome to use that as well if you prefer.
- And, of course, there's the case-insensitive comparators.
As a simple test, search for the word "snow" in the email folder I provided - there's exactly one email from Zvi on that note. You can also open the file in Pine or CUBMail and verify that your searcher is producing sensible results.