CS3134 Homework #5
Due on
Tuesday, November 30, 2004 at 11:00am
There are two parts to this homework: a written component worth
12 points, and a programming assignment
worth 13 points. See the homework submission instructions on how to
hand it in and for important notes on programming style and
structure.
Written questions
- (9 points) You're given the following list of numbers to
work with.
48, 21, 45, 1, 93, 87, 55, 100, 34, 97
- (1 point) Insert the numbers into a binary search tree.
- (1 point) Using the book's convention, draw the resulting tree when
48 is deleted.
- (1 point) Using the inorder predecessor instead of the
inorder successor, draw the resulting tree when 48 is deleted.
- (1 point) Insert the numbers into an 11-element-array-backed hash
table using linear probing. Use the hash function key % 11.
Draw the resulting table.
- (3 points) Insert the numbers into an 13-element
array-backed hash table using double hashing. The first
hash function is key % 13, and the second
hash function is 5 - (key % 5).
Draw the resulting table.
- (2 points) From (e), state the number of initial collisions
(e.g., using the first hash function), and compute the
average "found" probe length (e.g., find the number of steps
to find each of the numbers in that list in the double hashing hash table, and average these over the set of 10 numbers).
- (3 points) You are to work out the steps to generate a
Huffman code for the string "SUNNING IN MISSISSIPPI"
(without quotes).
- (2 points) Using the algorithm outlined in the book and
in class, create a Huffman tree for this String. Make sure
to show the steps involved in the Huffman tree's creation
(e.g., starting with singleton trees in a priority queue).
- (1 points) Given the Huffman tree in (a), encode the
aforementioned string. If we would ordinarily use 7 bits
per character, what's the savings in total # of bits?
Programming problem
The goal of this programming exercise is to develop a search tool for
email using a tree-based dictionary structure. There are two major
parts to this assignment: modifying the Tree data structure to support
this data, and to write an App that parses an email folder file and
inserts the relevant data into the tree, thereby allowing lookup.
The keys that will be used are every word in an email body, and the
associated value will be the email header. In other words, you'll
build the email header from the header fields, and then will insert it
against every word key. Since a word may be contained in multiple
emails, we'll use a linked list to store a list of email headers
associated with each key word.
The mail format that we're going to read is the UNIX mbox format -
this is the format your mail is stored in if you use CUBMail or Pine.
If you look on your CUNIX account, there may be a mail folder containing
all your mail, with one file per folder. Your INBOX isn't stored
there, though -- that may be stored in a special file called mbox
in your homedirectory, or may be located in a more esoteric location.
You're welcome to test your code against your mail, but I've also
provided a sample mailbox here -- it's a
collection of the public emails I've gotten from
Dean Zvi Galil, of SEAS,
from the last year or so.
The mbox format can be described as follows (look at the sample
mailbox I've provided). For a full technical description, try
typing "man 5 mbox" on CUNIX.
- Each new message starts with the string "From " -- that is,
the word From, with a capital F, followed by a space.
That's the beginning of the headers of a new set of email.
- That line is not the From field in an email. That,
along with the others, are stored in the header lines starting with the
strings "From:", "To:", "Date:", and "Subject:".
Apart from those four, you can safely ignore the rest of the headers for
the purposes of this assignment (except the above trick that acts as the
separator of emails). Note: There can potentially be emails
missing one or more of the headers, as they're optional, so make sure
your program doesn't get confused if it doesn't find it.
- The body of the email is separated from the headers by
exactly one blank line. We will treat each word case-insensitive,
so that "Hello" is the same as "hello", but we will treat words with
punctuation distinct for simplicity purposes, so "Hello" is different
from "Hello!"
So, what are you going to do with this?
- (2 points) Write an EmailHeader class that will store four
Strings: the From address, the To address, the Subject,
and the Date. Note that you can store the date as a String
without any side-effects for this assignment. Also write a
constructor that takes those four parameters and stores them, and a
toString method that generates out a String representation of the
email header (recommendation: use a tab whitespace character,
e.g., "\t", between each field so that they appear nicely in
columns when the EmailHeaders are later printed out).
- (4 points) Modify the Tree code as supplied with the book
(downloadable from here) as follows.
- (1 point) In the Node class, change the key to be of
type String and the associated value to be of type
LinkedList. By doing so, we will be able to look up any
word in the tree and get a list of matching EmailHeader
objects (i.e., those that would be stored in the LinkedList).
Like the previous homework assignment, you can use Java's
LinkedList class.
- (2 points) In the Tree class, modify the insert
method so that it takes a String key and an EmailHeader
value. It should then search for the key in the tree. If
it's not found, it should add it, create a new LinkedList
associated with it, and add the EmailHeader as the first
item in that LinkedList. If it is found, it should
first search the LinkedList to see if this EmailHeader
has been added (to take care of the situation where an email has two
instances of the same word), and if not, add it to the
LinkedList.
- (1 point) In the Tree class, modify the find
method so that it takes a String key and returns the
associated LinkedList, or null if it's not found.
- For simplicity's sake, you can remove all the other methods from
the Tree class - we won't be using them.
- (7 points) Write a EmailSearcher app class that reads mail
files, inserts the matches into the tree, and lets us search for the
results. You will only need one main method.
- (1 point) Make sure the main method takes one argument -- the
filename of the mailbox to read -- and opens up a BufferedReader
for it. Also make sure to instantiate a new Tree to
store our information.
- (4 points) Parse the mailbox. To do this, keep on reading
one String at a time to see what it's about.
- If it's the start of a new email, i.e., the String
starts with "From ", start up another loop that
reads the headers (i.e., until the blank line that starts the
body). This loop should look for the four aforementioned
headers and store them in a new instanace of EmailHeader.
- Otherwise, read each line of body text, split it into
individual words, and insert the (word, email header) pair into
the tree for every word encountered on every line.
You should use the last EmailHeader object built from
the headers. This sounds wasteful, but in fact it's not -
Java will recycle the one EmailHeader object over and
over until you get to the next email.
- (2 points) Finally, prompt the user, repeatedly, to enter a
word. Once the user enters that word, search the tree for that
word. Print out the resulting email headers, or "not found"
if the word is not found. Input should end when the user
enters a blank line.
- (2 points extra credit) Handle punctuation by stripping it
out before inserting it into the tree, so that the limitation as
described at the beginning of this section is no longer an issue.
Make sure to indicate in your README if you have done so.
Tips:
- Use the LinkedList Javadoc to your advantage. There
are lots of methods to make your life easier (such as the contains()
method). Also, remember the rules about casting from HW4.
- Avoid using mailboxes with attachments -- it'll add lots of garbage
to your tree. It should work, but it may cause your program
to eat more memory than you desired. We will grade you on simple
mailboxes, like the one provided.
- There's several useful methods in the String class.
(Once again, use the Javadocs to your advantage!)
- trim() removes whitespace on either side of the
String. In particular, you may find this useful when
processing email headers, as the actual From, etc. may be separated
from the header with an indeterminate amount of whitespace.
- split() lets you split a String into multiple
pieces. In particular, s.split(" ") splits the
String into multiple pieces separated by a space. This is
the "modern" equivalent of StringTokenizer, although you're
welcome to use that as well if you prefer.
- And, of course, there's the case-insensitive comparators.
- As a simple test, search for the word "snow" in the email
folder I provided - there's exactly one email from Zvi on that note.
You can also open the file in Pine or CUBMail and verify that your
searcher is producing sensible results.