Theory |
Do the following questions from Schneider & Gersting Third Edition: |
Programming |
The frequency of letters as they appear in written documents is often used in fields like cryptography and language processing. In this assignment, you will write a program that reads a text file and reports the ten most common letters in the text, along with the frequencies (for our purposes, you can assume the text file is all lower case letters and no punctuation). The name of the text file will be provided by the user (you can read it as a program input argument or prompt the user to enter it and use the Scanner to read it), and the output will be the ten most common letters and their frequencies, in sorted order. There are many ways to design the Java classes to implement a solution to this assignment. The suggested one is as follows: First, create a class called LetterCounter that has a single attribute, which is an array of 26 ints. The first element of the array represents the number of occurrences of the character 'a', the second element is for 'b', and so on. You should have an accessor method for this attribute. You should also have a method called "updateCount" which takes a character as its parameter, and then increments the corresponding element in the occurrence array (i.e., if the parameter is 'a', then increment the first element). Next, create a class called FileReader. It should have two attributes: a LetterCounter, and a String representing the name of the file to be read. The file name should be provided as a parameter of the constructor, and you should have an accessor for the LetterCounter attribute. The FileReader should then have the following methods:
Then, create a class called FrequencyAnalysis, which has a single method called "reportFrequency". This static method takes a LetterCounter as its parameter and then prints out (using System.out.println) the top ten most common letters and their frequencies (do not just print out the number of occurrences; you need to report the frequency as a percentage). Last, create a class called FrequencyAnalysisTest that has a main method, and then creates the objects and calls their methods as necessary. The most challenging part of this assignment is: how will you determine the most frequent characters? Do you need to sort the entire array? If so, how will you do that and make sure that you don't lose the character information? If you don't sort it, how will you find the ten most common? Your submission must also include a README file, in which you explain how your program works and how to use it. Also, you should discuss the decisions you made when it came to finding the most frequent letters.
While developing your program, you should create your own small, simple text files to use so that
you can be sure you are correctly reading the words, finding the letters, and keeping track of the number
of occurrences. Once
you think you have your program in a good state, you can try it on a larger file. One such file is the
Declaration of Independence; you can copy a version of it from
http://www.cs.columbia.edu/~cmurphy/1004/declaration.txt.
When you run your program on that file, you should have output that is similar to the following: e: 13.18007662835249% t: 9.777777777777779% o: 7.8773946360153255% n: 7.402298850574713% s: 7.295019157088123% a: 7.264367816091954% i: 6.881226053639847% r: 6.498084291187739% h: 5.3486590038314175% d: 3.89272030651341% Keep in mind that the text file you want to read should be in the directory that is the root of your Eclipse project. If you're not sure where this is, right click on the project name in Eclipse, select "Properties", then "Resource", and the "Location" setting will tell you which folder to put the text file into. A final word of advice: in this assignment, part of your grade will be based on how "elegant" your code is, particularly in how you implement the "reportFrequency" method. If you find yourself copying and pasting large pieces of code, or writing code that looks very similar, then you may want to think twice about the algorithm that you are using. In particular, you should not be creating classes that have 26 attributes (that's way too many), nor should you be writing if/else statements that have 26 clauses (that's way too long). Think about simpler ways to implement the updateCount and reportFrequency methods, and ask a member of the teaching staff if you need help. Extra Credit: for 20 points extra credit, modify your program so that it reports the most frequent bigrams, which are pairs of letters. This is substantially harder than reporting the most frequent letters, since there are only 26 letters but 26*26 bigrams. If you try to do this, you will need to reconsider how you implement your LetterCounter and FrequencyAnalysis classes, and discuss your solution in your README file. You may not use the Map or List classes in the Java API (such as ArrayList or HashMap), you must use arrays. |
Grading |
The Theory assignment will be worth a total of 35 points. All questions are worth five points, except Chapter 6 question 9, which is worth 10. The Programming assignment will be worth a total of 65 points, distributed as follows:
|
Submitting your assignment |
For the Theory assignment, print a paper copy and submit it at the beginning of class. You do not need to submit it electronically. For the Programming assignment, submit a paper copy of all your source code and submit it electronically by the due date. Put all your .java files and your README file in a .zip or .tar file, then follow these instructions:
|