Spam Analysis

Authors

Mohit Vazirani
Columbia University
New York, NY 10027
USA
mcv2107@columbia.edu

Abstract

This project involves analyzing message headers such as From, Return-path, Received, List-Id and Message-Id which contain relevant information such as e-mail address, IP addresses, domain names of hosts, mailing list information and the message identifier. This data is used as input to several tests that are performed to help differentiate characteristics of spam from non-spam messages. The code operates on three sources of messages:

The classes MailHeaderParser, SpamArchive and IETF represent the parsers operating on the above datasets respectively.

System Requirements

The project was coded in Java under Windows XP. It was modified slightly so that it would work under Linux. The MailHeaderParser project requires the libraries activation.jar and mail.jar to compile and run. To be able to tests the hosts for reachability fping needs to be installed and it needs root access to execute.

Installation

To install the fping utility, type the following:
$ tar -xvf fping.tar
$ cd fping-2.4b2_to
$ ./configure
$ make
$ sudo make install

Extract the compressed project archive by typing:
$ tar -xvzf spam-analysis.tar.gz

Change into the project directory by typing:
$ cd spam-archive

To compile the CUNIX parser, type:
$ make buildcunix

To compile the IETF parser, type:
$ make buildietf

To compile the SpamArchive parser, type:
$ make buildsa

To compile all parsers, type:
$ make buildall

To execute the CUNIX parser, type:
$ make runcunix

To determine number of hosts reachable by ping on CUNIX parser, type:
$ fping -c 1 -f MailHeaderParser/cunix_folderName_ping 2>/dev/null | wc -l

To determine total number of hosts counted in the reachability test for the CUNIX parser, type:
$ wc -l MailHeaderParser/cunix_folderName_ping

To determine number of hosts reachable by ping on IETF parser, type:
$ fping -c 1 -f IETF/ietf_ping 2>/dev/null | wc -l

To determine total number of hosts counted in the reachability test for the IETF parser, type:
$ wc -l IETF/ietf_ping

To determine number of hosts reachable by ping on SpamArchive parser, type:
$ fping -c 1 -f SpamArchive/spam_archive_ping 2>/dev/null | wc -l

To determine total number of hosts counted in the reachability test for the SpamArchive parser, type:
$ wc -l SpamArchive/spam_archive_ping

To execute the IETF parser, type:
$ make runietf

To execute the SpamArchive parser, type:
$ make runsa

To delete the class files and generated output files for the CUNIX parser, type:
$make cleancunix

To delete the class files and generated output files for the IETF parser, type:
$make cleanietf

To delete the class files and generated output files for the SpamArchive parser, type:
$make cleansa

To delete the class files and generated output files for all parsers, type:
$make cleanall

Operation

All projects on execution, show debugging output on the standard output and output statistics, successive outcomes, failed outcomes and other output files in the project folder.
Detailed information about class members is available through the created javadoc files.

Restrictions

Useful Enhancements

Further work may consist of the following enhancements:

Acknowledgements

I would like to thank Prof. Henning Schulzrinne for his guidance and continued support on this project.