Design of server based approach and email source analysis module can be represented as follows:
As shown in figure 1.1, whenever a new email is received, the parser passes the message_id of the received message to Email Source Analysis module using a controller.
Using that message_id, the sender's address of the message that is stored by the parser is extracted from the database.
Friend list consists of all the addresses to which the mails have been sent from the mail server.
This module queries the database in which the friend list is populated from the IMAP sent folder and compares the sender's address with all the entries in the friend list.
If a match is found, the sender is classified as a friend of the server, otherwise non-friend. Corresponding results are stored in the database that can be used for further analysis.
Implementation
Email Source Analysis module is written in Java and JDBC is used for connecting to the database. This schema uses the following tables of the database:
1. 'TO' is the table that stores the receiver's information retrieved form the message header. It is populated by the parser.
2. 'FROM' is the table that stores the sender's information retrieved form the message header. It is populated by the parser.
3. 'FRIEND_LIST' is the table that stores the addresses to which a mail has been sent and its corresponding server from which the mail has been sent. It will be populated using IMAP retrieval at regular intervals.
4. 'FRIEND_RESULT' is the table where the results obtained from this test are stored. It contains information about each received mail regarding whether it is from a known sender or not.
Database Schema can be found at: http://wiki.cs.columbia.edu:8080/display/spam/DatabaseSchema
Problems and Solutions
The honeypot being very new was unable to attract enough spam. We tried to attract spam to this honey post by publishing its address at various sources in the Internet like blogs, personal websites, etc. However, very less spam was attracted.
To obtain concrete results we had to run this test on several messages. So we decided to switch to the standalone approach that used messages from the existing mailboxes for analysis.
Also, an approach used by Phil Bradley[13] can be used to publicize the server address.
Next: Standalone Approach