By :
Preethi Narayan
Columbia University
Department of Computer Science
New York, NY 10027
This project aims to gather statistics about the various trends in e-mails so that an analysis can be made as to what affects e-mails to be classified as “spam”. This can be observed by running a number of tests on the various sources of e-mails and then making conclusions from this. The project deals with recognizing patterns associated with e-mails to be classified as “spam” or “ham” and not to decide whether the e-mails are themselves “spam” or not. There are two basic approaches to decide whether mails are “spam” or “ham”(non spam) . The first is be to observe the body of the mail and decide whether they are legitimate or not. The second is to the observe the information related to the e-mails present in the headers(e-mail headers). The second approach is used to make a study of the trends in e-mails to be classified as “spam” or “ham”.
Headers in e-mail contain a wide variety of information. This is used to observe behavior of both “non-spam” and “spam” e-mails. To gather statistics an application containing different tests to be made is run on various IMAP e-mail accounts. The statistics generated are based on the headers present in each of the e-mails present the account. The statistics can be generated by using parameters like the comparison between results of the various sources like Blacklist or Friendlist,Received Header vs From and Sent Header, etc . This application takes into consideration three folders for each account. These folders are “sent mail”, “inbox” and “spam”. The results are gathered by running the tests on these folders. The module covered in this report is Received Header vs From and Sent Header.The application when run on a large number of IMAP e-mail accounts helps in deciding whether the tests on the headers were good indicators or not. The main purpose of creating this application is to analyse which test performed on the e-mail headers is a good indicator to recognise “non-spam” and “spam” messages.
The project is divided into two parts. The first part is the server based approach. The second part involves developing a stand alone statistics generator which can be run on individual IMAP mail boxes. The first approach consists of configuring a server to receive e-mails from various sources. Then these e-mails are parsed into different portions. The parsed portions are stored in a database. The database consists of a number of portions each indicative of the different portions of the header. From the information available in the database the different checks are be performed on the e-mails. The problems faced in this approach leads to the secomd approach.
The second approach deals with the development of a standalone statistics generator. This application is used for IMAP enabled e-mail accounts. This involves the retrieval of e-mail messages from the server where the messages are stored and running the tests on them. The results of running these tests are displayed in the graphical user interface. A separate module was developed for each of the tests to be made. These modules were integrated and used in both the approaches. The modules developed for the project are as follows:
Friend Check
Pingable Hosts
Black Lists
Domain Check
In-reply-to
DKIM and SPF
Received Header
DHCP and DSL
Attachments
Getting hour,date,time information from the message
Whether the To field and the Body contain the name of the person.
Whether there is any image in the body of the message.
Columbia Internal mails
Each of these modules operate on different parts of the headers and body of e-mails. Each of these modules have been implemented by different members of the team.
The design of the project is briefly described below:
Java is the programming language used for the modules.
MailStats is the main module which connects to a given account if the account is IMAP enabled. This calls all the modules synchronously.
The MailStats connects to the user's account on an imap server starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent.
A Graphical User Interface is used to perform the operations of connecting to the IMAP server and indicating the status of the progress of the tests.
First the e-mails headers are retrieved from the account specified. Then the modules run the tests on the retrieved headers.
The external library used is the javamail-1.4 library. This library is used extensively by all the modules for the retrieval of the headers of the e-mails.
The main module MailStats then passes javax.mail.Message arrays containing the three folders spam, ham and sent messages to all the modules. The modules use these to individually find statistics and print out a result that can be later used for analysis.
2.Introduction
The modules that I implemented for the sever based approach and the standalone generator are :
Received Header vs Sent and From header for the server based approach. Received Header vs Sent and From header for the stand alone generator.
The received header in e-mails is related to the trace fields. The "Received:" field contains a (possibly empty) list of name/value pairs followed by a semicolon and a date-time specification. The first item of the name/value pair is defined by item-name, and the second item is either an address-specification, an atom, a domain, or a message-id. The received field in the header was chosen as a classifier because it indicates the trace from where the e-mail originates. This field contains the trace of the route from where the e-mail originates. Each time an e-mail reaches a hop, the received header is added to the list of headers with details of the domain of the current hop and from where the e-mail was received from.
An example of a message header for an email sent from MrJones@emailprovider.com to MrSmith@gmail.com:
Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path:
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith
In the example, headers are added to the message three times:
1.When Mr. Jones composes the email
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith
2.When the email is sent through the servers of Mr. Jones' email provider, mail.emailprovider.com
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST
3.When the message transfers from Mr. Jones' email provider to Mr. Smith's Gmail address
Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb;Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path: MrJones@emailprovider.com
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Below is a description of each section of the email header:
Delivered-To: MrSmith@gmail.com
The email address the message will be delivered to.
Received: by 10.36.81.3 with SMTP id e3cs239nzb;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The time the message reached Gmail's servers.
Return-Path:
The address from which the message was sent.
Received: from mail.emailprovider.com
(mail.emailprovider.com [111.111.11.111])
by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The message was received from mail.emailprovider.com, by a Gmail server on March 29, 2005 at approximately 3 pm.
Message-ID: 20050329231145.62086.mail@mail.emailprovider.com
A unique number assigned by mail.emailprovider.com to identify the message.
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP;
Tue, 29 Mar 2005 15:11:45 PST
Mr. Jones used an email composition program to write the message, and it was then received by the email servers of mail.emailprovider.com.
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith
The date, sender, subject, and destination -- Mr. Jones entered this information (except for the date) when he composed the email.
The "Received:" header field can be used to check the number of e-mails received from known domains and whether they were actually received from the domain from which they were sent. E-mail headers have the “From:” and “Sent:” fields. These fields are not necessarily always present. To perform this test ,the “From” header is compared with the “Received:” header. Alternatively, if the “Sent:” header is present , a comparison with of the “Received:” and “Sent:” header is made. A number of existing spam filters like spamassasin use the received header to run tests to decide the points to be assigned to any particular mail. The statistics gathered are by running the mail statistics generator on my inbox and the mail accounts of friends who have IMAP enabled email accounts.
The architechture of the components involved in the module is described below:
The “Received:” header field in the e-mail header has trace information of the mail hops. This is in the form of either domain names themselves or the IP addresses of the domain names. In both the server approach and for the stand alone generator, the parsed email header returns the domain name if found otherwise the IP address of as part of the “Received:” header. If the IP address is obtained then the domain name has to be extracted from this. This is done using a reverse DNS lookup procedure. From the IP address the domain names can be obtained. This is the first step in the process of testing the “Received:” header with the “From:” and “Sent:” header.
his component does the actual comparison of the domain names which are received from the “Received:” and the “From:” and “Sent:” header. Domain names are ASCII letters "a" through "z" (case-insensitive), the digits "0" through "9", and the hyphen, with some other restrictions. For example "imap.gmail.com", "cs.mit.edu", etc.Domain names are classified as:
Top-Level Domains - they are part of a list of generic names or a two-character territory code for eg. .in, .uk, .jp, etc
Second Level Domains - These are the names directly to the left of .com, .net, and the other top-level domains
Third Level Domains - These domains are immediately to the left of a second-level domain eg.columbia.edu
Sub-Domains - Domains of third or higher level eg. cs.columbia.edu
The domain names are split into their component parts of top level domain , second level domain , third level domain and sub domains. Then a comparison between the corresponding domain names is made.
This component deals with the tokenising the headers received. All the e-mail headers received have to be tokenised to extract only the required components of the e-mail headers. Once the IP addresses and the domain names of the “Received:” and the “From:” and “Sent:” headers are received, a comparison is required to be made. This is done by the CheckForDomain component of the module. If the domain names match then there is a correspondence between the sender's e-mail id and the domain from where the mail came from.
The design and implementaion of the module for the received header check involoves two parts. The design and implementation for the server based approach and the design and implementaion for the standalone approach. The details of the design for each of the components of the module are described below.
4.1
Server Based Approach
4.2
Stand Alone Based Approach
4.1 Server Based Approach - Design and Implementation
In this approach, a server was configured to send and receive e-mails. Whenever an incoming message is received, it goes through a parser module which breaks down the message into the header and the body. The message header is further broken down into individual components based on their fields.These parsed values are stored in the database. Each of the individual modules obtain their data from the database and perform the tests on them. The results of the tests performed are stored back in the database.
Figure 4.1
Figure 4.1 shows the class diagram for the design in the server based approach. The class diagram has three classes indicative of the components which are part of this approach. Here the controller is used as an interface for all the modules. The order of operations is decided by the controller. Each of the modules are called independently to perform the analysis on each message. This contains a message id which is unique for each message. It also contains a vector of all the modules present in the system to perform the corresponding tests. To each module it passes a message id. Using this message id as a primary key to the tables in the database, the values required by the corresponding module are retrieved. The JDBC connection class is used to get the handle for the connection and establish the connection. The data is retrieved from the databse using this. Once that is done the connection is closed. In case of the Received Header module, the message headers corresponding to the “Received:” , “From:” and “Sent:” are retrieved from the database. The Received Header module performs the the tests.
It compares the domain names of the “From:” field and the “Sent:” field with that of the domain names from the “Received:” field. The number of e-mails in which there was a macth with the domain names of the Received Header and the From header are stored in the databse. Similarly the number of e-mails in which there was a match in the Received Header and the Sent header are stored in the database.
The following shows the flow of information in the Received Header Analysis module.
Figure 4.2
Figure 4.2 is the data flow diagram for the server based approach. Here the messages are retrieved from the server. The messages are parsed and the parsed contents are stored in the database. The handle of the controller is passed to the check for domain module which gets the corresponding fields from the database and sends it to the received header check module. The results of this are stored in the database.
4.2 Stand Alone Approach - Design and Implementation
Figure 5.1
Figure 5.1 shows the class diagram for this approach. Here the classes defined are Received Header Analysis, Check For Domain and Get Domain Name. The Received Header Analysis retrieves all the messages from a correponding folder. The folders taken into consideration are folders with “Ham” messages and “Spam” messages. From the messages in each of the folders, the headers are extracted from each message.
The headers are parsed and if only the IP address is present, then the domain name for the corresponding IP address is retrieved. This is done using the Get Domain Names module.
Using the domain name retrieved from this module, the check for domain class computes the comparison between the domain names. Here the domain names are broken down into induvidual components like primary domain name, secondary domain name etc. A comparison is made with each subset of the names and if a match is found then a true value is returned.
In the received header class a hash map is maintained to store the domain names of IP addresses already seen in the messages, so that the speed of the test is increased. The result of this module is the number e-mails in which the received header and the from field matched. The other result is the number of e-mails in which the received header and the sent field matched.
The results are of the format as shown below :
Inbox:
From match count : 1512/1631
Sender match count : 16/1631
Spam:
From match count : 345/976
Sender match count : 5/976
Here the total number of messages in the inbox, or the ham messages are 1631.The number of messages in which the “received:” header matched with the “from:” header is 1512 and the number of messages in which the “received:” header matched with the “sent:” header is 16. Similarly this process is repeated with the spam messages.
Mail Boxes Considered
The sample data consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:
Mail Box Number | Number Of Inbox Mails | Number of Spam Mails | Total Number of Mails |
1 | 435 | 186 | 621 |
2 | 1631 | 976 | 2607 |
3 | 133 | 145 | 278 |
4 | 1703 | 207 | 1910 |
5 | 1072 | 0 | 1072 |
6 | 857 | 0 | 857 |
7 | 365 | 61 | 426 |
8 | 212 | 137 | 349 |
9 | 358 | 61 | 419 |
10 | 566 | 141 | 707 |
11 | 2187 | 0 | 2187 |
12 | 351 | 0 | 351 |
13 | 202 | 1 | 203 |
14 | 151 | 0 | 151 |
15 | 352 | 0 | 352 |
16 | 1119 | 21 | 1140 |
17 | 1047 | 0 | 1047 |
18 | 1104 | 39 | 1143 |
19 | 1237 | 0 | 1237 |
20 | 416 | 73 | 489 |
21 | 1638 | 0 | 1638 |
22 | 1334 | 2 | 1336 |
Figure 5.0
Following figure 5.1 shows the statistics and the
data gathered for the Received Header Analysis.This data corresponds to the ham message folder. Figure 5.1
Following figure 5.2 shows the statistics and the
data gathered for the Received Header Analysis.This data corresponds to the spam message folder. Figure 5.2
Figure 5.3
It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the from header is 75%. And in case of spam messages on an evarage the percentage match is less than 10%. However much of a conclusion cannot be drawn from this as the number of spam messages in the mailboxes used was very less. From the results table we get the scatter plots indicating the match of sent headers with the received headers. This is done for
both the ham and spam message folders.
Figure 5.4
It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the sent header is 40%. And in case of spam messages on an evarage the percentage match is less than 5%. It can be seen that percentage match is very less for the sent header and the received header. One of the main reasons for this is that the number of messages which have the sent header are very few. So the check for the received header with the from header provides better results as against the sent header. Thus as seen from the
statistics generated for the different mailboxes, of different sizes,
the Received Header test can considered as a good test based on
following two properties: The number of messages in
the ham folder whose domains match, with the received header is high. The number of messages in
the spam folder whose domains match,with the received header is low. Some other properties observed are that this check fails for messages which have originated from a mailing list, as the headers for messages from a mailing list are different from those of regular messages. Result Summary:
Some of the problems faced during the course of the problems are listed out below.
The server based approach did not attract enough e-mails . So no analysis could be made with modules in that approach. This in turn paved way for the standalone mail statistics generator. One of the main problems faced was the lack of rigid rules for the format of headers. Each e-mail service adds its variation to this thus making a generalization tough. This is the problem mainly dealt while parsing the header. The second problem faced was with the execution time of the module. The execution time depends on the speed of the network to retrieve the e-mails from the server.
The module without optimization took more than 20 minutes to analyze e-mails greater than 1K. However I started caching the IP addresses once the look up was done, so if the same IP address was found, then instead of contacting the server to do a look up the locally cached IP with its corresponding domain name will be looked up. This made the whole module work much faster and could process mails of around 1K in around 2-3 minutes.
Another problem is that the number of mail boxes on which the stand alone application was run was small. To be able to make a decision about the how good the received header is a parameter to indicate whether e-mails are ham or spam , the stand alone application has to be run on a number of mailboxes.
The tools used were as
follows: Navicat MySQL for
creating the database for the Server based Approach Netbeans IDE for writing
code for the modules as the code was written in JAVA OpenOffice and EditPlus
HTML based tools for writing the report. OpenOffice Excel draw for
drawing the figures. Open Office Drawing tool
for making the class Diagrams Excel to Html converter for converting result tables in excel into
the html form The following link contains the source code for the individual modules: RFC 2822
[http://tools.ietf.org/html/rfc2822]
RFC 2821
[http://tools.ietf.org/html/rfc2821]
JavaMail API
[http://java.sun.com/products/javamail/]
Spam Analysis and
Reputation Project
[http://www1.cs.columbia.edu:8080/display/spam/Home]
SARP Modules
[http://www1.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]
Professor Henning
Schulzrinne
– Advisor for the Project. Adrian Frei - Spam
Analysis and Reputation Project: DNS Blacklists
Tejas Nadkarni –
Parser and Standalone Framework Aditi Rajoriya - Spam
Analysis and Reputation Project: IMAP Retrieval and To/Body Module.
Dhrumin Shah –
Spam Analysis and Reputation Project : Domain Check and Image Analysis
Nirav Shah - Spam
Analysis and Reputation Project: Email Source, Date/Time and
Attachment Analysis
Swati Kumar - Spam
Analysis and Reputation Project : Email Encryption Headers and
Database Schema
5.1
RECEIVED HEADER MODULE RESULTS
To compute the results for the Received Header module the two folders considered for running the tests were ham and spam.
Here the number of e-mail messages in which the "From:" header field matched with the "Received:" header are considered. And also the number of
e-mail messages in which the "Sent:" header field matched with the "Received:" header are considered.
Once the statistics are analyzed, the result is then
computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.
Mail
Box Number
Ham mails matched between
Received and From Headers
Percentage Ham mails matched between
Received and From Headers
Ham mails matched between
Received and Sent Headers
Percentage Ham mails matched between
Received and Sent Headers
Total Messages
1
255
59
16
4
435
2
1431
88
63
4
1631
3
101
76
1
1
133
4
1451
85
71
4
1703
5
738
69
27
3
1072
6
665
78
42
5
857
7
289
79
21
6
365
8
159
71
1
0
212
9
221
61
10
3
358
10
513
90
1
2
566
11
1897
87
15
1
2187
12
322
92
6
2
351
13
196
97
2
1
202
14
132
87
1
1
151
15
281
80
24
7
352
16
721
64
198
18
1119
17
551
52
245
23
1047
18
984
89
27
2
1104
19
757
61
160
13
1237
20
300
72
10
2
416
21
837
51
384
23
1638
22
919
69
282
21
1334
Once the statistics are analyzed, the result is then
computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.
Mail
Box Number
Spam mails matched between
Received and From Headers
Percentage Spam mails matched between
Received and From Headers
Spam mails matched between
Received and Sent Headers
Percentage Spam mails matched between
Received and Sent Headers
Total Messages
1
16
9
0
0
186
2
212
22
5
1
976
3
33
23
0
0
145
4
77
37
0
0
207
5
0
0
0
0
0
6
0
0
0
0
0
7
20
33
0
0
61
8
51
37
1
0
137
9
13
21
0
0
61
10
51
36
1
0
141
11
0
0
0
0
0
12
0
0
0
0
0
13
0
0
0
0
1
14
0
0
0
0
0
15
0
0
0
0
0
16
11
52
0
0
21
17
0
0
0
0
0
18
15
38
1
3
39
19
0
0
0
0
0
20
31
32
0
0
73
21
0
0
0
0
0
22
0
0
0
0
2
5.2
RECEIVED HEADER MODULE ANALYSIS
From the results table we get the scatter plots indicating the match of from headers with the received headers. This is done for
both the ham and spam message folders.
Received Header Module : This module can be used as a fairly good filter to understand and classify messages as spam or ham.
Received Header vs From and Sent Module