Spam Analysis and Reputation Project Report: Received Header vs From and Sent Header

By :

Preethi Narayan

Columbia University

Department of Computer Science

New York, NY 10027

Abstract

This project aims to gather statistics about the various trends in e-mails so that an analysis can be made as to what affects e-mails to be classified as “spam”. This can be observed by running a number of tests on the various sources of e-mails and then making conclusions from this. The project deals with recognizing patterns associated with e-mails to be classified as “spam” or “ham” and not to decide whether the e-mails are themselves “spam” or not. There are two basic approaches to decide whether mails are “spam” or “ham”(non spam) . The first is be to observe the body of the mail and decide whether they are legitimate or not. The second is to the observe the information related to the e-mails present in the headers(e-mail headers). The second approach is used to make a study of the trends in e-mails to be classified as “spam” or “ham”. Headers in e-mail contain a wide variety of information. This is used to observe behavior of both “non-spam” and “spam” e-mails. To gather statistics an application containing different tests to be made is run on various IMAP e-mail accounts. The statistics generated are based on the headers present in each of the e-mails present the account. The statistics can be generated by using parameters like the comparison between results of the various sources like Blacklist or Friendlist,Received Header vs From and Sent Header, etc . This application takes into consideration three folders for each account. These folders are “sent mail”, “inbox” and “spam”. The results are gathered by running the tests on these folders. The module covered in this report is Received Header vs From and Sent Header.The application when run on a large number of IMAP e-mail accounts helps in deciding whether the tests on the headers were good indicators or not. The main purpose of creating this application is to analyse which test performed on the e-mail headers is a good indicator to recognise “non-spam” and “spam” messages.

1. Project Overview

The project is divided into two parts. The first part is the server based approach. The second part involves developing a stand alone statistics generator which can be run on individual IMAP mail boxes. The first approach consists of configuring a server to receive e-mails from various sources. Then these e-mails are parsed into different portions. The parsed portions are stored in a database. The database consists of a number of portions each indicative of the different portions of the header. From the information available in the database the different checks are be performed on the e-mails. The problems faced in this approach leads to the secomd approach.

The second approach deals with the development of a standalone statistics generator. This application is used for IMAP enabled e-mail accounts. This involves the retrieval of e-mail messages from the server where the messages are stored and running the tests on them. The results of running these tests are displayed in the graphical user interface. A separate module was developed for each of the tests to be made. These modules were integrated and used in both the approaches. The modules developed for the project are as follows:

```
Friend Check
```
```
Pingable Hosts
```
```
Black Lists
```
```
Domain Check
```
```
In-reply-to
```
```
DKIM and SPF
```
```
Received Header
```
```
DHCP and DSL
```
```
Attachments
```

Getting hour,date,time information from the message

Whether the To field and the Body contain the name of the person.

Whether there is any image in the body of the message.

```
Columbia Internal mails
```

Each of these modules operate on different parts of the headers and body of e-mails. Each of these modules have been implemented by different members of the team.

The design of the project is briefly described below:

Java is the programming language used for the modules.

MailStats is the main module which connects to a given account if the account is IMAP enabled. This calls all the modules synchronously.

The MailStats connects to the user's account on an imap server starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent.

A Graphical User Interface is used to perform the operations of connecting to the IMAP server and indicating the status of the progress of the tests.

First the e-mails headers are retrieved from the account specified. Then the modules run the tests on the retrieved headers.

The external library used is the javamail-1.4 library. This library is used extensively by all the modules for the retrieval of the headers of the e-mails.
	The main module MailStats then passes javax.mail.Message arrays containing the three folders spam, ham and sent messages to all the modules. The modules use these to individually find statistics and print out a result that can be later used for analysis.

2.Introduction

The modules that I implemented for the sever based approach and the standalone generator are :

Received Header vs Sent and From header for the server based approach.

Received Header vs Sent and From header for the stand alone generator.

The received header in e-mails is related to the trace fields. The "Received:" field contains a (possibly empty) list of name/value pairs followed by a semicolon and a date-time specification. The first item of the name/value pair is defined by item-name, and the second item is either an address-specification, an atom, a domain, or a message-id. The received field in the header was chosen as a classifier because it indicates the trace from where the e-mail originates. This field contains the trace of the route from where the e-mail originates. Each time an e-mail reaches a hop, the received header is added to the list of headers with details of the domain of the current hop and from where the e-mail was received from.

An example of a message header for an email sent from MrJones@emailprovider.com to MrSmith@gmail.com:

Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path:
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46; Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith

In the example, headers are added to the message three times:

1.When Mr. Jones composes the email
Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith

2.When the email is sent through the servers of Mr. Jones' email provider, mail.emailprovider.com
Message-ID: <20050329231145.62086.mail@mail.emailprovider.com>
Received: from [11.11.111.111] by mail.emailprovider.com via HTTP; Tue, 29 Mar 2005 15:11:45 PST

3.When the message transfers from Mr. Jones' email provider to Mr. Smith's Gmail address
Delivered-To: MrSmith@gmail.com
Received: by 10.36.81.3 with SMTP id e3cs239nzb;Tue, 29 Mar 2005 15:11:47 -0800 (PST)
Return-Path: MrJones@emailprovider.com
Received: from mail.emailprovider.com (mail.emailprovider.com [111.111.11.111]) by mx.gmail.com with SMTP id h19si826631rnb; Tue, 29 Mar 2005 15:11:47 -0800 (PST)

Below is a description of each section of the email header:
Delivered-To: MrSmith@gmail.com
The email address the message will be delivered to.

Received: by 10.36.81.3 with SMTP id e3cs239nzb;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The time the message reached Gmail's servers.

Return-Path:
The address from which the message was sent.

Received: from mail.emailprovider.com
(mail.emailprovider.com [111.111.11.111])
by mx.gmail.com with SMTP id h19si826631rnb.2005.03.29.15.11.46;
Tue, 29 Mar 2005 15:11:47 -0800 (PST)
The message was received from mail.emailprovider.com, by a Gmail server on March 29, 2005 at approximately 3 pm.

Message-ID: 20050329231145.62086.mail@mail.emailprovider.com
A unique number assigned by mail.emailprovider.com to identify the message.

Received: from [11.11.111.111] by mail.emailprovider.com via HTTP;
Tue, 29 Mar 2005 15:11:45 PST
Mr. Jones used an email composition program to write the message, and it was then received by the email servers of mail.emailprovider.com.

Date: Tue, 29 Mar 2005 15:11:45 -0800 (PST)
From: Mr Jones
Subject: Hello
To: Mr Smith
The date, sender, subject, and destination -- Mr. Jones entered this information (except for the date) when he composed the email.

The "Received:" header field can be used to check the number of e-mails received from known domains and whether they were actually received from the domain from which they were sent. E-mail headers have the “From:” and “Sent:” fields. These fields are not necessarily always present. To perform this test ,the “From” header is compared with the “Received:” header. Alternatively, if the “Sent:” header is present , a comparison with of the “Received:” and “Sent:” header is made. A number of existing spam filters like spamassasin use the received header to run tests to decide the points to be assigned to any particular mail. The statistics gathered are by running the mail statistics generator on my inbox and the mail accounts of friends who have IMAP enabled email accounts.

3.Architecture

The architechture of the components involved in the module is described below:

3.1 Get Domain Name

The “Received:” header field in the e-mail header has trace information of the mail hops. This is in the form of either domain names themselves or the IP addresses of the domain names. In both the server approach and for the stand alone generator, the parsed email header returns the domain name if found otherwise the IP address of as part of the “Received:” header. If the IP address is obtained then the domain name has to be extracted from this. This is done using a reverse DNS lookup procedure. From the IP address the domain names can be obtained. This is the first step in the process of testing the “Received:” header with the “From:” and “Sent:” header.

3.2 Check For Domain

his component does the actual comparison of the domain names which are received from the “Received:” and the “From:” and “Sent:” header. Domain names are ASCII letters "a" through "z" (case-insensitive), the digits "0" through "9", and the hyphen, with some other restrictions. For example "imap.gmail.com", "cs.mit.edu", etc.Domain names are classified as:

Top-Level Domains - they are part of a list of generic names or a two-character territory code for eg. .in, .uk, .jp, etc

Second Level Domains - These are the names directly to the left of .com, .net, and the other top-level domains

Third Level Domains - These domains are immediately to the left of a second-level domain eg.columbia.edu

Sub-Domains - Domains of third or higher level eg. cs.columbia.edu

The domain names are split into their component parts of top level domain , second level domain , third level domain and sub domains. Then a comparison between the corresponding domain names is made.

3.3. Received Header Analysis

This component deals with the tokenising the headers received. All the e-mail headers received have to be tokenised to extract only the required components of the e-mail headers. Once the IP addresses and the domain names of the “Received:” and the “From:” and “Sent:” headers are received, a comparison is required to be made. This is done by the CheckForDomain component of the module. If the domain names match then there is a correspondence between the sender's e-mail id and the domain from where the mail came from.

4.Design and Implementation

The design and implementaion of the module for the received header check involoves two parts. The design and implementation for the server based approach and the design and implementaion for the standalone approach. The details of the design for each of the components of the module are described below.

4.1 Server Based Approach
4.2 Stand Alone Based Approach

4.1 Server Based Approach - Design and Implementation

In this approach, a server was configured to send and receive e-mails. Whenever an incoming message is received, it goes through a parser module which breaks down the message into the header and the body. The message header is further broken down into individual components based on their fields.These parsed values are stored in the database. Each of the individual modules obtain their data from the database and perform the tests on them. The results of the tests performed are stored back in the database.
The class diagram for this approach is shown below.

Domain Data Flow Diagram
Figure 4.1

Figure 4.1 shows the class diagram for the design in the server based approach. The class diagram has three classes indicative of the components which are part of this approach. Here the controller is used as an interface for all the modules. The order of operations is decided by the controller. Each of the modules are called independently to perform the analysis on each message. This contains a message id which is unique for each message. It also contains a vector of all the modules present in the system to perform the corresponding tests. To each module it passes a message id. Using this message id as a primary key to the tables in the database, the values required by the corresponding module are retrieved. The JDBC connection class is used to get the handle for the connection and establish the connection. The data is retrieved from the databse using this. Once that is done the connection is closed. In case of the Received Header module, the message headers corresponding to the “Received:” , “From:” and “Sent:” are retrieved from the database. The Received Header module performs the the tests.

It compares the domain names of the “From:” field and the “Sent:” field with that of the domain names from the “Received:” field. The number of e-mails in which there was a macth with the domain names of the Received Header and the From header are stored in the databse. Similarly the number of e-mails in which there was a match in the Received Header and the Sent header are stored in the database.

The following shows the flow of information in the Received Header Analysis module.

Domain Data Flow Diagram
Figure 4.2

Figure 4.2 is the data flow diagram for the server based approach. Here the messages are retrieved from the server. The messages are parsed and the parsed contents are stored in the database. The handle of the controller is passed to the check for domain module which gets the corresponding fields from the database and sends it to the received header check module. The results of this are stored in the database.

4.2 Stand Alone Approach - Design and Implementation

Since the server we had configured did not attract many emails to perform all the test, a stand alome application was developed. Here the application has the ability to connect to IMAP servers are retrieve e-mails from the server. These messages are separated out as message body and message header.The individual header values are then retreived. However in this approach , the data from the header is not completely parsed into different components. So a significant part of the module involves parsing the data.

Domain Class
Figure 5.1

Figure 5.1 shows the class diagram for this approach. Here the classes defined are Received Header Analysis, Check For Domain and Get Domain Name. The Received Header Analysis retrieves all the messages from a correponding folder. The folders taken into consideration are folders with “Ham” messages and “Spam” messages. From the messages in each of the folders, the headers are extracted from each message.

The headers are parsed and if only the IP address is present, then the domain name for the corresponding IP address is retrieved. This is done using the Get Domain Names module.

Using the domain name retrieved from this module, the check for domain class computes the comparison between the domain names. Here the domain names are broken down into induvidual components like primary domain name, secondary domain name etc. A comparison is made with each subset of the names and if a match is found then a true value is returned.
In the received header class a hash map is maintained to store the domain names of IP addresses already seen in the messages, so that the speed of the test is increased. The result of this module is the number e-mails in which the received header and the from field matched. The other result is the number of e-mails in which the received header and the sent field matched.

The results are of the format as shown below :

Inbox:

From match count : 1512/1631
Sender match count : 16/1631

Spam:

From match count : 345/976
Sender match count : 5/976

Here the total number of messages in the inbox, or the ham messages are 1631.The number of messages in which the “received:” header matched with the “from:” header is 1512 and the number of messages in which the “received:” header matched with the “sent:” header is 16. Similarly this process is repeated with the spam messages.

5. Results and Analysis

Mail Boxes Considered

The sample data consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:

Mail Box Number	Number Of Inbox Mails	Number of Spam Mails	Total Number of Mails
1	435	186	621
2	1631	976	2607
3	133	145	278
4	1703	207	1910
5	1072	0	1072
6	857	0	857
7	365	61	426
8	212	137	349
9	358	61	419
10	566	141	707
11	2187	0	2187
12	351	0	351
13	202	1	203
14	151	0	151
15	352	0	352
16	1119	21	1140
17	1047	0	1047
18	1104	39	1143
19	1237	0	1237
20	416	73	489
21	1638	0	1638
22	1334	2	1336

Figure 5.0

5.1 RECEIVED HEADER MODULE RESULTS

To compute the results for the Received Header module the two folders considered for running the tests were ham and spam. Here the number of e-mail messages in which the "From:" header field matched with the "Received:" header are considered. And also the number of e-mail messages in which the "Sent:" header field matched with the "Received:" header are considered.

Following figure 5.1 shows the statistics and the data gathered for the Received Header Analysis.This data corresponds to the ham message folder.
Once the statistics are analyzed, the result is then computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.

Mail Box Number	Ham mails matched between Received and From Headers	Percentage Ham mails matched between Received and From Headers	Ham mails matched between Received and Sent Headers	Percentage Ham mails matched between Received and Sent Headers	Total Messages
1	255	59	16	4	435
2	1431	88	63	4	1631
3	101	76	1	1	133
4	1451	85	71	4	1703
5	738	69	27	3	1072
6	665	78	42	5	857
7	289	79	21	6	365
8	159	71	1	0	212
9	221	61	10	3	358
10	513	90	1	2	566
11	1897	87	15	1	2187
12	322	92	6	2	351
13	196	97	2	1	202
14	132	87	1	1	151
15	281	80	24	7	352
16	721	64	198	18	1119
17	551	52	245	23	1047
18	984	89	27	2	1104
19	757	61	160	13	1237
20	300	72	10	2	416
21	837	51	384	23	1638
22	919	69	282	21	1334

Figure 5.1

Following figure 5.2 shows the statistics and the data gathered for the Received Header Analysis.This data corresponds to the spam message folder.
Once the statistics are analyzed, the result is then computed.
The first coloumn corresponds to the mailbox number.
The second coloumn corresponds to the number of emails in which the from header matches the received header.
The third coloumn corresponds to the number of emails in which the sent header matches the received header.

Mail Box Number	Spam mails matched between Received and From Headers	Percentage Spam mails matched between Received and From Headers	Spam mails matched between Received and Sent Headers	Percentage Spam mails matched between Received and Sent Headers	Total Messages
1	16	9	0	0	186
2	212	22	5	1	976
3	33	23	0	0	145
4	77	37	0	0	207
5	0	0	0	0	0
6	0	0	0	0	0
7	20	33	0	0	61
8	51	37	1	0	137
9	13	21	0	0	61
10	51	36	1	0	141
11	0	0	0	0	0
12	0	0	0	0	0
13	0	0	0	0	1
14	0	0	0	0	0
15	0	0	0	0	0
16	11	52	0	0	21
17	0	0	0	0	0
18	15	38	1	3	39
19	0	0	0	0	0
20	31	32	0	0	73
21	0	0	0	0	0
22	0	0	0	0	2

Figure 5.2

5.2 RECEIVED HEADER MODULE ANALYSIS

From the results table we get the scatter plots indicating the match of from headers with the received headers. This is done for both the ham and spam message folders.

Figure 5.3

It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the from header is 75%. And in case of spam messages on an evarage the percentage match is less than 10%. However much of a conclusion cannot be drawn from this as the number of spam messages in the mailboxes used was very less.

From the results table we get the scatter plots indicating the match of sent headers with the received headers. This is done for both the ham and spam message folders.

Figure 5.4

It can be seen from the plot that on an average for the ham messages the percentage match between the received header and the sent header is 40%. And in case of spam messages on an evarage the percentage match is less than 5%. It can be seen that percentage match is very less for the sent header and the received header. One of the main reasons for this is that the number of messages which have the sent header are very few. So the check for the received header with the from header provides better results as against the sent header.

Thus as seen from the statistics generated for the different mailboxes, of different sizes, the Received Header test can considered as a good test based on following two properties:

The number of messages in the ham folder whose domains match, with the received header is high.
The number of messages in the spam folder whose domains match,with the received header is low.

Some other properties observed are that this check fails for messages which have originated from a mailing list, as the headers for messages from a mailing list are different from those of regular messages.

Result Summary:

Received Header Module : This module can be used as a fairly good filter to understand and classify messages as spam or ham.

6. Problems and Solutions

Some of the problems faced during the course of the problems are listed out below.

The server based approach did not attract enough e-mails . So no analysis could be made with modules in that approach. This in turn paved way for the standalone mail statistics generator.
One of the main problems faced was the lack of rigid rules for the format of headers. Each e-mail service adds its variation to this thus making a generalization tough. This is the problem mainly dealt while parsing the header.
The second problem faced was with the execution time of the module.
The execution time depends on the speed of the network to retrieve the e-mails from the server.
The module without optimization took more than 20 minutes to analyze e-mails greater than 1K. However I started caching the IP addresses once the look up was done, so if the same IP address was found, then instead of contacting the server to do a look up the locally cached IP with its corresponding domain name will be looked up. This made the whole module work much faster and could process mails of around 1K in around 2-3 minutes.
Another problem is that the number of mail boxes on which the stand alone application was run was small. To be able to make a decision about the how good the received header is a parameter to indicate whether e-mails are ham or spam , the stand alone application has to be run on a number of mailboxes.

7. Tools Used

The tools used were as follows:

Navicat MySQL for creating the database for the Server based Approach
Netbeans IDE for writing code for the modules as the code was written in JAVA
OpenOffice and EditPlus HTML based tools for writing the report.
OpenOffice Excel draw for drawing the figures.
Open Office Drawing tool for making the class Diagrams
Excel to Html converter for converting result tables in excel into the html form

8. Appendix

The following link contains the source code for the individual modules:

Received Header vs From and Sent Module

9. References

RFC 2822 [http://tools.ietf.org/html/rfc2822 ]
RFC 2821 [http://tools.ietf.org/html/rfc2821 ]
JavaMail API [http://java.sun.com/products/javamail/]
Spam Analysis and Reputation Project [http://www1.cs.columbia.edu:8080/display/spam/Home ]
SARP Modules [http://www1.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules ]
Professor Henning Schulzrinne – Advisor for the Project.
Adrian Frei - Spam Analysis and Reputation Project: DNS Blacklists
Tejas Nadkarni – Parser and Standalone Framework
Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.
Dhrumin Shah – Spam Analysis and Reputation Project : Domain Check and Image Analysis
Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis
Swati Kumar - Spam Analysis and Reputation Project : Email Encryption Headers and Database Schema