Spam Analysis and Reputation Project : Domain Check and Image Analysis

By:

Dhrumin Shah

Columbia University

Department of Computer Science

New York, NY 10027

USA

Abstract

The project aims at gathering statistical data about the various headers and body fields present in emails and hence differentiating between the two large collections of messages: Spam and Non-Spam (or Ham) . Based on the data gathered for the spam and ham mails, we will decide whether a particular field is good enough to be used for the required classification. The two modules covered in this report are the Domain Check module and the Image Test module If a particular field in the header or body of an email is a good indicator, it will have varying values (eg. Spam score) for the spam and ham mails. To gather the statistics we run our program on folders like Inbox, Spam and others like Sent that contain substantial number of messages and gather useful data about different header and body fields. After examining this data, we arrive at a conclusion that whether the particular field is a good metric for the required classification. The statistics can be generated by using parameters like the number of mails; comparison between the headers or the body fields; or from the individual results of the various sources like Blacklist or Friendlist, etc. These statistics when derived for a sufficient number of mailboxes and hence for a sufficient number of different kind of mails, can be used to classify a particular message as spam or ham. For example, if a mail has a domain name to which the receiver has never sent a mail, then the probability of the message being regarded as spam is high, but on the other hand, if the mail has a domain name to which the receiver has sent a mail before, then this mail will probably not be regarded as a spam. So, we find the count of such mails from the set of mailboxes which were used for testing, and based on these count values for spam and ham mails, we infer whether we can use the given field for the purpose of classification.

Table of Contents

i. Abstract

1. Project Overview

2. Introduction

3. Architecture

3.1 Checking for Domain Names

3.1.1 Domain Names

3.1.2 Matching Domain

3.2 Image Analysis

4. Design and Implementation - A StandAlone Approach

4.1 DomainCheck Module
4.2 Image Analysis Module

5. Server Based Approach

5.1 Domain Check Module

5.2 Image Analysis Module

6. Results

6.1 Check Domain Module

6.2 Image Analysis Module

7. Problems and Solutions

1. Project Overview

The project is divided into modules for which statistics are to be gathered. These modules consist of the different header fields that seem important and can potentially be used for classification. The various modules are :

```
Friend Check
```
```
Pingable Hosts
```
```
Black Lists
```
```
Domain Check
```
```
In-reply-to
```
```
DKIM and SPF
```
```
Received Header
```
```
DHCP and DSL
```
```
Attachments
```

Getting hour,date,time information from the message

Whether the To field and the Body contain the name of the person.

Whether there is any image in the body of the message.

```
Columbia Internal mails
```

These are the main parts of the header and body on which data has to be gathered. They have been implemented as a joint effort by the team members.
The design of the project is briefly outlined below:

```
All modules are implemented in JAVA. 
```

The javamail-1.4 library is used extensively by all the modules.

There is a main file called MailStats.java that calls all the modules synchronously one after the other, so that all the checks can be performed on the message 
sequentially.

The MailStats connects to the user's account on an Imap server and starts up a basic user-interface with which the user can categorize his mail folders into spam, ham and sent.

The GUI has a progress bar to indicate which module is currently running and hence gives feedback to the user.

MailStats passes javax.mail.Message arrays containing the spam, ham and sent messages to all the modules. The modules use these to individually find statistics and print out
 a result that can be later used for analysis. The end result is the combination of the results of the individual modules.

TOP

2. Introduction

Out of the modules listed in the previous section, this report mainly concentrates on the following two modules:

Domain Check : This module deals with finding the domain name of the sender of the message and checking whether the receiver has sent any message to that domain in the past.

Image Analysis : This module checks whether there was any image present in the body of the message.

The above modules were chosen since they are used by the currently existing spam filters like spamassasin and hence can act as good classigiers of the incoming mail.The results of the test indicate whether the chosen parameters are good enough for suc a classification. To gather statistics, mailboxes of a small number of people(both, Columbia and non-columbia students) were used to provide the three message arrays, namely sent, mail and spam to all the modules. The statistics gathered can be further improved by increasing the accuracy of the test conducted. These statistics are then used to form the results of the test.

To gather such information for classifying the mails, two approaches were proposed:

Server Based Approach:
Initially, the statistics were to be gathered using an IMAP based server that could be used as a honey pot to attract spam mails. For this approach, the overall design was quite different from that of the other approach called the standalone program.

StandAlone Program:
The incoming mail was parsed on the fly and individual fields were then stored as a part of a database. The individual modules queried the database to get data for their individual analysis. The modules extracted the stored messages from the required tables in the database and then performed the same test on every mail received. The following sections give a general description of the standalone program and server based program.

TOP

3. Architecture

The basic architecture of the modules and what each module does is described below.

3.1 Check Domain Module

The first of the module covered in this report is the Check Domain Module. The main aim of this module is to determine which of the incoming mails(both spam and ham) have a
known domain . The following sections give more details about the module.

3.1.1 Domain Names

The most common types of domain names are host names that provide memorable names to stand in for numeric IP addresses. They allow for any service to move to a different location in the topology of the Internet (or an intranet), which would then have a different IP address. By allowing the use of unique alphabetical addresses instead of numeric ones, domain names allow Internet users to more easily find and communicate with web sites and other server-based services. The flexibility of the domain name system allows multiple IP addresses to be assigned to a single domain name, or multiple domain names to be assigned to a single IP address.

Domain names are restricted to the ASCII letters "a" through "z" (case-insensitive), the digits "0" through "9", and the hyphen, with some other restrictions. For example "cs.columbia.edu", "123.abc.com", etc. There are a number of types of Domain names:

Top-Level Domains ( either one of a small list of generic names (three or more characters ), or a two-character territory code for eg. .in, .uk, .us, etc)
Second Level Domains ( These are the names directly to the left of .com, .net, and the other top-level domains )
Third Level Domains ( These domains are immediately to the left of a second-level domain )
Sub-Domains ( Domains of third or higher level ).

A domain name is one's own unique identity and always will be as long has one continues to use that name. One can easily be aware of someone else's presence knowing his domain names. No two parties can ever hold the same domain name simultaneously; therefore your Internet identity is totally unique. Whenever messages are exchanged between a sender and a receiver, domains are exchanges as well; meaning each one is aware of who is sending the message and whom does his own message go to. The domains of the sender or the receiver can be known from the header of the message itself.

3.1.2 Matching Domain Names

Checking for the domains is an important test as far as classification of the messages is concerned. If a known sender has sent a message from a known domain, then the message cannot be considered as a spam message. We define a known sender as follows:

If a receiver gets a message from a sender, and the receiver has already sent a message to the same sender in the past (meaning the sender is on the sent list of the receiver), then the sender is a known sender for the receiver. The domain of this known sender is called the known domain

The MailStats connects to the user's account on an IMAP server and starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent. Each message is parsed by every module to get the required data. The idea is to get the domains for each of the messages in the ham and spam folders and compare them with the domains of the sent folder. If there is a match, it ensures that the the message is not a spam. This can be done by extracting the host name from the received headers of the mail and using this host name to get the domain from where the message was sent. The same procedure is repeated on the messages in the sent folder of the receiver and the two domains can be matched. If a match exists then the mail is from a known domain. The only concern here is regarding how the match can be made.

The levels to which a domain match can occur, will differ from one message to the other. The different levels are classified as follows:

Level 1 match is done on the entire host name. That is the host name of the sender will be matched completely with the host names in the messages of the sent folder of the receiver.
Level 2 match is performed only if the Level 1 match fails. In this case we take out the first token from the original host name and compare the remaining part of the host name with that of the messages in the sent folder, to check if the match occurs.
Level 3 match occurs only if the previous levels fail. Similar to Level 2, here we remove the first token from the remaining hostname, and compare it with the sent messages, to check if the match occurs.
This can be continued for several levels, until there are no more tokens left, but the last one. But usually the domains are restricted to around 3 to 4 levels, before concluding that the domains do not match.

For example:

Suppose person A received a mail from dms2169@cs.columbia.edu. To perform a match, first the entire host name that is dms2169@cs.columbia.edu is matched with the host names in the sent folder of A. If there is no match, then we remove the first token “dms” and compare cs.columbia.edu in the next level of the match. Again if there is no match, columbia.edu is chosen for the next level of match. If still the match does not occur, then we conclude that the domain does not match.

3.2 Image Analysis Module

The idea that ham mails usually do not contain any images most of the time, can be considered as a good metric or a test for classification. Hence, if a mail contains an image, it is more likely to be considered as a spam, though other tests might be necessary along with this to come to such a conclusion. Thus a crucial aspect is to determine whether the mail contains an image or not. Multipurpose Internet Mail Extensions (MIME) is an Internet Standard that extends the format of e-mail to support:

text in character sets other than US-ASCII;
non-text attachments;
multi-part message bodies; and
header information in non-ASCII character sets.

Thus, a MIME message may have a number of different parts, each having a different type of content. A number of Content types are supported:

simple text messages using text/plain (the default value for "Content-type:")
text plus attachments (multipart/mixed with a text/plain part and other non-text parts). A MIME message including an attached file generally indicates the file's original name with the "Content-disposition:" header, so the type of file is indicated both by the MIME content-type and the (usually OS-specific) filename extension
reply with original attached (multipart/mixed with a text/plain part and the original message as a message/rfc822 part)
alternative content, such as a message sent in both plain text and another format such as HTML (multipart/alternative with the same content in text/plain and text/html forms)
image, audio, video and application (for example, image/jpg, audio/mp3, video/mp4, and application/mswork and so on)
many other message constructs

The Content-type header indicates the Internet media type of the message content. A media type is composed of at least two parts: a type, a subtype, and one or more optional parameters. For example, subtypes of image type can be GIF or JPEG, etc. depending upon the type of the image. Thus to check whether a mail contains an image or not, we simply need to check the Content-type header and see if it is equal to "Image" with any subtype.

Spammers generally make use of a technique called Image Spam in which the text of the message is stored as a GIF or JPEG image and displayed in the email. Filtering messages with image spam is more difficult than with text only as traditional methods are not effective. Image-based spam is a particularly difficult problem for a couple of reasons: One, it is much harder to detect with conventional spam filtering and blocking technologies, and second, it is typically much larger than normal text-based spam, consuming much more bandwidth and storage.

Thus, checking for images in the incoming mails could be used for classification, upto a certain extent. Generally the legitimate mails i.e ham, would only contain text for communication. So, the percentage of ham messages containing images is usually low and this gives an indication that the message containing an image is more likely to be a spam, though this may not always be the case, as discussed later.

TOP

4. Design and Implementation - A Standalone Approach

This section of the report provides the basic design and the implementation of the modules, using the Standalone Approach. It provides a detailed descripion on how the modules are programmed in order to produce the desired results. The Standalone Approach was designed to run on all the mails of the spam and the ham folders of a particular mailbox. Each mail was checked , one after the another by all the modules, and in the end the results of all the individual modules were gathered for the required classification. It begins with basic class structure of the modules, describing how the modules are arranged, followed by the implementation describing the basic idea and the functionality behind the modules. We study each module separately, one by one as follows:

Domain Check - Design and Implementation

Design

The design of the Domain Check module is object-oriented and follows a class structure with methods and variables. The class diagram in the following figure shows the basic representation of how the classes are arranged in order to provide the required functionality.

Domain Class
Figure 4.1

As shown in the Figure 4.1 above, Module forms the base class. There is another class CheckDomain class which inherits the main class, and accesses all the methods and variables, defined in the public and protected domains of the Module class. This specialized class gets the messages in form of three message arrays as shown, on which the domain test is performed. Hashset is a global data structure in this class, which is used to store the domains of the sent messages and domainresult is a string, which is used to store the results of the Check Domain module. Following methods are also used in theis module:
sentList(): This function is used to divide the domain of the sent messages into a number of tokens and store these tokens into the hashset.
domainCheck: This function is used to perform the matching test on spam as well as non-spam messages.

Implementation

The CheckDomain class in the figure 4.1, receives three message arrays: sent, mail and spam from the base class. It then starts its implementation by first working on the messages in the sent folder and storing the domains of these messages in the hashset. This can be done using the message.getAllRecipients() function, which will get all the hostnames of a particular message and store it in the array of Address objects. Once a domain is extracted from the hostname, it is divided into a number of tokens. Tokens can be formed from a domain using the split function which takes a delimiter string as an input. If a token is already present in the hashset, then it cannot not be added into the hashset.
For example: if a message was sent to a domain called cs.columbia.edu by the receiver, then the various tokens that could be added in the hashset are : cs.columbia.edu and columbia.edu, but if any of these domains are already present in the hashset, then they cannot be entered again in the hashset. This speeds up the search process again, as there are less number of tokens to be compared and thus improves the efficiency.

Once the domais of the sent mails have been stored in the hashset, by means of a sentList() function call, the actual process of matching the domain begins. Method
domainCheck() is called once each, using the messages of the non-spam and the spam folders. For each folder, the concept of the Matching Domain as described earlier in section 3.1.2 of the report , is used to find a match. Thus at each level, we check whether the remaining domain matches any of the tokens stored in the hashset. If this domain does not match, then first token is removed and the process repeats for the remaining domian, until there is a match or no match at all. An exmaple for this can be found here. For each match, the count for the number of domains matched is incremented. We are interested in following types of such counts : ham domain count, spam domain count, ham mailing list domain count, spam mailing list domain count, ham mailing list sender domain count and spam mailing list sender domain count. The result is computed and the values are written into the public variable result of the base class. Ham and spam domain counts are basically the number of domains matched for the ham folder and the spam folder respectively. But mailing list domain counts and mailing list sender domain counts (for both ham and spam), are the two variables related to the mails from the mailing lists. These are described in detail below.

A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is referred to as "the mailing list". Mailing list is simply a list of e-mail addresses of people that are interested in the same subject. When a member of the list sends a note to the group's special address, the e-mail is broadcasted to all the members of the list.

Thus there are two domains of interest: the domain of the mailing list itself and the domain of the sender, who has sent a mail to the special address of the mailing list. The sender may not be of the known domain to the receiver, but the domain of the mailing list might be known. Or it could be the other way around. Hence we get these counts seperately and analyse them seperately as well. This gives us an indication of how many known mailing list are there and how many senders who send mails to the mailing list to which the receiver has also subscribed, are known to the receiver. Thus the first step should be identifying the mail as one coming from a mailing list or not. This can be done by checking the List-Id in the header of the message. If the List-Id header is present, then the mail is from a mailing list. Once the mail has been identified as coming from a mailing list, we can use the same implementation for domain match as that of the other non-mailing list mails, but only the intermediate results are stored in different variables: mailing list domain count and mailing list sender domain count.
For example, when the mailbox "dpr2110@columbia.edu", was tested for the Domain Check module, it gave the following results:

Domain Testing Module
HAM
Total Mails:1174
Total Non-MailingList Mails: 830
Total Mails whose Domain Matched: 746/830
Total Mails whose Domain did not Match: 84/830
Ham mailing List Senders domain match: 324/344
Ham mailing List domain match: 344/344

The mailbox contains 1174 mails, out of which there are 830 non-mailing list mails. Among the non-mailing list mails, there are 746 mails whose domains matched and 84 mails whose domains did not match. Among the 344 mailing list mails out of which 324 mails have senders with known domains and 20 mails dont. Also all the mailing list mails have known domains.

Image Analysis - Design and Implementation

Design

The design of the Image Analysis module is object oriented and follows a class structure with methods and variables. The class diagram in the following Figure 4.2 shows the basic representation of how the classes are arranged in order to provide the required functionality.

Figure 4.2

As shown above in figure 4.2, there is no individual class for the Image Analysis module. The reason for making this module a method of another class was because, it required the entire body of the message to be read, took a long time. Hence to save the time and ensure that the body is read only once, the module was integrated into "From_Body module", implemented by Aditi Rajoria. Also this program was separated from the remaining ones, to ensure that the time taken by the other modules as a whole to produce results does not increase further. Hence the main module is run again for passing the message arrays to the From_Body module and produce the results.

Just as the previous module, there is a base class called Module which reads in the messages from the folders of the mailbox, forms three message arrays and passes them to all the individual modules. The inherited class Form_Body accepts these arrays and does the task required. The image_test() method, in this inherited class, is the only method of interest for the Image Test module. It is used to check whether a message (either spam or non-spam) has an image as a part of its content or not.

Implementation

The image testing module is implemented using image_test method in the inherited class From_body. This method tests all the mails in the non_spam and spam folders and counts the number of messages having images in the body of the message. This can be done by checking the content-type of every part of the body message. As discussed earlier, a mail can be divided into a number of parts, each containing a different multimedia object of the mail for eg. text, image, video, etc. Hence the content-type is checked for each of the parts till content-type=”IMAGE” is satisfied. The content-type may also have many sub-types eg. Primary, secondary, . Thus, we always check for the primary type of “IMAGE”, since the secondary type can be anything eg. GIF, JPEG etc. On each image found, we increment the count of image by one. The above process is done for all the mails in spam as well as non-spam folders.

According to this test, the image count of the spam mails should be more, since the spam messages are likely to contain images, and that of the non-spam mails should be low as the legitimate mail conversation hardly includes images in it.

TOP

5. Server Based Approach

For the server based approach, a database schema was used. The database contained a number of tables and depending upon the requirement of the different modules and the RFC 2822/RFC 2821 architecture, various header fields were stored in the database for e.g. to and from field, by, Return-Path, etc.

All the modules that were implemented for the standalone program were implemented for the server based approach. The Domain Check and the Image Testing modules were triggered by the parser itself, and based on the requirements of individual modules, various fields of a particular mail were retrieved from the Database. And the actual test was done on these field values of the message. The database stored every incoming mail and passed its identifier to different modules, each performing a specific test. The server based approach was used for the following modules:

Domain Check Module - Design and Implementation : A Server Based Approach

The design of the Domain Check module was object-oriented and followed a class structure with methods and variables. The class diagram in the following figure shows the basic representation of how the classes were arranged in order to provide the required functionality, using a Server based approach.

Domain Data Flow Diagram
Figure 5.1

As shown in the Figure 5.1 above, the basic design of the Domain Check module using the Server Based Approach, consisted of the 3 classes:
Controller class is the basic class, responsible for reading the text file containing the Module names, using a readFile() method and also calling these modules for each incoming message, using the callModules() method. The msg_id of each message is stored in the moduleArray, in order to pass to all the individual modules, so that the each module could perform the individual check on that particular message.

The DomainCheck class is responsible for performing the domain match test on the incoming mail, using the concept of Matching Domain as described in the section 3.1.2. Using the msg_id, passed by the Controller class, the messages of the sent, non-spam and spam folders, are stored in "sent, mail and spam" strings and the domain test could then be performed using these strings. As the results are obtained, they are appended to the string "result", and retrieved using "getResult()" method. The Jdbcconnection class is responsible for Database Connectivity and retrieving the required fields like to and from from the database, for the incoming message, whose msg_id is known. The data is retrieved using the mails object of type Mail, in the getTo() and getFrom() methods of the class. An object of the JdbcConnection class is created in the DomainCheck class, so that the data can be retrieved from the database appropriately, for each incoming message. After all the data has been retrieved, the closeConnection() method is called to close the connection established by the Connection object con, with the database.

The following shows the flow of information in the Domain Check module.

Domain Data Flow Diagram
Figure 5.2

The Domain Check module like other modules, extracted messages from the database using the message Id, passed to it by the Controller program. Each of the messages in all the non-spam and spam folders, were extracted from the database.

For domain matching, the domains of the received message were compared to the IMAP sent folder of the receiver, which consisted of the domains to which the receiver has sent messages in the past. An Imap Retreival Module was responsible for populating the data in the Message Database. Whenever a message Id was passed to the Domain Check Module, the database was scanned and the received domain was compared with the all the domains stored in the database in the Imap Sent table and the results of the match were stored back in the database.

Image Analysis - Design and Implementation : A Server Based Approach

Image testing module would have been implemented in the similar way, as the Domain Check Module, using the Server Based Approach. The messages would be stored in the database, the msg_id would be passed by the controller and using this msg_id , messages would be extracted from the database for the image testing. Only the required tables would be accessed like spam and to tables, containing the various fields of the spam and non-spam messages.

Domain Data Flow Diagram
Figure 5.3

The following flow model was developed for the Image Testing Module.

Image Test Data Flow Diagram
Figure 5.4

The image testing would be done in the same way as described in the standalone program, using the checkImage() method. The only difference is that it would have been done on the fly, that is for every incoming mail, which will be stored, the Image Test program would run as well, after getting the required identifier of the new message. With this msg_id from the controller, the Image Test module would extract the corresponding message from the Database and perform the Image Test on the message. The results of this test would have been stored back into the Database.

The Domain Check and the Image Testing modules using the Server Based Approach, faced a lot of problems, as mentioned in Problems and Solutions in Section 7, due to which the approach of implementation was changed to Standalone Program, as described in the previous sections. However, the Server Based Approach is a good idea to implement and could be used, provided that the honey-pot server continues to attract more spam messages in future.

TOP

6. Results

SAMPLE DATA SET

The sample data set consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:

Mailbox #	Mailbox Name	Ham Mails	Spam Mails	Total Mails
1	aditi_columbia	1818	0	1818
2	aditi_gmail	497	96	593
3	Deepti_columbia	1174	0	1174
4	Deepti_gmail	576	65	641
5	dhrumin_gmail	5002	103	5105
6	pinank_gmail	1418	264	1682
7	Preetinarayan_columbia	1230	0	1230
8	Preetinarayan_gmail	1788	204	1992
9	sneha_gmail	133	227	360
10	spinank_gmail	524	355	879
11	vasa_columbia	168	0	168
12	dms2169_columbia	1301	21	1322
13	nirav_gmail	1360	48	1408
14	nns_2108	934	0	934
15	manish_gmail	414	45	459
16	pragni_gmail	1999	184	2183
17	preetimalik_columbia	527	0	527
18	preetimalik_gmail	380	0	380
19	sak2144	749	0	749
20	shradha_columbia	140	0	140
21	shradha_gmail	1151	371	1522
22	vasa_gmail	2367	890	3257

Figure 6.0

6.1 DOMAIN CHECK MODULE

While computing the results and the statistics for the Domain Check module, we divide the messages from the mailboxes into two classes: Mailing List messages and Non – Mailing List messages.
We take each of the class and find the statistics for all the mails belonging to that class only.

Non-Mailing List Analysis
Following figure 6.1 shows the statistics and the data gathered for Non-Mailing List mails for both Ham as well as Spam messages. The mailboxes number corresponds to the Mailboxes from the ResultSet.
Once the statistics are analyzed, the result is then computed.

Column 2 : Number of mails not coming from the Mailing Lists and in the Ham Folders.
Column 3 : Number of mails from amongst the Column2, whose domains match.
Column 4 : Fraction of mails in Column2 whose domains match and that do not come from the Mailing List.
Column 5,6 and 7 correspond to the Columns 2,3 and 4 respectively, but for Spam folders.

MailBox #	# Non - Mailing List Ham Mails	# Non - Mailinglist Ham Domains matched	% Non-Mailing List Ham Domains matched	# Non-Mailing List Spam mails	# Non-Mailing List Spam Domains Matched	% Non-Mailing List spam Domains Matched
1	1175	1050	89	0	0	NA
2	438	281	64	96	0	0
3	830	746	90	0	0	NA
4	576	430	75	65	4	6
5	1717	806	47	103	1	1
6	867	202	23	264	7	3
7	764	710	93	0	0	NA
8	1059	786	74	204	3	2
9	133	105	79	227	1	1
10	522	23	4	223	1	1
11	157	122	78	0	0	NA
12	858	715	83	21	1	5
13	1290	274	21	48	4	8
14	670	541	81	0	0	NA
15	414	208	50	45	1	2
16	1590	1021	64	179	6	3
17	480	392	82	0	0	NA
18	326	311	95	0	0	NA
19	684	624	91	0	0	NA
20	140	85	61	0	0	NA
21	1151	983	85	371	50	13
22	1468	1318	90	890	51	6

Figure 6.1

Based on data gathered, a scattered plot of the fraction of the mails whose domains matched for both, ham as well as spam, is constructed as shown in figure 6.2. It can observed that there are a few mailboxes, for which the '% ham domains matched' is low. This is possible if the user does not use this particular mailbox for sending mails to others, but might have advertised hi/her mailbox address to receive mails. Hence there are very few domains in the list of sent domains. And hence the percentage of matched domains for such a mailbox is low, s shown below.

Non-Mailing List Analysis

Figure 6.2

As shown in scattered graph, it can be seen that the Non-Mailing List domains of the ham mails, which match the domains of the sent messages, are entirely scattered within the region
of 60 %-100 %, whereas it is less than 20 % for the spam mails. Thus, based on this fact, it can be conclude that most of the non-mailing list domains are matching domains (match the domains of sent mails)indicating that these mails are indeed not spam mails. But this is not true in case of the spam mails. For most of these mails, the domains do not match, hence the possibility of these mails being spam is more.

Mailing List Analysis
Following figure 6.3 shows the statistics and the data gathered for Mailing List mails for both Ham as well as Spam messages. The mailboxes number corresponds to the Mailboxes from the ResultSet.

Column 2 : Number of mails coming from the Mailing Lists (ML) and in the Ham Folders.
Column 3 : Number of mails in Column2, whose Mailing List domain match.
Column 4 : Fraction of mails in Column2 whose Mailing list domains match.
Column 5 : Number of mails in Column2, whose sender domain mtach.
Column 6 : Fraction of mails in Column2 whose Sender domains match
Column 7,8,9,10 and 11 correspond to the Columns 2,3,4,5 and 6 respectively, but for Spam folders.

MailBox #	# ML Ham Mails	# ML Ham Domains Matched	% ML Ham Domains Matched	# ML Sender Ham Domains Matched	% ML Sender Ham Domains Matched	# ML Spam Mails	# ML Spam Domains Matched	# ML Spam Sender Domains Matched	% ML Spam Domains Matched	% ML Spam Sender Domains Matched
1	643	523	81	623	97	0	0	0	NA	NA
2	59	59	100	36	61	0	0	0	NA	NA
3	344	344	100	324	94	0	0	0	NA	NA
4	0	0	NA	0	NA	0	0	0	NA	NA
5	3285	3266	99	3035	92	0	0	0	NA	NA
6	551	540	98	519	94	0	0	0	NA	NA
7	466	376	81	450	97	0	0	0	NA	NA
8	729	680	93	716	98	0	0	0	NA	NA
9	0	0	NA	0	NA	0	0	0	NA	NA
10	2	0	0	0	0	132	0	0	0	0
11	11	11	100	11	100	0	0	0	NA	NA
12	443	437	99	437	99	0	0	0	NA	NA
13	70	0	0	12	17	0	0	0	NA	NA
14	264	264	100	264	100	0	0	0	NA	NA
15	0	0	NA	0	NA	0	0	0	NA	NA
16	409	407	99	401	98	5	5	5	100	100
17	47	47	100	47	100	0	0	0	NA	NA
18	54	54	100	50	93	0	0	0	NA	NA
19	65	65	100	65	100	0	0	0	NA	NA
20	0	0	NA	0	NA	0	0	0	NA	NA
21	0	0	NA	0	NA	0	0	0	NA	NA
22	899	885	98	872	97	0	0	0	NA	NA

Figure 6.3

As seen in the table of figure 6.3, the mailing list messages contain two domains, the domain of the sender, who has sent the mail to the mailing list address, before it is multicasted, and the domain of the mailing list itself. For example, if a person with email id : abc@columbia.edu, sends a mail to the mailing list, xyz@yahoogroups.com, then all the receivers which have also subscribed to this mailing list, will receive two domains : columbia.edu and yahoogroups.com, both of which can be checked for a match. Following observations were made after the data was collected:

1. In the figure 6.3, most of the fields in the last two columns (% mailing list spam domain matched and % mailing list spam sender's domain matched) contain the value NA. This is because, for such mailboxes, there are no mailing list spam mails. Hence very few mails which come from the mailing list, are spam mails, since the user has himself subscribed to this domain of the mailing list. So the possibility
of these mails (coming a mailing list) being spam is very low.

2. Also there are few NA values in the '% mailing list ham domains' and '% mailing list ham sender's domain'. This is because for such mailboxes, there were no mails coming from the mailing lists.This typically happens when either the user has not subscribed to any of the mailing lists or the mailing list to which the user has subscribed to, is not active.

The Sender's domain and the Mailing list domains are both plotted (only for non NA values) using a scatter graph, for both ham and spam mails as shown below in the figure 6.4.

Mailing List Analysis

Figure 6.4

The scatter plots in figure 6.4 show that for most of the mailboxes, domains for nearly 100 % of the ham mails, coming from the mailing lists, match the domains of the mails sent by the user. In other terms, user is aware of the domains coming from from the mailing list, which should rightly be true, since the user has himself subscribed to the domains of the mailing lists and these mails should not spam.
But on the other hand, as mentioned before in the observations, there are very spam messages from the mailing lists, and if there are, there are very few messages, whose domain matches with that of the
sent messages.

The mailbox # refers to the mailbox in the Resultset, which was created at the time of the testing and the link of whose is available in the References. If we observe the graph and the Resultset, there are some mailboxes for which the # of Ham messages matching the sent mails are less. This could be possible since the receiver might not send mails using this mailbox, or because the sent folder contains very less number of mails, meaning the receiver does not send mails often.

Matching Analysis

Thus as seen from the statistics generated for the different mailboxes, of different sizes, the Domain Check test can considered as a good test based on following two properties:

The number of messages in the non-spam folder whose domains match, should be high.
The number of messages in the spam folder whose domains match, should be low.

Now to show that any test is a good test, the above two properties need to be satisfied. In order to generate the statistics, we had divided the mails received into two broad classes, Mailing list and Non-mailing list. We took the fraction of mails whose domains matched and represented this fraction with the help of a scatter graph, for every mailbox, and for each mail folder(ham and spam).
Thus after careful analysis, it was found that the percentage of mails in the ham folder, whose domains matched, was really high for both Mailing list and Non-mailing list mails and this percentage was pretty low for the mails in the spam folder. This result was generated based on the Mailboxes takes for testing.

Hence in order to generalize this test, we can define a threshold value of say 60 % (or more than 50 %) and count the number of mailboxes for which the fraction of mails whose domains matched, is above threshold and then arrive at a conclusion, of whether the test is suitable for a particular resultset. For most of the cases, the value of the count is very high as compared to the count of other mailboxes in the resultset. Domain Check can thus be a good test for the required classification.

6.2 IMAGE ANALYSIS MODULE

Proceeding in the same manner as the Domain Check test, the received mails are divided into Mailing List and Non-mailing list and each is then tested individually and statistics for each are generated and studied in order to find the results of the Image test.

Non-Mailing List Analysis
From the total mails received, the mails that do not come from a mailing list are separated out and data is gathered for them as shown in the figure 6.5.

Column 2 : Number of mails not coming from the Mailing lists in the spam folder.
Column 3 : Number of Mails from amongst Column 2, which contain images in the body.
Column 4: Fraction of mails containing mails, to the total mails, not coming from Mailing list in the Ham folders.
Column 5,6 and 7 have the similar meaning as Column 2,3 and 4, but for Spam folders.

MailBox #	Non-mailingList Ham Messages	# Non_mailing list ham Images	% Ham Image Count	Non-mailingList Spam Messages	# Non-Mailing list Spam Images	% Spam Image Count
1	1175	363	31	0	0	NA
2	438	150	34	96	1	1
3	830	82	10	0	0	NA
4	576	268	47	65	4	6
5	1717	311	18	103	0	0
6	867	145	17	264	31	12
7	764	90	12	0	0	NA
8	1059	314	30	204	6	3
9	133	57	43	227	9	4
10	522	15	3	223	0	0
11	157	30	19	0	0	NA
12	858	81	9	21	1	5
13	1290	423	33	48	0	0
14	670	130	19	0	0	NA
15	414	138	33	45	1	2
16	1590	381	24	179	0	0
17	480	160	33	0	0	NA
18	326	100	31	0	0	NA
19	684	137	20	0	0	NA
20	140	60	43	0	0	NA
21	1151	300	26	371	10	3
22	1468	866	59	890	0	0

Figure 6.5

As showin the figure 6.5, some of the fields are NA, the reason being the same as mentioned before. The fraction of the mails containing images are found out for each mailbox and each mail folder (ham and spam) and a a scattered graph is plotted for the values of these fractions obtained, to see in which regions these fractions lie. Figure 6.6 shows the scatter plot.

Non-Mailing List Image Analysis

Figure 6.6

As seen from above, the scatter of the '% Non-mailing list ham image count' is not concentrated in the high value region, unlike in the Domain Check module. Nearly for all the mailboxes, the value of the fraction is under 50 %, for ham messages. This is an indication that the ham messages are communicated mostly through text and not many of them contain images. Thus if a message contains an image, then the possibility of that message being a ham, is low.

But at the same time, in the figure 6.6, all the mailboxes have fraction value less than 30% for the spam messages. This might not be the general case and could depend on the type of spam message or
type of spammer,etc. Hence we cannot classify only on the basis of whether a message contains an image or not, since a message not containing an image can be both ham or a spam, as per the statistics generated.,

Thus for the Non-mailing list messages, the Image test may not be of any use, or it may be used alongside other relevant tests to provide the required classification.

Mailing List Analysis

Proceeding in the same manner as Non-Mailinglist List, the mails from the Mailing list are separated out and data is gathered for them as shown in the figure 6.7.

Column 2 : Number of mails coming from the Mailing lists in the Ham folder
Column 3 : Number of Mails from amongst Column 2, which contain images in the body.
Column 4: Fraction of mails containing mails, to the total mails, coming from Mailing lists and in the Ham folders.
Column 5,6 and 7 have the similar meaning as Column 2,3 and 4 respectively, but for Spam folders.

MailBox #

Mailing list Ham Messages

# MailingList Ham Images

% Ham image count

Mailing list Spam Messages

# MailingList Spam Images

% Spam image count

1

643

67

10

0

0

NA

2

59

8

14

0

0

NA

3

344

70

20

0

0

NA

4

0

0

NA

0

0

NA

5

3285

676

21

0

0

NA

6

551

0

0

0

0

NA

7

466

0

0

0

0

NA

8

729

26

4

0

0

NA

9

0

0

NA

0

0

NA

10

2

0

0

132

2

2

11

11

2

18

0

0

NA

12

443

48

11

0

0

NA

13

70

3

4

0

0

NA

14

264

18

7

0

0

NA

15

0

0

NA

0

0

NA

16

409

27

7

5

1

20

17

47

7

15

0

0

NA

18

54

12

22

0

0

NA

19

65

13

20

0

0

NA

20

0

0

NA

0

0

NA

21

0

0

NA

0

0

NA

22

899

397

44

0

0

NA

MailBox #	Mailing list Ham Messages	# MailingList Ham Images	% Ham image count	Mailing list Spam Messages	# MailingList Spam Images	% Spam image count
1	643	67	10	0	0	NA
2	59	8	14	0	0	NA
3	344	70	20	0	0	NA
4	0	0	NA	0	0	NA
5	3285	676	21	0	0	NA
6	551	0	0	0	0	NA
7	466	0	0	0	0	NA
8	729	26	4	0	0	NA
9	0	0	NA	0	0	NA
10	2	0	0	132	2	2
11	11	2	18	0	0	NA
12	443	48	11	0	0	NA
13	70	3	4	0	0	NA
14	264	18	7	0	0	NA
15	0	0	NA	0	0	NA
16	409	27	7	5	1	20
17	47	7	15	0	0	NA
18	54	12	22	0	0	NA
19	65	13	20	0	0	NA
20	0	0	NA	0	0	NA
21	0	0	NA	0	0	NA
22	899	397	44	0	0	NA

Figure 6.7

Once the required data is gathered, it can be represented with the help of a scatter graph. One such plot is shown in figure 6.8.

Mailing List Image Analysis

Figure 6.8

As shown in the scatter plot in figure 6.8, the test does not give good results. We cannot infer anything out this test for Mailing list messages due to the following two observations:

1. There are no spam messages from mailing list for any of the mailboxes. This happens normally for most of the mailboxes since messages from the mailing list are only sent to those users who have subscribed for the list and if user receives a message from such a mailing list, to which he has subscribed for, then such a message should not be a spam.

Since there are no spam mailing list mails, we cannot say anything about the type of mail received and hence we cannot classify on the basis of image present or not.

2. The fraction values for the mailing list spam image count, are scattered in all the regions, with most of them being below 45 %. Ideally, the image count for the spams should be high to clearly classify it as spam since it is believed that the authenticated communication mostly takes place through text and not images. So the presence of an image should increase the possibility of the mail being a spam. But the count value for the spam is not vey high nor concentrated in the high regions of the plot. The reason being the same as mentioned before.

Hence based on the observations made, we cannot infer anything out of the image test, since the results may vary depending on the type of mails and time at which the test is performed. We may get a higher value for number of ham images due to the fact that most of the Mailing list mails, sent from friends and known people, often contains one or more images. But spam on the other hand, can simply be full of undesired text and hyper links. Thus it is difficult to use the Image test alone for classification hence it must be used in collaboration with some other test like FriendList or Domain Check to provide the required classification.

Result Summary:

1. Domain Check Module : Can be used as a good filter and gives good results most of the time.
2. Image Analysis Module : is a weak filter and cannot be used alone for the purpose of classification of mails.

TOP

7. Problems and Solutions

One of the major problems faced with the server based approach was that it did not attract many spam mails to work with. Thus, accurate results were not obtained. As a result, the entire approach shifted to standalone module,rather than passing messages one at a time, the whole message arrays were passed at once to all the individual modules.
In the standalone program the biggest concern was the amount of time it took for the modules to finish running, because the messages needed to be parsed and then the host name was extracted from them which was then being compared to the stored names. Hence, the hashset data structure was used instead of an array to store messages which at least sped up the comparison process.
A major hurdle was faced while integrating each of the correctly working standalone modules, including CheckDomain and Image Analysis, into the base module i.e Module Class. There were a number of smaller and bigger issues which were handled by the group collectively and solved. The Folder Closing Exception was faced too often, due to which the program got stuck in between and did not compute the right results for some modules. This case was seen for mailboxes containing over 2500 messages in them. This Exception was due to the fact that the message folder accidentally closes, due to the timeout factor, as a result of which the errors occured. This issue was handled by opening the folder only when it was required on demand, ensuring that the folder is not opened at a time when it is not being used.
Another difficulty faced in the Image Testing module was that the dataset collected could not give the accurate results for the test. Most of the mails from gmail mailboxes had a large number of mails in the non-spam folders containing images, either in the form of attachments or integrated in the body itself. This was also because of the fact that people often join certain mailgroups or are part of several mailing list and receive about 5-6 mails daily, containing images as well text. Mails are also received from certain sites on which the user has registered, advertisements from known companies, etc which contain images. Also the spam messages are sometimes very short containing only text and instead of the images, they contain links or sites for navigating the user to their page. So the count of mails with images could often be low.

A similar observation was made on the Columbia mailboxes too. Firstly, these mailboxes hardly received any spams since it is known to very less number of people. Also there were images present in some non-spam mails, due to which the analysis could not be done accurately on the Result set that was taken for testing. It varied for different mailboxes.

Due to the above two facts, it became very difficult to analyze the results obtained and infer on the basis of these results whether the test could be used for the classification or not. As a result, due to such variations, the Image test would fail to give out the desired accuracy for the classification in many cases.

TOP

8. Appendix

The source code to the individual modules can be found here: Source Code. This zipped folder contain the following :

1. Domain Check Module : "SpamTestLatest.jar"
2. Image Analysis Module : "SpamTestImage.jar"

The java files for the individual modules can be obtained by extracting the jar folder. The files are named "CheckDomain.java and From_Body.java (image_test() function ). The Image Test Module was implemented as a part of the From_Body module which checks if the "To field" or the "Body" of the mail contains the real name of the person or not. This module was implemented by Aditi Rajoria. The reason for doing this had been mentioned earlier.

The java source code for these modules can be found here :

1. Domain Check Module

2. Image Analysis Module

TOP

9. Tools Used

The tools used were as follows:

Navicat MySQL for creating the database for the Server based Approach
Netbeans IDE for writing code for the modules as the code was written in JAVA
OpenOffice and EditPlus HTML based tools for writing the report.
OpenOffice Excel draw for drawing the figures.
Open Office Drawing tool for making the class Diagrams
Concept Draw Pro tool for making the Data Flow diagrams for the Server Based Approach.
Excel to Html converter for converting result tables in excel into the html form

TOP

10. References

RFC 2822 [http://tools.ietf.org/html/rfc2822 ]

RFC 2821 [http://tools.ietf.org/html/rfc2821 ]

JavaMail API [http://java.sun.com/products/javamail/]

Spam Analysis and Reputation Project [http://wiki.cs.columbia.edu:8080/display/spam/Home]
SARP Modules [http://wiki.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]
Professor Henning Schulzrinne – Advisor for the Project.
Adrian Frei - Spam Analysis and Reputation Project: DNS Blacklists
Tejas Nadkarni – Parser and Standalone Framework
Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.
Preethi Narayan – Spam Analysis and Reputation Project : From And Received Header Analysis
Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis
Swati Kumar - Spam Analysis and Reputation Project : Email Encryption Headers and Database Schema
The results of running the modules on a few mailboxes which were considered while testing and result computing, are provided on the link below:
http://wiki.cs.columbia.edu:8080/display/spam/Resultset]

TOP