By:
Dhrumin Shah
Columbia University
Department of Computer Science
New York, NY 10027
USA
The project aims at
gathering statistical data about the various headers and body fields
present in emails and hence differentiating between the two large
collections of messages: Spam and Non-Spam (or Ham) .
Based on the data gathered for the spam and ham mails, we will decide
whether a particular field is good enough to be used for the
required classification. The two
modules covered in this report are the Domain Check
module and the Image Test
module
If a particular field in the header or body of an email is a good indicator, it will
have varying values (eg. Spam score) for the spam and ham mails. To
gather the statistics we run our program on folders like Inbox, Spam
and others like Sent that contain substantial number of messages and
gather useful data about different header and body fields. After
examining this data, we arrive at a conclusion that whether the
particular field is a good metric for the required classification.
The statistics can be generated by using parameters like the number of mails; comparison
between the headers or the body fields; or from the individual
results of the various sources like Blacklist or Friendlist, etc.
These statistics when derived for a sufficient number of mailboxes
and hence for a sufficient number of different kind of mails, can be
used to classify a particular message as spam or ham. For example, if
a mail has a domain name to which the receiver has never sent a mail,
then the probability of the message being regarded as spam is high,
but on the other hand, if the mail has a domain name to which the
receiver has sent a mail before, then this mail will probably not be
regarded as a spam. So, we find the count of such mails from the set
of mailboxes which were used for testing, and based on these count
values for spam and ham mails, we infer whether we can use the given
field for the purpose of classification.
Table
of Contents
i.
Abstract
2.
Introduction
3.
Architecture
4. Design and Implementation - A StandAlone Approach
The project is divided into modules for which statistics are to be gathered. These modules consist of the different header fields that seem important and can potentially be used for classification. The various modules are :
Friend Check
Pingable Hosts
Black Lists
Domain Check
In-reply-to
DKIM and SPF
Received Header
DHCP and DSL
Attachments
Getting hour,date,time information from the message
Whether the To field and the Body contain the name of the person.
Whether there is any image in the body of the message.
Columbia Internal mails
These are the main parts of the header and body on which data has to be gathered. They have been implemented as a joint effort by the team members.
The design of the project is briefly outlined below:
All modules are implemented in JAVA.
The javamail-1.4 library is used extensively by all the modules.
There is a main file called MailStats.java that calls all the modules synchronously one after the other, so that all the checks can be performed on the message sequentially.
The MailStats connects to the user's account on an Imap server and starts up a basic user-interface with which the user can categorize his mail folders into spam, ham and sent.
The GUI has a progress bar to indicate which module is currently running and hence gives feedback to the user.
MailStats passes javax.mail.Message arrays containing the spam, ham and sent messages to all the modules. The modules use these to individually find statistics and print out a result that can be later used for analysis. The end result is the combination of the results of the individual modules.
2. Introduction
Out of the modules listed in the previous section, this report mainly concentrates on the following two modules:
Domain Check : This module deals with finding the domain name of the sender of the message and checking whether the receiver has sent any message to that domain in the past.
Image Analysis : This module checks whether there was any image present in the body of the message.
The above modules were chosen since they are used by the currently existing spam filters like spamassasin and hence can act as good classigiers of the incoming mail.The results of the test indicate whether the chosen parameters are good enough for suc a classification. To gather statistics, mailboxes of a small number of people(both, Columbia and non-columbia students) were used to provide the three message arrays, namely sent, mail and spam to all the modules. The statistics gathered can be further improved by increasing the accuracy of the test conducted. These statistics are then used to form the results of the test.
To gather such information for classifying the mails, two approaches were proposed:
Initially, the statistics were to be gathered using an IMAP based server that could
be used as a honey pot to attract spam mails. For this approach, the
overall design was quite different from that of the other approach called the standalone
program.
The incoming mail was parsed on the fly and individual
fields were then stored as a part of a database. The individual
modules queried the database to get data for their individual
analysis. The modules extracted the stored messages from the required
tables in the database and then performed the same test on every
mail received. The following sections give a general description of
the standalone program and server based program.
The basic architecture of the modules and what each module does is described below.
The first of the module
covered in this report is the Check Domain Module. The main aim of
this module is to determine which of the incoming mails(both spam and
ham) have a
known
domain.
The following sections give more details about the module.
The most common types of domain names are host names that provide memorable names to stand in for numeric IP addresses. They allow for any service to move to a different location in the topology of the Internet (or an intranet), which would then have a different IP address. By allowing the use of unique alphabetical addresses instead of numeric ones, domain names allow Internet users to more easily find and communicate with web sites and other server-based services. The flexibility of the domain name system allows multiple IP addresses to be assigned to a single domain name, or multiple domain names to be assigned to a single IP address.
Domain names are restricted to the ASCII letters "a" through "z" (case-insensitive), the digits "0" through "9", and the hyphen, with some other restrictions. For example "cs.columbia.edu", "123.abc.com", etc. There are a number of types of Domain names:
Top-Level Domains ( either one of a small list of generic names (three or more characters ), or a two-character territory code for eg. .in, .uk, .us, etc)
Second Level Domains ( These are the names directly to the left of .com, .net, and the other top-level domains )
Third Level Domains ( These domains are immediately to the left of a second-level domain )
Sub-Domains ( Domains of third or higher level ).
A domain name is one's own unique identity and always will be as long has one continues to use that name. One can easily be aware of someone else's presence knowing his domain names. No two parties can ever hold the same domain name simultaneously; therefore your Internet identity is totally unique. Whenever messages are exchanged between a sender and a receiver, domains are exchanges as well; meaning each one is aware of who is sending the message and whom does his own message go to. The domains of the sender or the receiver can be known from the header of the message itself.
Checking for the domains is an important test as far as classification of the messages is concerned. If a known sender has sent a message from a known domain, then the message cannot be considered as a spam message. We define a known sender as follows:
If a receiver gets a message from a sender, and the receiver has already sent a message to the same sender in the past (meaning the sender is on the sent list of the receiver), then the sender is a known sender for the receiver. The domain of this known sender is called the known domain
The MailStats connects to the user's account on an IMAP server and starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent. Each message is parsed by every module to get the required data. The idea is to get the domains for each of the messages in the ham and spam folders and compare them with the domains of the sent folder. If there is a match, it ensures that the the message is not a spam. This can be done by extracting the host name from the received headers of the mail and using this host name to get the domain from where the message was sent. The same procedure is repeated on the messages in the sent folder of the receiver and the two domains can be matched. If a match exists then the mail is from a known domain. The only concern here is regarding how the match can be made.
The levels to which a domain match can occur, will differ from one message to the other. The different levels are classified as follows:
Level 1 match is done on the entire host name. That is the host name of the sender will be matched completely with the host names in the messages of the sent folder of the receiver.
Level 2 match is performed only if the Level 1 match fails. In this case we take out the first token from the original host name and compare the remaining part of the host name with that of the messages in the sent folder, to check if the match occurs.
Level 3 match occurs only if the previous levels fail. Similar to Level 2, here we remove the first token from the remaining hostname, and compare it with the sent messages, to check if the match occurs.
This can be continued for several levels, until there are no more tokens left, but the last one. But usually the domains are restricted to around 3 to 4 levels, before concluding that the domains do not match.
Suppose person A received a mail from dms2169@cs.columbia.edu. To perform a match, first the entire host name that is dms2169@cs.columbia.edu is matched with the host names in the sent folder of A. If there is no match, then we remove the first token “dms” and compare cs.columbia.edu in the next level of the match. Again if there is no match, columbia.edu is chosen for the next level of match. If still the match does not occur, then we conclude that the domain does not match.
The idea that ham mails usually do not contain any images most of the time, can be considered as a good metric or a test for classification. Hence, if a mail contains an image, it is more likely to be considered as a spam, though other tests might be necessary along with this to come to such a conclusion. Thus a crucial aspect is to determine whether the mail contains an image or not. Multipurpose Internet Mail Extensions (MIME) is an Internet Standard that extends the format of e-mail to support:
text in character sets other than US-ASCII;
non-text attachments;
multi-part message bodies; and
header information in non-ASCII character sets.
Thus, a MIME message may have a number of different parts, each having a different type of content. A number of Content types are supported:
simple text messages using text/plain (the default value for "Content-type:")
text plus attachments (multipart/mixed with a text/plain part and other non-text parts). A MIME message including an attached file generally indicates the file's original name with the "Content-disposition:" header, so the type of file is indicated both by the MIME content-type and the (usually OS-specific) filename extension
reply with original attached (multipart/mixed with a text/plain part and the original message as a message/rfc822 part)
alternative content, such as a message sent in both plain text and another format such as HTML (multipart/alternative with the same content in text/plain and text/html forms)
image, audio, video and application (for example, image/jpg, audio/mp3, video/mp4, and application/mswork and so on)
many other message constructs
The Content-type header indicates the Internet media type of the message content. A media type is composed of at least two parts: a type, a subtype, and one or more optional parameters. For example, subtypes of image type can be GIF or JPEG, etc. depending upon the type of the image. Thus to check whether a mail contains an image or not, we simply need to check the Content-type header and see if it is equal to "Image" with any subtype.
Spammers generally make use of a technique called Image Spam in which the text of the message is stored as a GIF or JPEG image and displayed in the email. Filtering messages with image spam is more difficult than with text only as traditional methods are not effective. Image-based spam is a particularly difficult problem for a couple of reasons: One, it is much harder to detect with conventional spam filtering and blocking technologies, and second, it is typically much larger than normal text-based spam, consuming much more bandwidth and storage.
Thus, checking for images in the incoming mails could be used for classification, upto a certain extent. Generally the legitimate mails i.e ham, would only contain text for communication. So, the percentage of ham messages containing images is usually low and this gives an indication that the message containing an image is more likely to be a spam, though this may not always be the case, as discussed later.
4. Design and Implementation - A Standalone Approach
This section of the report provides the basic design and the implementation of the modules, using the Standalone Approach. It provides a detailed descripion on how the modules are programmed in order to produce the desired results. The Standalone Approach was designed to run on all the mails of the spam and the ham folders of a particular mailbox. Each mail was checked , one after the another by all the modules, and in the end the results of all the individual modules were gathered for the required classification. It begins with basic class structure of the modules, describing how the modules are arranged, followed by the implementation describing the basic idea and the functionality behind the modules. We study each module separately, one by one as follows:
Figure 4.1
As shown in the Figure 4.1 above, Module forms the base class. There is another class CheckDomain class which
inherits the main class, and accesses all the methods and variables, defined in the public and protected domains of the Module class. This specialized
class gets the messages in form of three message arrays as shown, on which
the domain test is performed. Hashset is a global data structure in this class, which is used to store the domains of the sent messages and domainresult is a string, which is used to store the results of the Check Domain module. Following methods are also used in theis module:
sentList(): This function is used to divide the domain of the sent messages into a number of tokens and store these tokens into the hashset.
domainCheck: This function is used to perform the matching test on spam as well as non-spam messages.
The CheckDomain class in the figure 4.1, receives three message arrays: sent, mail and spam from the base class. It then starts its implementation by first working on the
messages in the sent folder and storing the domains of these messages
in the hashset. This can be done using the message.getAllRecipients() function, which will get all the hostnames of a particular message and store it in the array of Address objects. Once a domain is extracted from the hostname, it is divided into a number of tokens. Tokens can be formed from a domain using the split function which takes a delimiter string as an input. If a token is already present in the hashset, then it cannot not be added into the hashset.
For example: if a message was sent to a domain called cs.columbia.edu by the receiver, then the various tokens that could be added in the hashset are : cs.columbia.edu and columbia.edu, but if any of these domains are already present in the hashset, then they cannot be entered again in the hashset. This speeds up the search process again, as there are less number of tokens to be compared and thus improves the efficiency.
Once
the domais of the sent mails have been stored in the hashset, by means of a sentList()
function call, the actual
process of matching the domain begins. Method
domainCheck() is
called once each, using the messages of the non-spam and the spam folders. For each folder, the concept of the Matching
Domain as described earlier in section 3.1.2 of the report , is used to find a match. Thus at each level, we check whether the remaining domain matches any of the tokens stored in the hashset. If this domain does not match, then first token is removed and the process repeats for the remaining domian, until there is a match or no match at all. An exmaple for this can be found here.
For each match, the count for the number of domains matched is incremented. We are interested in following types of such counts :
ham domain count, spam domain count, ham mailing list domain count, spam mailing list domain count, ham mailing list sender domain count and spam mailing list sender domain count. The result is
computed and the values are written into the public variable result
of the base class. Ham and spam domain counts are basically the number of domains matched for the ham folder and the spam folder respectively. But mailing list domain counts and mailing list sender domain counts (for both ham and spam), are the two variables related to the mails from the mailing lists. These are described in detail below.
A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is referred to as "the mailing list". Mailing list is simply a list of e-mail addresses of people that are interested in the same subject. When a member of the list sends a note to the group's special address, the e-mail is broadcasted to all the members of the list.
Thus there are two domains of interest:
the domain of the mailing list itself and the domain of the sender, who has sent a mail to the special address of the mailing list. The sender may not be of the known domain to the receiver, but the domain of the mailing list might be known. Or it could be the other way around. Hence we get these counts seperately and analyse them seperately as well. This gives us an indication of how many known mailing list are there and how many senders who send mails to the mailing list to which the receiver has also subscribed, are known to the receiver. Thus the first step should be identifying the mail as one coming from a mailing list or not. This can be done by checking the List-Id in the header of the message.
If the List-Id header is present, then the mail is from a mailing list. Once the mail has been identified as coming from a mailing list, we can use the same implementation for domain match as that of the other non-mailing list mails, but only the intermediate results are stored in different variables: mailing list domain count and mailing list sender domain count.
For example, when the mailbox "dpr2110@columbia.edu", was tested for the Domain Check module, it gave the following results:
Domain Testing Module
HAM
Total Mails:1174
Total Non-MailingList Mails: 830
Total Mails whose Domain Matched: 746/830
Total Mails whose Domain did not Match: 84/830
Ham mailing List Senders domain match: 324/344
Ham mailing List domain match: 344/344
The mailbox contains 1174 mails, out of which there are 830 non-mailing list mails. Among the non-mailing list mails, there are 746 mails whose domains matched and 84 mails whose domains did not match. Among the 344 mailing list mails out of which 324 mails have senders with known domains and 20 mails dont. Also all the mailing list mails have known domains.
Figure 4.2
As shown above in figure 4.2, there is no individual class for the Image Analysis
module. The reason for making this module a method of another class was
because, it required the entire body of the message to be read,
took a long time. Hence to save the time and ensure that the
body is read only once, the module was integrated into "From_Body
module", implemented by Aditi Rajoria. Also this program was separated from the remaining ones, to
ensure that the time taken by the other modules as a whole to produce
results does not increase further. Hence the main module is run again for passing the message arrays to the From_Body module and produce the results.
Just as the previous module,
there is a base class called Module which
reads in the messages from the folders of the mailbox, forms
three message arrays and passes them to all the individual modules. The
inherited class Form_Body accepts
these arrays and does the task required. The image_test() method, in this inherited class, is the only method of interest for the Image Test module. It is used to check whether a message (either spam or non-spam) has an image as a part of its content or not.
Implementation
The
image testing module is implemented using image_test
method
in the inherited class From_body. This method tests all the mails in the non_spam and
spam folders and counts the number of messages having images in the
body of the message. This can be done by checking the
content-type of every part of the body message. As discussed
earlier, a mail can be divided into a number of parts, each
containing a different multimedia object of the mail for eg. text, image, video, etc. Hence the
content-type is checked for each of the parts till
content-type=”IMAGE”
is
satisfied. The content-type may also have many sub-types eg. Primary,
secondary, . Thus, we always check for the
primary type of “IMAGE”, since the
secondary type can be anything eg. GIF, JPEG etc. On each image
found, we increment the count of image by one. The above
process is done for all the mails in spam as well as non-spam
folders.
According to this test, the image count of the spam
mails should be more, since the spam messages are likely to contain
images, and that of the non-spam mails should be low as the
legitimate mail conversation hardly includes images in it.
For the server based approach, a database schema was used. The database contained a number of tables and depending upon the requirement of the different modules and the RFC 2822/RFC 2821 architecture, various header fields were stored in the database for e.g. to and from field, by, Return-Path, etc.
All the modules that were implemented for the standalone program were implemented for the server based approach. The Domain Check and the Image Testing modules were triggered by the parser itself, and based on the requirements of individual modules, various fields of a particular mail were retrieved from the Database. And the actual test was done on these field values of the message. The database stored every incoming mail and passed its identifier to different modules, each performing a specific test. The server based approach was used for the following modules:
Figure 5.1
As shown in the Figure 5.1 above, the basic design of the Domain Check module using the Server Based Approach, consisted of the 3 classes:
Controller class is the basic class, responsible for reading the text file containing the Module names, using a readFile() method and also calling these modules for each incoming message, using the callModules() method. The msg_id of each message is stored in the moduleArray, in order to pass to all the individual modules, so that the each module could perform the individual check on that particular message.
The DomainCheck class is responsible for performing the domain match test on the incoming mail, using the concept of Matching Domain as described in the section 3.1.2. Using the msg_id, passed by the Controller class, the messages of the sent, non-spam and spam folders, are stored in "sent, mail and spam" strings and the domain test could then be performed using these strings. As the results are obtained, they are appended to the string "result", and retrieved using "getResult()" method. The Jdbcconnection class is responsible for Database Connectivity and retrieving the required fields like to and from from the database, for the incoming message, whose msg_id is known. The data is retrieved using the mails object of type Mail, in the getTo() and getFrom() methods of the class. An object of the JdbcConnection class is created in the DomainCheck class, so that the data can be retrieved from the database appropriately, for each incoming message. After all the data has been retrieved, the closeConnection() method is called to close the connection established by the Connection object con, with the database.
The following shows the flow of information in the Domain Check module.
Figure 5.2
The Domain Check module like other modules, extracted messages from the database using the message Id, passed to it by the Controller program. Each of the messages in all the non-spam and spam folders, were extracted from the database.
For domain matching, the domains of the received message were compared to the IMAP sent folder of the receiver, which consisted of the domains to which the receiver has sent messages in the past. An Imap Retreival Module was responsible for populating the data in the Message Database. Whenever a message Id was passed to the Domain Check Module, the database was scanned and the received domain was compared with the all the domains stored in the database in the Imap Sent table and the results of the match were stored back in the database.
Figure 5.3
Figure 5.4
The image testing would be done in the same way as described in the standalone program, using the checkImage() method. The only difference is that it would have been done on the fly, that is for every incoming mail, which will be stored, the Image Test program would run as well, after getting the required identifier of the new message. With this msg_id from the controller, the Image Test module would extract the corresponding message from the Database and perform the Image Test on the message. The results of this test would have been stored back into the Database.
The Domain Check and the Image Testing modules using the Server Based Approach, faced a lot of problems, as mentioned in Problems and Solutions in Section 7, due to which the approach of implementation was changed to Standalone Program, as described in the previous sections. However, the Server Based Approach is a good idea to implement and could be used, provided that the honey-pot server continues to attract more spam messages in future.
SAMPLE DATA SET
The sample data set consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:
Mailbox # |
Mailbox Name |
Ham Mails |
Spam Mails |
Total Mails |
---|---|---|---|---|
1 |
aditi_columbia |
1818 |
0 |
1818 |
2 |
aditi_gmail |
497 |
96 |
593 |
3 |
Deepti_columbia |
1174 |
0 |
1174 |
4 |
Deepti_gmail |
576 |
65 |
641 |
5 |
dhrumin_gmail |
5002 |
103 |
5105 |
6 |
pinank_gmail |
1418 |
264 |
1682 |
7 |
Preetinarayan_columbia |
1230 |
0 |
1230 |
8 |
Preetinarayan_gmail |
1788 |
204 |
1992 |
9 |
sneha_gmail |
133 |
227 |
360 |
10 |
spinank_gmail |
524 |
355 |
879 |
11 |
vasa_columbia |
168 |
0 |
168 |
12 |
dms2169_columbia |
1301 |
21 |
1322 |
13 |
nirav_gmail |
1360 |
48 |
1408 |
14 |
nns_2108 |
934 |
0 |
934 |
15 |
manish_gmail |
414 |
45 |
459 |
16 |
pragni_gmail |
1999 |
184 |
2183 |
17 |
preetimalik_columbia |
527 |
0 |
527 |
18 |
preetimalik_gmail |
380 |
0 |
380 |
19 |
sak2144 |
749 |
0 |
749 |
20 |
shradha_columbia |
140 |
0 |
140 |
21 |
shradha_gmail |
1151 |
371 |
1522 |
22 |
vasa_gmail |
2367 |
890 |
3257 |
Figure 6.0
6.1
DOMAIN CHECK MODULE
While
computing the results and the statistics for the Domain Check module,
we divide the messages from the mailboxes into two classes: Mailing
List messages and Non – Mailing List messages.
We take each
of the class and find the statistics for all the mails belonging to
that class only.
Non-Mailing List
Analysis
Following figure 6.1 shows the statistics and the
data gathered for Non-Mailing List mails for both Ham as well as Spam
messages. The mailboxes number corresponds to the Mailboxes from the
ResultSet.
Once the statistics are analyzed, the result is then
computed.
Column 2 : Number of mails not coming from the
Mailing Lists and in the Ham Folders.
Column
3 : Number of mails from amongst the Column2, whose domains
match.
Column 4 : Fraction of mails in Column2 whose domains match
and that do not come from the Mailing List.
Column
5,6 and 7 correspond to the Columns 2,3 and 4 respectively, but for
Spam folders.
MailBox # |
# Non - Mailing List Ham Mails |
# Non - Mailinglist Ham Domains matched |
% Non-Mailing List Ham Domains matched |
# Non-Mailing List Spam mails |
# Non-Mailing List Spam Domains Matched |
% Non-Mailing List spam Domains Matched |
---|---|---|---|---|---|---|
1 |
1175 |
1050 |
89 |
0 |
0 |
NA |
2 |
438 |
281 |
64 |
96 |
0 |
0 |
3 |
830 |
746 |
90 |
0 |
0 |
NA |
4 |
576 |
430 |
75 |
65 |
4 |
6 |
5 |
1717 |
806 |
47 |
103 |
1 |
1 |
6 |
867 |
202 |
23 |
264 |
7 |
3 |
7 |
764 |
710 |
93 |
0 |
0 |
NA |
8 |
1059 |
786 |
74 |
204 |
3 |
2 |
9 |
133 |
105 |
79 |
227 |
1 |
1 |
10 |
522 |
23 |
4 |
223 |
1 |
1 |
11 |
157 |
122 |
78 |
0 |
0 |
NA |
12 |
858 |
715 |
83 |
21 |
1 |
5 |
13 |
1290 |
274 |
21 |
48 |
4 |
8 |
14 |
670 |
541 |
81 |
0 |
0 |
NA |
15 |
414 |
208 |
50 |
45 |
1 |
2 |
16 |
1590 |
1021 |
64 |
179 |
6 |
3 |
17 |
480 |
392 |
82 |
0 |
0 |
NA |
18 |
326 |
311 |
95 |
0 |
0 |
NA |
19 |
684 |
624 |
91 |
0 |
0 |
NA |
20 |
140 |
85 |
61 |
0 |
0 |
NA |
21 |
1151 |
983 |
85 |
371 |
50 |
13 |
22 |
1468 |
1318 |
90 |
890 |
51 |
6 |
Figure 6.1
Based on data gathered, a scattered plot of the fraction of the mails whose domains matched for both, ham as well as spam, is constructed as shown in figure 6.2. It can observed that there are a few mailboxes, for which the '% ham domains matched' is low. This is possible if the user does not use this particular mailbox for sending mails to others, but might have advertised hi/her mailbox address to receive mails. Hence there are very few domains in the list of sent domains. And hence the percentage of matched domains for such a mailbox is low, s shown below.
Figure 6.2
As
shown in scattered graph, it can be seen that the Non-Mailing List
domains of the ham mails, which match the domains of the sent
messages, are entirely scattered within the region
of 60 %-100 %,
whereas it is less than 20 % for the spam mails. Thus, based on this fact, it can be conclude
that most of the non-mailing list domains are matching domains (match the domains of sent mails)indicating that these mails are indeed not spam mails. But this is not true in case of the spam mails. For
most of these mails, the domains do not match, hence the possibility
of these mails being spam is more.
Mailing List
Analysis
Following figure 6.3
shows the statistics and the data gathered for Mailing List mails for
both Ham as well as Spam messages. The mailboxes number corresponds
to the Mailboxes from the ResultSet.
Column
2 : Number of mails coming from the Mailing Lists (ML) and in the Ham
Folders.
Column 3 : Number of mails in Column2, whose Mailing List
domain match.
Column 4 : Fraction of mails in Column2 whose
Mailing list domains match.
Column 5 : Number of mails in Column2, whose sender
domain mtach.
Column 6 : Fraction of mails in Column2 whose
Sender domains match
Column 7,8,9,10 and 11 correspond to the Columns 2,3,4,5 and 6 respectively, but for
Spam folders.
MailBox # | # ML Ham Mails | # ML Ham Domains Matched | % ML Ham Domains Matched | # ML Sender Ham Domains Matched | % ML Sender Ham Domains Matched | # ML Spam Mails | # ML Spam Domains Matched | # ML Spam Sender Domains Matched | % ML Spam Domains Matched | % ML Spam Sender Domains Matched |
---|---|---|---|---|---|---|---|---|---|---|
1 | 643 | 523 | 81 | 623 | 97 | 0 | 0 | 0 | NA | NA |
2 | 59 | 59 | 100 | 36 | 61 | 0 | 0 | 0 | NA | NA |
3 | 344 | 344 | 100 | 324 | 94 | 0 | 0 | 0 | NA | NA |
4 | 0 | 0 | NA | 0 | NA | 0 | 0 | 0 | NA | NA |
5 | 3285 | 3266 | 99 | 3035 | 92 | 0 | 0 | 0 | NA | NA |
6 | 551 | 540 | 98 | 519 | 94 | 0 | 0 | 0 | NA | NA |
7 | 466 | 376 | 81 | 450 | 97 | 0 | 0 | 0 | NA | NA |
8 | 729 | 680 | 93 | 716 | 98 | 0 | 0 | 0 | NA | NA |
9 | 0 | 0 | NA | 0 | NA | 0 | 0 | 0 | NA | NA |
10 | 2 | 0 | 0 | 0 | 0 | 132 | 0 | 0 | 0 | 0 |
11 | 11 | 11 | 100 | 11 | 100 | 0 | 0 | 0 | NA | NA |
12 | 443 | 437 | 99 | 437 | 99 | 0 | 0 | 0 | NA | NA |
13 | 70 | 0 | 0 | 12 | 17 | 0 | 0 | 0 | NA | NA |
14 | 264 | 264 | 100 | 264 | 100 | 0 | 0 | 0 | NA | NA |
15 | 0 | 0 | NA | 0 | NA | 0 | 0 | 0 | NA | NA |
16 | 409 | 407 | 99 | 401 | 98 | 5 | 5 | 5 | 100 | 100 |
17 | 47 | 47 | 100 | 47 | 100 | 0 | 0 | 0 | NA | NA |
18 | 54 | 54 | 100 | 50 | 93 | 0 | 0 | 0 | NA | NA |
19 | 65 | 65 | 100 | 65 | 100 | 0 | 0 | 0 | NA | NA |
20 | 0 | 0 | NA | 0 | NA | 0 | 0 | 0 | NA | NA |
21 | 0 | 0 | NA | 0 | NA | 0 | 0 | 0 | NA | NA |
22 | 899 | 885 | 98 | 872 | 97 | 0 | 0 | 0 | NA | NA |
Figure 6.3
As seen in the table of figure 6.3, the mailing list messages contain two domains, the domain of the sender, who has sent the mail to the mailing list address, before it is multicasted, and the domain of the mailing list itself. For example, if a person with email id : abc@columbia.edu, sends a mail to the mailing list, xyz@yahoogroups.com, then all the receivers which have also subscribed to this mailing list, will receive two domains : columbia.edu and yahoogroups.com, both of which can be checked for a match. Following observations were made after the data was collected:
1.
In the figure 6.3, most of the
fields in the last two columns (% mailing list spam domain matched
and % mailing list spam sender's domain matched) contain the value
NA. This is because,
for such mailboxes, there are no mailing list spam mails. Hence very
few mails which come from the mailing list, are spam mails, since the
user has himself subscribed to this domain of the mailing list. So
the possibility
of these mails (coming a mailing list) being spam
is very low.
2.
Also there are few NA
values in the '% mailing list ham
domains' and '% mailing list ham sender's domain'. This is because
for such mailboxes, there were no mails coming from the mailing lists.This typically happens when either the user has not subscribed
to any of the mailing lists or the mailing list to which the user has
subscribed to, is not active.
The
Sender's domain and the Mailing list domains are both plotted (only
for non NA values) using a scatter graph, for both ham and spam mails
as shown below in the figure 6.4.
Mailing
List Analysis
Figure
6.4
The scatter plots in figure
6.4 show that for most of the mailboxes, domains for nearly 100 % of
the ham mails, coming from the mailing lists, match the domains of
the mails sent by the user. In other terms, user is aware of the
domains coming from from the mailing list, which should rightly be
true, since the user has himself subscribed to the domains of the
mailing lists and these mails should not spam.
But on the other
hand, as mentioned before in the observations,
there are very spam messages from the mailing lists, and if there
are, there are very few messages, whose domain matches with that of
the
sent messages.
The mailbox # refers to the mailbox in the Resultset, which was created at the time of the testing and the link of whose is available in the References. If we observe the graph and the Resultset, there are some mailboxes for which the # of Ham messages matching the sent mails are less. This could be possible since the receiver might not send mails using this mailbox, or because the sent folder contains very less number of mails, meaning the receiver does not send mails often.
Matching Analysis
Thus as seen from the statistics generated for the different mailboxes, of different sizes, the Domain Check test can considered as a good test based on following two properties:
The number of messages in the non-spam folder whose domains match, should be high.
The number of messages in the spam folder whose domains match, should be low.
Now to show that any test is a
good test, the above two properties need to be satisfied. In order to
generate the statistics, we had divided the mails received into two
broad classes, Mailing list and Non-mailing list. We took the
fraction of mails whose domains matched and represented this fraction
with the help of a scatter graph, for every mailbox, and for each
mail folder(ham and spam).
Thus after careful analysis, it was
found that the percentage of mails in the ham folder, whose domains
matched, was really high for both Mailing list and Non-mailing list
mails and this percentage was pretty low for the mails in the spam
folder. This result was generated based on the Mailboxes takes for
testing.
Hence in order to generalize this test, we can define a threshold value of say 60 % (or more than 50 %) and count the number of mailboxes for which the fraction of mails whose domains matched, is above threshold and then arrive at a conclusion, of whether the test is suitable for a particular resultset. For most of the cases, the value of the count is very high as compared to the count of other mailboxes in the resultset. Domain Check can thus be a good test for the required classification.
6.2
IMAGE ANALYSIS MODULE
Proceeding
in the same manner as the Domain Check test, the received mails are
divided into Mailing List and Non-mailing list and each is then
tested individually and statistics for each are generated and studied
in order to find the results of the Image test.
Non-Mailing List
Analysis
From the total mails
received, the mails that do not come from a mailing list are
separated out and data is gathered for them as shown in the figure
6.5.
Column 2 : Number of mails not coming from the Mailing
lists in the spam folder.
Column 3 : Number of Mails from amongst
Column 2, which contain images in the body.
Column 4: Fraction of
mails containing mails, to the total mails, not coming from Mailing
list in the Ham folders.
Column 5,6 and 7 have the similar meaning
as Column 2,3 and 4, but for Spam folders.
MailBox # |
Non-mailingList Ham Messages |
# Non_mailing list ham Images |
% Ham Image Count |
Non-mailingList Spam Messages |
# Non-Mailing list Spam Images |
% Spam Image Count |
---|---|---|---|---|---|---|
1 |
1175 |
363 |
31 |
0 |
0 |
NA |
2 |
438 |
150 |
34 |
96 |
1 |
1 |
3 |
830 |
82 |
10 |
0 |
0 |
NA |
4 |
576 |
268 |
47 |
65 |
4 |
6 |
5 |
1717 |
311 |
18 |
103 |
0 |
0 |
6 |
867 |
145 |
17 |
264 |
31 |
12 |
7 |
764 |
90 |
12 |
0 |
0 |
NA |
8 |
1059 |
314 |
30 |
204 |
6 |
3 |
9 |
133 |
57 |
43 |
227 |
9 |
4 |
10 |
522 |
15 |
3 |
223 |
0 |
0 |
11 |
157 |
30 |
19 |
0 |
0 |
NA |
12 |
858 |
81 |
9 |
21 |
1 |
5 |
13 |
1290 |
423 |
33 |
48 |
0 |
0 |
14 |
670 |
130 |
19 |
0 |
0 |
NA |
15 |
414 |
138 |
33 |
45 |
1 |
2 |
16 |
1590 |
381 |
24 |
179 |
0 |
0 |
17 |
480 |
160 |
33 |
0 |
0 |
NA |
18 |
326 |
100 |
31 |
0 |
0 |
NA |
19 |
684 |
137 |
20 |
0 |
0 |
NA |
20 |
140 |
60 |
43 |
0 |
0 |
NA |
21 |
1151 |
300 |
26 |
371 |
10 |
3 |
22 |
1468 |
866 |
59 |
890 |
0 |
0 |
Figure 6.5
As showin the figure 6.5, some of the fields are NA, the reason being the same as mentioned before. The fraction of the mails containing images are found out for each mailbox and each mail folder (ham and spam) and a a scattered graph is plotted for the values of these fractions obtained, to see in which regions these fractions lie. Figure 6.6 shows the scatter plot.
Non-Mailing List Image Analysis
Figure 6.6
As seen from above, the scatter of the '% Non-mailing list ham image count' is not concentrated in the high value region, unlike in the Domain Check module. Nearly for all the mailboxes, the value of the fraction is under 50 %, for ham messages. This is an indication that the ham messages are communicated mostly through text and not many of them contain images. Thus if a message contains an image, then the possibility of that message being a ham, is low.
But at the same
time, in the figure 6.6, all the mailboxes have fraction value less
than 30% for the spam messages. This might not be the general case
and could depend on the type of spam message or
type of
spammer,etc. Hence we cannot classify only on the basis of whether a
message contains an image or not, since a message not containing an
image can be both ham or a spam, as per the statistics generated.,
Thus for the Non-mailing list messages, the Image test may not be of any use, or it may be used alongside other relevant tests to provide the required classification.
Mailing List Analysis
Proceeding in the
same manner as Non-Mailinglist List, the mails from the Mailing list
are separated out and data is gathered for them as shown in the
figure 6.7.
Column
2 : Number of mails coming from the Mailing lists in the Ham folder
Column 3 : Number of
Mails from amongst Column 2, which contain images in the body.
Column 4: Fraction
of mails containing mails, to the total mails, coming from Mailing
lists and in the Ham folders.
Column 5,6 and 7
have the similar meaning as Column 2,3 and 4 respectively, but for
Spam folders.
MailBox # |
Mailing list Ham Messages |
# MailingList Ham Images |
% Ham image count |
Mailing list Spam Messages |
# MailingList Spam Images |
% Spam image count |
---|---|---|---|---|---|---|
1 |
643 |
67 |
10 |
0 |
0 |
NA |
2 |
59 |
8 |
14 |
0 |
0 |
NA |
3 |
344 |
70 |
20 |
0 |
0 |
NA |
4 |
0 |
0 |
NA |
0 |
0 |
NA |
5 |
3285 |
676 |
21 |
0 |
0 |
NA |
6 |
551 |
0 |
0 |
0 |
0 |
NA |
7 |
466 |
0 |
0 |
0 |
0 |
NA |
8 |
729 |
26 |
4 |
0 |
0 |
NA |
9 |
0 |
0 |
NA |
0 |
0 |
NA |
10 |
2 |
0 |
0 |
132 |
2 |
2 |
11 |
11 |
2 |
18 |
0 |
0 |
NA |
12 |
443 |
48 |
11 |
0 |
0 |
NA |
13 |
70 |
3 |
4 |
0 |
0 |
NA |
14 |
264 |
18 |
7 |
0 |
0 |
NA |
15 |
0 |
0 |
NA |
0 |
0 |
NA |
16 |
409 |
27 |
7 |
5 |
1 |
20 |
17 |
47 |
7 |
15 |
0 |
0 |
NA |
18 |
54 |
12 |
22 |
0 |
0 |
NA |
19 |
65 |
13 |
20 |
0 |
0 |
NA |
20 |
0 |
0 |
NA |
0 |
0 |
NA |
21 |
0 |
0 |
NA |
0 |
0 |
NA |
22 |
899 |
397 |
44 |
0 |
0 |
NA |
Figure 6.7
Once the required data is gathered, it can be represented with the help of a scatter graph. One such plot is shown in figure 6.8.
Mailing List Image Analysis
Figure 6.8
As shown in the scatter plot in figure 6.8, the test does not give good results. We cannot infer anything out this test for Mailing list messages due to the following two observations:
1. There are no spam messages from mailing list for any of the mailboxes. This happens normally for most of the mailboxes since messages from the mailing list are only sent to those users who have subscribed for the list and if user receives a message from such a mailing list, to which he has subscribed for, then such a message should not be a spam.
Since there are no
spam mailing list mails, we cannot say anything about the type of
mail received and hence we cannot classify on the basis of image
present or not.
2. The fraction values for the mailing list
spam image count, are scattered in all the regions, with most of them
being below 45 %. Ideally, the image count for the spams should be
high to clearly classify it as spam since it is believed that the
authenticated communication mostly takes place through text and not
images. So the presence of an image should increase the possibility
of the mail being a spam. But the count value for the spam is not vey
high nor concentrated in the high regions of the plot. The reason
being the same as mentioned before.
Hence based on the observations made, we cannot infer anything out of the image test, since the results may vary depending on the type of mails and time at which the test is performed. We may get a higher value for number of ham images due to the fact that most of the Mailing list mails, sent from friends and known people, often contains one or more images. But spam on the other hand, can simply be full of undesired text and hyper links. Thus it is difficult to use the Image test alone for classification hence it must be used in collaboration with some other test like FriendList or Domain Check to provide the required classification.
Result Summary:
1. Domain Check Module : Can be used as a good filter and gives good results most of the time.
2. Image Analysis Module : is a weak filter and cannot be used alone for the purpose of classification of mails.
One of the major problems faced with the server based approach was that it did not attract many spam mails to work with. Thus, accurate results were not obtained. As a result, the entire approach shifted to standalone module,rather than passing messages one at a time, the whole message arrays were passed at once to all the individual modules.
In the standalone program the biggest concern was the amount of time it took for the modules to finish running, because the messages needed to be parsed and then the host name was extracted from them which was then being compared to the stored names. Hence, the hashset data structure was used instead of an array to store messages which at least sped up the comparison process.
A major hurdle was faced while integrating each of the correctly working standalone modules, including CheckDomain and Image Analysis, into the base module i.e Module Class. There were a number of smaller and bigger issues which were handled by the group collectively and solved. The Folder Closing Exception was faced too often, due to which the program got stuck in between and did not compute the right results for some modules. This case was seen for mailboxes containing over 2500 messages in them. This Exception was due to the fact that the message folder accidentally closes, due to the timeout factor, as a result of which the errors occured. This issue was handled by opening the folder only when it was required on demand, ensuring that the folder is not opened at a time when it is not being used.
Another difficulty faced in the Image Testing module was that the dataset collected could not give the accurate results for the test. Most of the mails from gmail mailboxes had a large number of mails in the non-spam folders containing images, either in the form of attachments or integrated in the body itself. This was also because of the fact that people often join certain mailgroups or are part of several mailing list and receive about 5-6 mails daily, containing images as well text. Mails are also received from certain sites on which the user has registered, advertisements from known companies, etc which contain images. Also the spam messages are sometimes very short containing only text and instead of the images, they contain links or sites for navigating the user to their page. So the count of mails with images could often be low.
A similar observation was made on the Columbia mailboxes too. Firstly, these mailboxes hardly received any spams since it is known to very less number of people. Also there were images present in some non-spam mails, due to which the analysis could not be done accurately on the Result set that was taken for testing. It varied for different mailboxes.
Due to the above two facts, it became very difficult to analyze the results obtained and infer on the basis of these results whether the test could be used for the classification or not. As a result, due to such variations, the Image test would fail to give out the desired accuracy for the classification in many cases.
The source code to the individual modules can be found here: Source Code. This zipped folder contain the following :
1. Domain Check Module : "SpamTestLatest.jar"
2. Image Analysis Module : "SpamTestImage.jar"
The java files for the individual modules can be obtained by extracting the jar folder. The files are named "CheckDomain.java and From_Body.java (image_test() function ). The Image Test Module was implemented as a part of the From_Body module which checks if the "To field" or the "Body" of the mail contains the real name of the person or not. This module was implemented by
Aditi Rajoria. The reason for doing this had been mentioned earlier.
The tools used were as follows:
Navicat MySQL for creating the database for the Server based Approach
Netbeans IDE for writing code for the modules as the code was written in JAVA
OpenOffice and EditPlus HTML based tools for writing the report.
OpenOffice Excel draw for drawing the figures.
Open Office Drawing tool for making the class Diagrams
Concept Draw Pro tool for making the Data Flow diagrams for the Server Based Approach.
Excel to Html converter for converting result tables in excel into the html form
RFC 2822 [http://tools.ietf.org/html/rfc2822]
RFC 2821 [http://tools.ietf.org/html/rfc2821]
JavaMail API [http://java.sun.com/products/javamail/]
Spam Analysis and Reputation Project [http://wiki.cs.columbia.edu:8080/display/spam/Home]
SARP Modules [http://wiki.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]
Professor Henning Schulzrinne – Advisor for the Project.
Adrian Frei - Spam Analysis and Reputation Project: DNS Blacklists
Tejas Nadkarni – Parser and Standalone Framework
Aditi Rajoriya - Spam Analysis and Reputation Project: IMAP Retrieval and To/Body Module.
Preethi Narayan – Spam Analysis and Reputation Project : From And Received Header Analysis
Nirav Shah - Spam Analysis and Reputation Project: Email Source, Date/Time and Attachment Analysis
Swati Kumar - Spam Analysis and Reputation Project : Email Encryption Headers and Database Schema
The results of running
the modules on a few mailboxes which were considered while testing
and result computing, are provided on the link
below:
http://wiki.cs.columbia.edu:8080/display/spam/Resultset]