Swati Kumar
Columbia University
New York, NY 10027
USA
sak2144@columbia.edu
The aim of this project is to gather statistical data about the various fields present in an email header or body that can help us differentiate between spam and non-spam (or ham) mails. If a particular field in the header or body of an email is a good indicator, the statistics gathered for it will differ for ham and spam mails. They should also be consistent over the sample dataset. To gather statistics we performed several checks on the email accounts of various users that formed the sample dataset and recorded data such as the total number of mails in ham and spam folders, number of spam and ham mails passing the test, number of spam and ham mails failing the test, number of mails for which the test is relevant and so on. After examining this data we determine if a particular field is good enough for classification purposes. The statistics are gathered for a substantial number of mailboxes to make our tests reliable and robust.
Table
of Contents
i. Abstract
This project determines whether certain parameters present in an email header or body are good enough for classification of mails into ham and spam. A separate module was developed for each of the parameters which was used to analyze the mails. The various modules are as follows:
Friend Check
Pingable hosts
Black Lists
Domain Check
In-reply-to
DKIM and SPF
Received Header
DHCP and DSL
Attachments
Getting hour,date,time information from the message
Whether the To field and the body contain the name of the person.
Columbia Internal mails
All modules are implemented in java.
The javamail-1.4 library is used extensively by all the modules.
There is a main class called MailStats.java that basically calls all the modules synchronously one after the other.
The MailStats connects to the user's account on an imap server, starts up a basic user-interface using which the user can categorize his mail folders into spam, ham and sent.
The GUI has a progress bar to indicate which module is currently running and hence gives feedback to the user.
MailStats passes javax.mail.Message arrays containing the spam, ham and sent messages to all the modules, which are used to find statistics
and print out a result that can be used for analysis. The final output consists of combined results of the individual modules.
2. Introduction The modules discussed in this report are as follows:
Check the message headers for information about dynamic hosts that use DHCP (Dynamic Hosts Configuration Protocol) or DSL (Digital Subscriber Line).
Check if the hosts present in the message headers are reachable. (reachable – when you ping a host with packets, it should respond to the ping by sending acknowledgment)
Check the message headers for Domain Keys Identified Mails (DKIM) and Sender Policy Framework (SPF)
Check the message headers for Mailing Lists
The basic architecture of the modules and what each module does is described below.
3.1.1 Domain Keys Identified Mail - DKIM
DKIM lets an organization take responsibility for a message in transit. The domain owner generates one or more private/public key-pairs that will be used to secure messages originating from that domain. The domain owner places the public-key in his domain namespace (i.e., in a DNS record associated with that domain), and makes the private-key available to the outbound email system. When an email is submitted by an authorized user of that domain, the email system uses the private-key to digitally sign the email associated with the sending domain. The signature is added as a header to the email, and the message is transferred to its recipients in the usual way.
For example:
DomainKey-Signature: a=rsa-sha1; q=dns; d=example.com; i=user@eng.example.com; s=jun2005.eng; c=relaxed/simple; t=1117574938; x=1118006938; h=from:to:subject:date; b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb av+yuU4zGeeruD00lszZVoG4ZHRNiYzR
Thus, the mails that have been authenticated with the signature will be present in the user's inbox. If a mail provider uses DKIM validation, then the ham mails will have authenticated signatures and spam mails won't. This is the basis of searching for
“DomainKey-Signature” in the email header. 3.1.2 Sender Policy Framework Protocol TestThe domain owners may authorize hosts to use their domain name in the "MAIL FROM" or "HELO" identity. Compliant domain holders publish Sender Policy Framework (SPF) records specifying which hosts are permitted to use their names, and compliant mail receivers use the published SPF records to test the authorization of sending Mail Transfer Agents (MTAs) using a given "HELO" or "MAIL FROM" identity during a mail transaction.
A mail receiver can perform a set of SPF checks for each mail message it receives. An SPF check tests the authorization of a client host to send mail with a given identity. Typically, such checks are done by a receiving MTA and they result in adding a header in the email as “Received-SPF”.
The Received-Spf header field is followed by a result and some comment conveying supporting information for the result like <ip>, <sender>, and <domain>. The values of the result field are:
Pass – the message meets the publishing domain's definition of legitimacy.
Fail – the message does not meet a domain's definition of legitimacy.
SoftFail –the message does not meet a domain's strict definition of legitimacy, but the domain cannot confidently state that the message is a forgery.
Error – indicates an error during lookup.
The mails in the ham and spam folders should typically have the result as "pass" or "neutral" and "fail", "softfail" or "neutral" respectively.
3.2 Dynamic Hosts Configuration Protocol and Digital Subscriber Line
Dynamic Host Configuration Protocol (DHCP) and Digital Subscriber Line (DSL) are used to assign dynamic <ip>.
Dynamic Hosts Configuration Protocol automates the assignment of <ip> addresses. When a DHCP-configured client connects to a network, it sends a broadcast query requesting necessary information from a DHCP server. The DHCP server manages a pool of IP addresses and information about client configuration parameters such as the default gateway, the domain name, the DNS servers, other servers such as time servers, and so forth. Upon receipt of a valid request the server will assign the computer an IP address, a lease (the length of time for which the allocation is valid), and other TCP/IP configuration parameters, such as the subnet mask and the default gateway. Thus, the mails that come from a dynamic host cannot be verified based on its host name.
DSL, is a family of technologies that provide digital data transmission over the wires of a local telephone network. The customer end of the connection consists of a DSL modem. This converts data from the digital signals used by computers into a voltage signal of a suitable frequency range which is then applied to the phone line. Thus, a permanent <ip> address will not be available for DSL hosts.
The email headers do not directly contain this information but by analyzing the headers we can find out if the sender's server was using DHCP or DSL. The DHCP and DSL module looks at the email headers and generates statistics for sender and mail servers that use DHCP and DSL. These statistics are interesting because we may be able to establish a relationship between senders of spam mails and hosts for which <ip> is assigned dynamically.
The mailing list headers are List ID, List Subscribe and so on. The mailing list headers are used to provide information about the corresponding mailing list. This information can be used to find out if mailing lists are present in ham and spam mails. Generally, mails received from a mailing list, for which the user has a subscription, will not be spam.
This module checks if the host name in the from and by field present in the received header of a mail, can be pinged or not. If the host can be pinged, then it means that a particular internet address exists and can accept requests. An authentic mail server should be reachable because, it will be up and running most of the times and should be able to accept the TCP/IP packets. The “by” field of the first received header of the trace gives information about the sender. Consider for example:
Received: from [192.168.123.110] (user-387gp1m.cable.mindspring.com [208.120.100.54]) (user=sak2144 mech=PLAIN bits=0) by serrano.cc.columbia.edu (8.14.1/8.14.1) with ESMTP id lBJ7Ww2E028346 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 19 Dec 2007 02:32:58 -0500 (EST) Message-ID: <4768C929.3070105@columbia.edu>
Here the first received header contains “from” field indicating the sender machine's ip address and the “by” field contains the server name, that received the mail from the sender. Now, the by field may contain the name of another mail server or the sender's mail server. In both cases, it should be reachable. The “from” field may or may not reachable, because it depends on the sender's computer. Thus, as ham mails use authentic Mail Sending Agents, the by host name should generally be reachable.
This section details the design and implementation of the modules for the standalone program. All the modules implement Module interface, which is used by the main class to call the individual modules. The main class is used to connect to a particular user's mail account on an IMAP enabled server and cache the mail headers required by each module, for ham and spam folders. The Module interface is used to pass data from the main class to individual modules. The output of the program is a combined result that can be stored and later used for analysis. The checking for DHCP, DSL and pinging hosts were combined in a single module called Dynamic and Reachable Hosts Module. The checking for DKIM, SPF and Mailing List were combined in another module called DKIM, SPF and Mailing List Module. The detailed design and implementation of both the modules is given below:
The design of this module is object oriented where a single class represents the entire module. The dynamic and reachable hosts use the same data for getting the statistics, hence they could be combined together in the same class. The class diagram is shown below:
Figure 4.1
Figure 4.1 shows the classes of the module. There is one main class DhcpDslPing that inherits from the base class Module and a helper class called PingLookup that is called from DhcpDslPing class to ping a given hostname. The DhcpDslPing class is the main class that calls the ping() and parseReceivedforDhcp() methods to gather data for reachable and dynamic hosts for a given set of messages belonging to a particular folder like spam or ham. Some of the design decisions made were as follows:
The fundamental design for the module is to get each message, parse the Received header from the trace field of each message header and then check if the parsed host name is reachable or dynamic.
Dynamic Hosts
The received field is checked for DHCP and DSL hosts using the method parseReceivedforDhcp(). In this method, the first Received field from each message is parsed and the from and by domains are extracted. These domains are then checked for the following:
String dhcp – Checks to see if the string “dhcp” is a part of the from and by fields of the first received header.
String dsl - Checks to see if the string “dsl” is a part of the from and by fields of the first received header.
String dclient - Checks to see if the string “dclient” is a part of the from and by fields of the first received header.
String cable-- Checks to see if the string “cable” is a part of the from and by fields of the first received header.
<ip> separated by dashes - The ip address is separated by dashes and then appears again separated by dots. For eg: [192-34-45-66] followed by [192.34.45.66].
Thus, the mails that satisfy the above checks are said to have come from dynamic hosts. The method has a counter which it increments every time a string match takes place. Finally the counter value is added to the result. These strings are matched using regular expressions in java. For the string checks, the indexOf(string) method is used and if the string is present in the domain name, then its value is greater than -1. For the ip address separated by dashes and then by dots, the from.*((\\d+-){3}\\d+) regular expression is used.
Reachable Hosts
The Received header needs to be parsed differently for the ping module. The host name of the from and by fields need to be extracted and then given to ping() method. Since, the hosts may be repeated for more than one mails, the host names are stored in a data structure called hashmap and only the ones that are not repeated and pinged. The result is evaluated and if the host was reachable, a counter is incremented. The result for a hostname is "true" if the host is reachable and "false" if it isn't reachable. After the ping method returns, the hostname and its corresponding result is stored in a hashmap. Thus, if this host name appears again, it will first be searched in the hashmap and only if it is not found, the actual ping command will be executed. This is done to improve the efficiency of the module, because a “ping” is a time consuming operation and lesser the pings, faster will be the program. To further increase the efficiency, first all the domain names of the from and by parts of the recieved fields are extracted and stored in an arraylist. This eliminates the need to repeat parsing the recieved header seperately for from hosts and by hosts.
For example:
Received: from [128.59.21.187] (photon.win.cs.columbia.edu [128.59.21.187])
(user=skn3 mech=PLAIN bits=0)
by serrano.cc.columbia.edu (8.14.1/8.14.1) with ESMTP id m0HMnJdt006183
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT)
for <sak2144@columbia.edu> Thu, 17 Jan 2008 17:49:19 -0500 (EST)
The from host is extracted as 128.59.21.187 and stored in an arraylist. Also, the by host is extracted as serrano.cc.columbia.edu and stored in another arraylist. Once both the lists have been populated, the ping() method is called where the lists are passed to it one by one. The ping method uses thread pool concept in java and pings the hosts in the list. The result of the ping along with the host name is stored in a hashmap and this hashmap is checked first before pinging. The helper class pinglookup is used for pinging the host. It implements the "Runnable" interface and perfoms the ping function synchronously.
The ping() method is implemented using the operating system (OS) ping
command. It checks for the OS on the machine where the stand-alone program
is running and forms a ping command accordingly.
For eg: ping -n 3 -w
200 <some host name> is the
ping command for windows, where n
gives the number of packets to be
sent and w is
the wait time.
After the module has been executed for all the messages for a particular folder like inbox or spam, the result string is formed and returned to Module. This result which, consists of the number of dynamic hosts, number of ping hits for “by” and number of ping hits for “from” is displayed on the GUI.
An example of a result is as follows:
DHCP and Ping module Non-Spam Dynamic senders : 187/749 Ping hits for trace field By : 504/749 Ping hits for trace field From : 179/749 Spam Dynamic senders : 0/0 Ping hits for trace field By : 0/0 Ping hits for trace field From : 0/0
The class diagram, representing the design of this module is shown below:
Figure 4.1.1
Figure 4.1.1 shows the class diagram that gives the design of the module. For every mailbox, the set of messages are iterated upon and the headers of DomainKey-Signature, Received-SPF and List-ID are parsed for each message and analyzed for DKIM, SPF and mailing lists modules respectively. If the DKIM header is present, it means that the message recipient has verified the signature by querying the signer's domain directly to retrieve the appropriate public key, and thereby confirmed that the message was attested to, by a party in possession of the private key for the signing domain, hence, validating the authenticity of the sender. If the Received-SPF header is present, then it also has a result value associated with it as given in section 3.1.2 and depending on the value, we confirm the authenticity of the sender. The List-ID header determines the presence of mailing list. The List-ID provides an identifier for an email distribution list.
For DKIM and mailing lists, checking the header field will suffice, but not all receiving mail servers perform SPF check. Thus, for mails that don't contain Received-SPF header, SPF checking is performed by the module using information from the email headers.SPF checking is performed using the MAIL FROM host name, HELO identity domain name or the MAIL FROM domain name and the ip address of the sender.
The domain owners wishing to be SPF compliant must publish SPF records for the hosts that are used in the "MAIL FROM" and "HELO" identities.
The check_host() function uses the following arguments:
The module consists of just one method DkimSpf() that gets the Received-Spf field, DomainKey-Signature field and List-ID field for each message corresponding to a mailbox. The mails from each of the mailboxes are read by the base class Module and stored in arrays of Message class. The message array is iterated over and each message is checked for the three fields. If the Received-Spf field is present, then it also gets the SpfResult for that message and stores it in the final result. The corresponding result counter is updated. Thus, if the result is fail then the counter for fail is incremented by one. Similarly, for the other results like Pass, Fail, Error, Neutral, SoftFail, corresponding counters are incremented. If the SPF checking is not done by the Mail Transfer Agent (MTA), then jSPF library is used to perform the checking. The jSPF library has a checkSPF() method that uses the three paramaeters of the MAIL FROM host name, HELO identity domain name and client ip address. These three parameters are found from the messsage headers as follows:
While performing SPF checking, a special case arises when the domain of the sender is same as that of the receiver. For example, if a mail is sent from one columbia server to another, then all ip addresses will begin with 128.59.x.x and so we will not be able to determine where the actual hop, that is the transfer of the mail from sender's mail transfer agent to receiver's mail server, takes place. This holds true for internal mails on all mail servers. Also, internal mails do not require any checking as they come from a reputed and trusted mail server. Hence, we ignore the internal mails and keep a separate count for them.
The DKIM check is simply checking for the presence or absence of the DomainKey-Signature header and if the header is present, a DKIM counter is incremented by one. Similarly for the mailing lists, presence or absence of List-ID header gives us the information if the message is from a mailing list or not. If the message is from a mailing list, a mailing list counter is incremented by one.
Finally the result string is updated with all this information and is displayed on the Graphical User Interface (GUI). The result is as follows:
DKIM AND SPF module
Non-Spam
Dkim encryption : 211/749
Mailing Lists : 65/749
Spf Result for 348 out of 748 mails
Fail : 2
Pass : 238
Error : 0
Neutral : 104
SoftFail : 4
internal mails found = 400
Spam
Dkim encryption : 0/0
Mailing Lists : 0/0
Spf Result for 0 out of 0 mails
Fail : 0
Pass : 0
Error : 0
Neutral : 0
SoftFail : 0
internal mails found = 0
The progress bar is updated each time a message has been parsed and its result has been added to the result string. This is to give the user continuous feedback about the module's progress.
Server based approach consisted of a mail server where the all the mails addressed to this server would be stored. When a mail was received, the parser was invoked and it parsed the message in header and its value pairs. They were then stored in the database. The parsing of the message was done so that information could be retrieved from the database by the other modules. The database had all the fields based on RFC 2821/2822 architecture.
Dynmaic and Reachable hosts, DKIM, SPF and Mailing list modules were implemented for the server-based approach alongwith the database schema.
The DKIM, SPF and mailing lists modules were a part of the parser itself, since they needed to retrieve the header values directly present in the email and find the total number of messages for which these fields were present. The dynamic and reachable hosts module were implemented in a manner similar to the standalone program. The internal working of the modules did not differ much from server-based to standalone, except that the values for message headers were queried from the database during server-based implementation and were actually parsed for each message on the fly for standalone approach.
The Server Based Approach was used for the following modules:
For the server based approach, the following database schema was used. The database system used was MySQL.
Database schema
TABLE 1 - MESSAGE:
create table message (
message-id
varchar2(50) NOT NULL,
date datetime NOT NULL,
sender
varchar2(50),
return-path varchar2(50),
list_subscribe
tinyint(1),
subject longtext,
body blob,
PRIMARY KEY
(msgcount)
);
Explanation of the fields:
message-id - This
is the unique id assigned to every message. It consists of characters
and numbers. It is stored as a string
date - gives the date
and time. It will be stored as yyyy-mm-dd hh:mm:ss
sender - This
field is sometimes used even when multiple "from" fields are not
present.
For eg: when a gmail account is used to send a message to columbia
mail.
return-path - Same as the from/sender field for ham mails, it represents the MAIL FROM identity of the sender.
list_subscribe - It represents the mailing list, if present. It is for our convenience to know if there is a mailing list
present.
subject - It is the subject in the mail header. It will be stored as longtext which is similar to clob.
body - The entire body of the message will be stored as a blob object, since it may contain characters that need to be escaped.
TABLE 2 - IN-REPLY-TO:
create table in-reply-to
(
CONSTRAINT in-reply-to_fk FOREIGN KEY(parent-msg-id)
REFERENCES
message(message-id),
CONSTRAINT message_count_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY
(parent-msg-id)
);
Explanation of fields:
in-reply-to -This field may be used to identify the message (or messages) to which the new message is a reply. As it can have more than one value, hence its a separate table. It has a one-to-many representation with the message-id of the given message.
Also, since in-reply-to is used for denoting the message id of earlier messages, this message id should already be present in the database, thus,
parent-msg-id is a foreign key from the message table.
parent-msg-id:
It gives the message-id of the parents or the earlier threads to the current message.
message-id: gives the
current message and is a foreign key for representing the one-to-many relationship.
TABLE 3 - REFERENCES:
create table references
(
CONSTRAINT references_fk FOREIGN KEY(thread-msg-id)
REFERENCES
message(message-id),
CONSTRAINT message_count_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY KEY
(thread-msg-id)
);
Explanation of fields:
thread-msg-id -
Contains all the message ids belonging to a particular thread.
Similar to parent-msg-id of previous table.
message-id - links the
thread to a particular message.
TABLE 4 - FROM
create table from
(
from_display_name varchar2(50),
from_addr_spec varchar2(50)
NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY
KEY(msgcount,from_addr_spec)
);
Explanation of
fields:
from_display_name : Gives the optional display name, eg: in
"Swati" < sak2144@columbia.edu > , we store Swati
from_addr_spec : Gives the
second part ie. sak2144@columbia.edu
The display name is optional
but the addr-spec is not. So, the primary key will be message-id that
associates the entries with a particular message and the addr-spec. Thus, the primary key is a combination of the foreign key and the unique identifier.
TABLE 5 - TO
create table to (
to_display_name
varchar2(50),
to_addr_spec varchar2(50) NOT NULL,
CONSTRAINT
message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES
message(message-id),
PRIMARY
KEY(msgcount,to_addr_spec)
);
It represents the "to" field of the message header.
Explanation of
fields:
to_display_name : Gives the optional display name eg: in
"Swati" < sak2144@columbia.edu >, we store Swati
to_addr_spec : Gives the second
part ie. sak2144@columbia.edu
The display name is optional but the
addr-spec is not. So, the primary key will be message-id that
associates the entries with a particular message and the
addr-spec.
TABLE 6 - REPLY-TO
create table reply-to
(
reply-to-name varchar2(50),
reply-to_addr_spec varchar2(50)
NOT NULL,
CONSTRAINT message_count_from_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY
KEY(msgcount,reply-to_addr_spec)
);
Explanation of fields:
The
fields name and addr-spec are the same as for the to and from tables.
The reply-to table is for storing the address of the mailbox(s)
to which the reply is to be sent. If this is not present all the
replies will be sent to "from" field mailbox.
TABLE 7 - HTTP_LINKS
create table htttp_links
(
url varchar2(50),
link_id NOT NULL AUTO_INCREMENT,
CONSTRAINT
message_count_from_fk FOREIGN KEY(msgcount_id)
REFERENCES
message(message-id),
PRIMARY
KEY(message-id,link_id)
);
Explanation of fields:
These
fields are used for storing urls found in the message body. The
link_id is just to uniquely determine the link.
msgcount
associates the url with the message.
TABLE 8 - TRACE
tracecount INT NOT
NULL,
received_from_host varchar2(50),
received_from_addr
varchar2(50),
received_by_host varchar2(50),
received by_addr
varchar2(50),
via varchar2(10),
with varchar2(10),
id
varchar2(50),
for_display_name varchar2(50),
for_addr_spec
varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(message-id),
PRIMARY
KEY(message-id,tracecount)
);
Explanation of
fields:
received_from_host - Gives the hostname present in the
"from" sub-field of the received fields.
received_from_addr
- Specifies the IP address
Similarly received_by_host and
received_by_addr represent the hostname and IP address in the "by"
sub-field.
via - gives the protocols eg: TCP
with - gives
additional details like with ESMTP
Each trace may/may not contain
an id (unique)
for - name and addr-spec. eg: for "Swati"
TABLE 9 - MAILBOX
create table mailbox
(
mailbox_display_name varchar2(50),
mailbox_addr_spec
varchar2(50),
CONSTRAINT message_count_from_fk FOREIGN
KEY(msgcount_id)
REFERENCES message(msgcount)
PRIMARY
KEY(mailbox_addr_spec,messge-id)
);
Explanation of fields:
Each
message is linked to a person/mailbox for which it is meant. This is
a convenience table meant for fast sorting of messages
for a
particular mailbox. eg: sak2144@columbia.edu received messages with
msgcount 1,2,4,10,14.
The primary key is a combination of
mailbox_name and message-id.
Using this table, we can get all the
other details for the messages for a particular maibox.
The database was populated by using database connectivity class that uses jdbc connection, which was called by the parser every time a new message arrived. The implementation of the database, database connectivity class and the parser is detailed below.
Database and Database Connectivity class:
Figure 5.1
The data flow diagram (DFD) in Figure 5.1 shows the parser and the DB
Connectivity class. Insert_DB is a function inside DB Connectivity class
that calls the appropriate function for inserting the values in the
database. The flow of information between modules is as follows:
The parser parses the message calls the DB Connectivity class. This class uses the getter methods for all the values parsed. For example: The From field value is got by the database connectivity class by using getFrom() method of the parser. It then calls insert_from() method to insert the value in the FROMS table of the database. Also, for each messsage, a unique identifier is used called Message-ID using which all the tables and fields of the database can be accessed. This is passed by the parser to the controller class which uses it while calling the other modules.
As shown in figure 5.2.1, the parser calls the Controller class and populates the database. The Controller class calls the module, that uses the populated data in the database and performs the checking. The module talks to a database connectivity class as database querying is involved.
The design is made as modular as possible, as querying to the database is separated from the business logic of performing the check and finding the results. The results are found for each mail, whose message-id - a unique identifier to the mail, is passed to the module by the controller.
The advantage gained by following this approach is that if, there is a change in the database schema or the database itself, it will not affect the actual implementation of the module.
The Controller is used to invoke the modules, becuase the mail server is always up and running, so if there is some modification in any of the modules, the Controller can be configured to not invoke that module till the change has been completed. Thus, the server need not be stopped and started to recompile the code, as that can be done remotely and added to the list of modules called by the Controller.
The Controller class has the list of modules to be invoked in a array called moduleArray. It reads the array and invokes the appropriate module by calling the run() method of each module.
The run method of DhcpDslPing class contains an object of the database connectivity class. When the getBy() and getFrom() methods are called, the database is queried and "by" and "from" hosts of the trace field for that message are returned. These are used by the parseReceivedforPing() and parseReceivedforDhcp() methods in the same way as given in section 4.1 of the standalone program. The getResult() returns the result of the check which is then stored in a result file.
The JdbcConnection class is the database connectivity and querying class. It has methods like getBy() and getFrom() that are used as interface to retrieve the "by" and "from" dynmaic hosts.
The DFD for Dynamic and Reachable hosts is given as follows:
As shown in figure 5.2.2, the dynamic and reachable hosts module queries the FROMS and TRACE tables to get the host names. It then stores the results of its processing back to the database in a result file. The message id is given to it by the Controller, which also calls the module.
The concept of implementing the module for standalone approach and server based approach remains constant and so the server based approach can be used in future if the honey-pot attracts spam and the gathers enough data for analysis. In this approach, we are not limited by the population size or type and data can be stored permanently for further analysis. Thus, some of the drawbacks of the standalone program can be overcome by using the server based approach.
Design
The DKIM, SPF and Mailing List consisted of extracting fields from the mail header, which could be included in the parser. This is done because the parser is basically used to parse the message header into various parts and store them in the database. The module for DKIM, SPF and Mailing List just checks for the fields of DomainKey-Signature, Received-SPF and List-ID. The values that are parsed and stored in the database for DKIM, SPF and mailing lists are as follows:
DKIM - The header name is DomainKey-Signature, the corresponding value is the signing domain. Refer section 5 for a detailed explanation of the working of DKIM.
SPF - The header name is Received-SPF, the corresponding value is the reuslt like pass, fail, softfail, neutral and error. The domain for which the SPF was obtained was also stored
Mailing List - The header name is List-ID, the coressponding value is the mailing list domain name to which the user has subscribed.
Implementation
The implementation was based on parsing the appropriate header fields. This was done using getHeader(header_name) method. This function returns the entire value for the header and to parse this header we define methods like parseDkim(), parseSpf() and parseList(). The parsing is performed using string manipulation and regex expressions.. Thus, after the entire message has been parsed, the values are stored in local variables of the parser class. The database connectivity class retrieves these values using getter methods and stores them in the database. The parser roughly implements a bean structure for passing data to and from other classes, as it uses getter and setter methods for all the fields and values parsed.
The DKIM header is added to the mail by the sender's mail server, but the SPF checking is done at the receiving server. This was implemented at the sarp mail server by using policyd-spf, which is basically a tool that performs SPF checking and adds the Received-SPF header to the email with the result and domain name. The List-ID header will be present only if the mail is sent by a mailing list.
The concept of implementing the module for standalone approach and server based approach remains constant and so the server based approach can be used in future if the honey-pot attracts spam and the gathers enough data for analysis. In this approach, we are not limited by the population size or type and data can be stored permanently for further analysis. Thus, some of the drawbacks of the standalone program can be overcome by using the server based approach.
Sample Data Set
The sample data set consists of 22 mailboxes on which the tests were performed. The details of these mailboxes are given in the following table:
Mailbox # | Mailbox Name | Ham Mails | Spam Mails | Total Mails |
---|---|---|---|---|
1 | aditi_columbia | 1818 | 0 | 1818 |
2 | aditi_gmail | 497 | 96 | 593 |
3 | Deepti_columbia | 1174 | 0 | 1174 |
4 | Deepti_gmail | 576 | 65 | 641 |
5 | dhrumin_gmail | 5002 | 103 | 5105 |
6 | pinank_gmail | 1418 | 264 | 1682 |
7 | Preetinarayan_columbia | 1230 | 0 | 1230 |
8 | Preetinarayan_gmail | 1788 | 204 | 1992 |
9 | sneha_gmail | 133 | 227 | 360 |
10 | spinank_gmail | 524 | 355 | 879 |
11 | vasa_columbia | 168 | 0 | 168 |
12 | dms2169_columbia | 1301 | 21 | 1322 |
13 | nirav_gmail | 1360 | 48 | 1408 |
14 | nns_2108 | 934 | 0 | 934 |
15 | manish_gmail | 414 | 45 | 459 |
16 | pragni_gmail | 1999 | 184 | 2183 |
17 | preetimalik_columbia | 527 | 0 | 527 |
18 | preetimalik_gmail | 380 | 0 | 380 |
19 | sak2144 | 749 | 0 | 749 |
20 | shradha_columbia | 140 | 0 | 140 |
21 | shradha_gmail | 1151 | 371 | 1522 |
22 | vasa_gmail | 2367 | 890 | 3257 |
Table 6.0
6.1 Dynamic and Reachable Hosts
The check for dynamic
hosts and reachable hosts is performed for both ham and spam mails of a sample mailbox. There
are cases when no spam messages exist for the mailbox and in this case
the check is performed for ham mails only.
The results are represented using a scatter graph. The x axis represents the Mailbox number and the y axis represents the Percentage Mails that pass the check among ham and spam mails of all the sample mailboxes.
Dynamic Hosts Check
The table and description of each table column is shown below.
Mailbox # | % Dynamic Hosts for Ham | % Dynamic Hosts for Spam |
---|---|---|
1 | 28 | No spam |
2 | 1 | 10 |
3 | 24 | No Spam |
4 | 1 | 20 |
5 | 2 | 16 |
6 | 1 | 10 |
7 | 32 | No Spam |
8 | 4 | 7 |
9 | 0 | 15 |
10 | 0 | 0 |
11 | 11 | No Spam |
12 | 36 | 0 |
13 | 12 | 0 |
14 | 37 | No Spam |
15 | 0 | 9 |
16 | 1 | 1 |
17 | 19 | No Spam |
18 | 2 | No Spam |
19 | 25 | No Spam |
20 | 5 | No Spam |
21 | 1 | 6 |
22 | 0 | 7 |
Table 6.1
The scatter graph representing the above data is given below:Figure 6.1
The graph in Figure 6.1 indicates that for most of the mailboxes, the number of dynamic hosts for spam mails is more than ham mails. With the exception of mailbox 12 and 13, the red dots are above the blue dots. This tells us that the spam mails are more likely to be sent from dynamic hosts than ham mails. The percentage of dynamic hosts present also determine if this check can be used as an effective filter. All the figures for both ham and spam are below 40% implying that not many mails are sent from dynamic senders. Thus, the dynamic host check can be used as a filter but with lesser importance.
Reachable Hosts Check
The host names for both the "From" and "By" fields of the email header were checked if they were reachable. Thus, the statistics for the two fields were gathered separately.
The table for reachable hosts in "From" field and its column description is given below:
Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "From" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "From" field and
were reachable. If the column value is No Spam, it means that there were no spam mails.
Mailbox # | % Reachable Hosts in "From" field - Ham | % Reachable Hosts in "From" field - Spam |
---|---|---|
1 | 28 | No spam |
2 | 21 | 18 |
3 | 30 | No Spam |
4 | 12 | 29 |
5 | 27 | 26 |
6 | 33 | 20 |
7 | 22 | No Spam |
8 | 22 | 15 |
9 | 7 | 22 |
10 | 30 | 31 |
11 | 18 | No Spam |
12 | 39 | 81 |
13 | 12 | 2 |
14 | 34 | No Spam |
15 | 7 | 18 |
16 | 19 | 6 |
17 | 0 | No Spam |
18 | 5 | No Spam |
19 | 24 | No Spam |
20 | 6 | No Spam |
21 | 6 | 14 |
22 | 30 | 25 |
Table 6.2
The scatter graph representing the above data is given below:
Figure 6.2
The graph in Figure 6.2 shows that the blue and red dots are randomly scattered and do not show any consistent pattern. For some mailboxes the red dots are more than the blue dots indicating that the percentage of hosts that could be reached by pinging them is more than that of ham mails. There can be many reasons for this behavior. The "From" field of the email header indicates the host machine of the sender, which may be a laptop or desktop and is not always reachable. Also, the machine may be situated in a secure network and can't be pinged. Thus, the host names in the "From" field should not be used as a classifier for spam filtering as they lead to inconsistent results.
The table for reachable hosts in "By" field and its column description is given below:
Column 1 - corresponds to the mailbox number
Column 2 - gives the percentage of mails among the ham mails for each mailbox that responded to a ping request sent to the host of their "By" field and were reachable.
Column 3 - gives the percentage of mails among the spam mails for each mailbox that responded to a ping request sent to the host of their "By" field and
were reachable. If the column value is No Spam, it means that there were no spam mails.
Mailbox # | % Reachable Hosts in "By" field - Ham | % Reachable Hosts in "By" field - Spam |
---|---|---|
1 | 79 | No spam |
2 | 42 | 11 |
3 | 79 | No Spam |
4 | 48 | 20 |
5 | 62 | 10 |
6 | 52 | 24 |
7 | 85 | No Spam |
8 | 53 | 33 |
9 | 30 | 25 |
10 | 90 | 30 |
11 | 79 | No Spam |
12 | 24 | 5 |
13 | 48 | 79 |
14 | 78 | No Spam |
15 | 21 | 9 |
16 | 64 | 32 |
17 | 0 | No Spam |
18 | 32 | No Spam |
19 | 68 | No Spam |
20 | 46 | No Spam |
21 | 52 | 22 |
22 | 75 | 28 |
Table 6.3
The scatter graph representing the above data is given below:
Figure 6.3
The graph in Figure 6.3 shows that the hosts in the "By" field of ham mails are much more reachable than spam mails. We can see the blue dots representing the reachable hosts for ham mails are concentrated on the upper portion of the graph and there is a distinct gap between the scatter of ham and spam mails that were found to be reachable. With the exception of mailbox 13, where the percentage of spam mails is greater than ham mails, the rest of the mailboxes display consistent results. Sometimes even if the mail server used for sending the mail, is reachable, it might not respond to ping requests to protect itself from ping attacks and this could be the reason for lesser ham mail hosts passing the reachable hosts check in mailbox 13. As the test displays consistent results with just one exception, it can be used as a strong filter to determine spam mails.
6.2 DKIM, Mailing List and SPF Check
DKIM Check
The DKIM test simply checks the ham mails for the presence of DomainKeys Signature and the spam mails for the absence of it. This is because, if the DomainKeys Signature header is present in the mail, it has already been authenticated and there is no need for further checks. Thus, the DKIM provides a very strong filter and should be used early on for segregation of ham and spam mails. If an email passes the DKIM test, then it is surely ham. If the header is not present, only then it should be subjected to further tests to classify it. The table for DKIM test data is given below:
Mailbox # | % Ham Mails with DKIM Header | % Spam Mails with DKIM Header |
---|---|---|
1 | 17 | No spam |
2 | 31 | 0 |
3 | 13 | No Spam |
4 | 23 | 0 |
5 | 70 | 1 |
6 | 56 | 5 |
7 | 11 | No Spam |
8 | 44 | 2 |
9 | 5 | 0 |
10 | 12 | 0 |
11 | 5 | No Spam |
12 | 4 | 1 |
13 | 33 | 0 |
14 | 8 | No Spam |
15 | 28 | 0 |
16 | 27 | 0 |
17 | 5 | No Spam |
18 | 25 | No Spam |
19 | 28 | No Spam |
20 | 3 | No Spam |
21 | 21 | 2 |
22 | 78 | 4 |
Table 6.4
The graph of the percentage of ham mails that contain DomainKeys signature and are authenticated is shown below. This graph helps us to determine the popularity of the DKIM method as it depends on the domains that have published their public key.
Figure 6.4
The graph in Figure 6.4 shows that only 3 mailboxes have greater than 50% mails that satisfy the DKIM check. On an average, 25% of mails have the DomainKeys Header. To further increase the scope of DKIM more domains need to register and publish their public key.
Mailing List Check
The ham and spam mails for all the mailboxes were checked for the mailing list headers. If the mail has a mailing list header, then it is not spam as the user has subscribed to the mailing list and thus, the mailing list domain is a known domain for the user. This can also be used as a strong filter to categorize the mails into ham and spam.
The table containing mailing list data is given below:
Mailbox # | % Ham Mails with Mailing List Header | % Spam Mails with Mailing List Header |
---|---|---|
1 | 36 | No spam |
2 | 12 | 0 |
3 | 30 | No Spam |
4 | 0 | 0 |
5 | 66 | 0 |
6 | 40 | 0 |
7 | 38 | No Spam |
8 | 42 | 0 |
9 | 0 | 0 |
10 | 2 | 0 |
11 | 8 | No Spam |
12 | 36 | 0 |
13 | 6 | 0 |
14 | 30 | No Spam |
15 | 0 | 0 |
16 | 22 | 0 |
17 | 10 | No Spam |
18 | 16 | No Spam |
19 | 10 | No Spam |
20 | 0 | No Spam |
21 | 0 | 0 |
22 | 38 | 0 |
Table 6.5
Thus, we can see that none of the spam mails came from a mailing list. The table also indicates the percentage of mails sent from a mailing list for an average user which helps in determining the scope of the mailing list check. On an average 20% of the mails are sent from mailing list. But, the standard deviation from the mean is about 19 showing that the percentage of mails sent from mailing lists widely differs from one mailbox to the next and is subjective to the user. Nonetheless, mailing list check can be implemented as a good filter to classify mails.
SPF Check
The SPF test was performed on both ham and spam mails, but the mails sent from the same domain as the recipient were not included in the test as it not possible to determine the sender and receiver hosts after the mail had been received. These mails are known as domain internal mails. Also, the chances of a domain internal mail being spam is very minimal.
The table for the SPF results and column description is shown below:
Column 1: Mailbox number
Column 2: The total number of ham mails in each mailbox
Column 3: The number of mails on which the SPF check was performed. (excludes domain internal mails)
Column 4: The number of mails that produced the result "Fail" for the SPF check
Column 5: The number of mails that produced the result "Pass" for the SPF check
Column 6: The number of mails that produced the result "Error" for the SPF check
Column 7: The number of mails that produced the result "Neutral" for the SPF check
Column 8: The number of mails that produced the result "Softfail" for the SPF check
Column 9: The number of domain internal mails on which the SPF check was not performed
SPF Results for Ham mails:
Mailbox# | Total ham mails | Mails for SPF check | Fail | Pass | Error | Neutral | Softfail | Internal |
1 | 1818 | 420 | 0 | 393 | 0 | 25 | 2 | 1398 |
2 | 497 | 288 | 5 | 221 | 0 | 49 | 13 | 209 |
3 | 1174 | 234 | 0 | 194 | 0 | 38 | 2 | 940 |
4 | 576 | 373 | 6 | 239 | 0 | 126 | 2 | 203 |
5 | 5002 | 4759 | 18 | 4468 | 1 | 189 | 83 | 243 |
6 | 1418 | 1314 | 3 | 1219 | 2 | 65 | 25 | 104 |
7 | 1230 | 192 | 0 | 179 | 0 | 13 | 0 | 1038 |
8 | 1788 | 1259 | 14 | 1108 | 0 | 129 | 8 | 529 |
9 | 133 | 51 | 0 | 27 | 1 | 22 | 1 | 82 |
10 | 524 | 521 | 7 | 426 | 10 | 76 | 2 | 3 |
11 | 168 | 48 | 0 | 39 | 0 | 8 | 1 | 120 |
12 | 1301 | 237 | 0 | 147 | 0 | 69 | 21 | 1064 |
13 | 1360 | 1154 | 13 | 865 | 0 | 258 | 18 | 206 |
14 | 934 | 269 | 0 | 208 | 0 | 57 | 4 | 665 |
15 | 414 | 290 | 80 | 240 | 0 | 31 | 1 | 124 |
16 | 1999 | 1527 | 86 | 1282 | 2 | 153 | 4 | 472 |
17 | 527 | 113 | 0 | 98 | 0 | 13 | 2 | 414 |
18 | 380 | 173 | 8 | 147 | 0 | 18 | 0 | 207 |
19 | 749 | 349 | 2 | 239 | 0 | 104 | 4 | 400 |
20 | 140 | 33 | 0 | 18 | 0 | 14 | 1 | 107 |
21 | 1151 | 649 | 12 | 296 | 2 | 322 | 17 | 502 |
22 | 2367 | 2282 | 56 | 1129 | 4 | 1093 | 0 | 85 |
TOTAL -> | 25650 | 16535 | 310 | 13182 | 22 | 2872 | 211 | 9115 |
Table 6.6
The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are
Figure 6.5
Thus, from Figure 6.5 we can see that very few ham mails fail the SPF test. The results for SPF checking are consistent and SPF check can be used as a good classifier for ham mails.
The table for SPF results on spam mails is given below:
Mailbox# | Total spam | Mails for SPF Check | Fail | Pass | Error | Neutral | Softfail | Internal |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 96 | 96 | 7 | 0 | 8 | 81 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 65 | 65 | 14 | 0 | 1 | 48 | 2 | 0 |
5 | 103 | 103 | 5 | 7 | 2 | 86 | 3 | 0 |
6 | 264 | 260 | 46 | 9 | 11 | 183 | 11 | 4 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 204 | 204 | 22 | 8 | 5 | 164 | 5 | 0 |
9 | 227 | 227 | 29 | 1 | 11 | 172 | 14 | 0 |
10 | 355 | 355 | 0 | 312 | 4 | 39 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 21 | 21 | 0 | 1 | 0 | 0 | 20 | 0 |
13 | 48 | 48 | 10 | 0 | 2 | 34 | 2 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 45 | 45 | 10 | 0 | 0 | 33 | 2 | 0 |
16 | 184 | 184 | 15 | 45 | 1 | 117 | 6 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21 | 371 | 369 | 29 | 69 | 5 | 255 | 11 | 2 |
22 | 890 | 890 | 43 | 491 | 31 | 285 | 40 | 0 |
TOTAL -> | 2873 | 2867 | 230 | 943 | 81 | 1497 | 116 | 6 |
Table 6.7
The pie chart is used to represent the fraction of mails that are fail, pass, error, neutral and softfail. The percentages that are found using the column totals are
Figure 6.6
The chart indicates that SPF check is not sufficient to categorize spam mails. Only 8% of the mails fail the SPF check, while 33% and 52% of the mails give pass and neutral as their result. Thus, we can use SPF check as a filter with medium weight. It should be used in conjunction with other filters for better performance.
Result Analysis Summary
7. Problems Faced and their Solutions
One of the major problems faced with the server based approach was that it did not attract any spam mails and hence there was no sample data to work with and gather statistics. Thus, the entire approach shifted to standalone module, where instead of using a new server or honeypot to attract mails, already existing IMAP or POP enabled mail accounts on servers like columbia or gmail were used.
For a standalone program, IMAP based or POP based mail servers that are commonly used by people were needed. The limitation was in our ability to find a lot of spam mails as Columbia has an effective spam filter in place. Also, the population for which the data was gathered did not vary and represented a certain set of people namely, university students. The solution is pretty straightforward and that is to distribute the standalone program to more and diverse set of people.
Another major problem faced during the final stages of implementation was with MailStats.java. It was throwing MailBoxClosed exception for mailboxes containing more than 1000 mails. The solution was to open the mailbox only when the messages were to be read and not before that. After this bug was fixed, data could be gathered easily, since the mailbox size was not a limitation.
There were some problems with threads being used in the main module, due to which it seemed that the individual modules were not working, but later on this bug was fixed and threads could be used in parallel to update the progress bar as well as in individual module implementation.
In the standalone program the biggest concern was the amount of time it took for the ping module to complete because the messages needed to parsed and then the host name was extracted from them which was being pinged with packets. This took about 200 ms – 1 s for each message. To overcome this problem, the concept of threadpool was used.
Lastly, the user needed to be given constant feedback and for this a progress bar was implemented that showed the progress of the entire program as well as individual modules.
The link to the source code is SpamTestLatest.zip. The files for the first module consist of the code. There are 3 files namely, DhcpDslPing.java, DkimSpfMailList.java and PingLookup.java.
The link to the result set page where all the cumulative results can be accessed is http://wiki.cs.columbia.edu:8080/display/spam/Resultset
The tools used were as follows:
Navicat MySQL for creating the database
Netbeans IDE for writing code for the modules
OpenOffice for writing the report
Visual Paradigm for UML diagrams
OpenOffice excel for drawing graphs and representing data
Concept Draw Pro for DFD
Convert Excel Spreadsheet to HTML
RFC 2822 [http://tools.ietf.org/html/rfc2822]
RFC 2821 [http://tools.ietf.org/html/rfc2821]
JavaMail APIhttp://java.sun.com/products/javamail/]
Spam Analysis and Reputation Project
[http://wiki.cs.columbia.edu:8080/display/spam/Home]
SARP Modules
[http://wiki.cs.columbia.edu:8080/display/spam/IMAP+analyzer+modules]
Adrian Frei - Spam Analysis and Reputation
Project: DNS Blacklists
Preethi Narayan - Spam Analysis and Reputation
Project : Received Header Vs Sent and From Header
Tejas Nadkarni – Parser and Standalone
Framework
Aditi Rajoriya - Spam Analysis and Reputation
Project: IMAP Retrieval and To/Body Module.
Dhrumin Shah - Spam Analysis and Reputation
Project: Domain Check and Image Analysis Modules.
Nirav Shah - Spam Analysis and Reputation Project:
Email Source, Date/Time and Attachment Analysis
Wikipedia – SPF and DKIM
Professor Henning G. Schulzrinne – Project Advisor and Mentor