Spam Analysis

Mohit Vazirani
Columbia University
New York, NY 10027
USA
mcv2107@columbia.edu

Abstract

With an increase in e-mail spam over recent years, the number of spam classification techniques being used has also increased. Broadly speaking, automated spam classification can be performed in two ways. One approach is to use content-based filtering which scans the message body for the occurrence of specific disallowed phrases or regular expressions. The other approach to classify messages is to analyze the information (particularly address information) stored in the message envelope in order to determine whether the source of the message is authorized to send messages using the identity claimed in the message headers. In this project, we utilize the second approach by running tests on the message headers and generate statistics that help in understanding the outcome of these tests.

Introduction

This project involves analyzing message headers such as From, Return-path, Received, List-Id and Message-Id which contain relevant information such as e-mail address, IP addresses, domain names of hosts, mailing list information and the message identifier. This data is used as input to several tests that are performed to help differentiate characteristics of spam from non-spam messages.

In order to run tests on message headers, we first need both spam and non-spam message stores. The sources of messages used in this project include:

The first data set used in the project was a CUNIX mailbox that contains a mix of spam and non-spam messages. The first task was to manually classify the messages into either spam or non spam and place them into different folders. The JavaMail API along with Java Secure Sockets Extension (JSSE) and JavaBeans Activation Framework (JAF) was used to connect to the CUNIX mailbox which used IMAP over SSL. The next data set worked upon was the SpamArchive public spam corpus. The headers of many messages of this corpus didn't follow specification and this necessitated preprocessing these messages and in some cases, removing inconsistent messages. The final data set used consisted of recent messages from the IETF mailing list archives. This corpus also required significant preprocessing to make it devoid of spam messages.

The next step was implementing parsers specific to the above datasets that extracted these message headers of interest, parsed them to extract data such as the earliest host to add the Received header, IP address of the sender, domain-part of the From header. This data is used to perform tests and generate relevant statistics.

The report is organized as follows:

  1. Introduction
  2. Related Work
  3. Background
  4. Program Documentation
  5. Tests
  6. Future Work
  7. References

Related Work

Most of the tools trying to stop spam that claim to use header analysis generally scan the content or headers for known spam patterns and add extra labels in the headers that the end user's email clients can use to accept/reject e-mails. SPF is an interesting approach to header-based classification since it permits only certain hosts to send e-mail on behalf of the domain. Being a fairly new proposal, not a lot of research has been published in this area.

Background

This section provides a brief explanation for some of the background concepts that this project is built upon.

SMTP

Simple Mail Transfer Protocol (SMTP) is the protocol used for the reliable transport and delivery of electronic messages on the Internet. An SMTP server can either be the final destination or an intermediate relay server. When a message is relayed by an intermediate relay server, the relay server plays the role of an SMTP client to the next server (which plays the role of an SMTP server). An SMTP client determines the address of the SMTP server host by resolving the destination domain name to either an intermediate Mail eXchanger (MX) host or a final target host. Thus a message transfer can occur in a series of hops through intermediary systems. SMTP clients and servers act as Mail Transfer Agents (MTAs) to send messages between the source and destination Mail User Agents (MUAs). When a message is forwarded through a gateway, it appends a Received header to the message envelope over existing Received headers. The Received header includes the name and the IP address of the source host, domain name of the SMTP server (who appends this header) and the timestamp. Thus a chain of Received headers is observed when the message gets transported through many intermediate servers.

SPF

Sender Policy Framework (SPF) is a mechanism for authorizing which mail servers are allowed to send mail for a particular domain. The SPF information for this domain is stored as a DNS TXT resource record. This information is available to the receiving MTA through a DNS query. The receiving MTA can thus decide to allow or bounce the message based on the result of the SPF query. The SPF record defines one or more tests to carry out to verify the sender. The first test to pass terminates SPF processing. SPF also allows macro expansion of the result.

The content of the returned SPF record is in the form:
v=spf1 [[pre] type ] ... [mod]

For example, if the domain's SPF record is "v=spf1 a -all" and its A record resolves to IP address 192.168.0.10, only the machine with IP address 192.168.0.10 is authorized to send e-mail for that domain.

SORBS

The Spam and Open Relay Blocking System (SORBS) is an open proxy and open mail relay DNSBL that has maintains lists of IP addresses related to spamming. It is built on top of the DNS and can be queried just like any regular DNS query. This project uses their Dynamic User and Host List (DUHL) to determine whether an IP address has been dynamically assigned.

For example, to check whether 172.128.0.10 is a dynamic IP address, the DNS A resource record for 10.0.128.172.dnsbl.sorbs.net is retrieved and if it contains 127.0.0.10, it means that the queried IP is dynamically assigned.

Program Documentation

Tests

This section lists tests performed by the parser on the relevant message headers and analyzes their outcomes.

Figures of success and failures of these tests along with specific cases and other relevant information are logged to files to help us analyze the outcome of the tests.

From domain and earliest Received hostname compatibility

In this test, the parser extracted the Received headers and parsed them to extract fields such as the earliest non-local host to have received the message and IP address of the first MTA. After extracting relevant fields from the Received headers, the parser verified if the earliest host agreed with the domain part of the From header. The parser also extracted the List-Id header to determine if the message belongs to a mailing list and to exclude such messages from being tested since the From header is generally the name of the mailing list for such messages. Messages originating from a Gmail account (either sent through the Gmail web interface or sent through other POP/IMAP accounts configured in Gmail) didn't have information about the earliest Received host. Instead, the earliest Received host shows up as a 10.0.0.0/8 IP address. The parser was modified to disregard such messages. This test was also inapplicable to messages that were sent from an e-mail client which was configured to use an outgoing server which wasn't one of the MX hosts for the domain which the sender's e-mail address belonged to. Statistics from this part of the project are shown below.

Corpus: Columbia University (CUNIX) mailbox
Corpus Type: Not Spam
Total Messages: 2773
Messages where Received test produced a result: 2164
Messages where earliest received host matches from domain: 2082
Percentage of messages where received host matches from domain: 96.21%

An example where this test succeeded for non-spam is shown below:

Subject: Computer Science Dept. Graduate Student Orientation
From: Twinkle Edwards twinkle@cs.columbia.edu
From domain: columbia.edu
Earliest Received Host: play.cs.columbia.edu
Received: from play.cs.columbia.edu (play.cs.columbia.edu [128.59.21.100]) by jujube.cc.columbia.edu (8.13.0/8.13.0) with ESMTP id j79GnHur005599 Tue, 9 Aug 2005 12:49:20 -0400 (EDT)
Received: from play.cs.columbia.edu (localhost [127.0.0.1]) by play.cs.columbia.edu (8.12.10/8.12.10) with ESMTP id j79GnG0M009426 Tue, 9 Aug 2005 12:49:16 -0400 (EDT)

An example where this test failed for non-spam is shown below:

From: "Gail Kaiser" kaiser@cs.columbia.edu
From domain: columbia.edu
Earliest Received Host: ms-smtp-03.rdc-nyc.rr.com
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by jujube.cc.columbia.edu (8.13.0/8.13.0) with ESMTP id jA4Gjvoq013508 Fri, 4 Nov 2005 11:46:21 -0500 (EST)
Received: from ms-smtp-03.rdc-nyc.rr.com (ms-smtp-03-smtplb.rdc-nyc.rr.com [24.29.109.7]) by cs.columbia.edu (8.12.10/8.12.10) with ESMTP id jA4GSKhn005617 Fri, 4 Nov 2005 11:28:20 -0500 (EST)
Received: from study (cpe-24-193-125-37.nj.res.rr.com [24.193.125.37]) by ms-smtp-03.rdc-nyc.rr.com (8.12.10/8.12.7) with SMTP id jA4GR3ME029697 Fri, 4 Nov 2005 11:27:04 -0500 (EST)
The test failed because the user used an SMTP server which isn't the designated SMTP server for the domain columbia.edu
Corpus: Columbia University (CUNIX) mailbox
Corpus Type: Spam
Total Messages: 700
Messages where Received test produced a result: 372
Messages where earliest received host matches from domain: 8
Percentage of messages where received host matches from domain: 2.15%

An example where this test succeeded for spam is shown below:

From: Phil Stapleton jfmontek@cs.columbia.edu
From domain: columbia.edu
Earliest Received Host: cs.columbia.edu
Received: from teewurst.cc.columbia.edu ([unix socket]) by teewurst.cc.columbia.edu (Cyrus v2.3-alpha) with LMTPA Tue, 18 Apr 2006 11:10:33 -0400
Received: from feta.cc.columbia.edu (feta.cc.columbia.edu [128.59.28.164]) by teewurst.cc.columbia.edu (8.13.1/8.13.1) with ESMTP id k3IFAWhi012303 Tue, 18 Apr 2006 11:10:33 -0400
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by feta.cc.columbia.edu (8.13.6/8.13.6) with ESMTP id k3IF8sRa008320 Tue, 18 Apr 2006 11:09:03 -0400 (EDT)
Received: from 425828E8 ([222.138.141.235]) by cs.columbia.edu (8.12.10/8.12.10) with SMTP id k3IF0JcO021227 Tue, 18 Apr 2006 11:00:54 -0400 (EDT)

This test succeeded since the From header was spoofed to an email address of the domain columbia.edu

An example where this test failed for spam is shown below.

From: "Jamie Browning" xztggevvoj@yahoo.com
From domain: yahoo.com
Earliest Received Host: cs.columbia.edu
Received: from teewurst.cc.columbia.edu ([unix socket]) by teewurst.cc.columbia.edu (Cyrus v2.3-alpha) with LMTPA Fri, 14 Apr 2006 20:14:44 -0400
Received: from feta.cc.columbia.edu (feta.cc.columbia.edu [128.59.28.164]) by teewurst.cc.columbia.edu (8.13.1/8.13.1) with ESMTP id k3F0Eih2015676 Fri, 14 Apr 2006 20:14:44 -0400
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by feta.cc.columbia.edu (8.13.6/8.13.6) with ESMTP id k3F0DGW8015771 Fri, 14 Apr 2006 20:13:18 -0400 (EDT)
Received: from 128.59.16.20 ([60.20.179.216]) by cs.columbia.edu (8.12.10/8.12.10) with SMTP id k3F05DcK007441 Fri, 14 Apr 2006 20:05:18 -0400 (EDT)
Reachability of source hosts

In this test, we check whether the host belonging to the earliest Received header is still reachable through ping. Raw sockets cannot be created in Java, thus it is not possible to ping from a Java application. The parser code was modified to generate a plain text file listing all the IP addresses of the earliest hosts from different messages and this file was provided as an input to the fping utility which is used to ping hosts in parallel. Another observation was a lot of messages had source IP address as either localhost or a private IP address. Such messages were not included in this test. The statistics for this particular test are shown below.

Corpus: Subset of corpus from IETF mailing list archives
Corpus Type: Not Spam
Hosts inspected: 4897
Hosts reachable via ping: 1816 (37.08%)
Corpus: Subset of corpus from SpamArchive
Corpus Type: Spam
Hosts inspected: 13263
Hosts reachable via ping: 4077 (30.74%)
Messages with statically assigned IP addresses

This step involves checking what fraction of spam and non-spam messages originate statically assigned IP addresses. Static IPs are generally assigned to cable and DSL connections. To determine whether the source IP address was statically or dynamically assigned, the SORBS DUHL database was used. In order to query the SORBS DNS records, Java Naming and Directory Interface (JNDI) was used. The statistics for this particular test are shown below.

Corpus: Subset of corpus from IETF mailing list archives
Corpus Type: Not Spam
Corpus Size: 70804
Messages originating from hosts with statically assigned IP addresses: 63280 (89.37%)
Corpus: Subset of corpus from SpamArchive
Corpus Type: Spam
Corpus Size: 44603
Messages originating from hosts with statically assigned IP addresses: 31849 (71.41%)
Domains with existing SPF records

This test determines the fraction of the sender domains for which SPF resource records exist.

Corpus: Subset of corpus from IETF mailing list archives
Corpus Type: Not Spam
Unique domains inspected: 1405
Unique domains where SPF test failed: 217
Unique domains not having SPF Records: 909
Unique domains having SPF Records: 279 (23.48%)
Corpus: Subset of corpus from SpamArchive
Corpus Type: Spam
Unique domains inspected: 11226
Unique domains where SPF test failed: 3618
Unique domains not having SPF Records: 5958
Unique domains having SPF Records: 1650 (21.69%)

Messages where SPF verification succeeds

This test checks whether the IP address in the earliest Received header agrees with the IP range(s) present in the SPF records for the domains that use SPF records. A very small percentage of domains had non-compliant SPF resource records. The code had to be modified to ignore these special cases. The statistics for this test are shown below.

Corpus: Subset of corpus from IETF mailing list archives
Corpus Type: Not Spam
Total messages inspected: 71480
Messages where SPF verification test was performed (without errors): 14446
Messages where IP address validation against SPF Record succeeded: 8317 (57.57%)

An example where this test suceeds for non-spam:

From: "Timothy J. Salo" salo@saloits.com
Domain: saloits.com
SPF Record: +mx -all
IP address of earliest MTA: 208.42.140.127 Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1FReud-0006Tp-6I Thu, 06 Apr 2006 20:33:19 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1FReuc-0006Tk-ER for 6lowpan@ietf.org Thu, 06 Apr 2006 20:33:18 -0400
Received: from saloits.com ([208.42.140.127] helo=newbsd.saloits.com) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1FReub-0005Y6-Tc for 6lowpan@ietf.org Thu, 06 Apr 2006 20:33:18 -0400
Received: from newbsd.saloits.com (localhost.saloits.com [127.0.0.1]) by newbsd.saloits.com (8.13.1/8.13.1) with ESMTP id k370XH73037014 for 6lowpan@ietf.org Thu, 6 Apr 2006 19:33:17 -0500 (CDT) (envelope-from salo@newbsd.saloits.com)
Received: (from salo@localhost) by newbsd.saloits.com (8.13.1/8.13.1/Submit) id k370XGpf037013 for 6lowpan@ietf.org Thu, 6 Apr 2006 19:33:16 -0500 (CDT) (envelope-from salo)

The test succeded in this case because the address contained in the retrieved MX resource record equalled the IP address of the earliest MTA.

An example where this test fails for non-spam:

From: "Steven M. Bellovin" smb@cs.columbia.edu
Domain: cs.columbia.edu
SPF Record: a mx ptr mx:ober.cs.columbia.edu mx:opus.cs.columbia.edu mx:firebird.cs.columbia.edu ip4:128.59.16.0/21 ~all
IP address of earliest MTA: 147.28.0.16
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fy8p7-0001xl-4F Wed, 05 Jul 2006 10:57:53 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1Fy8p4-0001x6-Te Wed, 05 Jul 2006 10:57:50 -0400
Received: from machshav.com ([147.28.0.16]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1Fy8p1-0001mK-HF Wed, 05 Jul 2006 10:57:50 -0400
Received: from berkshire.machshav.com (localhost [127.0.0.1]) by machshav.com (Postfix) with ESMTP id F348AFB2DF Wed, 5 Jul 2006 14:57:45 +0000 (UTC)
Received: by berkshire.machshav.com (Postfix, from userid 54047) id BB88D3C04C2 Wed, 5 Jul 2006 10:57:38 -0400 (EDT)

The test failed because the mail was sent through a SMTP server (machshav.com) which isn't one of the designated ones for this domain.
Corpus: Subset of corpus from SpamArchive
Corpus Type: Spam
Total messages inspected: 50265
Messages where SPF verification test was performed (without errors): 6193
Messages where IP address validation against SPF Record succeeded: 2037 (32.89%)

An example where this test fails for spam:

From: "BOBBI FOBERG" bobbifoberg7181@hotmail.com
Domain: hotmail.com
SPF Record: ip4:209.185.128.0/23 ip4:209.185.130.0/23 ip4:209.185.240.0/22 ip4:216.32.180.0/22 ip4:216.32.240.0/22 ip4:216.33.148.0/22 ip4:216.33.151.0/24 ip4:216.33.236.0/22 ip4:216.33.240.0/22 ip4:216.200.206.0/24 ip4:204.95.96.0/20 ~all
IP address of earliest MTA: 64.4.60.60
Received: from (209.240.205.149) by with WTV-SMTP Mon, 2 Jan 2006 01:30:46 -0800
Received: from hotmail.com (bay0-dav-038.bay0.hotmail.com [64.4.60.60]) by smtpin-3301.bay.webtv.net (WebTV_Postfix+sws) with ESMTP id 67124100A27 for scot8@webtv.net Mon, 2 Jan 2006 01:30:46 -0800 (PST)
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC Mon, 2 Jan 2006 01:30:37 -0800

Future Work

Further work may consist of the following enhancements:

References

1
J. Klensin, RFC 2821 - Simple Mail Transfer Protocol, AT&T Laboratories, April 2001.
2
P. Resnick, RFC2822 - Internet Message Format, Qualcomm Incorporated, April 2001.
3
M. Wong, W. Schlitt, RFC4408 - Sender Policy Framework (SPF) for Authorizing Use Of Domains in E-mail, Version 1, April 2006.
4
James Kurose, Keith Ross, Computer Networking: A Top-Down Approach Featuring the Internet, 2005.
5
JavaMail API
6
Java Secure Sockets Extensions (JSSE)
7
Java Naming and Directory Interface (JNDI)