Spring 2003 - IRT Project

Technical Analysis of E-mail Spam

In this reasearch I conducted a statistical analysis about spam, also known as unsolicited commercial email. The analysis mostly based on the certain spam characteristics. They are spam features extracted from header fields and body contents. The goal of the research is not merely finding spam identifications but rather conducting the statistical information about spam.

Collecting Spam

Collecting spam from January 1997 to April 2003 at google's group

The source spam I used in this research was from google's group website at "Advanced Groups Search" session at google.com. Manually, we can get it by following the link: "http://groups.google.com/advanced_group_search?q=group:*&hl=en&lr=&ie=UTF-8&oe=UTF-8&group=*"

Automatically, I used a small program to connect to google's site and downloaded about 100 spams for each month from January 1997 until December 2002 and 1,000 spams for each month in the year 2003. I grouped the spams into each year and used them for the entire research

Spam Origin

Statistical report about spam origin

Getting spam origin information mostly based on the Received header. This is the only place in an email that provides a certain reliable information. Assuming a reasonably standard and recent sendmail setup, a Received header line normaly looks like:

Received: from host1 (host2 [ww.xx.yy.zz]) by host3 (8.7.5/8.7.3) 
          with SMTP id MAA04298; Thu, 18 Jul 1996 12:18:06 -0600
or
Received: from ww.xx.yy.zz (HELO host1) (ww.xx.yy.zz) by host 3 
          with SMTP; 22 Apr 2003 06:35:19 -0700 (PDT)

In either cases, the Received lines show four pieces of useful information (reading from back to front, in order of decreasing reliability):

 - The host that added the Received line (host3)
 - The IP address of the incoming SMTP connection (ww.xx.yy.zz)
 - The reverse-DNS lookup of that IP address (host2)
 - The name the sender used in the SMTP HELO command when they
   connected (host1).

An important truth about Received field is that Received lines are like links in a chain. The message is passed from one computer to the next with no breaks in the chain (i.e. the "by host" at any line should match with the "from host at the above line.) If there is a break between Received header lines that means spammer had inserted the faked headers. Moreover, when an email is sent from one host to another host, there are at most 2 source hosts involve in the header lines. So if there are more than two source hosts we know that an open relay occurs.

Based on that idea, I used a program to track down spam origin from Received header. The program parsed through each Received line, sellected host name or IP address, invoked unix "host" command and checked the consistency of hosts from line to line...

Figure 1 illustrates the fractions of spam from direct source or from an open relay. The fraction of spam from open relay seems to decrease in the recent years. In fact, only 9% of spams that I got at my email account on last month were from the open relay. The source spam I used to plot this figure are from google site. I used about 1,000 spams for each year from 1997 to 2002 and 4,000 spams for the year 2003.

Figure 1. Spam from direct source or open relay

Note that "untraceable spam" are those perhaps directly sent from spam host by using completely faked Received header that made us unable to trace out the source host.

Spam from the US or from other countries

After classifying spams into direct source or open relay, I divided spam in each category into spam from the US or from foreign countries. Table 1 shows the fraction of spam sent directly from the US or from other countries, table 2 shows the fraction of US's open relay and other countries open relay

1997

1998

1999

2000

2001

2002

2003

From the US

85.3%

76.1%

74.7%

75.0%

69.9%

66.4%

77.6%

From foreign countries

14.7%

23.9%

25.3%

25.0%

30.1%

33.6%

22.1%

Popular Spam Hosts

mindspring.com

mindspring.com

att.net

itctel.com

megabaud.fi

yahoo.com

blarg.net

Popular foreign spam

China

China

Hongkong

Argentina

Findland

Brasil

Brasil

Table 1.Spam from the US or from foreign Countries

1997

1998

1999

2000

2001

2002

2003

From the US

85.3%

82.3%

61.8%

65.7%

73.8%

79.0%

76.5%

From foreign countries

14.7%

17.7%

38.2%

34.3%

26.2%

21.0%

23.5%

Popular open relay

usit.net

interlog.net

std.com

earthlink.net

well.com

bigfoot.com

demon.net

Popular foreign open relay

Australia

Germany

Italy

Japan

China

Denmark

England

Table 2. Open relay from the US or from foreign Countries

Note that the fraction between US spam and other contries spam in this research are conducted from spam at google (i.e. spam distributed from different places). In reality, these fractions could be different at everyone email account, depending on where one is living.

Spam with Faked Information at Header Fields

Spam with faked or real address at From header

Figure 2 illustrates the fraction of spam using faked or "appeared" real sender email address. It just proved thing that we all knew of how unreliable the information from spammer email address

Figure 2. Spam with faked or real address at Form header

To figure if the spammers address is real, I checked the address against source host that originated the spam (from Received header). If the address match with the host, it is likely a real address, otherwise, it must be faked.

As we can see from Figure 2, the fraction of Spam address appeared real or being empty are slightly increasing by years while the percent of faked address dropped from 94% to 80%. This can be found to be due to the tighter constrain in federal law about spam. Spammers are likely willing to provide more accurate informaion about their spam to avoid trouble in law sue.

Also, from the matter of fact that many email providers are having blocking or reporting utility at client email account (i.e. users can block a specific email address, or report spam to distributed center). Those utilities often base on From header. So emptying From header is one of the tactis that spammers are using to avoid being blocking or reporting from users

Spams with faked or real address at Reply-To header

Similarly to the use of faked address at From header. Spammers often use faked reply address at Reply-To header. Also, the statistic from table 3 shows that spam with real reply address are slightly increasing in the late years. This was found to be due to the similar reasons that I have just memtioned above.

1997

1998

1999

2000

2001

2002

2003

Faked

96.5%

92.9%

84.0%

82.4%

86.8%

67.7%

78.2%

Empty

0.0%

2.1%

7.1%

13.2%

8.3%

17.5%

7.8%

Appeard Real

3.5%

5.0%

8.9%

4.4%

4.9%

14.8%

14.0%

Table 3. Spam with faked or real address at Reply-To header

Note that email without Reply-To header will use Return-path as route for replied email. Return-Path is generated by the mail transport service at the time of final deliver. Usually, Return-Path often route replied email to the sender by using the address at From header.

Destination address in format of individual vs. group vs. undisclosed-recipients

Since it will not be reliable if I used spam at google to investigate about spam destination address (i.e. I cannot tell who actually own the destination email address). I sorely used spams at my account to conduct the survey about To header. Most of the time, the destination address is my email address. The other times, the destination address contains the lists of email addresses that are similar but not exactly. It can also be a faked, invalid or hind address at To header (e.g. "undisclosed-recipients" or "@subscribers") or someone else address. Finaly, in some other spams my email address is not at appeared at the destination addresss but at CC address. Figure 3 shows information of using destination address of spams at my email account

Figure 3. Destination address formats

Content-base Analysis

There are many content-based spam filtering softwares out there. They are all similar in the way of using signature algorithms to identify individual spam features and thusly using spam features to determine if any message is spam. Although they can stop some portion of spam but in general they all suffers from one draw back, signature based technique is effective against known spam, but unable to detect and prevent new spam.

In this milestone I conduct a statistical approach in content-based analysis for the spam collection at google. I want to show that spam has change a lot in recent years Indeed, the "spam score" system like the one at SpamAssassin is not the best ideal in fighting spam

I started by scanning the entire content including header and body of each spam in each year. I currently consider only English-alphabet characters and ignoring case. Any character not in the English alphabet would be treated as a token separator. Of course, alternative perspective could be applied in future.

Then I computed the spam frequency of each word occur in each year by taking the count of each word (i.e., the number of spam that contain each word, regardless of how many times each word appears in each spam) over the total number of spam I used in this year.

Most interesting spam words

SpamAssassin are using spam words as one of the key features in its scoring rule. However, spams are containing less and less spam words than they were in 5 years ago. Instead, spammers are changing spams to have similar format with legitimate email. In some cases spam are mixed the spam content with a regular document format. In other cases they are short message including links to commercial site. Table 4 introduces the frequency of use of some of the popular spam words that we meet everyday at spam in our spam bulk. The results show that the use of those spam words are decreasing in over the past few years.

Spam words / Year

1997

1998

1999

2000

2001

2002

2003

Financial

0.060

0.108

0.062

0.097

0.093

0.083

0.063

Money

0.366

0.295

0.250

0.307

0.252

0.203

0.179

Dollars

0.250

0.112

0.098

0.115

0.102

0.116

0.077

Business

0.231

0.290

0.250

0.268

0.252

0.206

0.137

Order

0.140

0.227

0.216

0.236

0.187

0.154

0.145

Credit

0.052

0.149

0.119

0.145

0.179

0.150

0.076

Payment

0.023

0.063

0.039

0.066

0.067

0.051

0.039

Legal

0.267

0.098

0.074

0.101

0.079

0.058

0.038

Fees

0.022

0.048

0.032

0.039

0.037

0.022

0.017

Hardcore

0.054

0.045

0.015

0.021

0.027

0.025

0.014

Table 4. Some most popular spam words

As part of the support for the claim that spam words have now been used less often than in the past, I picked the word "Money" to plot its frequency of use in Figure 4

Figure 4. The use of spam word "Money" are reducing by years

Subject header format

For most of us, spam is easily recognizable. Normaly, we never have to open an email to know it is spam. By loking at an email subject, we have little trouble to recognize spam. And thus, email subject is an interesting feature that most of signature-based spam filter try to use. Spammers, on the other hand have lot of different format for spam's Subject. One tactic that spammers often use to diliver their spam is sending out the same spam in many times just by changing the subject a little bit. For example: we might get spam with subjet "GOOD NEWS" in the first day. Then "GOOD NEWS YOU CAN USE" in the second day and finaly "GOOD NEWS, The Good News Electronic Journal" in the third day.

Table 5 shows some of the other formats of spam subject header.

1997

1998

1999

2000

2001

2002

2003

Randomizing with extra letters/digits

0%

2%

6%

5%

10%

6%

7%

Using all Capital

30%

20%

17%

15%

12%

9%

6%

Using many !, $, # or other symbols

26%

20%

16%

15%

14%

13%

8%

Not using Alphabet

0%

1.4%

1.6%

1.2%

1.9%

1.8%

4%

Table 5. Spam Subject header format

Using HTML in spam

As I mentioned earlier, Spammer have been increasingly using HTML for spam message. Content-based spam filtering software considered HTML as an important spam feature to recognize spam. Figure 5 illustrates the rapid change in using HTML over plain text in spam over the past 7 years.

Figure 5. Spam with plain text or HTML

Table 6 gives more details about the statistic information about spam features related to HTML

1997

1998

1999

2000

2001

2002

2003

Spam using HTML format

14%

19%

21%

28%

42%

75%

66%

Using font color

2%

4%

4%

7%

18%

35%

28%

Containing images

0.2%

0.7%

2%

3%

10%

26%

30%

Having URL links

3%

8%

9%

12%

27%

53%

58%

Having "click" on something

15%

19%

22%

27%

33%

50%

45%

Using HTML table

2%

2.2%

3%

6%

16%

35%

30%

Using input form

3%

10%

11%

12%

16%

14%

9%

Table 6. Spam with HTML format

Spams concern about privacy policy

Finaly, as we often see at the bottom of each spam, I am ending this page by presenting the statistic value of how spammers used unsubscribe or removal instruction in spam. The matter of fact that giving users information about removal or unsubscribe is a trick that spammers often use to get confirm that users email address is real. Many people have been so naive to submit their email address to spammer through those look real removal systems. Surely, they would receive more spam once spammers receive their feedback. Table 7 presents statistic data of how spammers have used unsubscribe or removal instruction in spam. We can see that the fraction of spam having removal instruction are increasing by years.

1997

1998

1999

2000

2001

2002

2003

Having removal information

30%

48%

54%

60%

73%

65%

46%

Having unsubscribe information

0.2%

1%

2%

4%

7%

17%

15%

Talking about privacy

1%

3%

5%

6%

5%

5%

5%

Claim you were on the list

32%

29%

24%

24%

30%

29%

22%

Table 7. Spam with removal instruction

Conclusions and Thought

The statistic information I presented in this research is perhaps some of the most important information about spam characteristics. Having those statistics information is important to understand the nature of spam so that we will find the best way to deal with spam. I think conducting statistical report should be first step to approach before we actually writing any spam filtering software. By doing this research, I also want to shows that spams are changing a lot in the past few years in order to get around bloking spam systems. By going along, I have had chance to investigate some different spam filtering softwares using content-based techniques (SpamAssassin), distributed notification systems (SpamNet at CloudMark), blacklist providers (MAPS), or Bayesian algorithm (suggested by Paul Graham) They are fine techniques and are used widely in the market. However, all of them remain the same problem in capability of fighting unknown spam. I believe that a good anti-spam system will not only find yesterday's spam, but also will evolve and help to find tomorrow's spam. We still get spam because we don't have an effective algorithm to recognize the new tactics of spammer. In other words, we cannot get fast enough update information about spam and thus we are always behind spammers.

Source codes

GetSpam.java : Collect spam from Google's group website 
Parser.java  : Extract Spams header fileds and body contents
Received.java: Get spam origin from Received header
Origin.java  : Compute statistic value of spam origin
Address.java : Check if sender or reply address is real or faked 
SpamWord.java: Compute the most common spam words in each year
Words.java   : Compute the used frequency of input word in each year
Subject.java : Subject header analysis

References

RFC 821 • RFC 822 • RFC 2045 • RFC 2046

SpamAssassin.org

Cloudmark SpamNet

Figuring out fake E-Mail & Posts

Paul Graham, A Plan for Spam

David Mertz, Six approaches to eliminating unwanted e-mail

Brandon M. Browning, Getting Rid of Spam

Linh Bui, May 17, 2003

	1997	1998	1999	2000	2001	2002	2003
From the US	85.3%	76.1%	74.7%	75.0%	69.9%	66.4%	77.6%
From foreign countries	14.7%	23.9%	25.3%	25.0%	30.1%	33.6%	22.1%
Popular Spam Hosts	mindspring.com	mindspring.com	att.net	itctel.com	megabaud.fi	yahoo.com	blarg.net
Popular foreign spam	China	China	Hongkong	Argentina	Findland	Brasil	Brasil

Spam words / Year	1997	1998	1999	2000	2001	2002	2003
Financial	0.060	0.108	0.062	0.097	0.093	0.083	0.063
Money	0.366	0.295	0.250	0.307	0.252	0.203	0.179
Dollars	0.250	0.112	0.098	0.115	0.102	0.116	0.077
Business	0.231	0.290	0.250	0.268	0.252	0.206	0.137
Order	0.140	0.227	0.216	0.236	0.187	0.154	0.145
Credit	0.052	0.149	0.119	0.145	0.179	0.150	0.076
Payment	0.023	0.063	0.039	0.066	0.067	0.051	0.039
Legal	0.267	0.098	0.074	0.101	0.079	0.058	0.038
Fees	0.022	0.048	0.032	0.039	0.037	0.022	0.017
Hardcore	0.054	0.045	0.015	0.021	0.027	0.025	0.014