In this reasearch I conducted a statistical analysis about spam, also known as unsolicited commercial email. The analysis mostly based on the certain spam characteristics. They are spam features extracted from header fields and body contents. The goal of the research is not merely finding spam identifications but rather conducting the statistical information about spam.
The source spam I used in this research was from google's group website at "Advanced Groups Search" session at google.com. Manually, we can get it by following the link: "http://groups.google.com/advanced_group_search?q=group:*&hl=en&lr=&ie=UTF-8&oe=UTF-8&group=*"Collecting spam from January 1997 to April 2003 at google's group
Automatically, I used a small program to connect to google's site and downloaded about 100 spams for each month from January 1997 until December 2002 and 1,000 spams for each month in the year 2003. I grouped the spams into each year and used them for the entire research
Statistical report about spam origin
Getting spam origin information mostly based on the Received header. This is the only place in an email that provides a certain reliable information. Assuming a reasonably standard and recent sendmail setup, a Received header line normaly looks like:
Received: from host1 (host2 [ww.xx.yy.zz]) by host3 (8.7.5/8.7.3) with SMTP id MAA04298; Thu, 18 Jul 1996 12:18:06 -0600 or Received: from ww.xx.yy.zz (HELO host1) (ww.xx.yy.zz) by host 3 with SMTP; 22 Apr 2003 06:35:19 -0700 (PDT)In either cases, the Received lines show four pieces of useful information (reading from back to front, in order of decreasing reliability):
- The host that added the Received line (host3) - The IP address of the incoming SMTP connection (ww.xx.yy.zz) - The reverse-DNS lookup of that IP address (host2) - The name the sender used in the SMTP HELO command when they connected (host1).An important truth about Received field is that Received lines are like links in a chain. The message is passed from one computer to the next with no breaks in the chain (i.e. the "by host" at any line should match with the "from host at the above line.) If there is a break between Received header lines that means spammer had inserted the faked headers. Moreover, when an email is sent from one host to another host, there are at most 2 source hosts involve in the header lines. So if there are more than two source hosts we know that an open relay occurs.
Based on that idea, I used a program to track down spam origin from Received header. The program parsed through each Received line, sellected host name or IP address, invoked unix "host" command and checked the consistency of hosts from line to line...
Figure 1 illustrates the fractions of spam from direct source or from an open relay. The fraction of spam from open relay seems to decrease in the recent years. In fact, only 9% of spams that I got at my email account on last month were from the open relay. The source spam I used to plot this figure are from google site. I used about 1,000 spams for each year from 1997 to 2002 and 4,000 spams for the year 2003.
|
Spam from the US or from other countries
After classifying spams into direct source or open relay, I divided spam in each category into spam from the US or from foreign countries. Table 1 shows the fraction of spam sent directly from the US or from other countries, table 2 shows the fraction of US's open relay and other countries open relay
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
From the US |
85.3% |
76.1% |
74.7% |
75.0% |
69.9% |
66.4% |
77.6% |
From foreign countries |
14.7% |
23.9% |
25.3% |
25.0% |
30.1% |
33.6% |
22.1% |
Popular Spam Hosts |
mindspring.com |
mindspring.com |
att.net |
itctel.com |
megabaud.fi |
yahoo.com |
blarg.net |
Popular foreign spam |
China |
China |
Hongkong |
Argentina |
Findland |
Brasil |
Brasil |
Table 1.Spam from the US or from foreign Countries |
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
From the US |
85.3% |
82.3% |
61.8% |
65.7% |
73.8% |
79.0% |
76.5% |
From foreign countries |
14.7% |
17.7% |
38.2% |
34.3% |
26.2% |
21.0% |
23.5% |
Popular open relay |
usit.net |
interlog.net |
std.com |
earthlink.net |
well.com |
bigfoot.com |
demon.net |
Popular foreign open relay |
Australia |
Germany |
Italy |
Japan |
China |
Denmark |
England |
Table 2. Open relay from the US or from foreign Countries |
Spam with faked or real address at From header
Figure 2 illustrates the fraction of spam using faked or "appeared" real sender email address. It just proved thing that we all knew of how unreliable the information from spammer email address
Figure 2. Spam with faked or real address at Form header |
To figure if the spammers address is real, I checked the address against source host that originated the spam (from Received header). If the address match with the host, it is likely a real address, otherwise, it must be faked.
As we can see from Figure 2, the fraction of Spam address appeared real or being empty are slightly increasing by years while the percent of faked address dropped from 94% to 80%. This can be found to be due to the tighter constrain in federal law about spam. Spammers are likely willing to provide more accurate informaion about their spam to avoid trouble in law sue.
Also, from the matter of fact that many email providers are having blocking or reporting utility at client email account (i.e. users can block a specific email address, or report spam to distributed center). Those utilities often base on From header. So emptying From header is one of the tactis that spammers are using to avoid being blocking or reporting from users
Spams with faked or real address at Reply-To header
Similarly to the use of faked address at From header. Spammers often use faked reply address at Reply-To header. Also, the statistic from table 3 shows that spam with real reply address are slightly increasing in the late years. This was found to be due to the similar reasons that I have just memtioned above.
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
Faked |
96.5% |
92.9% |
84.0% |
82.4% |
86.8% |
67.7% |
78.2% |
Empty |
0.0% |
2.1% |
7.1% |
13.2% |
8.3% |
17.5% |
7.8% |
Appeard Real |
3.5% |
5.0% |
8.9% |
4.4% |
4.9% |
14.8% |
14.0% |
Table 3. Spam with faked or real address at Reply-To header |
Destination address in format of individual vs. group vs. undisclosed-recipients
Since it will not be reliable if I used spam at google to investigate about spam destination address (i.e. I cannot tell who actually own the destination email address). I sorely used spams at my account to conduct the survey about To header. Most of the time, the destination address is my email address. The other times, the destination address contains the lists of email addresses that are similar but not exactly. It can also be a faked, invalid or hind address at To header (e.g. "undisclosed-recipients" or "@subscribers") or someone else address. Finaly, in some other spams my email address is not at appeared at the destination addresss but at CC address. Figure 3 shows information of using destination address of spams at my email account
Figure 3. Destination address formats |
There are many content-based spam filtering softwares out there. They are all similar in the way of using signature algorithms to identify individual spam features and thusly using spam features to determine if any message is spam. Although they can stop some portion of spam but in general they all suffers from one draw back, signature based technique is effective against known spam, but unable to detect and prevent new spam.
In this milestone I conduct a statistical approach in content-based analysis for the spam collection at google. I want to show that spam has change a lot in recent years Indeed, the "spam score" system like the one at SpamAssassin is not the best ideal in fighting spam
I started by scanning the entire content including header and body of each spam in each year. I currently consider only English-alphabet characters and ignoring case. Any character not in the English alphabet would be treated as a token separator. Of course, alternative perspective could be applied in future.
Then I computed the spam frequency of each word occur in each year by taking the count of each word (i.e., the number of spam that contain each word, regardless of how many times each word appears in each spam) over the total number of spam I used in this year.
Most interesting spam words
SpamAssassin are using spam words as one of the key features in its scoring rule. However, spams are containing less and less spam words than they were in 5 years ago. Instead, spammers are changing spams to have similar format with legitimate email. In some cases spam are mixed the spam content with a regular document format. In other cases they are short message including links to commercial site. Table 4 introduces the frequency of use of some of the popular spam words that we meet everyday at spam in our spam bulk. The results show that the use of those spam words are decreasing in over the past few years.
Spam words / Year |
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
Financial |
0.060 |
0.108 |
0.062 |
0.097 |
0.093 |
0.083 |
0.063 |
Money |
0.366 |
0.295 |
0.250 |
0.307 |
0.252 |
0.203 |
0.179 |
Dollars |
0.250 |
0.112 |
0.098 |
0.115 |
0.102 |
0.116 |
0.077 |
Business |
0.231 |
0.290 |
0.250 |
0.268 |
0.252 |
0.206 |
0.137 |
Order |
0.140 |
0.227 |
0.216 |
0.236 |
0.187 |
0.154 |
0.145 |
Credit |
0.052 |
0.149 |
0.119 |
0.145 |
0.179 |
0.150 |
0.076 |
Payment |
0.023 |
0.063 |
0.039 |
0.066 |
0.067 |
0.051 |
0.039 |
Legal |
0.267 |
0.098 |
0.074 |
0.101 |
0.079 |
0.058 |
0.038 |
Fees |
0.022 |
0.048 |
0.032 |
0.039 |
0.037 |
0.022 |
0.017 |
Hardcore |
0.054 |
0.045 |
0.015 |
0.021 |
0.027 |
0.025 |
0.014 |
Table 4. Some most popular spam words |
As part of the support for the claim that spam words have now been used less often than in the past, I picked the word "Money" to plot its frequency of use in Figure 4
Figure 4. The use of spam word "Money" are reducing by years |
Subject header format
For most of us, spam is easily recognizable. Normaly, we never have to open an email to know it is spam. By loking at an email subject, we have little trouble to recognize spam. And thus, email subject is an interesting feature that most of signature-based spam filter try to use. Spammers, on the other hand have lot of different format for spam's Subject. One tactic that spammers often use to diliver their spam is sending out the same spam in many times just by changing the subject a little bit. For example: we might get spam with subjet "GOOD NEWS" in the first day. Then "GOOD NEWS YOU CAN USE" in the second day and finaly "GOOD NEWS, The Good News Electronic Journal" in the third day.
Table 5 shows some of the other formats of spam subject header.
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
Randomizing with extra letters/digits |
0% |
2% |
6% |
5% |
10% |
6% |
7% |
Using all Capital |
30% |
20% |
17% |
15% |
12% |
9% |
6% |
Using many !, $, # or other symbols |
26% |
20% |
16% |
15% |
14% |
13% |
8% |
Not using Alphabet |
0% |
1.4% |
1.6% |
1.2% |
1.9% |
1.8% |
4% |
Table 5. Spam Subject header format |
Using HTML in spam
As I mentioned earlier, Spammer have been increasingly using HTML for spam message. Content-based spam filtering software considered HTML as an important spam feature to recognize spam. Figure 5 illustrates the rapid change in using HTML over plain text in spam over the past 7 years.
Figure 5. Spam with plain text or HTML |
Table 6 gives more details about the statistic information about spam features related to HTML
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
Spam using HTML format |
14% |
19% |
21% |
28% |
42% |
75% |
66% |
Using font color |
2% |
4% |
4% |
7% |
18% |
35% |
28% |
Containing images |
0.2% |
0.7% |
2% |
3% |
10% |
26% |
30% |
Having URL links |
3% |
8% |
9% |
12% |
27% |
53% |
58% |
Having "click" on something |
15% |
19% |
22% |
27% |
33% |
50% |
45% |
Using HTML table |
2% |
2.2% |
3% |
6% |
16% |
35% |
30% |
Using input form |
3% |
10% |
11% |
12% |
16% |
14% |
9% |
Table 6. Spam with HTML format |
Spams concern about privacy policy
Finaly, as we often see at the bottom of each spam, I am ending this page by presenting the statistic value of how spammers used unsubscribe or removal instruction in spam. The matter of fact that giving users information about removal or unsubscribe is a trick that spammers often use to get confirm that users email address is real. Many people have been so naive to submit their email address to spammer through those look real removal systems. Surely, they would receive more spam once spammers receive their feedback. Table 7 presents statistic data of how spammers have used unsubscribe or removal instruction in spam. We can see that the fraction of spam having removal instruction are increasing by years.
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
|
Having removal information |
30% |
48% |
54% |
60% |
73% |
65% |
46% |
Having unsubscribe information |
0.2% |
1% |
2% |
4% |
7% |
17% |
15% |
Talking about privacy |
1% |
3% |
5% |
6% |
5% |
5% |
5% |
Claim you were on the list |
32% |
29% |
24% |
24% |
30% |
29% |
22% |
Table 7. Spam with removal instruction |
The statistic information I presented in this research is perhaps some of the most important information about spam characteristics. Having those statistics information is important to understand the nature of spam so that we will find the best way to deal with spam. I think conducting statistical report should be first step to approach before we actually writing any spam filtering software. By doing this research, I also want to shows that spams are changing a lot in the past few years in order to get around bloking spam systems. By going along, I have had chance to investigate some different spam filtering softwares using content-based techniques (SpamAssassin), distributed notification systems (SpamNet at CloudMark), blacklist providers (MAPS), or Bayesian algorithm (suggested by Paul Graham) They are fine techniques and are used widely in the market. However, all of them remain the same problem in capability of fighting unknown spam. I believe that a good anti-spam system will not only find yesterday's spam, but also will evolve and help to find tomorrow's spam. We still get spam because we don't have an effective algorithm to recognize the new tactics of spammer. In other words, we cannot get fast enough update information about spam and thus we are always behind spammers.