Unsolicited bulk e-mail (spam) is growing, with no signs of abating. Current spam detecting tools sometimes gives false negatives (classifies spam as non-spam) and false positives (classifies non-spam as spam) which is certainly not desirable. Metrics like reputation of the sender and network characteristics can be used to distinguish between ham and spam messages with better accuracy. In this project we have analyzed new methods of filtering email messages by gathering statistics about metrics obtained from the email header and body. These statistics are then used to determine which metrics are good predictors of spam, either by themselves or in combination with other metrics. This report describes implementation details, results obtained and conclusions drawn while analyzing the following metrics: email source and attachments.
Introduction
Spam Analysis and Reputation Project aimed at analyzing various characteristics present in the email header and body that can be used to distinguish between spam and ham (non-spam) messages. In order to analyze these metrics and have concrete results, we needed sufficient amount of spam and ham messages.
In the initial approach, we created a mail server (a honeypot) for the purpose of attracting spam. The idea was to run a series of tests on the messages received on that honeypot to find out information about the source and network characteristics of those messages. The information thus obtained can then be used to find out the effectiveness of these characteristics to distinguish between spam and ham. To attract email spam, we publicized the email id of the honeypot at various places on the Internet. However, as the honeypot was very new, it was not able to attract enough spam.
Hence we decided to switch to another approach that involved analyzing email messages that already existed in the mailboxes. In this standalone approach, we conducted our tests on 22 different mailboxes that contained about 25,000 ham messages and about 3000 spam messages. In the following sections, these two approaches are discussed in detail with respect to their implementation details, tests conducted, results obtained and conclusions drawn.
Here are the links to the corresponding sections in the report:
1. Server Based Approach