spamarchive
Class SpamArchive

java.lang.Object
  extended by spamarchive.SpamArchive

public class SpamArchive
extends java.lang.Object

SpamArchive is the main class that is used to parse e-mail headers of recent messages from the SpamArchive.org website. This data has been preprocessed to remove errors and deviation from specifications wherever necessary.


Constructor Summary
SpamArchive()
          Creates a new instance of SpamArchive and initializes the member variables.
 
Method Summary
 void checkDSL()
          Checks if the source IP address of the message is statically or dynamically assigned by querying the SORBS DUHL.
 int checkSPFExists()
          Queries the DNS records for existence of a SPF record belonging to the domain of the earliest Received host.
 int checkSPFMatches()
          Checks if the domain argument's SPF record permits messages to be sent from a host whose address is the ipAddress argument.
 boolean checkWellFormed()
          Checks whether the current message is well formed.
 void clearHeaders()
          Clears relevant message headers before a new message is about to be parsed.
 java.lang.String expandBinary(java.lang.String str)
          Expands the binary str argument to an eight bit binary string.
 java.lang.String expandMacro(java.lang.String inputStr, java.lang.String ipAddress, java.lang.String domain)
          Performs macro expansion on the inputStr argument according to section 8 of RFC 4408.
 java.lang.String expandMacroTerm(java.lang.String macroTerm, java.lang.String ipAddress, java.lang.String domain)
          Expands the macroTerm argument as specified by section 8 of RFC 4408.
 java.util.Date getDate(java.lang.String str)
          Retrieves the timestamp of the Received header passed as its argument.
 java.lang.String getEarliestReceivedHost()
          Gets the first host that adds its Received header to the message.
 java.lang.String getSourceIP(boolean publicIP)
          Gets the IP address of the machine from which the message was sent.
 boolean isCountryCodeTLD(java.lang.String str)
          Determines whether the domain component represented by the str argument is a country code top level domain.
 boolean isGenericTLD(java.lang.String str)
          Determines whether the domain component represented by the str argument is a generic top level domain.
 boolean isPrivateAddress(java.lang.String address)
          Checks whether the address argument is a private address.
static void main(java.lang.String[] args)
          The main entry point into the SpamArchive class.
 java.lang.String normalize(java.lang.String str)
          Removes redundant spaces and other unwanted symbols from the Received headers.
 void printSPFFailure(java.lang.String reason, java.lang.String domain, java.lang.String spfRecord)
          Prints relevant message headers of messages to analyze cases where SPF verification against the source IP address failed and also what lead to the failure.
 void printSPFSuccess(java.lang.String reason, java.lang.String domain, java.lang.String spfRecord)
          Prints relevant message headers of messages to analyze cases where SPF verification against the source IP address succeeds and also what lead to the success.
 void printStatistics()
          Prints relevant statistics to standard output and to the file spam_archive_statistics after processing every input data file.
 void processFiles()
          Reads data files from the SpamArchive.org website line by line and extracts values from relevant headers into its member variables.
 java.lang.String reverseMacro(java.lang.String text)
          Reverses the representation of the given text splitting at dot boundaries.
 void sortReceivedHeaders()
          Reverses the order of the Received headers so that they are arranged from earliest to last.
 int spfLookUp(java.lang.String ipAddress, java.lang.String domain)
          Performs verification of the ipAddress argument against the SPF record of the domain argument.
 boolean testIpMatch(java.lang.String ipAddress, java.lang.String ipAddressRange)
          Checks whether the ipAddress argument falls in the range of the ipAddressRange argument.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SpamArchive

public SpamArchive()
            throws java.lang.Exception
Creates a new instance of SpamArchive and initializes the member variables.

Throws:
java.io.FileNotFoundException - if the file exists but is a directory rather than a regular file, does not exist but cannot be created, or cannot be opened for any other reason
javax.naming.NamingException - if a naming exception is encountered
java.lang.Exception
Method Detail

clearHeaders

public void clearHeaders()
Clears relevant message headers before a new message is about to be parsed.


printStatistics

public void printStatistics()
Prints relevant statistics to standard output and to the file spam_archive_statistics after processing every input data file.


printSPFFailure

public void printSPFFailure(java.lang.String reason,
                            java.lang.String domain,
                            java.lang.String spfRecord)
Prints relevant message headers of messages to analyze cases where SPF verification against the source IP address failed and also what lead to the failure.

Parameters:
reason - the reason that led to SPF verification failure.
domain - the domain currently being looked up.
spfRecord - the SPF Record of the current domain.

printSPFSuccess

public void printSPFSuccess(java.lang.String reason,
                            java.lang.String domain,
                            java.lang.String spfRecord)
Prints relevant message headers of messages to analyze cases where SPF verification against the source IP address succeeds and also what lead to the success.

Parameters:
reason - the reason that led to SPF verification success.
domain - the domain currently being looked up
spfRecord - the SPF Record of the current domain.

getDate

public java.util.Date getDate(java.lang.String str)
                       throws java.lang.Exception
Retrieves the timestamp of the Received header passed as its argument.

Parameters:
str - the Received header.
Returns:
the Date object corresponding to the timestamp present in str in the format "d MMM yyyy HH:mm:ss Z";
null if the timestamp in the Received header doesn't match the regex pattern.
Throws:
java.util.regex.PatternSyntaxException - The regular expression's syntax is invalid.
java.lang.IllegalArgumentException - The pattern describing the date and time format is invalid.
InvalidStateException - if no match has yet been attempted or the previous match operation has failed.
java.lang.Exception

getEarliestReceivedHost

public java.lang.String getEarliestReceivedHost()
                                         throws java.lang.Exception
Gets the first host that adds its Received header to the message.

Returns:
the first host that adds its Received header to this message.
"by missing" if the earliest Received header doesn't contain "by";
"received missing" if no Received headers exist.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.Exception

getSourceIP

public java.lang.String getSourceIP(boolean publicIP)
                             throws java.lang.Exception
Gets the IP address of the machine from which the message was sent.

Parameters:
publicIP - flag indicating whether a public IP address is strictly necessary.
Returns:
IP address of the machine from which the message originated;
"received missing" if no Received headers are found or none contain the source IP address.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.IllegalStateException - if no match has yet been attempted or the previous match operation failed.
java.lang.IndexOutOfBoundsException - if there is no capturing group in the pattern with the given index.
java.lang.Exception

isPrivateAddress

public boolean isPrivateAddress(java.lang.String address)
                         throws java.lang.Exception
Checks whether the address argument is a private address.

Parameters:
address - the IP address that needs to be checked.
Returns:
true if address argument is a private address;
false otherwise.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.NumberFormatException - if the string doesn't contain a parseable integer.
java.lang.Exception

checkWellFormed

public boolean checkWellFormed()
                        throws java.lang.Exception
Checks whether the current message is well formed. The checks performed include:

Returns:
true if none of the above conditions are true;
false otherwise.
Throws:
java.lang.Exception

normalize

public java.lang.String normalize(java.lang.String str)
Removes redundant spaces and other unwanted symbols from the Received headers.

Parameters:
str - the Received header to be normalized.
Returns:
the normalized Received header.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.

sortReceivedHeaders

public void sortReceivedHeaders()
                         throws java.lang.Exception
Reverses the order of the Received headers so that they are arranged from earliest to last.

Throws:
java.lang.Exception

isGenericTLD

public boolean isGenericTLD(java.lang.String str)
Determines whether the domain component represented by the str argument is a generic top level domain.

Parameters:
str - the domain component to be checked.
Returns:
true if str is a generic TLD;
false otherwise.

isCountryCodeTLD

public boolean isCountryCodeTLD(java.lang.String str)
Determines whether the domain component represented by the str argument is a country code top level domain.

Parameters:
str - the domain component to be checked.
Returns:
true if str is a country code TLD;
false otherwise.

checkSPFExists

public int checkSPFExists()
                   throws java.lang.Exception
Queries the DNS records for existence of a SPF record belonging to the domain of the earliest Received host.

Returns:
-6 if the ServiceUnavailableException is encountered;
-5 if the InvalidNameException is encountered;
-4 if the NameNotFoundException is encountered;
-3 if the CommunicationException is encountered;
-2 if the earliest Received host has an invalid top level domain;
-1 if the domain hasn't been encountered previously and lacks a SPF record in its DNS entry;
0 if the domain has been encountered previously and lacks a SPF record in its DNS entry;
1 if the domain hasn't been encountered previously and it has a SPF record in its DNS entry;
2 if the domain has been encountered previously and it has a SPF record in its DNS entry.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
javax.naming.CommunicationException - if the client cannot communicate with the server and a timeout occurs.
javax.naming.NameNotFoundException - if a component of the name cannot be resolved because it is not bound.
javax.naming.InvalidNameException - if the name doesn't conform to the naming syntax of the naming system.
javax.naming.ServiceUnavailableException - if the directory or naming service is unavailable.
java.lang.Exception

expandBinary

public java.lang.String expandBinary(java.lang.String str)
Expands the binary str argument to an eight bit binary string.

Parameters:
str - the binary string that is to be expanded.
Returns:
the expanded eight bit binary string.

testIpMatch

public boolean testIpMatch(java.lang.String ipAddress,
                           java.lang.String ipAddressRange)
                    throws java.lang.Exception
Checks whether the ipAddress argument falls in the range of the ipAddressRange argument.

Parameters:
ipAddress - the IP address to be tested.
ipAddressRange - the range of IP addresses that the ipAddress argument is to be checked against. The range can be either an IP address or an IP address with a network prefix (CIDR notation).
Returns:
true if the ipAddress argument falls in the ipAddressRange argument's range;
false otherwise.
Throws:
java.lang.NumberFormatException - if the string doesn't contain a parseable integer.
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.Exception

reverseMacro

public java.lang.String reverseMacro(java.lang.String text)
Reverses the representation of the given text splitting at dot boundaries.

For example, if the text argument is "aw.bx.cy.dz", the text returned is "dz.cy.bx.aw".

Parameters:
text - the text to be reversed.
Returns:
the reversed text.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.

expandMacroTerm

public java.lang.String expandMacroTerm(java.lang.String macroTerm,
                                        java.lang.String ipAddress,
                                        java.lang.String domain)
                                 throws java.lang.Exception
Expands the macroTerm argument as specified by section 8 of RFC 4408.

The following macro letters are expanded in term arguments:

Parameters:
macroTerm - the macro term to be expanded.
ipAddress - the IP address of the host where the current message originated.
domain - the domain part of the Return-Path header.
Returns:
the expanded macro term if expansion succeeded;
"invalid headers" if the Return-path contains a syntax error;
"null" if PTR query doesn't succeed.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.IllegalStateException - if no match has yet been attempted or the previous match operation failed.
javax.naming.CommunicationException - if the client is unable to communicate with the directory or naming service.
javax.naming.NameNotFoundException - if a component of the name cannot be resolved, because it is not bound.
java.lang.Exception

expandMacro

public java.lang.String expandMacro(java.lang.String inputStr,
                                    java.lang.String ipAddress,
                                    java.lang.String domain)
                             throws java.lang.Exception
Performs macro expansion on the inputStr argument according to section 8 of RFC 4408.

Parameters:
inputStr - the text to be macro-expanded.
ipAddress - the IP address of the host where the current message originated.
domain - the domain part of the Return-Path header.
Returns:
the macro-expanded inputStr if successful;
"invalid headers" if the Return-path contains a syntax error;
"null" if the PTR query doesn't succeed.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.Exception

spfLookUp

public int spfLookUp(java.lang.String ipAddress,
                     java.lang.String domain)
              throws java.lang.Exception
Performs verification of the ipAddress argument against the SPF record of the domain argument.

Parameters:
ipAddress - the IP address of the host from where the current message originated.
domain - the domain-part of the Return-path header.
Returns:
-6 if infinite recursion is avoided due to the SPF record of the domain argument containing itself as an included domain;
-5 if the CommunicationException is encountered;
-4 if the NameNotFoundException is encountered;
-3 if the InvalidNameException is encountered;
-2 if the ServiceUnavailableException is encountered;
-1 if the domain argument doesn't have a SPF record in its DNS entry;
0 if the SPF record for the domain argument doesn't permit messages from the ipAddress argument;
1/10 if the SPF record for the domain argument permits messages from the ipAddress argument.
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
javax.naming.CommunicationException - if the client cannot communicate with the server and a timeout occurs.
javax.naming.NameNotFoundException - if a component of the name cannot be resolved because it is not bound.
javax.naming.InvalidNameException - if the name doesn't conform to the naming syntax of the naming system.
javax.naming.ServiceUnavailableException - if the directory or naming service is unavailable.
java.lang.Exception

checkSPFMatches

public int checkSPFMatches()
                    throws java.lang.Exception
Checks if the domain argument's SPF record permits messages to be sent from a host whose address is the ipAddress argument.

Returns:
-5 if getSourceIP returns "NULL";
-4 if Received headers are missing;
-3 if SPF verification test couldn't be performed;
-2 if the From header's syntax is invalid;
-1 if the Return-path header's syntax is invalid;
0 if the SPF record for the domain argument doesn't permit messages from the ipAddress argument;
1 if the SPF record for the domain argument permits messages from the ipAddress argument;
Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.IllegalStateException - if no match has yet been attempted or the previous match operation failed.
java.lang.Exception

checkDSL

public void checkDSL()
              throws java.lang.Exception
Checks if the source IP address of the message is statically or dynamically assigned by querying the SORBS DUHL. After performing this check, it increments the relevant counters.

It also outputs the source IP addresses to the file spam_archive_ping which is later used as input to fping to determine which of the hosts are reachable via ping.

Throws:
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
javax.naming.CommunicationException - if the client cannot communicate with the server and a timeout occurs.
javax.naming.NameNotFoundException - if a component of the name cannot be resolved because it is not bound.
javax.naming.InvalidNameException - if the name doesn't conform to the naming syntax of the naming system.
javax.naming.ServiceUnavailableException - if the directory or naming service is unavailable.
java.lang.Exception

processFiles

public void processFiles()
                  throws java.lang.Exception
Reads data files from the SpamArchive.org website line by line and extracts values from relevant headers into its member variables. After reaching the end of a message, the following tests are performed: After performing the above tests and updating member variables for all messages in the file, printStatistics() is called. The process then repeats for all the files in the input data set.

Throws:
java.lang.SecurityException - if a security manager exists and its SecurityManager.checkRead(String) method denies read access to the directory.
java.io.FileNotFoundException - if the file does not exist or is a directory rather than a regular file.
java.io.IOException - if an I/O error occurs.
java.util.regex.PatternSyntaxException - if the regular expression's syntax is invalid.
java.lang.Exception

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
The main entry point into the SpamArchive class. Creates an instance of this class and performs the following tasks:

Parameters:
args - the command line arguments
Throws:
java.lang.Exception