This module is implemented in Java and the file name is FriendAnalyze.java
The main program MailStats.java passes javax.mail.Message arrays containing the spam, non-spam and sent messages to this module.
The program first populates the friend list of the user whose account is being tested by using getAllRecipients() method on all the sent messages.
The friend list is implemented using a hashset to speed up the performance.
Senders of the non-spam messages are then obtained by using getFrom() method.
Now each sender is classified as a friend or a non friend by comparing sender's address with all the entries in the friend list of the user.
Number of friends and non friends for non-spam messages is calculated.
Similarly, number of friends and non friends for spam messages are calculated using the array containing spam messages.
Following results (in Table 2.1.1 (a)) show fraction of non-spam and spam messages which are from friends for each mailbox:
MailBox # | # of Mails | # of Non-Spam Mails | # of Non-Spam from Friends |
(%) Non-Spam from Friends |
# of Spam mails | # of Spam from Friends |
(%) Spam from Friends |
||
1 | 1818 | 1818 | 1105 | 60.78 | 0 | 0 | No Spam | ||
2 | 593 | 497 | 95 | 19.11 | 96 | 0 | 0 | ||
3 | 1174 | 1174 | 725 | 61.75 | 0 | 0 | No Spam | ||
4 | 641 | 576 | 450 | 78.13 | 65 | 3 | 4.62 | ||
5 | 5105 | 5002 | 1412 | 28.23 | 103 | 0 | 0 | ||
6 | 1682 | 1418 | 153 | 10.79 | 264 | 9 | 3.41 | ||
7 | 1230 | 1230 | 487 | 39.59 | 0 | 0 | No Spam | ||
8 | 1992 | 1788 | 880 | 49.22 | 204 | 8 | 3.92 | ||
9 | 360 | 133 | 84 | 63.16 | 227 | 0 | 0 | ||
10 | 879 | 524 | 10 | 1.91 | 355 | 0 | 0 | ||
11 | 168 | 168 | 91 | 54.17 | 0 | 0 | No Spam | ||
12 | 1322 | 1301 | 828 | 63.64 | 21 | 1 | 4.76 | ||
13 | 1408 | 1360 | 183 | 13.46 | 48 | 7 | 14.58 | ||
14 | 934 | 934 | 578 | 61.88 | 0 | 0 | No Spam | ||
15 | 459 | 414 | 144 | 34.78 | 45 | 0 | 0 | ||
16 | 2183 | 1999 | 1164 | 58.23 | 184 | 3 | 1.63 | ||
17 | 527 | 527 | 339 | 64.33 | 0 | 0 | No Spam | ||
18 | 380 | 380 | 308 | 81.05 | 0 | 0 | No Spam | ||
19 | 749 | 749 | 553 | 73.83 | 0 | 0 | No Spam | ||
20 | 140 | 140 | 25 | 17.86 | 0 | 0 | No Spam | ||
21 | 1522 | 1151 | 891 | 77.41 | 371 | 4 | 1.08 | ||
22 | 3316 | 2370 | 1647 | 69.49 | 946 | 5 | 0.53 | ||
Total | 28582 | 25653 | 12152 | 47.37 | 2929 | 40 | 1.37 |
Following is the scatter plot that show fraction of non-spam and spam messages which are from friends for each mailbox:
From the above results we observe that about 47.37% of all the non-spam messages are from friends while only 1.37% of all the spam messages are from friends. We can now calculate the ratio of number of non-spam from friends to total number of mails from friends to find out the effectiveness of this test.
The results obtained are as follows:
MailBox # | # of Mails | A: # of Mails from Friends | B: # of non-spam from Friends | B/A |
1 | 1818 | 1105 | 1105 | 1.00 |
2 | 593 | 95 | 95 | 1.00 |
3 | 1174 | 725 | 725 | 1.00 |
4 | 641 | 453 | 450 | 0.99 |
5 | 5105 | 1412 | 1412 | 1.00 |
6 | 1682 | 162 | 153 | 0.94 |
7 | 1230 | 487 | 487 | 1.00 |
8 | 1992 | 888 | 880 | 0.99 |
9 | 360 | 84 | 84 | 1.00 |
10 | 879 | 10 | 10 | 1.00 |
11 | 168 | 91 | 91 | 1.00 |
12 | 1322 | 829 | 828 | 1.00 |
13 | 1408 | 190 | 183 | 0.96 |
14 | 934 | 578 | 578 | 1.00 |
15 | 459 | 144 | 144 | 1.00 |
16 | 2183 | 1167 | 1164 | 1.00 |
17 | 527 | 339 | 339 | 1.00 |
18 | 380 | 308 | 308 | 1.00 |
19 | 749 | 553 | 553 | 1.00 |
20 | 140 | 25 | 25 | 1.00 |
21 | 1522 | 895 | 891 | 1.00 |
22 | 3316 | 1652 | 1647 | 1.00 |
Total | 28582 | 12192 | 12152 | 1.00 |
If the feature is to be useful, it has to appear only in spam or almost exclusively in non-spam. This is because it is acceptable if a few spam messages have the feature. However, non-spam being classified as spam is certainly not acceptable.
We observe that most of the mails from friends are non-spam. Thus, this test is very useful to identify non-spam messages. This test can be run on already existing mailboxes to identify the mails from friends as this test takes all the sent messages into consideration.
This module is implemented in Java and the file name is FriendTillDateAnalyze.java
The main program MailStats.java passes javax.mail.Message arrays containing the spam, non-spam and sent messages to this module.
Date when the message was sent is obtained using getSentDate() method.
A hashtable is used to store the earliest date when a mail was sent to an address where the address is the key and the earliest date is the value.
Now, each sender of the non-spam message is searched in the hashtable. If there is no entry of that sender in the hash table then the sender is classified as a non friend.
If there is an entry of that sender in the hashtable then the correspoding value of earliest sent date is compared with the received date of the current message being tested.
If the earliest sent date is before the received date of the current message, the sender is categorized as a friend and otherwise as a non friend. Let us call them 'friends till date' and 'non friends till date' respectively.
Number of friends till date and non friends till date for all the non-spam messages is calculated.
Similarly, number of friends till date and non friends till date for spam messages are calculated using the array containing spam messages.
Following results (in Table 2.1.2 (a)) show fraction of non-spam and spam messages that are from friends till date for each mailbox:
MailBox # | # of Mails | # of Non-Spam Mails | # of Non-Spam from Friends till date |
(%) Non-Spam from Friends till date |
# of Spam mails | # of Spam from Friends till date |
(%) Spam from Friends till date |
||
1 | 1818 | 1818 | 915 | 50.33 | 0 | 0 | No Spam | ||
2 | 593 | 497 | 18 | 3.62 | 96 | 0 | 0.00 | ||
3 | 1174 | 1174 | 580 | 49.40 | 0 | 0 | No Spam | ||
4 | 641 | 576 | 387 | 67.19 | 65 | 3 | 4.62 | ||
5 | 5105 | 5002 | 870 | 17.39 | 103 | 0 | 0.00 | ||
6 | 1682 | 1418 | 132 | 9.31 | 264 | 9 | 3.41 | ||
7 | 1230 | 1230 | 446 | 36.26 | 0 | 0 | No Spam | ||
8 | 1992 | 1788 | 715 | 39.99 | 204 | 8 | 3.92 | ||
9 | 360 | 133 | 0 | 0.00 | 227 | 0 | 0.00 | ||
10 | 879 | 524 | 7 | 1.34 | 355 | 0 | 0.00 | ||
11 | 168 | 168 | 81 | 48.21 | 0 | 0 | No Spam | ||
12 | 1322 | 1301 | 688 | 52.88 | 21 | 1 | 4.76 | ||
13 | 1408 | 1360 | 118 | 8.68 | 48 | 7 | 14.58 | ||
14 | 934 | 934 | 489 | 52.36 | 0 | 0 | No Spam | ||
15 | 459 | 414 | 134 | 32.37 | 45 | 0 | 0.00 | ||
16 | 2183 | 1999 | 969 | 48.47 | 184 | 3 | 1.63 | ||
17 | 527 | 527 | 281 | 53.32 | 0 | 0 | No Spam | ||
18 | 380 | 380 | 253 | 66.58 | 0 | 0 | No Spam | ||
19 | 749 | 749 | 450 | 60.08 | 0 | 0 | No Spam | ||
20 | 140 | 140 | 15 | 10.71 | 0 | 0 | No Spam | ||
21 | 1522 | 1151 | 745 | 64.73 | 371 | 4 | 1.08 | ||
22 | 3316 | 2370 | 1425 | 60.13 | 946 | 5 | 0.53 | ||
Total | 28582 | 25653 | 9718 | 37.88 | 2929 | 40 | 1.37 |
Following is the scatter plot that show fraction of non-spam and spam messages which are from friends till date for each mailbox:
From the above results we observe that about 37.88% of all the non-spam messages are from friends till date while only 1.37% of all the spam messages are from friends till date. We can now calculate the ratio of number of non-spam from friends till date to total number of mails from friends till date to find out the effectiveness of this test.
The results obtained are as follows:
MailBox # | # of Mails | A: # of Mails from Friends till date |
B: # of Non-Spam from Friends till date |
B/A |
1 | 1818 | 915 | 915 | 1.00 |
2 | 593 | 18 | 18 | 1.00 |
3 | 1174 | 580 | 580 | 1.00 |
4 | 641 | 390 | 387 | 0.99 |
5 | 5105 | 870 | 870 | 1.00 |
6 | 1682 | 141 | 132 | 0.94 |
7 | 1230 | 446 | 446 | 1.00 |
8 | 1992 | 723 | 715 | 0.99 |
9 | 360 | 0 | 0 | N/A |
10 | 879 | 7 | 7 | 1.00 |
11 | 168 | 81 | 81 | 1.00 |
12 | 1322 | 689 | 688 | 1.00 |
13 | 1408 | 125 | 118 | 0.94 |
14 | 934 | 489 | 489 | 1.00 |
15 | 459 | 134 | 134 | 1.00 |
16 | 2183 | 972 | 969 | 1.00 |
17 | 527 | 281 | 281 | 1.00 |
18 | 380 | 253 | 253 | 1.00 |
19 | 749 | 450 | 450 | 1.00 |
20 | 140 | 15 | 15 | 1.00 |
21 | 1522 | 749 | 745 | 0.99 |
22 | 3316 | 1430 | 1425 | 1.00 |
Total | 28582 | 9758 | 9718 | 1.00 |
We observe that most of the mails from friends till date are non-spam.
As discussed in the previous test, this test is very useful to identify non-spam messages as this feature is present almost exclusively in non-spam mails.
Thus, this test can be used to identify whether any incoming mail is non-spam.
Next: Attachment Analysis
Last updated: 2008-08-19 by Nirav Shah