Creating Network
1. Check if valid email
2. Check if "from" is valid name and extract it
3. check if the name is in the network by reversing the first and last name. flip if exists
4. For each to and cc
1. Check if "to" is valid name and extract it
2. Check if the name is in the network by reversing the first and last name. flip if exists
3. if all checks were good, insert new edge "from" -> "to" into network
Getting rid of "spam" Emails
1. blacklist
2. If its cc'd to more than x people, don't use it in the network
Extract Name
1. remove troubling characters such as: & . ? * ( ) '
2. if digits, or @, then its not a name
3. remove titles such as: Mr Mrs Dr ans suffixes such as: Jr Sr
4. if more than 3 words, then its not a name
5. Convert all names to same format: (note case)
Valid Formats: (size is 2 or 3)
a. LAST, FIRST
b. FIRST LAST
c. LAST, FIRST MIDDLE|INITIAL.
d. FIRST MIDDLE|INITIAL.
Analysis
10-24-00, 11-29-00, and 12-20-00 (Skilling named as CEO) Network
Date | Blacklist | Mass Email Removal at | # Nodes | #Edges |
10/24/00 | Yes | 10 | 436 | 640 |
10/24/00 | No | 10 | 441 | 1090 |
10/24/00 | Yes | 20 | 541 | 1379 |
11/29/00 | Yes | 5 | 397 | 451 |
11/29/00 | Yes | 10 | 567 | 728 |
11/29/00 | No | 10 | 572 | 733 |
11/29/00 | Yes | 20 | 770 | 1154 |
11/29/00 | No | 20 | 775 | 1158 |
12/20/00 | Yes | 10 | 2080 | 6016 |
12/20/00 | No | 10 | 2099 | 6043 |
12/20/00 | No | 20 | 2675 | 8719 |
Note: Using a blacklist can delete more than just those "names", since people may only converse with the blacklist.
Name Problems:
1. There are nodes that should have been, but were not listed in the blacklist
2. Some names are more than 3 words. Example: Janet De La Paz
In 12-20-00 (blacklist and x=10) Network, 26/2080 had errors; a mere 1.2% - Can be reduced even further by adding more names to the blacklist.
Top 10 - 12-20-00 (blacklist and x=10)
# | person | degree |
1 | Jeff Dasovich | 145
|
2 | Vince Kaminski | 121
|
3 | Sara Shackleton | 120
|
4 | Steven Kean | 119
|
5 | Tana Jones | 112
|
6 | Kay Mann | 93
|
7 | Jeffrey Shankman | 90
|
8 | Mark Taylor | 86
|
9 | Chris Germany | 85
|
10 | David Delainey | 83 |
... |
35 | Jeff Skilling | 43
|
Top 10 - 11-29-00 (blacklist and x=10)
# | person | degree |
1 | Tana Jones | 32
|
2 | Sara Shackleton | 27
|
3 | Jeff Dasovich | 24
|
4 | Vince Kaminski | 23
|
5 | Susan Scott | 20
|
6 | Chris Germany | 19
|
7 | Kate Symes | 19
|
8 | Mike Mcconnell | 15
|
9 | Jeffrey Shankman | 13
|
10 | Karen Denne | 13
|
... |
133 | Jeff Skilling | 3
|
Networks
11-29-00 Network (x=10)
12-20-00 Network (x=10)
11-29-00 Strongly Connected Component Subnet