This homework is designed to give you
experience doing corpus-based
research. You will collect a corpus of email messages, anonymize
their email addresses, put them into a canonical form, classify
them
in several ways, extract some features from them, and perform
some
analyses on them using rule based and machine learning techniques
on
the features you extract to perform several kinds of automatic
classification. Your result will be some email filters that work
more
or less well. An example of a similar study for voicemail can be
found in Hirschberg
& Ringel, CHI 2001.
The homework will be due in two stages. Stage I involves
collection
and preparation of the data for analysis. This will be due on 14
November. All collected messages will be combined for use by the
whole class in Stage II, which will involve corpus analysis; a
larger
corpus will permit more interesting analyses and, hopefully,
produce
better results. For this reason, it is essential that you follow
the
specifications for corpus collection and preparation described
below
and pay careful attention to the format of the sample files. Your
classmates will be depending upon you to produce high quality,
correct
data.
Stage I: Corpus
collection, clean-up and annotation. 150 pts. Due 14 November.
1. (35 pts) Collect a corpus of 100 email messages in English
either from your
own incoming and/or outgoing email or someone else's donated
email.
This corpus should contain no more than 25% spam and no more than
25% broadcast messages (e.g. talk announcements). You will be
given a unique id to use in numbering the messages. Id's will of
the form "fall02-N", where each of you will be assigned
a unique N.
Each msg should be placed in a separate file, numbered
"fall02-N-M.msg, where M is a number you will assign to each
individual message, and thus will range between 001 and 100. So,
if you are assigned the id 9999, the first message in your corpus
will be fall02-9999-001.msg and the last will be
fall02-9999-100.msg. This file will contain the original of the
message, without any annotation or labeling, but with all email
addresses anonymized, as in (2) below. Cf. fall02-9999.msg.
Do *not* include any messages in your corpus that might embarrass
you or anyone else if read by others or that refers to anything
illegal. Include a README file in
your submission that states you are willing to allow your
messages to be used for research purposes and that they do not
contain anything that might cause others embarrassment or harm.
2. (15 pts) Write a program to anonymize all email addresses in
these messages,
translating the username for each address into a corresponding
anonymous alias. You should preserve translation correspondence
across your corpus; i.e., maryb@cs.columbia.edu should always be
translated the same in all messages in which this address
appears,
e.g. as janec@cs.columbia.edu. You may use any correspondence you
like to translate these addresses, but please preserve at least
the
final 3-letter suffix (e.g. .edu, .gov,...). fall02-9999-001.msg,
e.g., should contain only these anonymized email addresses. (If
you wish to also anonymize proper names, make sure the result is
also a proper name.) Include your anonymizer program in your
submission.
3. (15 pts) Write a script to ransform all message files fall02-M-N.msg
(e.g. fall02-9999-001.msg) into a
canonical form by creating an ascii file fall02-M-N.txt
(e.g. fall02-9999-001.txt in the
following format:
Date: <day, time and date information as it appears in
dateline>
From: <all names and email addresses as they appear in
fromline>
To: <all names and email addresses as they appear in
toline>
cc: <all names and email addresses as they appear in cc
line>
Subject: <subject line information>
Body: <body of message in plain ascii, preserving
capitalization,
punctuation, line breaks and paragraphing>. (NB: Everything
that
follows the keyword 'Body:' here should be plain ascii. You
should
all non-ascii attachments, e.g. Cf. fall02-9999.txt. Include your script in your submission.
4. (35 pts) Annotate the .txt files produced in (3) as follows:
a) Use the (corrected if necessary) time and date delimiter
program
you wrote for Homework I to identify and label all absolute and
deictic dates in the body of your messages. This time, label the
times of day as <TIME> 3:47 a.m. </TIME> and dates
<DATE> Tuesday,
June 1st </Date> separately. A guiding principle for
determining a
time or date is, can you specify it on a clock or a calendar;
e.g. I can look at my watch and tell when 'now' is; I can look at
a
calendar and tell when 'next year' is. If you can't tell whether
something is a time or a date (e.g. early Tuesday morning), label
it all
as a date. If you run into a tricky example, ask. Include your
(corrected or uncorrected) program in your submission.
b) Hand-correct your time and date delimiters' output so that
your
final version of <prefix>N.txt correctly delimits all times
and
dates in the corpus. NB: The better your delimiter program works,
the less hand labeling you will need to do...
5. (35 pts) Classify each of the messages in your corpus by hand
as follows: In
3 separate ascii files with format specified below, rate each
message
in your corpus from 1 to 3, where 1 is 'not at all', 2 is 'sort
of'
and 3 is 'definitely' along three dimensions:
a) To what extent would you say this message is spam? (fall02-N.spam)
b) To what extent is this message personal? (fall02-N.pers)
c) To what extent did you consider this message 'urgent' when
you received it? (i.e., something you would have wanted to read
immediately or which required immediate action) (fall02-N.urg)
For each rating, create an ascii file with a 2x100 matrix
containing the msg id (e.g. fall02-9999-001) in the first column
and the
message rating (1, 2 or 3) in the second, separated by a space.
You will thus produce 3 files, <prefix>.spam,
<prefix>.pers, and
<prefix>.urg. C.f. fall02-9999.spam, fall02-9999.pers,
fall02-9999.urg.
6. (15 pts) In your README file,
describe any difficulties you had in deciding how
to anonymize (2), label (4) or classify (5) the data.
7. Place your README file, the programs you used to anonymize
messages, produce canonical format, automatically delimit times
and dates, and all of your .msg, .txt, .spam, .pers, and .urg
files in a single directory for submission. Follow the submission guidelines for Homework 1 to submit Homework 2. All programs
must run on a CS cluster machine under unix.