Statistical Methods for Natural Language Processing (NLP) |
Spring 2010 |
Course Information:
Time : Tuesday, 4:10 - 6pm
Location : 606 Lewisohn
Office Hours: Tuesday, 2 - 4pm (or by appointment), Speech Lab CEPSR building (7th floor)
Instructor:
Dr. Sameer Maskey
smaskey @ cs.columbia.edu
914 945 1573
Teaching Assistant:
Kapil Thadani, kapil at cs. domain .edu (domain = columbia)
Office Hours: 3 - 5pm (or by appointment), Thursday, Office 724, CEPSR building
Guest lectures by :
Dr. Salim Roukos
Dr. Bowen Zhou
|
Course Description
This course will explore topics in Statistical Methods/Machine Learning for real-world Natural Language Processing (NLP) problems. We will study ML topics that are commonly used in NLP such as Maximum Entropy Models, Hidden Markov Models, Clustering techniques, Conditional Random Fields, Expectation-Maximization algorithm, Active Learning and Support Vector Machines. We will understand how these methods are applied to real world NLP problems such as information extraction, stochastic parsing, text segmentation and classification, topic/document clustering and word sense disambiguation. We will also study the details of inference algorithms such as Viterbi, Synchronous Chart Parsing and Beam Search. The students will get hands-on experience by implementing some of these ML techniques for classification, clustering and a complex NLP task of machine translation.
|
Academic Integrity
Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.
|
Prerequisites
Background knowledge in probability, statistics, linear algebra. Some experience in at least one programming language.
|
Grading
There will be 3 Homework, Final Exam and a Final Project. There will be no Mid-term Exam. Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (40%), Final Exam (15%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.
|
Tentative Class Schedule
Week |
Date |
Topics |
Slides |
Assignments |
Readings and Remarks |
Week1 |
19 Jan |
Introduction, Text Mining, Linear Models of Regression |
|
|
|
Week2 |
26 Jan |
Text Categorization, Linear Methods of Classification |
|
|
23.1.1, 23.1.2, 23.1.3 J&M Book
1.1, 3.1, 4.1 - Bishop Book
|
Week3 |
2 Feb |
Text Categorization, Support Vector Machines |
|
HW1 Assigned HW1 Solutions |
6.1, 6.2, 7.1 (upto 7.1.1 only) - Bishop Book
3.1, 4.5.1 - J&M Book
Sebastiani, F., Machine Learning in Automated Text Catgorization, ACM Surveys 2002
Optional Reading:
Christopher Burge's SVM tutorial
|
Week4 |
9 Feb |
Information Extraction, Sequential Stochastic Models, HMM |
|
|
22.1, 6.1 to 6.5 J&M Book |
Week 5 |
16 Feb |
Hidden Markov Models II |
|
HW1 Due, HW2 Assigned
HW2 Solutions
|
22.2 J&M Book 13.1, 13.2 Bishop Book |
Week 6 |
23 Feb |
Maximum Entropy Models |
|
Project Proposal Due (11:59pm) |
J&M 6.6-6.8 |
Week 7 |
2 Mar |
Semantics, Brief Introduction to Graphical Models |
|
|
Please come prepared with at least 1 question for each paper
~ Liang P., Jordan M., Klein D., Learning Semantic Correspondences with Less Supervision, ACL 2009
~ Shen D., and Lapata M, Using semantic roles to improve question answering, EMNLP 2007
~Carlson A, et. al, Coupling Semi-Supervised Learning of Categories and Relations, HLT 2009 Workshop
|
Week 8 |
9 Mar |
Topic, Document Clustering, K-means, Mixture Models, Expectation Maximization |
|
HW2 Due (March 14) |
9.1-9.4, Bishop Book |
Week 9 |
16 Mar |
No Class, Spring Break |
|
Project Information Project Intermediate Results Due (March 25) |
|
Week 10 |
23 Mar |
Conditional Random Fields |
|
HW3 Assigned
HW3 Solutions
|
8.3 Bishop Book,
Sutton, C. and McCallum, A., "An Introduction to Conditional Random Fields for Relational Learning" 2006
|
Week 11 |
30 Mar |
Machine Translation I |
|
|
25.1-25.13 J&M Book Invited Lecture: Dr. Salim Roukos |
Week 12 |
6 Apr |
Machine Translation II |
|
HW3 Due (April 9) |
Invited Lecture: Dr. Bowen Zhou |
Week 13 |
13 Apr |
Language Models, Graphical Models |
|
|
4.2-4.7 J&M Book 8.1-8.3 Bishop Book |
Week 14 |
20 Apr |
Part I : Markov Random Fields Part II : Equations to Implementation |
|
|
|
Week 15 |
27 Apr |
Project Presentations |
|
Final Projects Due (April 25, 11:59pm) |
|
|
Books
For NLP topics of the course we will use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN-13: 9780131873216)
For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN-13: 9780387310732)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.
Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman
|