Statistical NLP for the Web |
Fall 2012 |
Course Information:
Time : Wednesday, 4:10 - 6pm
Location : 627 Mudd
Instructor:
Dr. Sameer Maskey
smaskey [at] cs.columbia.edu
Office Hours:
2 - 4pm (or by appointment), Wednesday, 457 CS building
TA: : Morgan Ulinski
Office Hours: Tuesday, 2 - 4pm Speech Lab CEPSR building (7th floor)
mulinski [at] cs.columbia.edu
|
Course Description
Are you interested in developing a Sentiment Analysis algorithm that
uses Twitter fire hose data? Do you want to learn how Hidden Markov
Models and Finite State Machines can be used to implement a Spoken
Dialog System like Siri? Would you like to understand News clustering
algorithm or Maximum Entropy based Question Answering systems? This
course will explore topics that juxtapose Statistical/Machine Learning
algorithms with real world NLP/Speech tasks that use a large amount of
web data. We will study NLP/Speech topics such as Text Mining,
Document Classification, Topic Clustering, Summarization and Dialog
systems. We will explore Statistical Methods/Machine Learning
techniques such as Linear classifiers, Clustering techniques,
Inference algorithms, Ranking Methods that are used in addressing some
of these NLP/Speech problems. Students will get hands-on experience in
implementing some of these techniques efficiently to build an
NLP/Speech system that can handle a significant amount of unstructured
web data (text, speech and video).
|
Academic Integrity
Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.
|
Prerequisites
Background knowledge in probability, statistics, linear algebra. Experience in at least one programming language.
|
Grading
There will be 3 Homework and a Final Project (No Final Exam). Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (55%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.
|
Tentative Class Schedule
Week |
Date |
Topics |
Slides |
Assignments |
Readings and Remarks |
Additional Material |
Week1 |
September 5, 2012 |
Introduction, Text Mining and Linear Methods of Regression |
|
|
|
  |
Week2 |
September 12, 2012 |
Text Categorization and Linear Classifiers |
|
(updated Dec 4th) Final Project Information |
23.1.1, 23.1.2, 23.1.3 J&M Book
1.1, 3.1, 4.1 - Bishop Book
| Elkan's intro |
Week3 |
September 19, 2012 |
Topic/Document Clustering, Unsupervised Learning, K-Means, Expectation Maximization algorithms, Hierarchical Clustering |
|
Homework1 Assigned Project Proposal Draft Due (11:59pm) |
9.1 to 9.4 Bishop Book
| Document Clustering Overview Eisner's excel |
Week4 |
September 26, 2012 |
Non-Metric Methods, Statistical Parsing, PCFGs, Synchronous PCFGs |
|
Project Proposal Due (11:59pm) |
Chapters 12, 13 and 14 J&M Book
|   |
Week5 |
October 3, 2012 |
Information Extraction, Tagging, Stochastic Sequential Models, Hidden Markov Models |
|
Homework1 Due (Oct 4th, 11:59pm) Homework2 Assigned |
22.2 J&M Book 13.1 and 13.2 Bishop Book
Rabiner Paper (Section I, II and III)
| Eisner's excel
F-measure Example excel |
Week6 |
October 10, 2012 |
Hidden Markov Models II, MapReduce |
|
|
J&M Book 6.1 to 6.5
|   |
Week7 |
October 17, 2012 |
MapReduce for Statistical NLP/Machine Learning |
|
Project Intermediate Report I Due (October 17, 11:59pm) |
Mapreduce paper :
|
MapReduce ML
Language Model MapReduce |
Week8 |
October 24, 2012 |
Neural Networks |
|
|
Bishop Book 5.1 to 5.3
|
  |
Week9 |
October 31, 2012 |
Deep Belief Networks |
|
Homework 2 Due (Oct 30, 11:59pm) |
Hinton's Deep Belief Network Paper
Colbert's DBN for NLP Tasks
|
Semantic hashing |
Week10 |
November 7, 2012 |
Machine Translation I |
|
Homework3 Assigned |
J&M Book 25.1 to 25.7,
|
Brown Paper Kevin Knight's Workbook |
Week11 |
November 14, 2012 |
Maximum Entropy Models |
|
|
J&M 6.6-6.8 MaxEnt for NLP |
Eisner's excel |
Week12 |
November 21, 2012 |
Machine Translation Decoding |
Invited Guest Lecture : Dr. Ahmad Emami |
Project Intermediate Report Oral (Nov 21 - 10:00 - 4:00) |
J&M Boook 25.8 to 25.12
|
  |
Week13 |
November 28, 2012 |
Log Linear Models in general, Conditional Random Fields, Question Answering |
|
Homework 3 Due (Dec 5, 11:59pm) |
Charles Elkan's cikm tutorial
|
  |
Week14 |
December 5, 2012 |
Equations to Implementation/Building Scalable Statistical Web NLP Applications |
|
|
|
  |
Week15 |
December 12, 2012 |
Final Project Demo/Presentation Day |
|
Final Project Report Due (Dec 12, 11:59)
Demo and Presentation (Dec 12 : 10:00-2:00) CS Conf room Last week of classes |
|
|
Week16 |
December 19, 2012 |
Finals Week (no Finals for this class) |
|
|
|
  |
|
Examples of Previous Student Projects
Section Classification in Clinical Notes using Supervised HMM - Ying
Automatic Summarization of Recipe Reviews - Benjamin
Classifying Kid-submitted Comments using Machine Learning Techniques - Tony
Towards An Effective Feature Selection Framework - Boyi
Using Output Codes as a Boosting Mechanism - Green
Enriching CATiB Treebank with Morphological Features - Sarah
SuperWSD: SupervisedWord Sense Disambiguation by Cross-Lingual Lexical Substitution - Wenhan
L1 regularization in log-linear Models - Tony
A System for Routing Papers to Proper Reviewers - Zhihai
|
Books
We will provide handouts in the class. Besides the handouts we will also use the following books.
For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN-13: 9780387310732)
For NLP topics of the course we will partly use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN-13: 9780131873216)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.
Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman
|