Statistical NLP for the Web

Fall 2012

Course Information:
Time : Wednesday, 4:10 - 6pm
Location : 627 Mudd

Instructor: Dr. Sameer Maskey
smaskey [at] cs.columbia.edu
Office Hours: 2 - 4pm (or by appointment), Wednesday, 457 CS building

TA: : Morgan Ulinski
Office Hours: Tuesday, 2 - 4pm Speech Lab CEPSR building (7th floor)
mulinski [at] cs.columbia.edu

Course Description

Are you interested in developing a Sentiment Analysis algorithm that uses Twitter fire hose data? Do you want to learn how Hidden Markov Models and Finite State Machines can be used to implement a Spoken Dialog System like Siri? Would you like to understand News clustering algorithm or Maximum Entropy based Question Answering systems? This course will explore topics that juxtapose Statistical/Machine Learning algorithms with real world NLP/Speech tasks that use a large amount of web data. We will study NLP/Speech topics such as Text Mining, Document Classification, Topic Clustering, Summarization and Dialog systems. We will explore Statistical Methods/Machine Learning techniques such as Linear classifiers, Clustering techniques, Inference algorithms, Ranking Methods that are used in addressing some of these NLP/Speech problems. Students will get hands-on experience in implementing some of these techniques efficiently to build an NLP/Speech system that can handle a significant amount of unstructured web data (text, speech and video).

Academic Integrity

Presenting copied work as your own is strictly not tolerated, and will result in automatic zero. If you believe you need extra time to complete the assignment please email the TA or the instructor in advance.

Prerequisites

Background knowledge in probability, statistics, linear algebra. Experience in at least one programming language.

Grading

There will be 3 Homework and a Final Project (No Final Exam). Each homework will contain programming assignment (some homework may contain a brief written assignment as well). HW1 (15%), HW2 (15%), HW3 (15%), Final Project (55%). You have 3 'no penalty' late days in total that can be used during the semester. Each additional late day (without approval) will be penalized by 20% each day.

Tentative Class Schedule

Week	Date	Topics	Slides	Assignments	Readings and Remarks	Additional Material
Week1	September 5, 2012	Introduction, Text Mining and Linear Methods of Regression
Week2	September 12, 2012	Text Categorization and Linear Classifiers		(updated Dec 4th) Final Project Information	23.1.1, 23.1.2, 23.1.3 J&M Book 1.1, 3.1, 4.1 - Bishop Book	Elkan's intro
Week3	September 19, 2012	Topic/Document Clustering, Unsupervised Learning, K-Means, Expectation Maximization algorithms, Hierarchical Clustering		Homework1 Assigned Project Proposal Draft Due (11:59pm)	9.1 to 9.4 Bishop Book	Document Clustering Overview Eisner's excel
Week4	September 26, 2012	Non-Metric Methods, Statistical Parsing, PCFGs, Synchronous PCFGs		Project Proposal Due (11:59pm)	Chapters 12, 13 and 14 J&M Book
Week5	October 3, 2012	Information Extraction, Tagging, Stochastic Sequential Models, Hidden Markov Models		Homework1 Due (Oct 4th, 11:59pm) Homework2 Assigned	22.2 J&M Book 13.1 and 13.2 Bishop Book Rabiner Paper (Section I, II and III)	Eisner's excel F-measure Example excel
Week6	October 10, 2012	Hidden Markov Models II, MapReduce			J&M Book 6.1 to 6.5
Week7	October 17, 2012	MapReduce for Statistical NLP/Machine Learning		Project Intermediate Report I Due (October 17, 11:59pm)	Mapreduce paper :	MapReduce ML Language Model MapReduce
Week8	October 24, 2012	Neural Networks			Bishop Book 5.1 to 5.3
Week9	October 31, 2012	Deep Belief Networks		Homework 2 Due (Oct 30, 11:59pm)	Hinton's Deep Belief Network Paper Colbert's DBN for NLP Tasks	Semantic hashing
Week10	November 7, 2012	Machine Translation I		Homework3 Assigned	J&M Book 25.1 to 25.7,	Brown Paper Kevin Knight's Workbook
Week11	November 14, 2012	Maximum Entropy Models			J&M 6.6-6.8 MaxEnt for NLP	Eisner's excel
Week12	November 21, 2012	Machine Translation Decoding	Invited Guest Lecture : Dr. Ahmad Emami	Project Intermediate Report Oral (Nov 21 - 10:00 - 4:00)	J&M Boook 25.8 to 25.12
Week13	November 28, 2012	Log Linear Models in general, Conditional Random Fields, Question Answering		Homework 3 Due (Dec 5, 11:59pm)	Charles Elkan's cikm tutorial
Week14	December 5, 2012	Equations to Implementation/Building Scalable Statistical Web NLP Applications
Week15	December 12, 2012	Final Project Demo/Presentation Day		Final Project Report Due (Dec 12, 11:59) Demo and Presentation (Dec 12 : 10:00-2:00) CS Conf room Last week of classes
Week16	December 19, 2012	Finals Week (no Finals for this class)

Examples of Previous Student Projects

Section Classification in Clinical Notes using Supervised HMM - Ying
Automatic Summarization of Recipe Reviews - Benjamin
Classifying Kid-submitted Comments using Machine Learning Techniques - Tony
Towards An Effective Feature Selection Framework - Boyi
Using Output Codes as a Boosting Mechanism - Green
Enriching CATiB Treebank with Morphological Features - Sarah
SuperWSD: SupervisedWord Sense Disambiguation by Cross-Lingual Lexical Substitution - Wenhan
L1 regularization in log-linear Models - Tony
A System for Routing Papers to Proper Reviewers - Zhihai

Books

We will provide handouts in the class. Besides the handouts we will also use the following books.
For statistical methods/Machine Learning topics we will partly use :
Pattern Recognition and Machine Learning by Christopher M. Bishop (ISBN-13: 9780387310732)
For NLP topics of the course we will partly use the following book :
Speech and Language Processing (2nd Edition) by Daniel Jurafsky and James H. Martin (ISBN-13: 9780131873216)
We may also use one of the online textbooks. We will also have assigned readings from various published papers.

Another good book ML book is "The Elements of Statistical Learning: Data Mining, Inference, and Prediction," Second Edition by Trevor Hastie, Rober Tibshirani and Jerome Friedman