COMS W4705 - Fall 2021

Getting into NLP

HW 0 will be released on Sept 6th and must be turned in by Sept. 13th, the first day of class.

Students must do well on HW0 to get into the class. While CS majors are given preference, it is possible for non-majors to get in if they do well on HW0. Of course, it will depend on the number of CS majors who get in. Watch for the posting of HW0 on this website. Everyone on the waitlist is welcome to do HW0 and it must be turned in by the required date. Students are ranked based on their HW 0 grade and I will be taking students off the waitlist once the ranking is determined. In the past, I've had juniors and seniors who are majoring in CS, MS students and PhD students. Most students come from CS, but students from other departments get in also.

For pre-requisites, you should have taken at least one of AI, ML or a class that uses deep learning (e.g., Applied Deep Learning or one of the vision classes). You should also be well versed in programming. Classes such as advanced programming and software engineering are essential. Programming Languages and Translators or a class in Linguistics can be helpful, but not required.

Course Information

Time	MW 4:10-5:25pm
Place	451 Computer Science Building
Professor	Kathleen McKeown
Office Hours	M 1:00-2:00, 722 CEPSR W 5:30-6:30, CS courtyard, CSB 452B
Email	kathy@cs.columbia.edu
Phone	212-939-7114

Weekly TA hours (EST) are listed below. TA hours will be held in the NLP Lab unless otherwise noted.

Monday	Faisal Ladhak (Head TA)	faisal@cs.columbia.edu	7:00pm-9:00pm
Tuesday	Antonio Camara	a.camara@columbia.edu	4:00pm-6:00pm
Wednesday	Andrew Sirenko	andrew.sirenko@columbia.edu	7:00pm-9:00pm
Thursday	Amith Ananthram	amith.ananthram@columbia.edu	1:00pm-3:00pm
Sunday	Bobby Hua	yh3228@columbia.edu	2:00pm-4:00pm

Here is where to find these rooms on the 7th floor of CEPSR.

Course Description

This course provides an introduction to the field of natural language processing (NLP). We will learn how to create systems that can analyze, understand and produce language. We will begin by discussing machine learning methods for NLP as well as core NLP, such as language modeling, part of speech tagging and parsing. We will also discuss applications such as information extraction, machine translation, text generation and automatic summarization. The course will primarily cover statistical and machine learning based approaches to language processing, but it will also introduce the use of linguistic concepts that play a role. We will study machine learning methods currently used in NLP, including supervised machine learning, hidden markov models, and neural networks. Homework assignments will include both written components and programming assignments.

The class will be held in person this fall, but to provide increased flexibility students will also be able to watch classes remotely using pre-recorded lectures. On the first day of class (which people should attend in person), I will discuss how this will work.

Requirements

Four homework assignments, a midterm and a final exam. Each student in the course is allowed a total of 4 late days on homeworks with no questions asked; after that, 10% per late day will be deducted from the homework grade, unless you have a note from your doctor. Do not use these up early! Save them for real emergencies.

We will use Google Cloud for the course. Instructions for setting up the cloud can be found here.

Textbook

Main textbook: Speech and Language Processing (SLP), 3rd Edition, by Jurafsky and Martin.

Recommended: Neural Network Methods for Natural Language Processing (NNNLP) by Yoav Goldberg. It is available online through Columbia's library but you can also purchase a hard copy from the publisher.

Recommended: Deep Learning (DL) by Goodfellow, Bengio and Courville.

Syllabus

This syllabus is still subject to change. Readings may change. But it will give you a good idea of what we will cover.

Week	Class	Topic	Reading	Assignments
1	Sept 13	Introduction and Course Overview		HW 0: Provided code
	Sept 15	Language modeling	C. 3 (through 3.6), SLP
2	Sept 20	Supervised machine learning, text classification	C. 5, SLP
	Sept 22	Supervised machine learning, Scikit Learn Tutorial	C 4 SLP	HW1
3	Sept 27	Sentiment and transition to NN	C 4.4 SLP
	Sept 29	Neural Nets	C 3 and 4, NNNLP, also see Michael Collins' Notes
4	Oct 4	Distributional Hypothesis and Word Embeddings	C 8 (through 8.5), C 10 (through 10.5.3) NNNLP
	Oct 6	RNNs / POS tagging	C15, 16.1 NNNLP, C 8-8.2, 8.4 SLP	HW1 due; HW2
5	Oct 11	Syntax	C 12-12.5 SLP
	Oct 13	Dependency Parsing	C 14-14.4 SLP
6	Oct 18	Introduction to Semantics	C 15-15.1, SLP
	Oct 20	Semantics and Midterm Review	--> Sample Midterm Questions --> Sample Midterm Questions and Answers	HW 2 due
7	Oct 25	Midterm
	Oct 27	Intro to Machine Translation	C 11.1-11.2, 11.8 SLP	HW3
8	Nov 1	Academic holiday
	Nov 3	Neural MT	C 11.3-11.7 SLP	Guest speaker: Kapil Thadani
9	Nov 8	Advanced Word embeddings and semantics	BERT paper
	Nov 10	Word Sense Disambiguation	C 18 SLP SenseBERT
10	Nov 15	Summarization	Extractive Neural Net Approach 1 Extractive Neural Net Approach 2 Extractive approach using BERT	HW 3
	Nov 17	Summarization	Abstractive Neural net approach 1 Abstractive Neural net approach 2 Abstractive approach with BART	HW 4
11	Nov 22	Language Generation	Seq2seq language generation A Good Sample is Hard to Find
	Nov 24	Academic holiday
12	Nov 29	Information Extraction	C. 17 SLP1 IE paper 1: wikification IE paper 2: relation extraction
	Dec 1	Dialog	Dialog paper	Guest speaker: Or Biran
13	Dec 6	Bias	Research paper 1 Research paper 2 Research paper 3
	Dec 8	Research and Review	Sample Final Questions	HW4 due
14	Dec 13	Final Exam - in class

Announcements

Check EdStem for announcements and check courseworks for your grades (only you will see them), and discussion. All questions should be posted through Piazza instead of emailing Professor McKeown or the TAs. They will monitor the discussion lists to answer questions.

Academic Integrity

Copying or paraphrasing someone's work (code included), or permitting your own work to be copied or paraphrased, even if only in part, is not allowed, and will result in an automatic grade of 0 for the entire assignment or exam in which the copying or paraphrasing was done. Your grade should reflect your own work. If you believe you are going to have trouble completing an assignment, please talk to the instructor or TA in advance of the due date.

CS 4705: Introduction to Natural Language Processing, Fall 2021