- Who?
Senior research scientist at
Yahoo Research NYC
PhD in computer science from Columbia University
Into natural language processing and machine learning
- More:
Curriculum vitae
LinkedIn
Google Scholar
Current research interests
- Structured prediction: information extraction, parsing, summarization, alignment and generation
- Representation learning: semantic, cross-lingual and multimodal representations, transfer learning
- Continual learning: online learning, active learning, deep reinforcement learning
Affiliations
- Natural language processing research for Yahoo News, Finance and Sports
- Columbia University NLP group, machine learning group and the center for computational learning systems
- The Association for Computational Linguistics
Teaching
- EECS 6984: Deep Learning for Computer Vision, Speech and Language
Columbia University, Fall 2018
- EECS 6984: Deep Learning for Computer Vision, Speech and Language
Columbia University, Spring 2017
Papers
- SCOT: Self-Supervised Contrastive Pretraining for Zero-Shot Compositional RetrievalTo appear in proceedings of WACV 2025 in Tucson, AZ
-
arXiv preprint
-
In IEEE Transactions on Knowledge and Data Engineering, 2024
-
In proceedings of the Workshop on Generative Models for Computer Vision at CVPR 2024 in Seattle, WA
-
In proceedings of WACV 2023 in Waikoloa, HI
-
In proceedings of KDD 2022 in Washington, D.C.
-
In proceedings of the Workshop on Knowledge Injection in Neural Networks at CIKM 2021 in the Gold Coast, Australia
-
In proceedings of COLING 2020 in Barcelona, Spain
-
In proceedings of CIKM 2020 in Galway, Ireland
-
In proceedings of WebSci 2020 in Southampton, UK
-
In proceedings of the Workshop on Noisy User-generated Text at EMNLP 2019 in Hong Kong, China
-
In proceedings of WSDM 2017 in Cambridge, UK
-
In proceedings of SIGDIAL 2016 in Los Angeles, California (Nominated for best paper)
-
In proceedings of LREC 2016 in Portorož, Slovenia
-
In the Journal of the American Society for Information Science and Technology, 2016
-
In proceedings of ACL 2014 in Baltimore, Maryland
-
In proceedings of IJCNLP 2013 in Nagoya, Japan
-
In proceedings of IJCNLP 2013 in Nagoya, Japan
-
In proceedings of CoNLL 2013 in Sofia, Bulgaria
-
In proceedings of COLING 2012 in Mumbai, India
-
In proceedings of Interspeech 2012 in Portland, Oregon
-
In proceedings of IJCNLP 2011 in Chiang-Mai, Thailand
-
In proceedings of the Workshop on Monolingual Text-to-Text Generation at ACL-HLT 2011 in Portland, Oregon
-
In proceedings of ACL-HLT 2011 in Portland, Oregon
-
In proceedings of ACL-HLT 2011 in Portland, Oregon
-
In proceedings of NAACL-HLT 2010 in Los Angeles, California
-
In proceedings of the Workshop on Creating Speech and Text Language Data with Amazon's Mechanical Turk at NAACL-HLT 2010 in Los Angeles, California
-
In proceedings of LREC 2010 in Valletta, Malta
-
In proceedings of COLING 2008 in Manchester, UK
-
In proceedings of NIPS 2007 in Vancouver, Canada
-
In proceedings of ECML 2007 in Warsaw, Poland
Patents
- Systems and Methods for Image Compositing via Machine LearningUS Patent filed Mar 2024
- Systems and Methods for Using AI to Facilitate Image EditingUS Patent filed Mar 2024
-
US Patent filed Aug 2023
-
US Patent filed Jul 2022
-
US Patent filed Jun 2021
-
US Patent filed Jun 2019, granted Apr 2023
-
US Patent filed Dec 2018, granted Jan 2022
-
US Patent filed Oct 2018, granted Oct 2020
-
US Patent filed Feb 2017, granted Feb 2024
-
US Patent filed Feb 2016, granted Oct 2020
-
US Patent filed Dec 2012, granted Apr 2016
Dissertations and other publications
-
PhD Dissertation, Columbia University, 2015
- Decreasing Textual RedundancyMaster's Thesis, Columbia University, 2007
-
In proceedings of the 2007 New York Academy of Sciences Symposium on Machine Learning in New York City
Datasets
- A collection of document IDs in the New York Times Annotated Corpus for which the corresponding summaries on the nytimes.com homepage are genuinely extractive or near-extractive. Code to extract these documents from the corpus is available here.Download (239 KB) README BibTeX
- A corpus of 1020 phrase-based alignments derived from the Edinburgh paraphrase corpus including tokenization fixes, dependency graphs, named entity annotations and baseline alignments generated by METEOR. See Scott Martin's description for more details.Download (1.6 MB) README BibTeX
- A small corpus featuring 297 pairs of related newswire sentences, each with 10 fusions of varying correctness (5 intersections and 5 unions) generated by Mechanical Turk users.Download (91 KB) README BibTeX
- A collection of 941 prepositional phrase attachment cases over unstructured blog text. Candidates were chosen automatically and final judgments were made by humans responding to multiple-choice questions on Mechanical Turk.Download (130 KB) README BibTeX
Miscellany
- Candidacy exam on text-to-text generation
- Erdős number: 4
Me → { Tony Jebara → Tommi Jaakkola; Kathy McKeown → Zvi Galil } → Noga Alon → Paul Erdős
Bacon number: ∞