Ranking in a Domain Specific Search Engine
CS6998-03 - NLP for the Web, Spring 2008, Semester Project
Sara Stolbach, ss3067 [at] columbia.edu
Final Report and Presentation
Report: [pdf]Presentataion: [pdf] [ppt]
Search Engine: web interface
Interim Report
Due: March 13thReport: [pdf]
Data, Report, and Code: [tar.gz]
Corpus
- Sample File: LE1492814.txt
- Complete Dataset: http://www.cs.columbia.edu/~sara/nlpForWeb/corpus (password protected)
Code
The code is included in the interim report (see above)Javadocs: http://www1.cs.columbia.edu/~sara/nlpForWeb/doc/
Stats
- The dataset consists of items in 3 clothing sites
- There are 4988 documents
- There are 29263 unique terms without stemming
Important Features
This is a sample of some of the important features in the clothing domain:# | feature | frequency | # | feature | frequency | # | feature | frequency | # | feature | frequency |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | blue | 990 | 2 | button | 858 | 3 | pants | 842 | 4 | white | 839 |
5 | men | 823 | 6 | pink | 661 | 7 | girls | 615 | 8 | red | 609 |
9 | women | 2054 |
Initial Project Proposal
Due: February 7thProposal: [doc]
Resources Used
- Lucene: http://lucene.apache.org/ (Java-Based Search Engine)
- Stanford POS Tagger