MultiGen

MultiGen is a multi-document summarization tool developed at Columbia University.

Given the multitude of sources that describe the same event in a similar manner (e.g., on-line news sources), it would be helpful to the end-user to have a summary of multiple related documents. Multiple document summarization could be useful, for example, in the context of large information retrieval systems to help determine which documents are relevant. Such summaries can cut down on the amount of reading by synthesizing information common among all retrieved documents and by explicitly highlighting distinctions.

We are developing a multi-document summarization system MultiGen to automatically generate a concise summary by identifying similarities and differences across a set of related documents. Input to the system is a set of related documents, such as those retrieved by a search engine in response to a particular query. Our work to date has focused on generating a summary including similarities across documents. Our approach uses machine learning over linguistic features extracted from the input documents to identify several groups of paragraph-sized text units so that all units in each group convey approximately the same information. Shallow linguistic analysis and comparison between phrases of these units is used to select the phrases that can adequately convey the similar information. This task is performed by the content planner of the language generation component and results in determination of summary content. Sentence planning and generation are then used to combine the phrases together to form a coherent whole.