MultiGen is a multi-document summarization tool developed at Columbia University.
Given the multitude of sources
that describe the same event in a similar manner (e.g., on-line news
sources), it would be helpful to the end-user to have a summary of multiple
related documents. Multiple document summarization could be useful, for
example, in the context of large information retrieval systems to help
determine which documents are relevant. Such summaries can cut down on the
amount of reading by synthesizing information common among all retrieved
documents and by explicitly highlighting distinctions.
We are developing a multi-document summarization system
MultiGen to automatically
generate a concise summary by identifying similarities and differences
across a set of related documents. Input to the system is a set of related
documents, such as those retrieved by a
search engine in response to a particular query. Our work to date
has focused on generating a summary including similarities across
documents. Our approach uses machine learning over linguistic features
extracted from the input documents to identify several groups of
paragraph-sized text units so that all units in each group convey
approximately the same information. Shallow linguistic analysis and
comparison between phrases of these units is used to select the phrases
that can adequately convey the similar information. This task is performed
by the content planner of the language generation component and
results in determination of summary content. Sentence planning and
generation are then used to combine the phrases together to form a coherent
whole.