ABSTRACT
We describe a topic tracking system developed at GE R&D Center in connection with our participation in DARPA TDT evaluations. The TDT tracking task is specified as follows: given Nt training news stories on a topic, the system must find all subsequent stories on the same topic in all tracked news sources. These sources include radio and television news broadcasts, as well as newswire feeds. The initial set of training stories (usually 1, 2 or 4) is the only information about the topic available to the tracking system (1). The tracking performance is gauged using the False Alarm Rate and the Miss Rate metrics, reflecting the incidence of incorrect classification decisions made by the automatic system.
1. OVERALL DESIGN
GE.Tracker has been developed in a course of a few weeks during the summer of 1998. It has been designed to achieve a reasonably high-accuracy performance using a lightweight, extremely portable and robust algorithms that rely on content compression rather than on corpus statistics to detect relevance and assess topicality of the source material. The tracker operates by first creating a topic tracking query (TQ) out of the available training stories. Subsequently, it accepts incoming stories, summarizes them topically, scores the summaries for content, then assesses content relevance to the tracking query. Stories whose compressed content summaries clear the empirically established threshold are classified as being "on topic".
2. BUILDING TOPIC TRACKING QUERY
The topic tracking query (TQ) is built out of available training material, which consists of Nt news stories on a topic of interest. In TDT2 evaluations the default value of Nt has been 4 topical stories. The initial tracking query TQ0 is formed out of the most frequent non-stop words and collocations found in the training set. The frequency threshold is set so that the words selected for the TQ0 occur at least Nt-1 times in the training set (the exact formula will be explained in the final paper). Collocations are all pairs of TQ0 terms that occur in 2 or more training stories. At this time, all within-the-story co-occurrences are collected. More advanced proximity or co-dependency calculations are planned for the future.
All terms and collocations are weighted to reflect their distribution within the training stories. Terms and collocations that occur in every training story are assigned greater weights than those occurring in fewer stories. The overall term frequency in the training set is factored in, but no external statistics are used. In particular, no collection level term weights, such as the inverted document frequency (idf) are computed.(2)
3. TRACKING INCOMING STORIES
The current version of GE.Tracker assumes that the input consists of a continuous broadcast stream pre-segmented into stories, although this segmentation does not have to be accurate. A future version is planned where an unsegmented broadcast stream can be used. The incoming stories are either text (newswire stories) or transcripts of audio tracks from television or radio news programs. These transcripts do not have to be accurate either, so the output of an automated speech recognition system is acceptable. Indeed we report here mostly on tracking results with automatically derived transcripts.
The incoming stories, whether text or speech transcripts, are summarized topically using the GE.Summarizer developed under the Tipster program. For TDT evaluations the summarizer has been modified to derive compressed content capsules of news stories, rather than true summaries. Since the capsules are not meant to be human-readable, we make no effort to maintain readability (there are no explicit paragraph or sentence boundaries available in automated broadcast transcripts). Furthermore, a tighter content compression is required than normally provided by the summarizer.
Summaries are scored for content density and topic coverage. The scoring method has been designed so that a summary obtains a high score only if it captures the dominating theme of the full story rather than some side aspect. Furthermore, only those high-scoring summaries that "cover" the tracking query TQ0 are selected as representing stories that are on topic. The concept of tracking query coverage is an attempt to quantify content distribution in a highly concentrated, yet naturally flowing news-style language. We use an empirically validated formula that requires specific coverage ratios depending upon the query length and term rankings (e.g., 45% for the 15 top-ranking terms in the query, 20% for the next 10 terms, etc.).
4. AUTOMATIC SUMMARIZATION
We summarize incoming stories in order to compress their content; specifically, we are only interested what a given story has to say on the topic being tracked. Currently, 5% topical summaries are derived. This may or may not be an optimal length, and indeed further experiments are needed to determine what the optimal summary length might be for each story. The basic summarization algorithm is outlined below. We present a simplified version of our original text summarization algorithm; the present version ignores all readability considerations we normally impose upon the summaries. This allows to produce shorter and tightly compressed summaries, while requiring no special provisions for dealing with continuous stream of words, with no paragraphs or sentence boundaries.
The summarizer can work in two modes: generic and topical. In the generic mode, it simply summarizes the main points of the original document. In the topical mode, it takes a user supplied statement of interest, a topic, and derives a summary related to this topic. A topical summary is thus usually different from the generic summary of the same document. The summarizer can produce both indicative and informative summaries. An indicative summary, typically 5-10% of the original text, is when there is just enough material retained from the original document to indicate its content. An informative summary, on the other hand, typically 20-30% of the text, retains all the relevant facts that a user may need from the original document, that is, it serves as a condensed surrogate, a digest. The summarization proceeds in the following steps:
For tracking long-running topics that evolve over time, adaptive tracking is achieved through continuous modification of the tracking query, TQ0, TQ1, TQ2, etc. This is done by augmenting the initial set of training stories with selected topical stories in the broadcast stream and recomputing the tracking query. In TDT evaluations adaptive tracking has not been useful for most topics.
6. PRELIMINARY RESULTS
Preliminary results of dry run evaluation (8/98) show that GE.Tracker achieves 13% miss rate and 0.8% false alarm rate for the automated speech transcription input, and 11% and 0.6%, respectively for manual speech transcription. These results are already very close to the target values of 10% and 0.1% respectively. In the final TDT2 run we scored 14% miss rate and 1.9% false alarm rate for automated speech transcription conditions, and 13% and 1.4%, respectively, for manually transcribed sources (FDCH data). It may be worth noting that the final evaluation test was harder than the dry run tests, which is reflected in the increased FA rates. We are planning to perform a few more contrastive tests before the workshop.
7. EXTENSIONS
We plan to revise how we do tracking by changing the way we compute weights for terms. The revised process is as follows:
7.1. Training
Using the Nt training stories, we remove stop
words and stem and calculate term frequencies for the remaining stems.
We then form the training query TQ out of the most frequent stems found
in training stories. Each stem tqi is assigned a weight wi,
defined as:
7.2. Tracking
During tracking we summarize each story, and keep counts
of how many stories have been read and also how many stories (summaries)
each term in TQ has occurred in:
N = number of stories read thus far
DF = { dfi: number of stories (summaries)
in which term i has occurred }
Using this information, we re-compute the weights for each
term appearing in TQ as
7.3. Furthermore
The reader may notice that in this revised model the system learns as it tracks more stories, so its performance is expected to improve over time. In order to improve the initial performance, the tracker can be pre-trained on similar material before any tracking begins. In TDT-2 stories preceding the last training story for each topic could be so used. The more of the pre-training material is available, the better, but we may also note that it is required only when the tracker is used for the first time, or when it is restarted after a longer period of time. In general, pre-training on old news (a few months old) is likely to be less effective than on more recent material.
Note also that our training method does not distinguish between on-topic and off-topic stories while gathering term statistics. This extra information was available (nominally) in TDT-2, but may not be available in TDT-3. In general, adaptive training using positive and negative examples, as opposed to overall distribution statistics, should converge faster and produce better tracking queries, as evidenced by relevance feedback methods used in document retrieval. However, it is unrealistic to expect that appropriately varied set of samples will be available in news tracking applications.
REFERENCES
Strzalkowski, T., G. Stein, B. Wise. 1999. Robust, Practical Text Summarization. To appear in M. Maybury & I. Mani (eds.). Advances in Text Summarization.
Strzalkowski, T. G. Stein, B. Wise. 1998. Natural Language Information Retrieval: TREC-7 Report. Proceedings of the 7th Text Retrieval Conference (TREC-7). NIST.
FOOTNOTES
(1). In the recently completed TDT2 evaluation, a large
(100Ęs) set of non-topical stories were also available. These were
the stories reported before and in-between the topical training stories
in continuous broadcast streams. This additional information could
be used in training the tracking system, although this made the test
slightly less realistic. The GE Tracker does not use the non-topical training
stories.
(2). Determining "typical" term distribution within news
stories is useful in order to eliminate or "downgrade" terms
that are relatively common for many unrelated stories.
In addition to the usual list of stopwords (such as determiners, prepositions,
and various pro-forms) we identify and eliminate a certain number of words
commonly used in news reporting: "said", "today", "time", etc.