CASS preprocessing of articles. We preprocess the articles using some well-known NLP tools. First, the corpus of articles are run through CASS (http://www.sfs.nphil.uni-tuebingen.de/~abney/), a well-known deterministic partial parser for large amounts of text. It parses sentences using a grammar (which was defined by us) into fragments in a machine-readable form. The grammar is defined bottom-up, and looks like a combination of BNF and regular expressions:
:np NOM = nx | mx | cdx | place | person | name | ci-st | doll | date | today; np -> only? (NOM | nmess) pos h=NOM?;The :np delineates this section as a noun-phrase section; then, the noun phrase can be developed from any of the rules such as nx, mx, cdx, and so on, which have already been defined ahead of time in the grammar. This allows us to build up a grammar from the low-level constituients into higher-level phrases and finally to sentences and entire documents. Having done this, we can then run CASS on the text, and output similar to the following excerpt is generated.
[nbest [nx [name [nnp Clinton]]]] [dsh :] [nbest [nx [prp It]]is a brief excerpt of CASS output for a text article. As you can see, Clinton is assigned a noun term, which is classified within name, and so on. Classifications are made based on the grammar and a tagfile, which exists to eliminate ambiguities about certain terms (e.g. CASS has no knowledge about the word "Clinton", so we can add a tag for it). CASS is also extremely useful because it has a part-of-speech tagger. All told, after having specified the desired grammar it takes care of most of the necessary sentence parsing.
The major modification made to the CASS POS tagger was to recognized user defined temporal words, and their surrounding events. Six words were defined as key words, and were designated (with the option of modifiers) as temporal words. These words are: before, after, during, until, while, and meanwhile. In addition, a second new classification was devised. This classification is an umbrella noun, it recognizes sets of noun phrases that make up an event occuring before or after a temporal event.
The next stage in processing was to search the CASS tagged article and extract the events and temporal phrases and pair them appropriately. A parse program was written in C that searched the CASS tagged file for an event. Once an event was found, the parser searched for a temporal phrase. If one was found, a temporal relation was writen to the output file. If not, the parser reset and searched for another another temporal event.
Conversion into the temporal network format. The output generated by the C program to parse the CASS input was then turned over to a Java program which read in the data and translated it into a temporal format, by using the appropriate Allen constraints. At this point, the data was then fed into the network, and propagation was completed.
Flattening of the temporal network. The next step, and perhaps one of the trickiest, was to take the temporal network and extract the phrases in chronological order, since this requires a "flattening" of the temporal network. A complete flattening algorithm is non-trivial; after several attemps, a simpler, recursive greedy algorithm was adopted. The algorithm works as follows:
flattenNetwork(network net) { while(there are unprocessed nodes in net) { nextNode = first unprocessed node in net; place nextNode randomly; flag nextNode as having been processed; process(nextNode); } return array of processed nodes; } process(node n) { scan all relationships of n; while(any nodes related to n are unprocessed) { newNode = next unprocessed node related to n; determine constraint with newNode; place newNode appropriately; flag newNode as having been processed; process(newNode); } }This algorithms is no by means complete. Since it is greedy, it will not go back and check to see if the constraints keep on being maintained. However, the advantage of this algorithm is that it is straightforward, fast, and that it works quite decently for most practical domains.
Having done that, we simply output the flattened array.
--> [0]: 600 sorties over the previous 24 hours --> [1]: the measure to Clinton by week 's end --> [2]: 240 bombing runs made --> [3]: its effort --> [4]: That would let him wage his fight later --> [5]: Belgrade embassy attacked prgrph --> [6]: the first air raid on the Yugoslav capital in three days --> [7]: President Boris Yeltsin said Russia could quit its role in --> [8]: Russia --> [9]: House members argued that --> [10]: The Chinese embassy in Belgrade was damaged --> [11]: a NATO airstrike late Friday --> [12]: Fish and Wildlife officials --> [13]: blocked a species from being listed as --> [14]: it '' prgrph --> [15]: The embassy was damaged --> [16]: Kosovo --> [17]: our bombing campaign goes on --> [18]: our demands are met --> [19]: NATO has gone through in --> [20]: one is going to come --> [21]: you and --> [22]: UN administration for the province --> [23]: Friday 's bombing --> [24]: were made only hours --> [25]: the House approved $13 billion last week --> [26]: the NATO airstrikes --> [27]: House Republicans balked --> [28]: anti-violence programs in schools went --> [29]: more might be added --> [30]: the measure is finished --> [31]: a crackdown in February --> [32]: special police it had deployed in Kosovo --> [33]: commercial fishing in the waters off Glacier Bay National Park --> [34]: courts decide the issue --> [35]: a framework for self-rule within Yugoslavia can be developed --> [36]: is that --> [37]: YELTSIN MADE his remarks --> [38]: a meeting of his Security Council held hours after --> [39]: the United Nations -- not --> [40]: Chinese demand UN meetingSome problems are immediately apparent, and are discussed below.
Problems. It turns out that this particular approach to the problem does not generate the best results, for a number of reasons:
Before, Kosovars and Serbs lived in harmony for many years.The question that must be asked is before WHAT? Unfortunately, this is taken from the context of the situation, and assumes readers' prior knowledge. The temporal constraint network does not alleviate this particular situation. Other problems include situations with words like while and meanwhile:
Meanwhile, a bomb fell.Again, the same problem is encountered.