Due: Monday, May 5, 2008, by 2:40pm.
In this homework, you will design and build your own speech understanding system for reserving train tickets. You should assume as input a grammatical English utterance (one sentence only) that indicates the departure and destination cities, and the departure day and time. We will provide training data to help you test your system, but you should also record your own utterances for further testing.
The system will consist of two main components:
(a) An Automatic Speech Recognition (ASR) System: We provide a script that builds the ASR component using HTK (an HMM toolkit). The ASR acoustic models will be trained on TIMIT, BDC, and the Columbia Games corpora. The input to this component is a wav file (audio format: mono, sampling rate: 8Khz), and the output will be the automatic transcript in mlf file format (see an example below)
(b) An Understanding Component: The input to this component is the ASR transcript in (a), and the output will be a table that contains the following concepts which you will be expected to extract automatically from the ASR transcript.
Departure city:
Destination city: (same set of cities as above)
Departure day: Sunday, Monday, …, Saturday
Departure Time: Morning, Noon, Afternoon, Evening, Night, Anytime
Here are two examples. Given the following utterances:
I would like a
ticket from
The output of your system should be:
Departure city |
|
Destination |
|
Day |
Friday |
Time |
Morning |
I need to go to
The output of your system should be:
Departure city |
|
Destination |
|
Day |
Monday |
Time |
Evening |
You should create a grammar that covers as many different ways of expressing these sorts of requests as you can think of to make your system very flexible. However, your grammar should be limited enough so that the ASR perplexity is not so high as to affect performance. You must experiment to see what trade-offs between flexibility and performance work best. Part of the homework assignment is to see how well you can determine how much you can cover with reasonable performance. Note that your success will be judged on concept accuracy, not transcription accuracy.
Here is an example of a grammar that covers the two examples above.
To build the ASR component, run the following commands:
1. cd
/proj/speech/users/cs4706/asrhw
2. mkdir USERNAME (e.g., fb2175)
3. cd
USERNAME
#The following command will
take about 2 hours!). While it is running, you should read Chapters 1,2, and 3 in the HTK toolkit book
to understand what the script is doing.
4.
/proj/speech/users/cs4706/tools/htk/htk/asr/train-asr.sh USERNAME
# When this command completes, you have your speech recognizer ready. The acoustic
models (monophones and triphones)
are trained on TIMIT, BDC, and games corpora.
Next, test your ASR system:
1. mkdir
/proj/speech/users/cs4706/asrhw/USERNAME/test/
2. Record the two utterances above as wave files (8Khz and mono) in praat. For best recognition
performance, leave ~1 second of silence at the beginning and ~1 second at the
end of the file when recording. Call your files test1.wav and test2.wav and save
the files in /proj/speech/users/cs4706/asrhw/USERNAME/test/.
Save the grammar here to gram
(not
gram.txt) in /proj/speech/users/cs4706/asrhw/USERNAME
3. cd
/proj/speech/users/cs4706/asrhw/USERNAME
4. Run /proj/speech/users/cs4706/tools/htk/htk/asr/recognizePath.sh USERNAME ./test
#Not
that the script in 4 takes a path as its argument and runs the recognizer on all the wave files in this directory. Feel free to change the
script to accept a filename as its argument so that you will have output for
each utterance in a separate file, e.g.
5. Check the output of your
recognizer in /proj/speech/users/cs4706/asrhw/USERNAME/ out.mlf .
Now, you have a speech recognizer that takes a speech wave file (or a directory containing a set of wave files) and generates the transcript in mlf file format (example)
Next, you must write a program that takes a wav file, runs the ASR System, and generates the concept table shown in examples 1 and 2 above. Add a tab between the field name and value and new line after each concept/value pair. (Your scripts must be able to run on Speech Lab machines, so be sure that there are no version conflicts or other issues with your scripts. Test them before submission on a lab machine such as vox, voix, veux, fluffy,…).
Example:
RecognizeConcepts.sh ~/test/test2.wav
Output example:
Departure city: |
|
Destination: |
|
Day: |
Friday |
Time: |
Morning |
We will test your system on the training data provided above as well as a test set in the train reservation domain spoken by the same speaker. You will be graded based on grammar coverage and concept accuracy on the training and test data. Note that your system should be flexible enough to recognize the test utterances, which will be grammatical English sentences.
Submission: You should submit 3 files (possible points for each in parentheses):
A readme.txt file that explains how to run your program with a command line example. This file should also briefly explain the coverage of your grammar and any heuristics you employed or other interesting aspects of your approach. (10 points)
gram: a file containing your grammar in the format specified above (20 points)
Your program that runs the ASR and extracts concepts (see II above) (20 points)
(Include a make.sh file to compile your code if necessary.)
***
Upload these files in one zip file USERNAME.zip
(e.g., fb2175.zip) to courseworks
The remaining 50 points will be based on your system’s concept accuracy on the training and test data.