Festvox and Clustergen Scripts for Training Voices from Babel Data

Many thanks to Alan Black for providing these scripts.

Please note that these instructions are meant for Columbia Speech Lab students only.

Run these scripts on kucing because all the dependencies are installed and working on there. The old instructions for full clustergen voice training can be found here, however since we are mainly only using these for frontend processing (to get utts), these are instructions for that.

Data and Setup

Always do this first, or make sure these are in your .bashrc (recommended):

export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox export SPTKDIR=/proj/tts/tools/babel_scripts/build/SPTK export BABELDIR=/proj/tts/data/babeldir

You will need to make sure that the language's Babel data is present in $BABELDIR. E.g., to add Amharic:

ln -s /proj/speech/corpora/babel/IARPA/IARPA-babel307b-v1.0b-build/BABEL_OP3_307 /proj/tts/data/babeldir/

You'll have to find the original directory under /proj/speech/corpora/babel by looking around, as each is named somewhat differently. The numerical language codes for each language can be found here.

Then, when you run each of the commands for voice training, replace BABEL_BP_105, which is the Turkish language directory name, with the directory for your new language, everywhere it appears, and also substituting the name of your voice directory for turkish.

These scripts are supposed to work on the Babel language packs as-is, and for the most part they do, only we have run into issues for languages that have both .wav and .sph-format audio data, since the scripts expect .sph data only. (.sph files are telephone conversations, .wav files are other types of recording conditions.) So before you start on a new language, check whether there are .wav files mixed in with the audio data, under
$BABELDIR/[yourlanguagecode]/conversational/training/audio
and if so, then create your own directories, one containing just the sph files, and another containing just the corresponding .txt transcript files for those .sph audio files, and use those directories instead of $BABELDIR/[yourlanguagecode]/conversational/training/transcription and $BABELDIR/[yourlanguagecode]/conversational/training/audio respectively, in all of the commands that require them.

Voice Setup:

e.g. for Turkish:

cd /proj/tts/tools/babel_scripts mkdir yourusername cd yourusername cp ../make_build . mkdir turkish cd turkish ../make_build setup_voice turkish \ $BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt \ $BABELDIR/BABEL_BP_105/conversational/training/transcription \ $BABELDIR/BABEL_BP_105/conversational/training/audio

Check Phoneset:

Under the festvox directory, check the phoneset file to make sure there are no phonemes with special characters that will break things later on. Brackets should have gotten replaced already, but we have also been replacing things like underscores (just removing them) and tildes (replace with TL).

Also, make sure that all the vowels are in fact set as vowels in the phoneset file. Any vowel that's not already in the default Festival phoneset ('radio') will not be set. Check the LSP file for the language if you are unsure.

Also, in the Babel lexicon files, the symbol # is commonly used to denote word boundaries. This should get converted to wb because # is a delimiter character in the label file format.

Check Lexicon:

Check the file cmu_babel_lex.out by find-and-replacing any phonemes that you've renamed in the phones file (make sure to ONLY replace them on the phoneme side, not on the word side).

Also check whether there are any weird characters on the word side. E.g. for Lithuanian, letters which were spoken as letters were in the lexicon like this: /C/ /D/ /T/ etc. This broke the scripts, and the fix was to remove the slashes in the lexicon entries.

Segment audio into utterances:

Back in the top-level directory for your language, run these commands one by one:

If you are working with conversational data:

../make_build make_raw_waves /path/to/babel/audio ../make_build make_prompts /path/to/babel/transcripts ../make_build reduce_prompts etc/txt.done.data.all ../make_build make_extract_subutts etc/txt.done.data ./bin/do_build parallel build_prompts etc/txt.done.data ./bin/do_build label etc/txt.done.data ./bin/do_clustergen parallel build_utts etc/txt.done.data

If you are working with scripted data:

../make_build make_raw_waves /path/to/babel/audio      This should create recording/*.wav

../make_build make_scripted_prompts /path/to/babel/transcripts      This should create etc/txt.done.data.all

../make_build reduce_prompts etc/txt.done.data.all      This should create etc/txt.done.data

../make_build clean_conv_subutts      This does some audio cleanup and should create wav/*.wav

./bin/do_build parallel build_prompts etc/txt.done.data      This creates prompt-utt/*.utt and prompt-lab/*.lab

./bin/do_build label etc/txt.done.data      This does EHMM alignment. It takes a long time. Save the output when done so you can get the log likelihoods later on.

./bin/do_clustergen parallel build_utts etc/txt.done.data      This produces utterance files in festival/utts/*.utt

[[TODO this is still buggy]]

Your .utt files should be present under festival/utts.

Languages we have run these scripts on so far, and issues we ran into:

Turkish
Amharic
Telugu conversational:
- Had the same .wav/.sph issue as Amharic. See /proj/tts/examples/move_txt.py for an example of separating out the txt files which only correspond to the .sph files. For Telugu, use telugu_sphonly and telugu_sphtxt.
- Had a similar phoneset issue to Amharic -- v\ was getting written as v in some places and vBS in other places. This was fixed by find-and-replacing v with vBS in festvox/cmu_babel_lex.out after setup and before running training.
- More notes on Amharic here.
Bengali conversational:
- Only built frontend and got .utt files for conversational data; did not train full voice.
- Made some changes to the phoneset (in both the phoneset file and the lexicon): underscores were removed, and tildes were replaced with TL.
- The lexicon contains some things like <hes> and the brackets got replaced by LT and GT by the scripts (which is incorrect) so, I put those back to brackets (on the word side of the lexicon).
Bengali scripted:
- EHMM alignment somehow fails on this data. Will have to debug sometime.
Tamil conversational:
- Only built frontend and got .utt files for conversational data; did not train full voice.
- Made some changes to the phoneset (in both the phoneset file and the lexicon): changed 4 to four; : to CL.
- Got this error at some point:
  'phoneme v has no duration info'
  so copied phoneme vBS as v in the phoneset. (This is likely due to an error in the scripts somewhere.)
Lithuanian conversational:
- Had to separate .wav and .sph.
- Phoneme renamings: ' to j; 5 to five; : to CL; { to LB; _ to nothing; # to wb (to match the other languages).
- Some of the words in the lexicon contained characters that broke things later on. In particular, letters pronounced in isolation, like /C/, /D/, /T/, etc. I removed the slashes from the lexicon, and also had to make a new copy of the transcripts with these slashes removed as well.
- Some audio files produced segfaults during acoustic feature extraction; there were few enough of them that I just excluded them from the training data.
- Current status: label files were created and acoustic feature files produced, but the acoustic model fails during training. To be debugged....

`../make_build make_raw_waves /path/to/babel/audio`	This should create `recording/*.wav`
`../make_build make_scripted_prompts /path/to/babel/transcripts`	This should create `etc/txt.done.data.all`
`../make_build reduce_prompts etc/txt.done.data.all`	This should create `etc/txt.done.data`
`../make_build clean_conv_subutts`	This does some audio cleanup and should create `wav/*.wav`
`./bin/do_build parallel build_prompts etc/txt.done.data`	This creates `prompt-utt/.utt` and `prompt-lab/.lab`
`./bin/do_build label etc/txt.done.data`	This does EHMM alignment. It takes a long time. Save the output when done so you can get the log likelihoods later on.
`./bin/do_clustergen parallel build_utts etc/txt.done.data`	This produces utterance files in `festival/utts/*.utt`