Please note that these instructions are meant for Columbia Speech Lab students only.
Run these scripts on kucing because all the dependencies are installed and working on there. The old instructions for full clustergen voice training can be found here, however since we are mainly only using these for frontend processing (to get utts), these are instructions for that.
export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export SPTKDIR=/proj/tts/tools/babel_scripts/build/SPTK
export BABELDIR=/proj/tts/data/babeldir
You will need to make sure that the language's Babel data is present in $BABELDIR. E.g., to add Amharic:
ln -s /proj/speech/corpora/babel/IARPA/IARPA-babel307b-v1.0b-build/BABEL_OP3_307 /proj/tts/data/babeldir/
You'll have to find the original directory under /proj/speech/corpora/babel by looking around, as each is named somewhat differently. The numerical language codes for each language can be found here.
Then, when you run each of the commands for voice training, replace BABEL_BP_105, which is the Turkish language directory name, with the directory for your new language, everywhere it appears, and also substituting the name of your voice directory for turkish.
These scripts are supposed to work on the Babel language packs as-is,
and for the most part they do, only we have run into issues for
languages that have both .wav and .sph-format audio data, since the
scripts expect .sph data only. (.sph files are telephone
conversations, .wav files are other types of recording conditions.) So before you start on a new language,
check whether there are .wav files mixed in with the audio data,
under
$BABELDIR/[yourlanguagecode]/conversational/training/audio
and if so, then create your own directories, one containing just the sph
files, and another containing just the corresponding .txt transcript
files for those .sph audio files, and use those directories instead
of $BABELDIR/[yourlanguagecode]/conversational/training/transcription
and $BABELDIR/[yourlanguagecode]/conversational/training/audio
respectively, in all of the commands that require them.
cd /proj/tts/tools/babel_scripts
mkdir yourusername
cd yourusername
cp ../make_build .
mkdir turkish
cd turkish
../make_build setup_voice turkish \
$BABELDIR/BABEL_BP_105/conversational/reference_materials/lexicon.txt \
$BABELDIR/BABEL_BP_105/conversational/training/transcription \
$BABELDIR/BABEL_BP_105/conversational/training/audio
Also, make sure that all the vowels are in fact set as vowels in the phoneset file. Any vowel that's not already in the default Festival phoneset ('radio') will not be set. Check the LSP file for the language if you are unsure.
Also, in the Babel lexicon files, the symbol # is commonly used to denote word boundaries. This should get converted to wb because # is a delimiter character in the label file format.
Also check whether there are any weird characters on the word side. E.g. for Lithuanian, letters which were spoken as letters were in the lexicon like this: /C/ /D/ /T/ etc. This broke the scripts, and the fix was to remove the slashes in the lexicon entries.
If you are working with conversational data:
../make_build make_raw_waves /path/to/babel/audio
../make_build make_prompts /path/to/babel/transcripts
../make_build reduce_prompts etc/txt.done.data.all
../make_build make_extract_subutts etc/txt.done.data
./bin/do_build parallel build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
If you are working with scripted data:
../make_build make_raw_waves /path/to/babel/audio | This should create recording/*.wav |
../make_build make_scripted_prompts /path/to/babel/transcripts | This should create etc/txt.done.data.all |
../make_build reduce_prompts etc/txt.done.data.all | This should create etc/txt.done.data |
../make_build clean_conv_subutts | This does some audio cleanup and should create wav/*.wav |
./bin/do_build parallel build_prompts etc/txt.done.data | This creates prompt-utt/*.utt and prompt-lab/*.lab |
./bin/do_build label etc/txt.done.data | This does EHMM alignment. It takes a long time. Save the output when done so you can get the log likelihoods later on. |
./bin/do_clustergen parallel build_utts etc/txt.done.data | This produces utterance files in festival/utts/*.utt |
[[TODO this is still buggy]]
Your .utt files should be present under festival/utts.