Creating .utt Files for English

Create prompt file and general setup

First you need to make a .data file with the base filenames of all the utterances and the text of each utterance. e.g.
( uniph_0001 "a whole joy was reaping." ) ( uniph_0002 "but they've gone south." ) ( uniph_0003 "you should fetch azure mike." )

You will also need to do some general setup to get Festival and related things on your path -- put these in your .bashrc file:
(These are the newest version of Festival (2.4), containing EHMM, from the Babel Festvox scripts) and also don't forget to source .bashrc:
export PATH=/proj/tts/tools/babel_scripts/build/festival/bin:$PATH export PATH=/proj/tts/tools/babel_scripts/build/speech_tools/bin:$PATH export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools

** Note that these are the paths that Speech Lab students should use. If you are not in Speech Lab, then set these paths to wherever you have Festival, Festvox, and EST installed.

** Also note that any labels created using the old version of Festival (in /proj/speech/tools) will be missing the feature "vowel in current syllable," which especially affects the quality of Merlin voices. Make sure the labels you are using are consistent, e.g. if using old-style labels, make sure to be comparing the voice to a baseline that also is using the old-style labels.

Fullcontext labels using EHMM alignment

EHMM stands for "ergodic HMM" and is an alignment method which accounts for the possibility that there might be pauses in between phoneme labels. This should in theory result in better duration models. This method is fairly commonly used, and is built into Festival. More information on EHMM can be found in this paper: Sub-Phonetic Modeling for Capturing Pronunciation Variations for Conversational Synthetic Speech (Prahallad et al. 2006).

Source: modified from http://www.nguyenquyhy.com/2014/07/create-full-context-labels-for-hts/

Prepare the directory:
$FESTVOXDIR/src/clustergen/setup_cg cmu us awb_arctic
Instead of cmu and awb_arctic you can pick any names you want, but please keep us so that Festival knows to use the US English pronunciation dictionary.
Copy or symlink your .wav files into the wav/ folder -- these should be in 16kHz, 16bit mono, RIFF format.
Put all the transcriptions into the file etc/txt.done.data -- this is the file you created in the very first step above
Run the following 3 commands:
./bin/do_build build_prompts ./bin/do_build label ./bin/do_build build_utts
The .utt files should now be under festival/utts.

In the label step, if you get an error "Wave files are missing. Aborting ehmm." then check the file names in txt.done.data vs. those in wav/ -- something is likely missing or duplicate. The set of utterances in both places must match exactly. If you only removed a transcript line in txt.done.data and did not remove any .wav files, then you can just continue with label; you don't have to re-run the build_prompts step.

Getting alignment score from EHMM

EHMM will tell you the average likelihood after each round (under ehmm/mod/log100.txt), but does not by default record the likelihood for each utterance. Speech lab students: our version of Festvox will print this out to stdout when it runs. Everyone else: I added this by going into $FESTVOXDIR/src/ehmm/src/ehmm.cc, the function ProcessSentence, and adding this line:

cout << "Utterance: " << tF << " LL: " << lh << endl;

after the part where the variable lh gets computed for the utterance. Then, recompile by going to top-level $FESTVOXDIR and running make.

Fullcontext labels using DTW alignment [DEPRECATED]

This method synthesizes all of the utterances with an existing English Festival voice, and then uses dynamic time warping (DTW) with the synthesized and actual audio, to get the alignments between our actual audio and the text. This is what we have used for many of our English voices so far, but there are better methods out there that we should use instead (see EHMM above). This method is included for reference.

Source: modified from http://festvox.org/bsv/x3082.html

You must select a name for the voice, by convention we use three part names consisting of a institution name, a language, and a speaker (or the corpus name). Make a directory of that name and change directory into it
mkdir cmu_us_awb cd cmu_us_awb
There is a basic set up script that will construct the directory structure and copy in the template files for voice building. If a fourth argument is given, it can be name one of the standard prompts list.
$FESTVOXDIR/src/unitsel/setup_clunits cmu us awb cp mydatafile.data etc/
** note that while you can rename cmu and awb to whatever you want, us needs to stay, in order to tell Festival to use the US English dictionary for pronunciations.
The next stage is to generate waveforms to act as prompts, or timing cues even if the prompts are not actually played. The files are also used in aligning the spoken data.
festival -b festvox/build_clunits.scm '(build_prompts "etc/mydatafile.data")'
Use whatever prompt file you are intending to use.
- This step creates prompt-lab/*, prompt-utt/*, and prompt-wav/*
Copy or symlink your .wav files into the wav/ directory.
Now we must label the spoken prompts. We do this by matching the synthesized prompts with the spoken ones. As we know where the phonemes begin and end in the synthesized prompts we can map that onto the spoken ones and find the phoneme segments. This technique works fairly well, but it is far from perfect and it is worthwhile to check the result and probably fix by hand.
./bin/make_labs prompt-wav/*.wav
- This creates cep/*, lab/*, and prompt-cep/*
After labeling we can build the utterance structure using the prompt list and the now labeled phones and durations.
festival -b festvox/build_clunits.scm '(build_utts "etc/mydatafile.data")'
- This creates festival/utts/*
The .utt files are now under festival/utts/