Creating .utt Files for English
Create prompt file and general setup
First you need to make a .data file with the base filenames of all the
utterances and the text of each utterance. e.g.
( uniph_0001 "a whole joy was reaping." )
( uniph_0002 "but they've gone south." )
( uniph_0003 "you should fetch azure mike." )
You will also need to do some general setup to get Festival and
related things on your path -- put these in your .bashrc file:
(These are the newest version of Festival (2.4), containing EHMM, from
the Babel Festvox scripts) and also don't forget to source .bashrc:
export
PATH=/proj/tts/tools/babel_scripts/build/festival/bin:$PATH
export
PATH=/proj/tts/tools/babel_scripts/build/speech_tools/bin:$PATH
export FESTVOXDIR=/proj/tts/tools/babel_scripts/build/festvox
export ESTDIR=/proj/tts/tools/babel_scripts/build/speech_tools
** Note that these are the paths that Speech Lab students should use.
If you are not in Speech Lab, then set these paths to wherever you
have Festival, Festvox, and EST installed.
** Also note that any labels created using the old version of Festival
(in /proj/speech/tools) will be missing the feature "vowel in
current syllable," which especially affects the quality of Merlin
voices. Make sure the labels you are using are consistent, e.g. if
using old-style labels, make sure to be comparing the voice to a
baseline that also is using the old-style labels.
Fullcontext labels using EHMM alignment
EHMM stands for "ergodic HMM" and is an alignment method which
accounts for the possibility that there might be pauses in between
phoneme labels. This should in theory result in better duration
models. This method is fairly commonly used, and is built into
Festival. More information on EHMM can be found in this paper:
Sub-Phonetic
Modeling for Capturing Pronunciation Variations for Conversational
Synthetic Speech (Prahallad et al. 2006).
Source: modified from
http://www.nguyenquyhy.com/2014/07/create-full-context-labels-for-hts/
- Prepare the directory:
$FESTVOXDIR/src/clustergen/setup_cg cmu
us awb_arctic
Instead of cmu and awb_arctic you can pick any names you want, but
please keep us so that Festival knows to use the US English
pronunciation dictionary.
- Copy or symlink your .wav files into the wav/
folder -- these should be in 16kHz, 16bit mono, RIFF format.
- Put all the transcriptions into the
file etc/txt.done.data -- this is the file you created in
the very first step above
- Run the following 3 commands:
./bin/do_build build_prompts
./bin/do_build label
./bin/do_build build_utts
- The .utt files should now be under festival/utts.
In the label step, if you get an error "Wave files are
missing. Aborting ehmm." then check the file names in txt.done.data
vs. those in wav/ -- something is likely missing or duplicate. The
set of utterances in both places must match exactly. If you only
removed a transcript line in txt.done.data and did not remove any .wav
files, then you can just continue with label; you don't have
to re-run the build_prompts step.
Getting alignment score from EHMM
EHMM will tell you the average likelihood after each round
(under ehmm/mod/log100.txt), but does not by default record
the likelihood for each utterance. Speech lab students: our version
of Festvox will print this out to stdout when it runs. Everyone else:
I added this by going into $FESTVOXDIR/src/ehmm/src/ehmm.cc,
the function ProcessSentence, and adding this line:
cout << "Utterance: " << tF << " LL: " << lh << endl;
after the part where the variable lh gets computed for the
utterance. Then, recompile by going to top-level $FESTVOXDIR
and running make.
Fullcontext labels using DTW alignment [DEPRECATED]
This method synthesizes all of the utterances with an existing English
Festival voice, and then uses dynamic time warping (DTW) with the
synthesized and actual audio, to get the
alignments between our actual audio and the text. This is what we
have used for many of our English voices so far, but there are better
methods out there that we should use instead (see EHMM above). This method
is included for reference.
Source: modified from
http://festvox.org/bsv/x3082.html
- You must select a name for the voice, by convention we use
three part names consisting of a institution name, a language, and a
speaker (or the corpus name). Make a directory of that name and
change directory into it
mkdir cmu_us_awb
cd cmu_us_awb
- There is a basic set up script that will construct the directory
structure and copy in the template files for voice building. If a
fourth argument is given, it can be name one of the standard prompts
list.
$FESTVOXDIR/src/unitsel/setup_clunits cmu us awb
cp mydatafile.data etc/
** note that while you can rename cmu and awb to
whatever you want, us needs to stay, in order to tell
Festival to use the US English dictionary for pronunciations.
- The next stage is to generate waveforms to act as prompts, or
timing cues even if the prompts are not actually played. The files
are also used in aligning the spoken data.
festival -b festvox/build_clunits.scm '(build_prompts
"etc/mydatafile.data")'
Use whatever prompt file you are intending to use.
- This step creates prompt-lab/*, prompt-utt/*,
and prompt-wav/*
- Copy or symlink your .wav files into the wav/
directory.
- Now we must label the spoken prompts. We do this by matching the
synthesized prompts with the spoken ones. As we know where the
phonemes begin and end in the synthesized prompts we can map that
onto the spoken ones and find the phoneme segments. This technique
works fairly well, but it is far from perfect and it is worthwhile
to check the result and probably fix by hand.
./bin/make_labs prompt-wav/*.wav
- This creates cep/*, lab/*,
and prompt-cep/*
- After labeling we can build the utterance structure using the
prompt list and the now labeled phones and durations.
festival -b festvox/build_clunits.scm '(build_utts
"etc/mydatafile.data")'
- This creates festival/utts/*
- The .utt files are now under festival/utts/