Adding OOV Words to the Lexicon
This assumes that you already have a baseline Clustergen voice for
your language with corresponding frontend --
please make sure you've done that
first.
0. Make a List of your OOVs
Make a file that contains one OOV per line (without hyphens,
punctuation, etc.). If you are using the
Clustergen training, it will tell you which words were not in the
lexicon.
1. Get Pronunciations
Use Sequitur or Sphinx G2P to obtain
pronunciations for these words. Sphinx is recommended as it is newer
and has better unicode support.
2. Get Syllabification
Options for syllabification:
- Syllabify by hand, if there are few enough words, using a naive
scheme like cv cvc cv.
- Use the naive syllabification script here.
- Use the LegaliPy syllabification pipeline that Chevy has
developed under /proj/tts/syllabification (usage described
below).
- Use Festival's built-in syllabifier (usage also described below).
Choose based on how much data you have and what works best for your language.
Simple Syllabification
This is a very naive syllabification algorithm that basically just
looks at clusters of vowels.
- Get your OOVs into a file where the format is just the phonemic
pronunciation only, one per line, with all the spaces between
phonemes removed. Name it
something that has the language code in the filename (and no other
numbers)
(e.g. 307_oovs.txt)
- Run:
/proj/tts/syllabification/scripts/simple.py yourfile.txt
- The output will be in a form of your phoneme string with
syllable boundaries denoted by =.
Amharic Syllabification
Elshadai modified the simple syllabifier to be better suited to
Amharic; it can be found here:
/proj/tts/tools/babel_scripts/amharic_aau/outfiles/my_syllabify.py
LegaliPy Syllabification
- Get all your OOVs into one file as described in the first step
above.
- python3
legal.py youroovfile.txt
- If it's a new language, you need to tell it which phonemes are
vowels, and also which phoneme symbols are more than one character.
Festival Syllabification
If you pass Festival's lex.compile a lexicon where the words
are not already grouped into syllables, then it will output a lexicon
that is both sorted and syllabified "in an old special way". For example, if this is your
file lex.scm:
( "Adapazarı" nil (a d a p a z a r 1 ))
You can run:
$ESTDIR/../festival/bin/festival -b
yourphoneset.scm '(set!
lex_syllabification nil)' '(lex.compile
"lex.scm"
"cmu_babel_lex.out")'
which will output cmu_babel_lex.out, containing:
MNCL
("Adapazarı" nil (((a d a) 0) ((p a z a r 1) 0)))
Obviously this does not do a very good job! It turns out that the
algorithm relies on knowing which phonemes are vowels as specified in
the phoneset, and our default Turkish phoneset did not have 'a' or
'1' marked as vowels. If you mark those as vowels and re-run, it
gives this output:
MNCL
("Adapazarı" nil (((a d) 0) ((a p) 0) ((a z) 0) ((a r) 0) ((1)
0)))
which is also not perfect but a lot more reasonable.
Phoneset: If you don't specify a phoneset, then it will
default to phoneset "radio" and give a warning if your lexicon
contains phones outside of that set. It also won't know which
phonemes in your phoneset are vowels, and thus will probably give bad output.
Stress markers: If you specify stress markers on your vowels in
your input lexicon, Festival will parse that out and apply the stress
to the syllable. For example, giving this input:
( "Adapazarı" nil (a0 d a0 p a1 z a0 r 10 ))
will produce this output:
MNCL
("Adapazarı" nil (((a d a) 0) ((p a z a r 1) 1)))
3. Add to Lexicon
In your language's baseline Clustergen voice, under festvox,
add your words to the lexicon lex.scm, anywhere in the file. The format looks
like this:
( "Adapazarı" nil (((a ) 0) ((d a ) 0) ((p a ) 0) ((z a ) 0) ((r 1 ) 0 )) )
The nil is a part-of-speech tag which we don't have information
about and are not using right now, so it should just be nil. Then,
the phonetic pronunciation is grouped into its syllables, and the
zeroes are stress information (which we are also not using right now
so ok to just put zeroes.)
Next, sort it by running (all on one line) in your top-level
Clustergen voice directory:
$ESTDIR/../festival/bin/festival -b festvox/yourvoicename_phoneset.scm
'(set! lex_syllabification nil)' '(lex.compile "festvox/lex.scm"
"festvox/cmu_babel_lex.out")'
Then, when you load that voice in Festival and use it to make utts
etc., it should be using that updated lexicon cmu_babel_lex.out.