Adding OOV Words to the Lexicon

This assumes that you already have a baseline Clustergen voice for your language with corresponding frontend -- please make sure you've done that first.

0. Make a List of your OOVs

Make a file that contains one OOV per line (without hyphens, punctuation, etc.). If you are using the Clustergen training, it will tell you which words were not in the lexicon.

1. Get Pronunciations

Use Sequitur or Sphinx G2P to obtain pronunciations for these words. Sphinx is recommended as it is newer and has better unicode support.

2. Get Syllabification

Options for syllabification: Choose based on how much data you have and what works best for your language.

Simple Syllabification

This is a very naive syllabification algorithm that basically just looks at clusters of vowels.

  1. Get your OOVs into a file where the format is just the phonemic pronunciation only, one per line, with all the spaces between phonemes removed. Name it something that has the language code in the filename (and no other numbers) (e.g. 307_oovs.txt)
  2. Run:
    /proj/tts/syllabification/scripts/simple.py yourfile.txt
  3. The output will be in a form of your phoneme string with syllable boundaries denoted by =.

Amharic Syllabification

Elshadai modified the simple syllabifier to be better suited to Amharic; it can be found here:

/proj/tts/tools/babel_scripts/amharic_aau/outfiles/my_syllabify.py

LegaliPy Syllabification

  1. Get all your OOVs into one file as described in the first step above.
  2. python3 legal.py youroovfile.txt
  3. If it's a new language, you need to tell it which phonemes are vowels, and also which phoneme symbols are more than one character.

Festival Syllabification

If you pass Festival's lex.compile a lexicon where the words are not already grouped into syllables, then it will output a lexicon that is both sorted and syllabified "in an old special way". For example, if this is your file lex.scm:

( "Adapazarı" nil (a d a p a z a r 1 ))

You can run:

$ESTDIR/../festival/bin/festival -b yourphoneset.scm '(set! lex_syllabification nil)' '(lex.compile "lex.scm" "cmu_babel_lex.out")'

which will output cmu_babel_lex.out, containing:

MNCL
("Adapazarı" nil (((a d a) 0) ((p a z a r 1) 0)))

Obviously this does not do a very good job! It turns out that the algorithm relies on knowing which phonemes are vowels as specified in the phoneset, and our default Turkish phoneset did not have 'a' or '1' marked as vowels. If you mark those as vowels and re-run, it gives this output:

MNCL
("Adapazarı" nil (((a d) 0) ((a p) 0) ((a z) 0) ((a r) 0) ((1) 0)))

which is also not perfect but a lot more reasonable.

Phoneset: If you don't specify a phoneset, then it will default to phoneset "radio" and give a warning if your lexicon contains phones outside of that set. It also won't know which phonemes in your phoneset are vowels, and thus will probably give bad output.

Stress markers: If you specify stress markers on your vowels in your input lexicon, Festival will parse that out and apply the stress to the syllable. For example, giving this input:

( "Adapazarı" nil (a0 d a0 p a1 z a0 r 10 ))

will produce this output:

MNCL
("Adapazarı" nil (((a d a) 0) ((p a z a r 1) 1)))

3. Add to Lexicon

In your language's baseline Clustergen voice, under festvox, add your words to the lexicon lex.scm, anywhere in the file. The format looks like this:

( "Adapazarı" nil (((a ) 0) ((d a ) 0) ((p a ) 0) ((z a ) 0) ((r 1 ) 0 )) )

The nil is a part-of-speech tag which we don't have information about and are not using right now, so it should just be nil. Then, the phonetic pronunciation is grouped into its syllables, and the zeroes are stress information (which we are also not using right now so ok to just put zeroes.)

Next, sort it by running (all on one line) in your top-level Clustergen voice directory:

$ESTDIR/../festival/bin/festival -b festvox/yourvoicename_phoneset.scm '(set! lex_syllabification nil)' '(lex.compile "festvox/lex.scm" "festvox/cmu_babel_lex.out")'

Then, when you load that voice in Festival and use it to make utts etc., it should be using that updated lexicon cmu_babel_lex.out.