Get Out-of-Vocabulary Pronunciations using Sequitur
G2P
Sequitur G2P is a trainable grapheme-to-phoneme converter which can be
found here:
https://www-i6.informatik.rwth-aachen.de/web/Software/g2p.html
From various Q&A we've found online, this software is not really
supported anymore, and we've had issues with training on non-Latin
languages such as Amharic, so we are looking to switch over
to CMU Sphinx G2P. However, we are
already using Sequitur G2P for English and Turkish.
1. Train pronunciation models on your existing pronunciation
dictionary
Skip this step if you already have trained models for your language.
Speech Lab students: we already have trained models for English and
Turkish. Our version of Sequitur G2P lives here:
/proj/tts/tools/g2p/bin/g2p.py
It only runs on kucing as the other machines do not have the dependencies.
From the Sequitur G2P README file:
- Obtain a pronunciation dictionary for training.
The format is one word per line. Each line contains the
orthographic form of the word followed by the corresponding
phonemic transcription. The word and all phonemes need to be
separated by white space. The word and phoneme symbols may thus
not contain blanks. We'll assume your training lexicon is called
train.lex, and that you set aside some portion for testing purposes
as test.lex, which is disjoint from train.lex.
- Train a model.
To create a first model type:
g2p.py --train train.lex --devel 5% --write-model model-1
This first model will be rather poor because it is only a unigram.
To create higher order models you need to run g2p.py again:
g2p.py --model model-1 --ramp-up --train train.lex --devel
5% --write-model model-2
Repeat this a couple of times
g2p.py --model model-2 --ramp-up --train train.lex --devel 5%
--write-model model-3
g2p.py --model model-3 --ramp-up --train train.lex --devel 5%
--write-model model-4
...
Speech Lab students: We have typically been training up to model-3.
2. Use trained models to generate pronunciations for unseen
words
Speech Lab students: Trained models for English and Turkish live
here:
/proj/tts/resources/g2p/cmudict/
/proj/tts/resources/g2p/babel_turkish/
From the Sequitur G2P README:
Prepare a list of words you want to transcribe as a simple text
file words.txt with one word per line (and no phonemic
transcription), then type:
/proj/tts/tools/g2p/bin/g2p.py --model model-3 --apply words.txt