Merlin Instructions and Troubleshooting

Before You Get Started

Install Merlin from github: https://github.com/CSTR-Edinburgh/merlin
Do not install any of the dependencies from pip. The dependencies are already installed on Speech Lab machines kata, paka, muur, and iring. If you install your own versions, they will be the wrong versions and Merlin will not work. Please only run the compilation step from the installation instructions (compile_tools.sh).
Only one process may be running on a given GPU at a time. Please check that no one is using the GPU on the machine you are working on. You can check by running merlin/src/gpu_lock.py. You can also see if anyone is logged into the machine (and possibly about to start using the GPU) by running who. If someone is already using the GPU, then your process will run on CPU and take much longer. Other speech lab students may also be using the GPU but not with Merlin, so in general, always check who to see if anyone else is using the machine.
Always make sure your GPU lock has been released when you are done. If your voice trains successfully, then the lock should get released automatically. If you kill the process manually or if it errors out, then it is possible that the lock will not get released. Check and release manually by using gpu_lock.py.

Speech Lab GPU Machines

Our GPU machines in lab are kata, paka, muur, iring, and hecate.

If you want to use hecate, you will need an account on the machine. Please ask Rose if you need one.
If you are using paka, muur, or iring, your recipename/scripts/submit.sh should have:
gpu$gpu_id
If you are using kata, your recipename/scripts/submit.sh should have:
cuda$gpu_id
This is because kata has a newer version of the GPU drivers than the other machines.

Merlin Instructions:

Check that the GPU on the machine you are using is currently free, by running python merlin/src/gpu_lock.py
Navigate to egs/build_your_own_voice/s1
Run ./01_setup.sh voice_name
Navigate to the experiments/voice_name directory. You should see directories labeled duration_model and acoustic_model.
Go to duration_model/data, create a text file with the filenames of every utterance you want to train on, one per line with no file extensions. Name this file file_id_list.scp. (If you're training on a subset, you can probably get this file by copying over a file we've already made of filenames. If you're training on the whole corpus, you can use the command "ls [directory containing label files]| sed 's/.\{4\}$//' > file_id_list.scp".)
Using the script merlin/misc/scripts/frontend/utils/normalize_lab_for_merlin.py, normalize the label files and create a directory of labels inside the data directory named label_phone_align. It takes as command line arguments the input directory of label files, the output directory, the label style (which will be phone_align), and the text file with the filenames.
Copy the label_phone_align directory and the file_id_list.scp text file to acoustic_model/data.
Run ./03_prepare_acoustic_features.sh [path_to_wav_dir] [path_to_feat_dir] in merlin/egs/build_your_own_voice/s1
Go to experiments/[voice_name]/test_synthesis, add your own test files and add the names to test_id_list.scp.
Then, make a directory within test_synthesis named prompt-lab containing normalized label files for your test utterances. Because Merlin's normalization script requires timestamps and our test label files don't have them, first use the Python script /proj/tts/examples/addtimestamps.py, which takes the input directory of the label files, the output directory for the label files, and the text file with list of filenames as command line arguments. Once you've output those label files, use the same normalization script you used to set up your training data.
Return to the s1 directory and open up the file conf/global_settings.cfg and edit the Train, Valid, and Test values to be the sizes of your training, validation, and test sets. (You can check the entire size of your training corpus by using the wc command on the file_id_list.scp file you've created. I generally follow the demos and make the test and validation sets each 1/10 the size of the training set—that is, 5/6 of the training corpus is training, 1/12 is validation, and 1/12 is test.). You will also need to edit QuestionFile to point to the question file associated with your language. Question files in Merlin are located in /misc/questions.
Also in global_settings.cfg, change label style to your own setting.
Run 04-07 in merlin/egs/build_your_own_voice/s1.
Synthesized wav files can be found in experiments/[your_voice]/acoustic_model/gen for the validation and test portions of your training corpus and in experiments/[your_voice]/test_synthesis/wav.

Things to check if you get an error:

Are any of your data files (either label files or acoustic features) empty? The latter can be caused by the feature extraction script throwing errors (even if the wav files aren't empty). In these cases, you can either try regenerating the relevant file or simply remove that utterance from the file id list.
Are the values for the number of utterances correct in conf/global_settings.cfg? Double check this, especially if you've removed utterances due to other problems.
Are your labels normalized for Merlin? If you're getting a lot of "silence not found" warnings, you may have forgotten to run the normalization script on your labels (since Merlin indicates silence differently than in HTS). The quick way to check this is to open up your label file and check if silence is indicated using "sil" (correct for Merlin) or "pau" (needs to be normalized).
Are you using the most recent version of the label files? If you're getting unintelligible voices, you may be using an old version of the label files and are missing a feature. Open up a label file and check what the last feature before "/C:" is in a random line. If it's a vowel, you're good to go. If it's 0, regenerate the labels using the latest version of festival.
Are your label files aligned properly? If you're getting errors about mismatched numbers of frames, you probably are using label files generated from either misaligned utts or an older alignment script. Again, regenerate the labels (using the latest version of festival and ehmm alignment).
Is validation loss increasing on every iteration after the first 5? Merlin doesn't output the model during the first 5 rounds of training, and it doesn't update the model if validation loss increases, so in some rare cases, if validation loss increases every time after the first 5 iterations, the acoustic model will finish training without ever outputting a model. If this is the case, you can edit the source code so that it's forced to output a model anyways. (The easiest way to do this is probably to edit the code so it saves the best model even in the first 5 iterations, which you can modify around line 325 in run_merlin.py.) However, if validation loss is increasing that much, it's likely the voice won't be very good anyways, so you should consider going back and checking for errors in your data instead.
If you get a MemoryError during acoustic model training, you can set the buffer size smaller in conf/acoustic_voicename.conf. However, don't set it too low, or you will get a ValueError: could not broadcast input array instead. I have found that a buffer size of 10000 avoids both errors, but it may depend on your data.

Miscellaneous Tips:

If you need to free up some space, the following can be deleted from the voice and the voice will still be able to synthesize new utterances, under experiments/voicename/[acoustic,duration]_model/inter_module:
- binary_label_425
- nn_mgc_lf0_vuv_bap_187
- nn_norm_mgc_lf0_vuv_bap_187
- nn_no_silence_lab_425
- nn_no_silence_lab_norm_425
Does your voice just sound a lot worse than you'd expect, given your data? Check your label files. Is the "syllable vowel" feature correct? If not, then Festival didn't know which phonemes in your phoneset were vowels. Are there symbols in your phoneset that are also delimiter symbols in the label file format? This will break things. Also make sure that the phoneset you are using in your training labels matches the phoneset used in the test synthesis labels.
If you are running the slt_arctic demo and you see an error about pygpu, then go into this file: merlin/egs/slt_arctic/s1/scripts/submit.sh and change this line:
THEANO_FLAGS="mode=FAST_RUN,device=cuda$gpu_id,"$MERLIN_THEANO_FLAGS
Change cuda$gpu_id to gpu$gpu_id and the error should no longer show.

Adding In Phrasing:

If you have information about where the phrase breaks are in your file, you can do the following to train your voice to incorporate this phrasing.

Add an extra feature to each line of each of your label files. Add "/K:" to the end of each line, followed by "B" if the line corresponds to a phone in the last word before a phrase break or "NB" if it does not.
Add the following line to your questions file: "QS "Word_Brk" {/K:B}"

Last updated 10/31/2018 by ecooper
Speech lab students: to edit this page, go to /proj/speech/html/merlin.html