Turkish

Questions File

The questions files for Turkish live here:

/proj/tts/data/turkish/babel/data/questions To be compatible with HTS, we had to rename some of the phones (e.g. '2') to spelled out versions (e.g. 'two'), since HTS expects alphabetic-only phoneme names. Not sure whether this is also required for Merlin.

Babel Conversational Data

Alan's frontend scripts worked as-is on this data, probably because they were developed and tested for this data.

Mert did a lot of debugging and data cleanup to get a Merlin baseline using the Babel conversational data. A modification to the Merlin code was required, to force it to output a model even when it would rather not (because the model is not so good) - see /local/users/mert/merlin/src/run_merlin.py on hecate, the commented part in line 331, for the fix - it forces Merlin to generate models regardless of validation errors.

A first baseline for Turkish Babel conversational is on hecate under
/local/users/mert/merlin/egs/build_your_own_voice/s1/experiments/turkish_new

The voice was judged to not sound so good so some cleanup of the data was done. Mert's cleaned-up list of utterance transcriptions can be found here: /proj/tts/tools/babel_scripts/turkish_merlin/etc/txt.done.data

This consists of hand-corrected transcripts, and bad / unintelligible utterances were removed. The beginnings of a voice trained on these is here, on hecate: /local/users/mert/merlin/egs/build_your_own_voice/s1/experiments/turkish2
The label files and audio have a frame mismatch, and so need to be regenerated to be able to build a voice.

I think we were using a "nonstandard" phone mapping for this data originally, so make sure to check the label files to make sure they are using ok phone names (i.e., matches other data you are using if combining). E.g., we were doing "oneone" instead of "oneCL".

Babel Scripted Data

Frontend processing in progress. EHMM would not align this data so currently looking into other aligners.

Radio Broadcast News

Cindy prepared the data for use in voice training and created a lexicon with OOVs added; that lexicon is here: /proj/tts/data/turkish/bn/txt_files/new_babel_lex_phonemapped.out

Kai-Zhan created a Merlin baseline voice using this data; that voice is here:
/proj/tts/habanero_archive/kl2792/tarballs/turkish_news.tgz

Erica has created a cleaned-up set of female utterances since the data contains a lot of background music and lower-quality audio. That 4.5 hour set is defined here:
/proj/tts/data/turkish/bn/4.5hr_cleaned_f.scp

We also had to do the following phone mapping:

# -> wb
1 -> one
1: -> oneCL
2 -> two
2: -> twoCL
5 -> five
a: -> aCL
e: -> eCL
i: -> iCL
o: -> oCL
u: -> uCL
y: -> yCL

A baseline voice trained on this data is currently in progress.