Trustworthy Voices Project
We are using BURNC data and features which correlate with trusted or
untrusted speech. These are the features for which a high
value corresponds with trustworthiness or untrustworthiness:
Trusted:
- vcd2tot
- min energy
- shimmer
Untrusted:
- max f0
- mean f0
- median f0
- stdv f0
- max energy
- stdv energy
Basic Voice
We did a first pass on experimentally training a voice with these switches built in
to synthesize in "trusted" or "untrusted" styles. There are many
things we could experiment with doing differently, so the basic
process is documented here. Suggestions for future experiments are
in blue.
1. Feature extraction using Praat
You shouldn't really have to redo this unless you want to look at
different features. See these two scripts in /proj/tts/data/english/brn/trustworthy/scripts/:
extractAcousticFeatures.praat
extractVoiceQualityFeatures.praat
Output is under trustworthy/ftrs/ (raw feature csv files).
2. Z-score normalization
In the lab's prior studies on deceptive speech, features normalized by
speaker were examined. So, we also normalize the features by
speaker. See scripts/zscore.py; output
is ftrs/*_znorm.csv.
3. Thirds partitions
In accordance with our
prior
work on using frontend features to alter the style of
synthesized speech, we decided to set thresholds for what is a "high,"
"medium," or "low" value for each feature such that each partition
consists of a third of the data. This is very
simplistic and there are probably better ways to partition the data,
for example based on standard deviations around the mean, or based
on at what value of the feature it becomes salient for
trustworthiness. These are the steps for getting hi/med/lo
partitions of the data based on each feature:
- scripts/make_ascending.py: Sorts the utterances from
low to high based on each zscore-normalized feature, one at a time.
Output is in ftrs/ascending/*.csv.
- scripts/make_thirds_subsets.py: Breaks up the data
into approximately thirds, based on each feature. Output is
in sets_thirds/z*.[hi,med,lo].
4. Label files
Label files are needed for voice training, and the method we are using
to alter the speaking style relies on augmenting the frontend label
files with this additional information about acoustic and prosodic
features. The fullcontext label format represents a list of phonemes
in context for one utterance. The
different context elements are extracted by a frontend tool (we use
Festival). The label file basically has one phoneme-in-context per
line (with the start and end times of that phoneme in ten-millionths
of seconds), and the line contains each context element separated by
unique delimiters to enable pattern matching. See lab_format.pdf from
the HTS demo for more information on the format and the standard
contextual features. We are adding features at the utterance level;
that is, every phoneme in the utterance gets the same value.
- The original Merlin-formatted label files for BURNC are here: orig_merlin_labels/*.lab
- Make an intermediate helper file
using scripts/features_per_utt.py; output
is sets_thirds/utt_labels.txt.
- Run scripts/add_labels.py to augment the original files
with the new features; output is
in sets_thirds/labels/*.lab.
- Test synthesis label files are also needed; originals are
in orig_test_labels/*.lab.
- Make 'trusted' and 'untrusted' versions of the label
files with flags set properly for each new feature: scripts/augment_test_labels.py. Output is in augmented_test_labels/*.lab.
5. Questions file
In order for the new features in the label files to actually get used
in voice training, they have to be included in the questions
file. The questions file is used to parse the label files to
feed features to the model, converting each line in the label file
into a binary representation that corresponds with the answer to each
yes/no question, where the answer is "yes" (1) if the pattern
indicated in the question is matched in the label.
The default English question file is
in $MERLINDIR/misc/questions/questions-radio_dnn_416.hed.
On the left side is the name of the question (e.g. "LL-Vowel" is
basically asking, "Is the phoneme two to the left of the current one a
vowel?") and on the right side is a list of things for which a match
would mean the answer is "yes" (e.g. all vowels in the English
phoneset). The fullcontext label file uses symbols to delimit the
different features, so that's why everything pertaining to, e.g. the
current phoneme (questions starting with "C-") has the possible
matches put between - and + (because that's how you find the current
phoneme in the current label, according to lab_format.pdf).
We made a custom questions file for this project by just copying the
default English questions file and adding new questions pertaining to
our new features -- this questions file is here: /proj/tts/tools/ecooper/merlin/misc/questions/questions-trustworthy.hed
The questions that were added are at the top of the file, starting
with Trust.
The representation in the questions file is a good area for future
experimentation. For instance, right now the qfile only checks
whether a given feature value is "hi" or "not hi" -- basically a
binary switch. Merlin allows for numeric features as well as discrete
symbolic pattern match type features -- see the CQS
question type at the end of the default English qfile; see
also the section on continuous numerical
features here for more info. It is common practice to
normalize numeric features, thus it may make sense to use the actual
zscore-normalized feature values for each utterance in place of the
'hi,' 'med,' and 'lo' symbolic values. As a somewhat simpler
experiment, it may also be worth trying using 1, 0, and -1 in place of hi,
med, and lo, in a numeric-feature setting, and then seeing whether
extrapolation can be done (e.g, setting a value of '2' in the test
label files) which in theory is possible but we have not tried.
6. Voice Training
We based voice training on the basic "build your own voice" recipe in
Merlin. The voice training directory is
here: /proj/tts/tools/ecooper/merlin/egs/trustworthy/s1 The
training recipe is thirds_voice.sh which is run by
uncommenting each section one by one. The test synthesis output
is in experiments/thirds_voice/test/synthesis/wav/*.wav.
If you train a new voice based on this recipe, the important things to
change are the voice name (so the old voice doesn't get overwritten),
and the label files and questions file if new ones were created.
6. Synthesis
See this script:
/proj/tts/tools/ecooper/merlin/egs/trustworthy/s1/synthesize.py
It will synthesize your input sentences in both trusted and nontrusted styles.