Summary | Past and
Present Projects | Recent
Talks
More recent projects can be found on the Speech Lab page
Past
Projects
Emotional Speech
Jennifer
Venditti, Jackson Liscombe, and I have looked at methods
of eliciting both subjective and objective judgments and
of correlating judgments of single tokens on multiple
emotion scale -- i.e., if subjects rate a token high for
frustration, what
other emotional states do they also rate it high for --
or low ("Classifying Subjective Ratings of
Emotional Speech," Eurospeech 2003). We conducted
eye-tracking experiments which allow us to compare
subjective judgments to more objective cues to the
decision process. We have also worked with colleagues at
the University of Pittsburgh to study speaker state
student speech in a tutorial system for emotional states
such as anger,
frustration, confidence and uncertainty
("Detecting
Certainness in Spoken Tutorial Dialogues,"
INTERSPEECH 2005). We have also studied question form
and function in this domain and performed machine
learning experiments to identify Question-Bearing Turns,
as well as question form and function, automatically (“Detecting
question-bearing turns in spoken tutorial dialogues”
and “Intonational
cues to student questions in tutoring dialogs”,
INTERSPEECH 2006).
Agus Gravano, Elisa Sneed, Gregory Ward and I
have also looked at intonational contour and syntactic
construction in the conveyance of speaker certainty (“The
effect of contour type and epistemic modality on the
assessment of speaker certainty”, Speech Prosody
2008), and Frank Enos and I
have proposed a new methodology for eliciting emotional
speech in “A
framework for eliciting emotional speech: Capitalizing
on the actor's process”, LREC Workshop on
Emotional Corpora.
Deceptive Speech
Frank
Enos, Stefan Benus, and I are working with colleagues at
SRI/ICSI and the University of Colorado on automatic
methods of distinguishing deceptive from non-deceptive
speech ("Distinguishing Deceptive from
Non-Deceptive Speech," INTERSPEECH 2005; “Detecting
deception using critical segments”, INTERSPEECH
2007). For this work
we collected and annotated a large corpus of deceptive
and non-deceptive speech, the CSC Deception Corpus. We
have also looked at the role of pausing in deception (“Pauses
in deceptive speech”; Speech
Prosody 2006) and examined the role of personality in
the accuracy of human judges of deception (“Personality
factors in human deception detection: Comparing human
to machine performance”, INTERSPEECH 2006).
Charismatic
Speech
Andrew
Rosenberg, Fadi Biadsy, and I are study the acoustic,
prosodic, and lexical cues to charismatic speech in
American English ("Acoustic/Prosodic and Lexical
Correlates of Charismatic Speech",
INTERSPEECH 2005). With Fadi Biadsy we
have extended our effort to include research on
Palestinian Arabic, and with Rolf Carlson (KTH) and Eva
Stangert (Umeå) we have investigated cross-cultural
perceptions of charisma and their acoustic, prosodic and
lexical features (“A
cross-cultural comparison of American, Palestinian,
and Swedish perception of charismatic speech”,
Speech Prosody 2008).
Speech
Summarization and Distillation
With Sameer Maskey, Andrew
Rosenberg, and Fadi Biadsy, I have worked on speech
summarization, exploring new techniques which take
advantage of prosodic and acoustic information, in
addition to lexical cues and structural cues, in news
broadcasts to 'gist' a broadcast (“Automatic
speech summarization of broadcast news using
structural features”, EUROSPEECH 2003; "Comparing Lexical,
Acoustic/Prosodic, Structural and Discourse Features
for Speech Summarization," INTERSPEECH
2005; "Summarizing Speech
without Text Using Hidden Markov Models,"
HLT/NAACL 2006; and “Intonational Phrases for Speech
Summarization”, INTERSPEECH 2008). We have also
looked at the segmentation of news broadcasts into
stories ("Story
Segmentation of Broadcast News in English, Mandarin
and Arabic" HLT/NAACL 2006), the
determination of speaker roles (e.g. anchor, reporter,
interviewee ) (See R. Barzilay et al., "Identification of
Speaker Role in Radio Broadcasts", AAAI
2000 for earlier work.), and the extraction of soundbites
from
broadcasts (spoken ‘quotes’ included in a show) and
identification of their speaker, . “An
unsupervised approach to biography production using
Wikipedia”, ACL/NAACL 2008. Elena
Filatova, Martin Jansche, Mehrbod
Sharifi, and Wisam Dakka are co-authors of some of
this work also.
Spoken
Dialogue Systems
The Columbia
Games Corpus
Agus
Gravano, Stefan Benus, and I have been collecting and
analyzing a large corpus of spontaneous dialogues,
produced by subjects playing a computer game we
created. We collected this data to test several
theories of the way speakers produce ‘given’ (as opposed
to ‘new’) information. We are currently labeling
this corpus for intonation, in the ToBI framework; we
have also turn-taking behaviors, cue phrases, questions
(identified as to form and function) and other aspects
of the corpus. This is joint work with Gregory
Ward and Elisa Sneed at
Cue Phrases
Work on cue
phrases, or discourse markers, is described in Julia
Hirschberg and Diane Litman, "Empirical Studies on the
Disambiguation of Cue Phrases," Computational
Linguistics, 1992; some figures are missing in this
version. More recently Agus Gravano, Stefan Benus,
Lauren Wilcox, Hector Chavez, and Shira Mitchell, Ilia
Vovsha, and I have been looking at cue phrase production
and detection in the Games corpus (“On
the role of context and prosody in the interpretation
of okay”, ACL 2007; “Classification
of discourse functions of affirmative words in spoken
dialogue”, Interspeech 2007; “The
prosody of backchannels in American English”,
ICPhS 2007).
Speaker Entrainment
Ani Nenkova,
Agus Gravano and I are looking at various types of
speaker entrainment in the Games Corpus (“High frequency
word entrainment in spoken dialogue”, ACL 2008). We are also
examining acoustic/prosodic entrainment.
The Given/New Distinction
Agus Gravano,
Ani Nenkova, Gregory Ward, Elisa Sneed and I have
studied the different ways speakers produce ‘given’ vs.
‘new’ information in “Effect
of genre, speaker, and word class on the realization
of given and new information”, INTERSPEECH 2006
and “Intonational
overload: Uses of the H* !H* L- L% contour in read and
spontaneous speech”, Laboratory Phonology 9.
Misrecognitions,
Corrections, and Error Awareness
Diane Litman,
Marc Swerts and I have studied the prosodic consequences
of recognition errors in Spoken Dialogue Systems. We are
studying whether prosodic features of user utterances
can tell us a) whether a speech recognition error has
occurred, as a user reacts to it (e.g. System: "Did you
say you want to go to
Predicting
Prosodic Events
Intonational
Variation in Synthetic Speech
Most of my
early work on predicting intonational phrase boundaries
and prominences was done in the Text-to-Speech synthesis
group at Bell Labs.
Some papers describing that work are Philipp
Koehn, Steven Abney, Julia Hirschberg, and Michael
Collins, "Improving
Intonational Phrasing with Syntactic Information,"
ICASSP-00; Julia Hirschberg and Pilar Prieto, "Training
intonational phrasing rules automatically for English
and Spanish Text-to-Speech," Speech Communication,
1996; Julia Hirschberg, "Pitch Accent in Context: Predicting
Intonational Prominence from Text," Artificial
Intelligence, 1993; and Michelle Wang and Julia
Hirschberg, "Automatic
Classification of Intonational Phrase Boundaries,"
Computer Speech and Language, 1992. These
methods were used to assign intonational variation
automatically in the Bell Labs Text-to-Speech System. I also
collaborated on two projects in concept-to-speech
generation (generating speech from an abstract
representation of the concepts to be conveyed). One,
with Shimei Pan and Kathy
McKeown of
Detecting Prosodic Events
More recent
work on prosody detection has been done with Andrew
Rosenberg, who has developed new ways to combine
energy-based features with other acoustic and lexical
features to achieve very high accuracy in prediction. Papers
documenting this work include (“On
the correlation between energy and pitch accent in
read English speech”, INTERSPEECH 2006; and “Detecting
pitch accent using pitch-corrected energy-based
predictors”, INTERSPEECH 2007)
Audio
Browsing and Retrieval
Work on our
SCAN
(Spoken Content-Based Audio Navigation) browsing and
retrieval system is summarized in John Choi et al., "Spoken
Content-Based Audio Navigation (SCAN)," ICPhS-99.
This project combines ASR and IR technology to enable
search of large audio databases, such as broadcast news
archives or voicemail. It started life as `AudioGrep'.
Current collaborators include Steve Abney, Brian Amento,
Michiel Bacchiani, Phil Isenhour, Diane Litman, Larry
Stead, and Steve Whittaker. My particular interests lie
in the use of acoustic information to segment audio
(Julia Hirschberg and Christine Nakatani, "Acoustic
Indicators of Topic Segmentation," ICSLP-98) and
the study of how people browse and search audio
databases such as broadcast news collections (Steve
Whittaker et al., "SCAN: Designing and Evaluating User
Interfaces to Support Retrieval from Speech Archives ",
SIGIR-99) and voicemail (Steve Whittaker, Julia
Hirschberg and Christine Nakatani, "Play it
again: a study of the factors underlying speech
browsing behavior," and Steve Whittaker, Julia
Hirschberg and Christine Nakatani, "All talk
and all action: strategies for managing voicemail
messages," both presented at CHI-98). We have also
studied how differences in ASR accuracy (comparing 100%,
84%, 69%, 50% accuracy transcripts) affect users'
ability to perform tasks, finding effects for transcript
accuracy on time to solution, amount of speech played,
likelihood of subjects abandoning transcript, and
various subjective measures; however, our results hold
only when we collapse our four categories into two;
i.e., there are no differences between perfect and 84%
accurate transcripts or between 69% and 50% accurate
ones (Litza Stark, Steve Whittaker, and Julia
Hirschberg, "ASR Satisficing: The effects of ASR
accuracy on speech retrieval", ICSLP-00).
Currently, in a new voicemail application, SCANMail, now
in friendly trial, we have ported SCAN technology to the
voicemail domain: users are able to browse and retrieve
their voicemail by content. See J. Hirschberg et al., "SCANMail:
Browsing and Searching Speech Data by Content Domain"
and A. Rosenberg et al., "Caller Identification for the SCANMail
Voicemail Browser" (both presented at Eurospeech
2001). Meredith Ringel and I have also worked on ranking
voicemail messages as to urgency and distinguishing
personal from business methods, using machine learning
techniques ("Automated Message Prioritization:
Making Voicemail Retrieval More Efficient,"
presented at CHI 2002).
Intonation
and Discourse Structure
Some results
of a long collaboration with Barbara Grosz and Christine
Nakatani on the intonational correlates of discourse
structure in read and spontaneous speech is described in
"A Prosodic
Analysis of Discourse Segments in Direction-Giving
Monologues," (ACL-96). The BDC corpus (with ToBI
labels) is available here.
Results of earlier studies of read speech are described
in "Some
Intonational Characteristics of Discourse Structure,"
(a reformatted version of ICSLP-92).
Intonational
Disambiguation
Empirical
studies comparing the way native speakers of different
languages employ intonational variation to disambiguate
potentially ambiguous utterances are described in Julia
Hirschberg and Cinzia Avesani, "The Role of
Prosody in Disambiguating Potentially Ambiguous
Utterances in English and Italian," ESCA Tutorial
and Research Workshop on Intonation, Athens, 1997.
Disfluencies
in Spontaneous Speech
Christine
Nakatani and Julia Hirschberg, "A Corpus-based
study of repair cues in spontaneous speech," JASA,
1994, describes studies of the acoustic/prosodic
characteristics of self-repairs.
Labeling
Conventions and Labeled Corpora
I have been
an active participant in the development of the ToBI
Labeling Standard for the prosodic labeling of
Standard American English (see the ToBI conventions
for a quick overview). . This standard was developed by
a number of researchers from industry and academia and
has been extended for other dialects of English and for
other languages, including Italian, German, Spanish,
Japanese and more. Interlabeler reliability ratings (see
John Pitrelli, Mary Beckman, and Julia Hirschberg, "Evaluation
of Prosodic Transcription Labeling Reliability in the
ToBI Framework," Proceedings of the Third
International Conference on Spoken Language Processing,
Yokohama, September, 1994, pp. 123-126) are quite good
and there are tools and training
materials available with pdf and html versions and
praat files there. There is also a Wavesurfer
version and another Praat
version with cardinal examples done by Agus Gravano
and available from the Columbia
ToBI site. The Boston Directions Corpus
(with ToBI labels) is available here.
Julia Hirschberg
Percy K. and Vida L. W. Hudson Professor of Computer Science
Columbia University
Department of Computer Science
1214 Amsterdam Avenue
M/C 0401
450 CS Building
New York, NY 10027
email: julia@cs.columbia.edu
phone: (212) 853-8464