Multimodal Tools for Speech and Language Processing

 

NLP tools

·      Word embeddings: GloVe (https://nlp.stanford.edu/projects/glove/), Word2Vec (https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa), BERT (https://pypi.org/project/bert-embedding//), ELMo (https://allennlp.org/elmo), RoBERTa

·      Stanford NLP software (https://nlp.stanford.edu/software/)

·      Unigrams, bigrams, trigrams

·      Linguistic Inquiry and Word Count (LIWC): (https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf)

·      POS tags: NLTK toolkit (https://www.nltk.org)

·      Morphological analysis:

o   Polyglot: https://polyglot.readthedocs.io/en/latest/MorphologicalAnalysis.html

o   Morfessor : https://morfessor.readthedocs.io/en/latest/

o   LegaliPy: https://github.com/henchc/syllabipy

·      Flesch reading ease and other readability formulas (Kincaid et al 1975)

·      Speciteller: Specificity score (Li and Nenkova 2015)

·      Concreteness score (Brysbaert et al 2014)

·      Dictionary of Affect and revised 2009 version (Whissell 1989, Whissell 2009)

·      Hedge words and phrases (Ulinski et al 2018)

·      textstat: tools to extract readability measures from text (readability, complexity, and grade level)

·      Tools to restore punctuation in unpunctuated text/ASR results:

o   Punctuator

o   Bert-restore-punctuation

o   fastPunct

o   Ottokart/punctuator2

·      Information on NLP for Chinese data

·      Polyglot (Multilingual text processing toolkit extracted from 136 language in Wikipedi

·      Other useful text features:

o   Number of filled pauses

o   Response latency

o   False starts and other speech disfluencies

o   Repetitions

o   Lexical diversity: determined by type/token ratio

o   Creativity: similarity of this response to other responses

·      Sentiment lexicons

o   The General Inquirer (Stone et al. 1966)

§  Positive (1915), Negative (2291), Strong vs Weak, Pleasure, Pain, etc.

o   MPQA Subjectivity Cues Lexicon

§  2718 positive, 4912 negative

o   Bing Liu Opinion Lexicon

§  2006 positive, 4783 negative

o   Product reviews on Amazon

§  Multidomain sentiment analysis dataset

§  Amazon product data, 143 million reviews

o   Movie reviews on IMDB

§  Cornell movie review data, labeled with sentiment polarity, scale, and subjectivity

§  Large Movie Review Dataset v1.0, 25k movie reviews

§  IMDB Movie Reviews Dataset, 50k movie reviews

§  Bag of Words Meets Bags of Popcorn, 50k movie reviews

o   Reviews from Rotten Tomatoes

§  Stanford Sentiment Treebank, 11k reviews

o   Tweets with emoticon

§  Sentiment140, 160k tweets

o   Twitter data on US airlines

§  Twitter US Airline Sentiment, with negative reasons (e.g. “rude service”)

o   Paper reviews

§  Paper Reviews

o   SentiWordNet

§  WordNet synsets automatically labeled with positivity, negativity, and objectiveness

o   NRC Word-emotion Association Lexicon (Mohammad and Turney 2011)

§  Labeled by Turkers for joy, sadness, anger, fear, trust, disgust, anticipation surprise

o   Lexicon of Valence, Arousal and Dominancy (Warriner et al 2013)

§  AMT ratings of 14k words

o   Sentiment in Twitter (Go et al 2009) (Kouloumpis et al 2011)

o   Emoji in Twitter (Felbo et al 2017)

o   Attention Modeling for Targeted Sentiment (Liu and Zhang 2017)

o   BERT in Sentiment Analysis (Google AI Language)

 

Speech approaches

·      Aenaes: text/speech alignment (https://www.readbeyond.it/aeneas/)

·      End2End speech processing tools: ESPnet

·      MFCC features

·      Speech processing benchmark: SUPERB

·      Acoustic-prosodic features

o   OpenSMILE (https://www.audeering.com/opensmile/)

o   Parselmouth (https://parselmouth.readthedocs.io/en/stable/)

o   Praat (https://www.fon.hum.uva.nl/praat/)

o   Prosodic labeling and detection

o   http://www.speech.cs.cmu.edu/tobi/

o   https://www.ling.ohio-state.edu/research/phonetics/E_ToBI/

o   Prosodic analysis:  AuToBI – A Tool for Automatic ToBI annotation (https://github.com/AndrewRosenberg/AuToBI)

§  PyToBI: ToBI labeling with Python

o   Video series in speech acoustics:

·      Denoising

o   To remove background noise or music: Spleeter, Descript, Audacity, CMGAN, SEMamba

o   Denoising script (multiple methods included)

·      Audio analysis: Librosa

·      Time Stamp transcriptors:  https://transkriptor.com/audio-to-text-timestamps/#

·      ASR

o   Kaldi (https://github.com/kaldi-asr/kaldi)

o   Google Cloud Speech-to-Text (https://cloud.google.com/speech-to-text)

o   Free: Kaldi, Simon, Mozilla DeepSpeech, Whisper, WhisperX, Coqui, SpeechBrain, Wav2letter

o   Paid: Labellerr, Google Cloud Speech-to-Text, Deepgram, Express Scribe, M$Azure, Nuance Dragon, IBM Watson, Verbit, …

o   And more: https://www.goodfirms.co/blog/best-free-open-source-speech-recognition-software

o   Basic information: https://cmusphinx.github.io/wiki/tutorialconcepts/

o   https://www.goodfirms.co/speech-recognition-software/blog/best-free-open-source-speech-recognition-software

o   https://en.speechocean.com/about/newsdetails/62.html

o   https://fosspost.org/open-source-speech-recognition/

·      TTS

o   Simon King Merlin video tutorial:  http://www.speech.zone/courses/one-off/merlin-interspeech2017/

o   http://www.cs.cmu.edu/~awb/synthesizers.html

o   NaturalReader,  Balabolka, Panopreter Basic, WordTalk, Zabaware Text-to-Speech Reader, WaveNet and WaveNet2, Deep Voice AI, Tacotron2, Tacotron-3, Re-Flow-TTS,  WaveGlow, Deepmind TTS, Cartesia

o   Noise reduction: (https://dl.acm.org/doi/10.1145/2964284.2967306)

o   Calculating spectral centroids

o   Median filtering

·      Old and new speech software:  

o   SoX conversion software: http://sox.sourceforge.net

o   http://linux-sound.org/speech.html

·      Spectrogam reading practice:  

o   https://home.cc.umanitoba.ca/~robh/howto.html

o   https://linguistics.ucla.edu/people/hayes/103/SpectrogramReading/index.htm

 

Visual features

·      Fisher Vector encoding (FV) (https://papers.nips.cc/paper/1998/file/db1915052d15f7815c8b88e879465a1e-Paper.pdf)

·      Vector of Linearly Aggregated Descriptors (VLAD) (https://lear.inrialpes.fr/pubs/2010/JDSP10/jegou_compactimagerepresentation.pdf)

·      Facial expression detection (FED) (https://www.jstor.org/stable/30204706?seq=1#metadata_info_tab_contents)

 

Statistical measures and z-score normalization

·      Pearson’s correlation

·      Krippendorff’s alpha

·      More here on ANOVA, Kruskal-Wallis test, regression, t-tests, Wilcoxon signed-rank test, F1 scores and z-score normalization: http://www.cs.columbia.edu/~julia/courses/Resources/Stats.pdf

 

Machine Learning

·      Weka

·      Scikit-learn (https:/scikit-learn.org/stable)

·      Deep learning models

o   ChatGPT – GPT3.5 GPT4 (zero-shot, fine-tuned) (OpenAI 2023)

o   Llama 2 (zero-shot, fine-turned) (Touvron et al 2023)

o   PaLM 2 (Anil et al 2023)

o   Alpaca-LoRA (Hu et al 2021)

o   Transfer learning w/teacher-student network – many papers

o   Many BERT uses

o   Multi-Modality Multi-Loss Fusion Network: end2end model that optimizes for feature extraction and ML processes; used for multiple modality corpora (Multimodal Learning with Transformers:  A Survey, peng, zhu, clifton 2023)

o   ImageBindhttps://imagebind.metademolab.com

o   Tensor Fusion Network (TFN):  for emotion, with Pytorch (tutorials for Pytorch)

o   MULT: a transformer encoder with cross-modal attention

o   Reading-comprehension datasets:

§  MultiSpanQA (Li et al 2022)

§  SQuAD (Rajpurkar et al 2016)

§  Quoref (Dasigi et al 2019)

·      Some other potentially useful papers:

 

·      https://www.aclweb.org/anthology/W16-0301.pdf

 

·      https://www.aclweb.org/anthology/W17-3101.pdf

 

·      http://www.cs.columbia.edu/speech/PaperFiles/2019/clpsych19.pdf

 

·      http://www.cs.columbia.edu/speech/PaperFiles/2010/Hirschberg_etal2010.pdf

 

·      https://docs.google.com/spreadsheets/d/1xDiuK4l5JJvwZKI6kDlMwcO1Twe7h7ItgnpBfmO75wA/edit - gid=0