Title: The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Authors: Jian Zhu, Changbing Yang, Farhan Samir, Jahurul Islam
Published: 14th November 2023 (Tuesday) @ 17:09:07
Link: http://arxiv.org/abs/2311.08323v2

Abstract

In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.



Quick Notes

  • adopt the same contrastive learning framework as CLAP (Wu et al., 2023): Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
  • SigLIP loss - a simpler sigmoid-based loss that is shown to be as effective as the softmax-based CLIP loss (Zhai et al., 2023): Sigmoid Loss for Language Image Pre-Training
  • speech encoder: Whisper encoder (drop decoder) initialised with Whisper weights + mean pooling on non-pad hidden states + SpecAugment
  • Phoneme
    • tokenizer: unigram SentencePiece Kudo 2018 Subword Regularization
    • encoder: BERT + Masking Probability 30% (because phoneme texts less complex than texts)
  • Forced alignment:
    • adaptive pooling to average BPEs within a given token (BPE IPA character)
    • pairwise similarity matrix between speech and phonemes is used to derive the temporal monotonic alignment between phonetic units and speech frames through dynamic time warping (DTW), even if CLAP-IPA had never between trained on alignment labels
    • Finetuning for FA: use Forward-Sum Loss, which has been shown to be effective in learning monotonic alignments between speech and phonemes (Shih et al., 2021; Badlani et al., 2022; Zhu et al., 2022c).
      • Forward Sum Loss: alignment learning loss function relies on the forward-sum algorithm in classic HMMs to maximize the likelihood of text sequence given speech sequences, while enforcing the monotonic constraint of alignment (see Shih et al. (2021) for detailed derivations).
      • requires good prior alignment, so in practice good initialisation
  • Out of domain test: Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
    • metrics: Equal Error Rate (EER) and Area Under the Curve (AUC)
  • Hold-out languages test (typologically diverse)
    • from MSWC-IPA and FLEURS-IPA:
      • Vietnamese
      • Tamil
      • Georgian
      • Hausa (Chadic language spoken by Hausa in northern Nigeria, Ghana, Cameroon, Benin and Togo, southern parts of Niger, and Chad)
      • Odia (language of Odisha, spoken natively by 82% of people in Odisha; also spoken in West Bengal, Jharkhand, Andhra Pradesh and Chhattisgarh)
    • 95 (81 unseen) languages - from UCLA phonetic Corpus (Li et al., 2021)
    • 14 unseen languages from DORECO-IPA
    • retrieval performance metrics: Hit@1 and Mean Average Precision (mAP)
  • KWS Evaluation results in Table 2 suggests that CLAP-IPA performs on par with the state-of-the-art models on LibrisPhrase-Easy, while not trained on the Libriphrase training set. Yet CLAPIPA failed to outperform state-of-the-art CED (Nishu et al., 2023) in LibriPhrase-Hard, suggesting that language-specific finetuning is still necessary to maximize performance. Generally speaking, phoneme-based models are more effective than text-based models.
  • See An introduction to Dynamic Time Warping for exactly that

Word and phoneme boundaries To evaluate the performance of forced alignment, we made use of F1 and R-Value, which were used in prior studies (RÀsÀnen et al., 2009; Kreuk et al., 2020; Zhu et al., 2022c). If the predicted boundary is within the tolerance interval of the true boundary, it is considered a hit, otherwise a miss. Since each boundary marked the onset and the offset of consecutive phones, we only evaluated the phone onsets with a tolerance of 20ms and word onsets with a tolerance of 100ms. We used TIMIT (Garofolo et al., 1993) as the English benchmark. DORECO-IPA also contains phoneme-level and word-level alignments, so we partitioned the DORECO-IPA into seen and unseen evaluation sets. Yet IPA-ALIGNER was never trained on any segmentation labels.


Phoneme-based Speech Datasets

Efforts to open-source phoneme-based speech corpora (most collected from fieldwork recordings):

  • DoReCo Paschen et al. (2020)
  • Pangloss Collection Michailovsky et al. (2014)
  • the UCLA phonetic Corpus Li et al. (2021)

👆 Notes on these:

  • Dozens of low-resource languages, most of which are transcribed phonemically.
  • Most languages are represented by only a few hours of recordings from a highly restricted pool of speakers
  • phonemic transcriptions often sparse and inconsistent across languages

For large-scale phoneme-based corpora, VoxCommunis A Corpus for Cross-linguistic Phonetic Analysis is another effort to create large-scale a phoneme-based speech corpus based on 57 languages in the Common Voice dataset Ardila et al. (2020) and the Epitran G2P system Mortensen et al. (2018).

Forced Alignment Background

Currently, some of the most popular forced alignment systems are still based on Hidden Markov Models (HMM), including:

  • the Montreal Forced Aligner (MFA) McAuliffe et al. (2017)
  • WebMAUS Kisler et al. (2012)
  • Forced Alignment and Vowel Extraction (FAVE) Rosenfelder et al. (2011)

Neural forced aligners:

  • Kelley and Tucker (2018); KĂŒrzinger et al. (2020)
  • Schulze-Forster et al. (2020)
  • Teytaut and Roebel (2021)
  • Teytaut et al. (2022)
  • Zhu et al. (2022c)

Neural models usually exhibit stronger performance over HMM-based systems.

Forced alignment systems are mostly set up to work in monolingual settings.

Dataset Curation (§3)

As a first step, we created large-scale phonemic transcriptions for public speech corpora, encompassing 115 languages across language families. The transcription can be automated through grapheme-to-phoneme conversion (G2P)

Primarily used three existing multilingual speech datasets:

  1. FLEURS Few-shot Learning Evaluation of Universal Representations of Speech Conneau et al. (2023)
  2. Multilingual Spoken Words Dataset MLCommons Datasets (MSWC) Mazumder et al. (2021b)
  3. Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo) Paschen et al. (2020)

FLEURS

We used two multilingual G2P systems, Epitran Mortensen et al. (2018) and CharsiuG2P Zhu et al. (2022b), to create phonemic transcriptions. As these two systems cover an overlapping but slightly different set of languages, combining them allowed us to maximize the diversity of languages. Before preprocessing, we removed any texts with Arabic numbers or code-switching, as G2P systems cannot process them correctly.

Yet some Asian languages do not explicitly mark word boundaries with spaces. For Mandarin Chinese, G2PW Chen et al. (2022) was used to create the Pinyin romanizations, which were then mapped to IPA symbols. For Thai, we used PyThaiNLP Phatthiyaphaibun et al. (2016) to perform word segmentation and G2P. For Japanese, the word segmentation was first performed with Fugashi McCann (2020) before G2P was applied.

MSWC

As MSWC is a word-level speech corpus, creating phonemic transcriptions was straightforward. CharsiuG2P and Epitran were deployed to transcribe the orthographic words to phonemic sequences. To strike a balance between diversity and quantity, we limited the maximum frequency to 50 to prevent high-frequency words from dominating the dataset. For words with more than 50 samples, only 50 of them will be randomly selected from the pool. After filtering, we ended up with 2.3 million spoken words, amounting to around 613 hours.

DoReCo

The original DoReCo data were distributed as hour-long recordings, so we segmented them into individual utterances based on the sentence boundaries in the provided annotations. For DoReCo, all languages were transcribed as phonemes using X-SAMPA Wells (1995) notations. We simply converted the X-SAMPA transcription to IPA symbols, as there is a one-to-one mapping between these two systems. Utterances with incomplete transcriptions or loud background noises were discarded.

Dataset validation (§3.2)

As G2P systems are based on rules or pronunciation dictionaries, they reflect how a word should be pronounced rather than how a word is pronounced

Two authors (trained phoneticians) listened to at least ten random samples in each language to determine the transcription quality. We applied a relatively relaxed standard for the generated transcriptions: as long as the speech signal approximately matches more than 80% of the transcription, it is considered valid.