Title: Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study
Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang
Published: 27th September 2023 (Wednesday) @ 17:21:13
Link: http://arxiv.org/abs/2309.15800v1
Abstract
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.
Summary of contributions
- A comparative analysis is conducted under fair conditions, evaluating the performance and training time reduction using discrete speech units as opposed to traditional speech features.
- A diverse range of benchmarks, including 12 ASR (Section 3.2), 3 ST (Section 3.3), and 1 SLU (Section 3.4) corpora, are mostly evaluated for the first time.
- To demonstrate wide applicability of the discrete units, we adopted noisy speech, spontaneous speech, telephony speech, and several multi-lingual speech corpora, which would be the first work to explore these aspects (Section 3.2.1).
- We show the versatility of discrete units in various E2E frame-works, including connectionist temporal classification (CTC) [3], attention-based encoder-decoder (AED) [5], and RNN-Transducer [4] (Table 3).
- We share various tips based on our investigations to get better performance, including SSL feature choice and discretization. Selecting SSL features based on canonical correlation analysis (CCA) [26] improves performance significantly compared to prior work [17] (Section 3.1).
- We also explore other possible choices of discrete units, including clustering SSL [14] or supervised representations, or vector quantization of neural codec models [27] (Section 3.2.5).
- We will release fully reproducible recipes and trained models on ESPnet [28], which can significantly benefit the community.
Discretisation Approaches
Contrast discretization approaches:
- Clustering approach: Offline - HuBERT
- Vector quantization: VQ/RVQ layer trained during training time - EnCodec, SoundStream
They prefer the clustering-based method because it gives inherent versatility catering to diverse tasks. Benefits of clustering:
- Arbitrary features: It enables a wide choice of feature extraction methods, including spectral features or intermediate representations from SSL or supervised learning-based models.
- Appropriate features / choose your layer: Distinct layers retain different information [14], and we can choose an optimal feature for different purposes.
- Flexible vocabulary size: The vocabulary size can be easily tuned for the balance of information distinctions and efficiency without modifying the pre-trained models.
Removing Redundancy and Reducing Sequence Length
Removing redundancy in commonly co-existing units and reducing sequence length. These are all cited from Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning, which is [17] in their references:
- De-duplication: This approach involves condensing consecutive subsequences featuring identical tokens into a single token to reduce redundancy.
- Subword Modeling: This technique combines frequent patterns of discrete unit subsequences and reassigns them to metatokens, to enhance the input token representation.
- question what does [17] use for this? BPE or something else?
Time masking (simple data augmentation) is additionally used in Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning for regularization during training.
Setup / Regime
- Encoder: E-Branchformer Branchformer with Enhanced merging for speech recognition
- Decoder: 6-layer transformer decoder
- No language model used (for ASR) for inference
- Beam size: 10
- may be worst than previous studies, which target state of the art performance -question what beam size do they use?
Constructing features for discretisation:
- Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning used final layer WavLM features
- they take inspiration from Comparative layer-wise analysis of self-supervised speech models which uses CCA
Questions
In relation to Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning they say:
[the] approach condenses the information considerably, reducing the original 1024-sized float vector to a mere 12-bit binary number: over 3000 times less.
question is , which is not greater than . What have I mistaken? (Note: fp16 / bf16 would imply half the multipier reduction.)