HuBERT Explained | by Miguel Aspis | Dev Genius

Excerpt

HuBERT Explained, learning language from speech using BERT | by Miguel Aspis | Medium In this post, we will walk through the HuBERT paper, read more here!


A short introduction to BERT

BERT is a bi-directional self-supervised NLP model based on the transformer architecture.

Let’s go step-by-step

The transformer architecture is a deep learning architecture based on the self-attention mechanism, but explaining it is out of the scope of this post, to learn more you can read this great guide.

Bi-directional means the model receives the entire sequence of words at once, thus being able to learn the context of a word based on both what words came before and what words came after it.

Self-supervised means BERT can learn with unlabeled data, and to do that it uses 2 mechanisms, Masked Language Modelling and Next Sentence Prediction.

Masked Language Modelling consists of replacing a percentage of input tokens with [MASK] tokens and training the model to predict the original value of the masked inputs.

Next Sentence Prediction consists of feeding the model 2 sentences and training it to learn if they are subsequent in the original document.

More in-depth explanations on BERT can be seen in this great TDS post or by reading the original paper [1]

Why we can’t use NLP models on audio

There are 3 main problems when trying to apply BERT or other NLP models on speech data:

  1. There are multiple sound units in each input expression
  2. There is no lexicon of discrete sound units
  3. Sound units have variable length and no explicit segmentation

Problem 1 prevents the usage of techniques such as instance classification, which are used in Computer Vision for pre-training.

Problem 2 hinders the usage of predictive losses, due to not having a reliable target to compare the prediction with.

Finally, problem 3 complicates masked prediction pre-training due to unknown borders between sound units.

To solve these problems the authors propose HuBERT.

HuBERT architecture and training procedure

HuBERT architecture introduction

HuBERT architecture — Image from the original paper

The HuBERT model architecture follows the wav2vec 2.0 architecture consisting of:

  • Convolutional encoder
  • BERT encoder
  • Projection layer
  • Code embedding layer

The number of each of these components varies between the base, large and x-large variations.

Each component and its task will be better explained while explaining the training loop.

The training consists of 2 steps:

Generating hidden units

HuBERT first clustering step

HuBERT initial clustering step — Image by Author

The first training step consists of discovering the hidden units, and the process begins with extracting MFCCs(Mel frequency cepstrum) from the audio waveform.

These are raw acoustic features useful for representing speech.

Each segment of audio is then passed to the K-means clustering algorithm, and assigned to one of K clusters.

All audio frames will then be labeled according to which cluster they belong to, and these are the hidden units.

Afterward, these units are converted into embedding vectors to be used in step B of training.

After the first training step, the model itself can generate better representations than the MFCCs, which is done by using the output of an intermediate layer of the BERT encoder from the previous iteration:

HuBERT Clustering step 2

HuBERT subsequent clustering step — Image by Author

Masked Prediction

HuBERT Prediction step training

HuBERT Prediction step — Image by Author

The second step is analogous to the training of the original BERT model, using masked language modeling.

The CNN is responsible for generating features from the raw audio, which are then randomly masked and fed into the BERT encoder.

The BERT encoder outputs a feature sequence, filling in the masked tokens. This output is then projected into a lower dimension to match the labels and the cosine similarity is computed between these outputs and each hidden unit embedding generated in step A.

The cross-entropy loss is then used on the logits to penalize wrong predictions.

How to use HuBERT in your projects

All official versions of HuBERT (base, large, and X-large) are available in the Transformers library by Huggingface:

import librosa
from transformers import HubertForCTC, Wav2Vec2Processor

# loading model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = HubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")  

# importing wav file
speech, rate = librosa.load(file, sr=16000)

# tokenize
input_values = processor(speech, return_tensors="pt", padding="longest", sampling_rate=rate).input_values

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

# Print transcribed text 
print(transcription)
view rawhubert.py hosted with ❤ by GitHub

Conclusion

Now that you know how the HuBERT model works you can go and use it in one of your speech recognition projects! Or even fine-tune it for other downstream tasks if you’d like!

To learn more about the model and the results go check the original paper!

References:

[1] Hsu, Wei-Ning, et al. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” (2021).

[2] Transformers from scratch by Peter Bloem

[3] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” (2018).

[4] Baevski, Alexei, et al. “wav2vec 2.0: A framework for self-supervised learning of speech representations.” (2020).


Notes

Very nice summary of problems and solutions for MLM for audio

Why we can’t use NLP models on audio

There are 3 main problems when trying to apply BERT or other NLP models on speech data:

  1. There are multiple sound units in each input expression
  2. There is no lexicon of discrete sound units
  3. Sound units have variable length and no explicit segmentation

Problem 1 prevents the usage of techniques such as instance classification, which are used in Computer Vision for pre-training.

Problem 2 hinders the usage of predictive losses, due to not having a reliable target to compare the prediction with.

Finally, problem 3 complicates masked prediction pre-training due to unknown borders between sound units.