Forced alignment for multilingual data — Torchaudio 2.2.0.dev20240214 documentation

Excerpt

Authors: Xiaohui Zhang, Moto Hira.


Authors: Xiaohui Zhang, Moto Hira.

This tutorial shows how to align transcript to speech for non-English languages.

The process of aligning non-English (normalized) transcript is identical to aligning English (normalized) transcript, and the process for English is covered in detail in CTC forced alignment tutorial. In this tutorial, we use TorchAudio’s high-level API, torchaudio.pipelines.Wav2Vec2FABundle, which packages the pre-trained model, tokenizer and aligner, to perform the forced alignment with less code.

<span></span>2.3.0.dev20240213
2.2.0.dev20240214
cuda
<span></span><span>from</span> <span>typing</span> <span>import</span> <a href="https://docs.python.org/3/library/typing.html#typing.List" title="typing.List"><span>List</span></a>

<span>import</span> <span>IPython</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

Creating the pipeline

First, we instantiate the model and pre/post-processing pipelines.

The following diagram illustrates the process of alignment.

https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png

The waveform is passed to an acoustic model, which produces the sequence of probability distribution of tokens. The transcript is passed to tokenizer, which converts the transcript to sequence of tokens. Aligner takes the results from the acoustic model and the tokenizer and generate timestamps for each token.

Note

This process expects that the input transcript is already normalized. The process of normalization, which involves romanization of non-English languages, is language-dependent, so it is not covered in this tutorial, but we will breifly look into it.

The acoustic model and the tokenizer must use the same set of tokens. To facilitate the creation of matching processors, Wav2Vec2FABundle associates a pre-trained accoustic model and a tokenizer. torchaudio.pipelines.MMS_FA is one of such instance.

The following code instantiates a pre-trained acoustic model, a tokenizer which uses the same set of tokens as the model, and an aligner.

<span></span><span>from</span> <span>torchaudio.pipelines</span> <span>import</span> <span>MMS_FA</span> <span>as</span> <span>bundle</span>

<span>model</span> <span>=</span> <span>bundle</span><span>.</span><span>get_model</span><span>()</span>
<a href="https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to" title="torch.nn.Module.to"><span>model</span><span>.</span><span>to</span></a><span>(</span><a href="https://pytorch.org/docs/stable/tensor_attributes.html#torch.device" title="torch.device"><span>device</span></a><span>)</span>

<span>tokenizer</span> <span>=</span> <span>bundle</span><span>.</span><span>get_tokenizer</span><span>()</span>
<span>aligner</span> <span>=</span> <span>bundle</span><span>.</span><span>get_aligner</span><span>()</span>

Note

The model instantiated by MMS_FA’s get_model() method by default includes the feature dimension for <star> token. You can disable this by passing with_star=False.

The acoustic model of MMS_FA was created and open-sourced as part of the research project, Scaling Speech Technology to 1,000+ Languages. It was trained with 23,000 hours of audio from 1100+ languages.

The tokenizer simply maps the normalized characters to integers. You can check the mapping as follow;

<span></span>{'-': 0, 'a': 1, 'i': 2, 'e': 3, 'n': 4, 'o': 5, 'u': 6, 't': 7, 's': 8, 'r': 9, 'm': 10, 'k': 11, 'l': 12, 'd': 13, 'g': 14, 'h': 15, 'y': 16, 'b': 17, 'p': 18, 'w': 19, 'c': 20, 'v': 21, 'j': 22, 'z': 23, 'f': 24, "'": 25, 'q': 26, 'x': 27, '*': 28}

The aligner internally uses torchaudio.functional.forced_align() and torchaudio.functional.merge_tokens() to infer the time stamps of the input tokens.

The detail of the underlying mechanism is covered in CTC forced alignment API tutorial, so please refer to it.

We define a utility function that performs the forced alignment with the above model, the tokenizer and the aligner.

We also define utility functions for plotting the result and previewing the audio segments.

<span></span><span># Compute average score weighted by the span length</span>
<span>def</span> <span>_score</span><span>(</span><span>spans</span><span>):</span>
    <span>return</span> <span>sum</span><span>(</span><span>s</span><span>.</span><span>score</span> <span>*</span> <span>len</span><span>(</span><span>s</span><span>)</span> <span>for</span> <span>s</span> <span>in</span> <span>spans</span><span>)</span> <span>/</span> <span>sum</span><span>(</span><span>len</span><span>(</span><span>s</span><span>)</span> <span>for</span> <span>s</span> <span>in</span> <span>spans</span><span>)</span>


<span>def</span> <span>plot_alignments</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>token_spans</span></a><span>,</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>transcript</span></a><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>=</span><span>bundle</span><span>.</span><a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>):</span>
    <span>ratio</span> <span>=</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>.</span><span>size</span><span>(</span><span>1</span><span>)</span> <span>/</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>.</span><span>size</span><span>(</span><span>1</span><span>)</span> <span>/</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a>

    <span>fig</span><span>,</span> <span>axes</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>(</span><span>2</span><span>,</span> <span>1</span><span>)</span>
    <span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>imshow</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>[</span><span>0</span><span>]</span><span>.</span><span>detach</span><span>()</span><span>.</span><span>cpu</span><span>()</span><span>.</span><span>T</span><span>,</span> <span>aspect</span><span>=</span><span>"auto"</span><span>)</span>
    <span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>set_title</span><span>(</span><span>"Emission"</span><span>)</span>
    <span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>set_xticks</span><span>([])</span>

    <span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>specgram</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>[</span><span>0</span><span>],</span> <span>Fs</span><span>=</span><a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>)</span>
    <span>for</span> <span>t_spans</span><span>,</span> <span>chars</span> <span>in</span> <span>zip</span><span>(</span><a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>token_spans</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>transcript</span></a><span>):</span>
        <span>t0</span><span>,</span> <span>t1</span> <span>=</span> <span>t_spans</span><span>[</span><span>0</span><span>]</span><span>.</span><span>start</span><span>,</span> <span>t_spans</span><span>[</span><span>-</span><span>1</span><span>]</span><span>.</span><span>end</span>
        <span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>axvspan</span><span>(</span><span>t0</span> <span>-</span> <span>0.5</span><span>,</span> <span>t1</span> <span>-</span> <span>0.5</span><span>,</span> <span>facecolor</span><span>=</span><span>"None"</span><span>,</span> <span>hatch</span><span>=</span><span>"/"</span><span>,</span> <span>edgecolor</span><span>=</span><span>"white"</span><span>)</span>
        <span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>axvspan</span><span>(</span><span>ratio</span> <span>*</span> <span>t0</span><span>,</span> <span>ratio</span> <span>*</span> <span>t1</span><span>,</span> <span>facecolor</span><span>=</span><span>"None"</span><span>,</span> <span>hatch</span><span>=</span><span>"/"</span><span>,</span> <span>edgecolor</span><span>=</span><span>"white"</span><span>)</span>
        <span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>annotate</span><span>(</span><span>f</span><span>"</span><span>{</span><span>_score</span><span>(</span><span>t_spans</span><span>)</span><span>:</span><span>.2f</span><span>}</span><span>"</span><span>,</span> <span>(</span><span>ratio</span> <span>*</span> <span>t0</span><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a> <span>*</span> <span>0.51</span><span>),</span> <span>annotation_clip</span><span>=</span><span>False</span><span>)</span>

        <span>for</span> <span>span</span><span>,</span> <span>char</span> <span>in</span> <span>zip</span><span>(</span><span>t_spans</span><span>,</span> <span>chars</span><span>):</span>
            <span>t0</span> <span>=</span> <span>span</span><span>.</span><span>start</span> <span>*</span> <span>ratio</span>
            <span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>annotate</span><span>(</span><span>char</span><span>,</span> <span>(</span><span>t0</span><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a> <span>*</span> <span>0.55</span><span>),</span> <span>annotation_clip</span><span>=</span><span>False</span><span>)</span>

    <span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>set_xlabel</span><span>(</span><span>"time [second]"</span><span>)</span>
    <span>fig</span><span>.</span><span>tight_layout</span><span>()</span>

Normalizing the transcript

The transcripts passed to the pipeline must be normalized beforehand. The exact process of normalization depends on language.

Languages that do not have explicit word boundaries (such as Chinese, Japanese and Korean) require segmentation first. There are dedicated tools for this, but let’s say we have segmented transcript.

The first step of normalization is romanization. uroman is a tool that supports many languages.

Here is a BASH commands to romanize the input text file and write the output to another text file using uroman.

<span></span>$<span> </span><span>echo</span><span> </span><span>"des événements d'actualité qui se sont produits durant l'année 1882"</span><span> </span>&gt;<span> </span>text.txt
$<span> </span>uroman/bin/uroman.pl<span> </span>&lt;<span> </span>text.txt<span> </span>&gt;<span> </span>text_romanized.txt
$<span> </span>cat<span> </span>text_romanized.txt
<span></span>Cette page concerne des evenements d'actualite qui se sont produits durant l'annee 1882

The next step is to remove non-alphabets and punctuations. The following snippet normalizes the romanized transcript.

<span></span><span>import</span> <span>re</span>


<span>def</span> <span>normalize_uroman</span><span>(</span><span>text</span><span>):</span>
    <span>text</span> <span>=</span> <span>text</span><span>.</span><span>lower</span><span>()</span>
    <span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>"’"</span><span>,</span> <span>"'"</span><span>)</span>
    <span>text</span> <span>=</span> <span>re</span><span>.</span><span>sub</span><span>(</span><span>"([^a-z' ])"</span><span>,</span> <span>" "</span><span>,</span> <span>text</span><span>)</span>
    <span>text</span> <span>=</span> <span>re</span><span>.</span><span>sub</span><span>(</span><span>' +'</span><span>,</span> <span>' '</span><span>,</span> <span>text</span><span>)</span>
    <span>return</span> <span>text</span><span>.</span><span>strip</span><span>()</span>


<span>with</span> <span>open</span><span>(</span><span>"text_romanized.txt"</span><span>,</span> <span>"r"</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>for</span> <span>line</span> <span>in</span> <span>f</span><span>:</span>
        <a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a> <span>=</span> <span>normalize_uroman</span><span>(</span><span>line</span><span>)</span>
        <span>print</span><span>(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a><span>)</span>

Running the script on the above exanple produces the following.

<span></span>cette page concerne des evenements d'actualite qui se sont produits durant l'annee

Note that, in this example, since “1882” was not romanized by uroman, it was removed in the normalization step. To avoid this, one needs to romanize numbers, but this is known to be a non-trivial task.

Aligning transcripts to speech

Now we perform the forced alignment for multiple languages.

German

Emission

<span></span>Raw Transcript:  aber seit ich bei ihnen das brot hole
Normalized Transcript:  aber seit ich bei ihnen das brot hole

Your browser does not support the audio element.

<span></span>aber (0.96): 0.222 - 0.464 sec

Your browser does not support the audio element.

<span></span>seit (0.78): 0.565 - 0.766 sec

Your browser does not support the audio element.

<span></span>ich (0.91): 0.847 - 0.948 sec

Your browser does not support the audio element.

<span></span>bei (0.96): 1.028 - 1.190 sec

Your browser does not support the audio element.

<span></span>ihnen (0.65): 1.331 - 1.532 sec

Your browser does not support the audio element.

<span></span>das (0.54): 1.573 - 1.774 sec

Your browser does not support the audio element.

<span></span>brot (0.86): 1.855 - 2.117 sec

Your browser does not support the audio element.

<span></span>hole (0.71): 2.177 - 2.480 sec

Your browser does not support the audio element.

Chinese

Chinese is a character-based language, and there is not explicit word-level tokenization (separated by spaces) in its raw written form. In order to obtain word level alignments, you need to first tokenize the transcripts at the word level using a word tokenizer like “Stanford Tokenizer”. However this is not needed if you only want character-level alignments.

<span></span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_raw</span></a> <span>=</span> <span>"关 服务 高端 产品 仍 处于 供不应求 的 局面"</span>
<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a> <span>=</span> <span>"guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian"</span>

Emission

<span></span>Raw Transcript:  关 服务 高端 产品 仍 处于 供不应求 的 局面
Normalized Transcript:  guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian

Your browser does not support the audio element.

<span></span>guan (0.33): 0.020 - 0.141 sec

Your browser does not support the audio element.

<span></span>fuwu (0.31): 0.221 - 0.583 sec

Your browser does not support the audio element.

<span></span>gaoduan (0.74): 0.724 - 1.065 sec

Your browser does not support the audio element.

<span></span>chanpin (0.73): 1.126 - 1.528 sec

Your browser does not support the audio element.

<span></span>reng (0.86): 1.608 - 1.809 sec

Your browser does not support the audio element.

<span></span>chuyu (0.80): 1.849 - 2.151 sec

Your browser does not support the audio element.

<span></span>gongbuyingqiu (0.93): 2.251 - 2.894 sec

Your browser does not support the audio element.

<span></span>de (0.98): 2.935 - 3.015 sec

Your browser does not support the audio element.

<span></span>jumian (0.95): 3.075 - 3.477 sec

Your browser does not support the audio element.

Polish

Emission

<span></span>Raw Transcript:  wtedy ujrzałem na jego brzuchu okrągłą czarną ranę
Normalized Transcript:  wtedy ujrzalem na jego brzuchu okragla czarna rane

Your browser does not support the audio element.

<span></span>wtedy (1.00): 0.783 - 1.145 sec

Your browser does not support the audio element.

<span></span>ujrzalem (0.96): 1.286 - 1.788 sec

Your browser does not support the audio element.

<span></span>na (1.00): 1.868 - 1.949 sec

Your browser does not support the audio element.

<span></span>jego (1.00): 2.009 - 2.230 sec

Your browser does not support the audio element.

<span></span>brzuchu (0.97): 2.330 - 2.732 sec

Your browser does not support the audio element.

<span></span>okragla (1.00): 2.893 - 3.415 sec

Your browser does not support the audio element.

<span></span>czarna (0.90): 3.556 - 3.938 sec

Your browser does not support the audio element.

<span></span>rane (1.00): 4.098 - 4.399 sec

Your browser does not support the audio element.

Portuguese

Emission

<span></span>Raw Transcript:  na imensa extensão onde se esconde o inconsciente imortal
Normalized Transcript:  na imensa extensao onde se esconde o inconsciente imortal

Your browser does not support the audio element.

<span></span>na (1.00): 0.020 - 0.080 sec

Your browser does not support the audio element.

<span></span>imensa (0.90): 0.120 - 0.502 sec

Your browser does not support the audio element.

<span></span>extensao (0.92): 0.542 - 1.205 sec

Your browser does not support the audio element.

<span></span>onde (1.00): 1.446 - 1.667 sec

Your browser does not support the audio element.

<span></span>se (0.99): 1.748 - 1.828 sec

Your browser does not support the audio element.

<span></span>esconde (0.99): 1.888 - 2.591 sec

Your browser does not support the audio element.

<span></span>o (0.98): 2.852 - 2.872 sec

Your browser does not support the audio element.

<span></span>inconsciente (0.80): 2.933 - 3.897 sec

Your browser does not support the audio element.

<span></span>imortal (0.86): 3.937 - 4.560 sec

Your browser does not support the audio element.

Italian

Emission

<span></span>Raw Transcript:  elle giacean per terra tutte quante
Normalized Transcript:  elle giacean per terra tutte quante

Your browser does not support the audio element.

<span></span>elle (1.00): 0.563 - 0.864 sec

Your browser does not support the audio element.

<span></span>giacean (0.99): 0.945 - 1.467 sec

Your browser does not support the audio element.

<span></span>per (1.00): 1.588 - 1.789 sec

Your browser does not support the audio element.

<span></span>terra (1.00): 1.950 - 2.392 sec

Your browser does not support the audio element.

<span></span>tutte (1.00): 2.533 - 2.975 sec

Your browser does not support the audio element.

<span></span>quante (1.00): 3.055 - 3.678 sec

Your browser does not support the audio element.

Conclusion

In this tutorial, we looked at how to use torchaudio’s forced alignment API and a Wav2Vec2 pre-trained mulilingual acoustic model to align speech data to transcripts in five languages.

Acknowledgement

Thanks to Vineel Pratap and Zhaoheng Ni for developing and open-sourcing the forced aligner API.

Total running time of the script: ( 0 minutes 3.673 seconds)