Forced alignment for multilingual data — Torchaudio 2.2.0.dev20240214 documentation
Excerpt
Authors: Xiaohui Zhang, Moto Hira.
Authors: Xiaohui Zhang, Moto Hira.
This tutorial shows how to align transcript to speech for non-English languages.
The process of aligning non-English (normalized) transcript is identical to aligning English (normalized) transcript, and the process for English is covered in detail in CTC forced alignment tutorial. In this tutorial, we use TorchAudio’s high-level API, torchaudio.pipelines.Wav2Vec2FABundle
, which packages the pre-trained model, tokenizer and aligner, to perform the forced alignment with less code.
<span></span>2.3.0.dev20240213
2.2.0.dev20240214
cuda
<span></span><span>from</span> <span>typing</span> <span>import</span> <a href="https://docs.python.org/3/library/typing.html#typing.List" title="typing.List"><span>List</span></a>
<span>import</span> <span>IPython</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
Creating the pipeline
First, we instantiate the model and pre/post-processing pipelines.
The following diagram illustrates the process of alignment.
The waveform is passed to an acoustic model, which produces the sequence of probability distribution of tokens. The transcript is passed to tokenizer, which converts the transcript to sequence of tokens. Aligner takes the results from the acoustic model and the tokenizer and generate timestamps for each token.
Note
This process expects that the input transcript is already normalized. The process of normalization, which involves romanization of non-English languages, is language-dependent, so it is not covered in this tutorial, but we will breifly look into it.
The acoustic model and the tokenizer must use the same set of tokens. To facilitate the creation of matching processors, Wav2Vec2FABundle
associates a pre-trained accoustic model and a tokenizer. torchaudio.pipelines.MMS_FA
is one of such instance.
The following code instantiates a pre-trained acoustic model, a tokenizer which uses the same set of tokens as the model, and an aligner.
<span></span><span>from</span> <span>torchaudio.pipelines</span> <span>import</span> <span>MMS_FA</span> <span>as</span> <span>bundle</span>
<span>model</span> <span>=</span> <span>bundle</span><span>.</span><span>get_model</span><span>()</span>
<a href="https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to" title="torch.nn.Module.to"><span>model</span><span>.</span><span>to</span></a><span>(</span><a href="https://pytorch.org/docs/stable/tensor_attributes.html#torch.device" title="torch.device"><span>device</span></a><span>)</span>
<span>tokenizer</span> <span>=</span> <span>bundle</span><span>.</span><span>get_tokenizer</span><span>()</span>
<span>aligner</span> <span>=</span> <span>bundle</span><span>.</span><span>get_aligner</span><span>()</span>
Note
The model instantiated by MMS_FA
’s get_model()
method by default includes the feature dimension for <star>
token. You can disable this by passing with_star=False
.
The acoustic model of MMS_FA
was created and open-sourced as part of the research project, Scaling Speech Technology to 1,000+ Languages. It was trained with 23,000 hours of audio from 1100+ languages.
The tokenizer simply maps the normalized characters to integers. You can check the mapping as follow;
<span></span>{'-': 0, 'a': 1, 'i': 2, 'e': 3, 'n': 4, 'o': 5, 'u': 6, 't': 7, 's': 8, 'r': 9, 'm': 10, 'k': 11, 'l': 12, 'd': 13, 'g': 14, 'h': 15, 'y': 16, 'b': 17, 'p': 18, 'w': 19, 'c': 20, 'v': 21, 'j': 22, 'z': 23, 'f': 24, "'": 25, 'q': 26, 'x': 27, '*': 28}
The aligner internally uses torchaudio.functional.forced_align()
and torchaudio.functional.merge_tokens()
to infer the time stamps of the input tokens.
The detail of the underlying mechanism is covered in CTC forced alignment API tutorial, so please refer to it.
We define a utility function that performs the forced alignment with the above model, the tokenizer and the aligner.
We also define utility functions for plotting the result and previewing the audio segments.
<span></span><span># Compute average score weighted by the span length</span>
<span>def</span> <span>_score</span><span>(</span><span>spans</span><span>):</span>
<span>return</span> <span>sum</span><span>(</span><span>s</span><span>.</span><span>score</span> <span>*</span> <span>len</span><span>(</span><span>s</span><span>)</span> <span>for</span> <span>s</span> <span>in</span> <span>spans</span><span>)</span> <span>/</span> <span>sum</span><span>(</span><span>len</span><span>(</span><span>s</span><span>)</span> <span>for</span> <span>s</span> <span>in</span> <span>spans</span><span>)</span>
<span>def</span> <span>plot_alignments</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>token_spans</span></a><span>,</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>transcript</span></a><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>=</span><span>bundle</span><span>.</span><a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>):</span>
<span>ratio</span> <span>=</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>.</span><span>size</span><span>(</span><span>1</span><span>)</span> <span>/</span> <a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>.</span><span>size</span><span>(</span><span>1</span><span>)</span> <span>/</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a>
<span>fig</span><span>,</span> <span>axes</span> <span>=</span> <span>plt</span><span>.</span><span>subplots</span><span>(</span><span>2</span><span>,</span> <span>1</span><span>)</span>
<span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>imshow</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>emission</span></a><span>[</span><span>0</span><span>]</span><span>.</span><span>detach</span><span>()</span><span>.</span><span>cpu</span><span>()</span><span>.</span><span>T</span><span>,</span> <span>aspect</span><span>=</span><span>"auto"</span><span>)</span>
<span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>set_title</span><span>(</span><span>"Emission"</span><span>)</span>
<span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>set_xticks</span><span>([])</span>
<span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>specgram</span><span>(</span><a href="https://pytorch.org/docs/stable/tensors.html#torch.Tensor" title="torch.Tensor"><span>waveform</span></a><span>[</span><span>0</span><span>],</span> <span>Fs</span><span>=</span><a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a><span>)</span>
<span>for</span> <span>t_spans</span><span>,</span> <span>chars</span> <span>in</span> <span>zip</span><span>(</span><a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>token_spans</span></a><span>,</span> <a href="https://docs.python.org/3/library/stdtypes.html#list" title="builtins.list"><span>transcript</span></a><span>):</span>
<span>t0</span><span>,</span> <span>t1</span> <span>=</span> <span>t_spans</span><span>[</span><span>0</span><span>]</span><span>.</span><span>start</span><span>,</span> <span>t_spans</span><span>[</span><span>-</span><span>1</span><span>]</span><span>.</span><span>end</span>
<span>axes</span><span>[</span><span>0</span><span>]</span><span>.</span><span>axvspan</span><span>(</span><span>t0</span> <span>-</span> <span>0.5</span><span>,</span> <span>t1</span> <span>-</span> <span>0.5</span><span>,</span> <span>facecolor</span><span>=</span><span>"None"</span><span>,</span> <span>hatch</span><span>=</span><span>"/"</span><span>,</span> <span>edgecolor</span><span>=</span><span>"white"</span><span>)</span>
<span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>axvspan</span><span>(</span><span>ratio</span> <span>*</span> <span>t0</span><span>,</span> <span>ratio</span> <span>*</span> <span>t1</span><span>,</span> <span>facecolor</span><span>=</span><span>"None"</span><span>,</span> <span>hatch</span><span>=</span><span>"/"</span><span>,</span> <span>edgecolor</span><span>=</span><span>"white"</span><span>)</span>
<span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>annotate</span><span>(</span><span>f</span><span>"</span><span>{</span><span>_score</span><span>(</span><span>t_spans</span><span>)</span><span>:</span><span>.2f</span><span>}</span><span>"</span><span>,</span> <span>(</span><span>ratio</span> <span>*</span> <span>t0</span><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a> <span>*</span> <span>0.51</span><span>),</span> <span>annotation_clip</span><span>=</span><span>False</span><span>)</span>
<span>for</span> <span>span</span><span>,</span> <span>char</span> <span>in</span> <span>zip</span><span>(</span><span>t_spans</span><span>,</span> <span>chars</span><span>):</span>
<span>t0</span> <span>=</span> <span>span</span><span>.</span><span>start</span> <span>*</span> <span>ratio</span>
<span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>annotate</span><span>(</span><span>char</span><span>,</span> <span>(</span><span>t0</span><span>,</span> <a href="https://docs.python.org/3/library/functions.html#int" title="builtins.int"><span>sample_rate</span></a> <span>*</span> <span>0.55</span><span>),</span> <span>annotation_clip</span><span>=</span><span>False</span><span>)</span>
<span>axes</span><span>[</span><span>1</span><span>]</span><span>.</span><span>set_xlabel</span><span>(</span><span>"time [second]"</span><span>)</span>
<span>fig</span><span>.</span><span>tight_layout</span><span>()</span>
Normalizing the transcript
The transcripts passed to the pipeline must be normalized beforehand. The exact process of normalization depends on language.
Languages that do not have explicit word boundaries (such as Chinese, Japanese and Korean) require segmentation first. There are dedicated tools for this, but let’s say we have segmented transcript.
The first step of normalization is romanization. uroman is a tool that supports many languages.
Here is a BASH commands to romanize the input text file and write the output to another text file using uroman
.
<span></span>$<span> </span><span>echo</span><span> </span><span>"des événements d'actualité qui se sont produits durant l'année 1882"</span><span> </span>><span> </span>text.txt
$<span> </span>uroman/bin/uroman.pl<span> </span><<span> </span>text.txt<span> </span>><span> </span>text_romanized.txt
$<span> </span>cat<span> </span>text_romanized.txt
<span></span>Cette page concerne des evenements d'actualite qui se sont produits durant l'annee 1882
The next step is to remove non-alphabets and punctuations. The following snippet normalizes the romanized transcript.
<span></span><span>import</span> <span>re</span>
<span>def</span> <span>normalize_uroman</span><span>(</span><span>text</span><span>):</span>
<span>text</span> <span>=</span> <span>text</span><span>.</span><span>lower</span><span>()</span>
<span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>"’"</span><span>,</span> <span>"'"</span><span>)</span>
<span>text</span> <span>=</span> <span>re</span><span>.</span><span>sub</span><span>(</span><span>"([^a-z' ])"</span><span>,</span> <span>" "</span><span>,</span> <span>text</span><span>)</span>
<span>text</span> <span>=</span> <span>re</span><span>.</span><span>sub</span><span>(</span><span>' +'</span><span>,</span> <span>' '</span><span>,</span> <span>text</span><span>)</span>
<span>return</span> <span>text</span><span>.</span><span>strip</span><span>()</span>
<span>with</span> <span>open</span><span>(</span><span>"text_romanized.txt"</span><span>,</span> <span>"r"</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>for</span> <span>line</span> <span>in</span> <span>f</span><span>:</span>
<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a> <span>=</span> <span>normalize_uroman</span><span>(</span><span>line</span><span>)</span>
<span>print</span><span>(</span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a><span>)</span>
Running the script on the above exanple produces the following.
<span></span>cette page concerne des evenements d'actualite qui se sont produits durant l'annee
Note that, in this example, since “1882” was not romanized by uroman
, it was removed in the normalization step. To avoid this, one needs to romanize numbers, but this is known to be a non-trivial task.
Aligning transcripts to speech
Now we perform the forced alignment for multiple languages.
German
<span></span>Raw Transcript: aber seit ich bei ihnen das brot hole
Normalized Transcript: aber seit ich bei ihnen das brot hole
Your browser does not support the audio element.
<span></span>aber (0.96): 0.222 - 0.464 sec
Your browser does not support the audio element.
<span></span>seit (0.78): 0.565 - 0.766 sec
Your browser does not support the audio element.
<span></span>ich (0.91): 0.847 - 0.948 sec
Your browser does not support the audio element.
<span></span>bei (0.96): 1.028 - 1.190 sec
Your browser does not support the audio element.
<span></span>ihnen (0.65): 1.331 - 1.532 sec
Your browser does not support the audio element.
<span></span>das (0.54): 1.573 - 1.774 sec
Your browser does not support the audio element.
<span></span>brot (0.86): 1.855 - 2.117 sec
Your browser does not support the audio element.
<span></span>hole (0.71): 2.177 - 2.480 sec
Your browser does not support the audio element.
Chinese
Chinese is a character-based language, and there is not explicit word-level tokenization (separated by spaces) in its raw written form. In order to obtain word level alignments, you need to first tokenize the transcripts at the word level using a word tokenizer like “Stanford Tokenizer”. However this is not needed if you only want character-level alignments.
<span></span><a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_raw</span></a> <span>=</span> <span>"关 服务 高端 产品 仍 处于 供不应求 的 局面"</span>
<a href="https://docs.python.org/3/library/stdtypes.html#str" title="builtins.str"><span>text_normalized</span></a> <span>=</span> <span>"guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian"</span>
<span></span>Raw Transcript: 关 服务 高端 产品 仍 处于 供不应求 的 局面
Normalized Transcript: guan fuwu gaoduan chanpin reng chuyu gongbuyingqiu de jumian
Your browser does not support the audio element.
<span></span>guan (0.33): 0.020 - 0.141 sec
Your browser does not support the audio element.
<span></span>fuwu (0.31): 0.221 - 0.583 sec
Your browser does not support the audio element.
<span></span>gaoduan (0.74): 0.724 - 1.065 sec
Your browser does not support the audio element.
<span></span>chanpin (0.73): 1.126 - 1.528 sec
Your browser does not support the audio element.
<span></span>reng (0.86): 1.608 - 1.809 sec
Your browser does not support the audio element.
<span></span>chuyu (0.80): 1.849 - 2.151 sec
Your browser does not support the audio element.
<span></span>gongbuyingqiu (0.93): 2.251 - 2.894 sec
Your browser does not support the audio element.
<span></span>de (0.98): 2.935 - 3.015 sec
Your browser does not support the audio element.
<span></span>jumian (0.95): 3.075 - 3.477 sec
Your browser does not support the audio element.
Polish
<span></span>Raw Transcript: wtedy ujrzałem na jego brzuchu okrągłą czarną ranę
Normalized Transcript: wtedy ujrzalem na jego brzuchu okragla czarna rane
Your browser does not support the audio element.
<span></span>wtedy (1.00): 0.783 - 1.145 sec
Your browser does not support the audio element.
<span></span>ujrzalem (0.96): 1.286 - 1.788 sec
Your browser does not support the audio element.
<span></span>na (1.00): 1.868 - 1.949 sec
Your browser does not support the audio element.
<span></span>jego (1.00): 2.009 - 2.230 sec
Your browser does not support the audio element.
<span></span>brzuchu (0.97): 2.330 - 2.732 sec
Your browser does not support the audio element.
<span></span>okragla (1.00): 2.893 - 3.415 sec
Your browser does not support the audio element.
<span></span>czarna (0.90): 3.556 - 3.938 sec
Your browser does not support the audio element.
<span></span>rane (1.00): 4.098 - 4.399 sec
Your browser does not support the audio element.
Portuguese
<span></span>Raw Transcript: na imensa extensão onde se esconde o inconsciente imortal
Normalized Transcript: na imensa extensao onde se esconde o inconsciente imortal
Your browser does not support the audio element.
<span></span>na (1.00): 0.020 - 0.080 sec
Your browser does not support the audio element.
<span></span>imensa (0.90): 0.120 - 0.502 sec
Your browser does not support the audio element.
<span></span>extensao (0.92): 0.542 - 1.205 sec
Your browser does not support the audio element.
<span></span>onde (1.00): 1.446 - 1.667 sec
Your browser does not support the audio element.
<span></span>se (0.99): 1.748 - 1.828 sec
Your browser does not support the audio element.
<span></span>esconde (0.99): 1.888 - 2.591 sec
Your browser does not support the audio element.
<span></span>o (0.98): 2.852 - 2.872 sec
Your browser does not support the audio element.
<span></span>inconsciente (0.80): 2.933 - 3.897 sec
Your browser does not support the audio element.
<span></span>imortal (0.86): 3.937 - 4.560 sec
Your browser does not support the audio element.
Italian
<span></span>Raw Transcript: elle giacean per terra tutte quante
Normalized Transcript: elle giacean per terra tutte quante
Your browser does not support the audio element.
<span></span>elle (1.00): 0.563 - 0.864 sec
Your browser does not support the audio element.
<span></span>giacean (0.99): 0.945 - 1.467 sec
Your browser does not support the audio element.
<span></span>per (1.00): 1.588 - 1.789 sec
Your browser does not support the audio element.
<span></span>terra (1.00): 1.950 - 2.392 sec
Your browser does not support the audio element.
<span></span>tutte (1.00): 2.533 - 2.975 sec
Your browser does not support the audio element.
<span></span>quante (1.00): 3.055 - 3.678 sec
Your browser does not support the audio element.
Conclusion
In this tutorial, we looked at how to use torchaudio’s forced alignment API and a Wav2Vec2 pre-trained mulilingual acoustic model to align speech data to transcripts in five languages.
Acknowledgement
Thanks to Vineel Pratap and Zhaoheng Ni for developing and open-sourcing the forced aligner API.
Total running time of the script: ( 0 minutes 3.673 seconds)