Wav2Vec2

Excerpt

We’re on a journey to advance and democratize artificial intelligence through open source and open science.


Overview

The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

The abstract from the paper is the following:

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Tips:

  • Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
  • Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.

This model was contributed by patrickvonplaten.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Wav2Vec2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Automatic Speech Recognition

🚀 Deploy

Wav2Vec2Config

class transformers.Wav2Vec2Config

< source >

( vocab_size = 32hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = ‘gelu’hidden_dropout = 0.1activation_dropout = 0.1attention_dropout = 0.1feat_proj_dropout = 0.0feat_quantizer_dropout = 0.0final_dropout = 0.1layerdrop = 0.1initializer_range = 0.02layer_norm_eps = 1e-05feat_extract_norm = ‘group’feat_extract_activation = ‘gelu’conv_dim = (512, 512, 512, 512, 512, 512, 512)conv_stride = (5, 2, 2, 2, 2, 2, 2)conv_kernel = (10, 3, 3, 3, 3, 2, 2)conv_bias = Falsenum_conv_pos_embeddings = 128num_conv_pos_embedding_groups = 16do_stable_layer_norm = Falseapply_spec_augment = Truemask_time_prob = 0.05mask_time_length = 10mask_time_min_masks = 2mask_feature_prob = 0.0mask_feature_length = 10mask_feature_min_masks = 0num_codevectors_per_group = 320num_codevector_groups = 2contrastive_logits_temperature = 0.1num_negatives = 100codevector_dim = 256proj_codevector_dim = 256diversity_loss_weight = 0.1ctc_loss_reduction = ‘sum’ctc_zero_infinity = Falseuse_weighted_layer_sum = Falseclassifier_proj_size = 256tdnn_dim = (512, 512, 512, 512, 1500)tdnn_kernel = (5, 3, 3, 1, 1)tdnn_dilation = (1, 2, 3, 1, 1)xvector_output_dim = 512pad_token_id = 0bos_token_id = 1eos_token_id = 2add_adapter = Falseadapter_kernel_size = 3adapter_stride = 2num_adapter_layers = 3output_hidden_size = Noneadapter_attn_dim = None**kwargs )

This is the configuration class to store the configuration of a Wav2Vec2Model. It is used to instantiate an Wav2Vec2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Wav2Vec2 facebook/wav2vec2-base-960h architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import Wav2Vec2Config, Wav2Vec2Model

>>> 
>>> configuration = Wav2Vec2Config()

>>> 
>>> model = Wav2Vec2Model(configuration)

>>> 
>>> configuration = model.config

Wav2Vec2CTCTokenizer

class transformers.Wav2Vec2CTCTokenizer

< source >

( vocab_filebos_token = ''eos_token = ''unk_token = ''pad_token = ''word_delimiter_token = ’|‘replace_word_delimiter_char = ’ ‘do_lower_case = Falsetarget_lang = None**kwargs )

Parameters

  • vocab_file (str) — File containing the vocabulary.

  • bos_token (str, optional, defaults to "<s>") — The beginning of sentence token.

  • eos_token (str, optional, defaults to "</s>") — The end of sentence token.

  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.

  • word_delimiter_token (str, optional, defaults to "|") — The token used for defining the end of a word.

  • do_lower_case (bool, optional, defaults to False) — Whether or not to accept lowercase input and lowercase the output when decoding.

  • target_lang (str, optional) — A target language the tokenizer should set by default. target_lang has to be defined for multi-lingual, nested vocabulary such as facebook/mms-1b-all.

    **kwargs — Additional keyword arguments passed along to PreTrainedTokenizer

Constructs a Wav2Vec2CTC tokenizer.

This tokenizer inherits from PreTrainedTokenizer which contains some of the main methods. Users should refer to the superclass for more information regarding such methods.

__call__

< source >

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs ) → BatchEncoding

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

save_vocabulary

< source >

( save_directory: strfilename_prefix: typing.Optional[str] = None )

decode

< source >

( token_ids: typing.Union[int, typing.List[int], ForwardRef(‘np.ndarray’), ForwardRef(‘torch.Tensor’), ForwardRef(‘tf.Tensor’)]skip_special_tokens: bool = Falseclean_up_tokenization_spaces: bool = Noneoutput_char_offsets: bool = Falseoutput_word_offsets: bool = False**kwargs ) → str or Wav2Vec2CTCTokenizerOutput

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Example:

>>> 
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch

>>> 
>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")

>>> 
>>> dataset = load_dataset("common_voice", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)

>>> 
>>> input_values = feature_extractor(sample["audio"]["array"], return_tensors="pt").input_values
>>> logits = model(input_values).logits[0]
>>> pred_ids = torch.argmax(logits, axis=-1)

>>> 
>>> outputs = tokenizer.decode(pred_ids, output_word_offsets=True)
>>> 
>>> time_offset = model.config.inputs_to_logits_ratio / feature_extractor.sampling_rate

>>> word_offsets = [
...     {
...         "word": d["word"],
...         "start_time": round(d["start_offset"] * time_offset, 2),
...         "end_time": round(d["end_offset"] * time_offset, 2),
...     }
...     for d in outputs.word_offsets
... ]
>>> 
>>> 
>>> word_offsets[:3]
[{'word': 'WHY', 'start_time': 1.42, 'end_time': 1.54}, {'word': 'DOES', 'start_time': 1.64, 'end_time': 1.9}, {'word': 'MILISANDRA', 'start_time': 2.26, 'end_time': 2.9}]

batch_decode

< source >

( sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef(‘np.ndarray’), ForwardRef(‘torch.Tensor’), ForwardRef(‘tf.Tensor’)]skip_special_tokens: bool = Falseclean_up_tokenization_spaces: bool = Noneoutput_char_offsets: bool = Falseoutput_word_offsets: bool = False**kwargs ) → List[str] or Wav2Vec2CTCTokenizerOutput

Convert a list of lists of token ids into a list of strings by calling decode.

Set the target language of a nested multi-lingual dictionary

Wav2Vec2FeatureExtractor

( feature_size = 1sampling_rate = 16000padding_value = 0.0return_attention_mask = Falsedo_normalize = True**kwargs )

Parameters

  • feature_size (int, defaults to 1) — The feature dimension of the extracted features.

  • sampling_rate (int, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).

  • padding_value (float, defaults to 0.0) — The value that is used to fill the padding values.

  • do_normalize (bool, optional, defaults to True) — Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly improve the performance for some models, e.g., wav2vec2-lv60.

  • return_attention_mask (bool, optional, defaults to False) — Whether or not call() should return attention_mask.

    Wav2Vec2 models that have set config.feat_extract_norm == "group", such as wav2vec2-base, have not been trained using attention_mask. For such models, input_values should simply be padded with 0 and no attention_mask should be passed.

    For Wav2Vec2 models that have set config.feat_extract_norm == "layer", such as wav2vec2-lv60, attention_mask should be passed for batched inference.

Constructs a Wav2Vec2 feature extractor.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

( raw_speech: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]]padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsemax_length: typing.Optional[int] = Nonetruncation: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonesampling_rate: typing.Optional[int] = None**kwargs )

Main method to featurize and prepare for the model one or several sequence(s).

Wav2Vec2Processor

Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor and a Wav2Vec2 CTC tokenizer into a single processor.

Wav2Vec2Processor offers all the functionalities of Wav2Vec2FeatureExtractor and PreTrainedTokenizer. See the docstring of call() and decode() for more information.

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s call() and returns its output. If used in the context as_target_processor() this method forwards all its arguments to PreTrainedTokenizer’s call(). Please refer to the docstring of the above two methods for more information.

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s pad() and returns its output. If used in the context as_target_processor() this method forwards all its arguments to PreTrainedTokenizer’s pad(). Please refer to the docstring of the above two methods for more information.

from_pretrained

< source >

( pretrained_model_name_or_path**kwargs )

save_pretrained

< source >

( save_directorypush_to_hub: bool = False**kwargs )

Parameters

  • save_directory (str or os.PathLike) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist).
  • push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
  • kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.

This method forwards all its arguments to PreTrainedTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to PreTrainedTokenizer’s decode(). Please refer to the docstring of this method for more information.

Wav2Vec2ProcessorWithLM

class transformers.Wav2Vec2ProcessorWithLM

< source >

( feature_extractor: FeatureExtractionMixintokenizer: PreTrainedTokenizerBasedecoder: BeamSearchDecoderCTC )

Parameters

Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor, a Wav2Vec2 CTC tokenizer and a decoder with language model support into a single processor for language model boosted speech recognition decoding.

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s call() and returns its output. If used in the context as_target_processor() this method forwards all its arguments to Wav2Vec2CTCTokenizer’s call(). Please refer to the docstring of the above two methods for more information.

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s pad() and returns its output. If used in the context as_target_processor() this method forwards all its arguments to Wav2Vec2CTCTokenizer’s pad(). Please refer to the docstring of the above two methods for more information.

from_pretrained

< source >

( pretrained_model_name_or_path**kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.
    • a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
    • a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json. **kwargs — Additional keyword arguments passed along to both SequenceFeatureExtractor and PreTrainedTokenizer

Instantiate a Wav2Vec2ProcessorWithLM from a pretrained Wav2Vec2 processor.

This class method is simply calling Wav2Vec2FeatureExtractor’s from_pretrained(), Wav2Vec2CTCTokenizer’s from_pretrained(), and pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub.

Please refer to the docstrings of the methods above for more information.

batch_decode

< source >

( logits: ndarraypool: typing.Union[<bound method BaseContext.Pool of <multiprocessing.context.DefaultContext object at 0x7fbb23262be0>>, NoneType] = Nonenum_processes: typing.Optional[int] = Nonebeam_width: typing.Optional[int] = Nonebeam_prune_logp: typing.Optional[float] = Nonetoken_min_logp: typing.Optional[float] = Nonehotwords: typing.Optional[typing.Iterable[str]] = Nonehotword_weight: typing.Optional[float] = Nonealpha: typing.Optional[float] = Nonebeta: typing.Optional[float] = Noneunk_score_offset: typing.Optional[float] = Nonelm_score_boundary: typing.Optional[bool] = Noneoutput_word_offsets: bool = Falsen_best: int = 1 )

Batch decode output logits to audio transcription with language model support.

This function makes use of Python’s multiprocessing. Currently, multiprocessing is available only on Unix systems (see this issue).

If you are decoding multiple batches, consider creating a Pool and passing it to batch_decode. Otherwise, batch_decode will be very slow since it will create a fresh Pool for each call. See usage example below.

Example: See Decoding multiple audios.

decode

< source >

( logits: ndarraybeam_width: typing.Optional[int] = Nonebeam_prune_logp: typing.Optional[float] = Nonetoken_min_logp: typing.Optional[float] = Nonehotwords: typing.Optional[typing.Iterable[str]] = Nonehotword_weight: typing.Optional[float] = Nonealpha: typing.Optional[float] = Nonebeta: typing.Optional[float] = Noneunk_score_offset: typing.Optional[float] = Nonelm_score_boundary: typing.Optional[bool] = Noneoutput_word_offsets: bool = Falsen_best: int = 1 )

Decode output logits to audio transcription with language model support.

Example:

>>> 
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch

>>> 
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

>>> 
>>> dataset = load_dataset("common_voice", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)

>>> 
>>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
>>> with torch.no_grad():
...     logits = model(input_values).logits[0].cpu().numpy()

>>> 
>>> outputs = processor.decode(logits, output_word_offsets=True)
>>> 
>>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate

>>> word_offsets = [
...     {
...         "word": d["word"],
...         "start_time": round(d["start_offset"] * time_offset, 2),
...         "end_time": round(d["end_offset"] * time_offset, 2),
...     }
...     for d in outputs.word_offsets
... ]
>>> 
>>> 
>>> word_offsets[:4]
[{'word': 'WHY', 'start_time': 1.42, 'end_time': 1.54}, {'word': 'DOES', 'start_time': 1.66, 'end_time': 1.9}, {'word': 'MILISANDRA', 'start_time': 2.26, 'end_time': 2.9}, {'word': 'LOOK', 'start_time': 3.0, 'end_time': 3.16}]

Decoding multiple audios

If you are planning to decode multiple batches of audios, you should consider using batch_decode() and passing an instantiated multiprocessing.Pool. Otherwise, batch_decode() performance will be slower than calling decode() for each audio individually, as it internally instantiates a new Pool for every call. See the example below:

>>> 
>>> from multiprocessing import get_context
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch

>>> 
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to("cuda")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")

>>> 
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))


>>> def map_to_array(batch):
...     batch["speech"] = batch["audio"]["array"]
...     return batch


>>> 
>>> dataset = dataset.map(map_to_array, remove_columns=["audio"])


>>> def map_to_pred(batch, pool):
...     inputs = processor(batch["speech"], sampling_rate=16_000, padding=True, return_tensors="pt")
...     inputs = {k: v.to("cuda") for k, v in inputs.items()}

...     with torch.no_grad():
...         logits = model(**inputs).logits

...     transcription = processor.batch_decode(logits.cpu().numpy(), pool).text
...     batch["transcription"] = transcription
...     return batch


>>> 
>>> 
>>> 
>>> with get_context("fork").Pool(processes=2) as pool:
...     result = dataset.map(
...         map_to_pred, batched=True, batch_size=2, fn_kwargs={"pool": pool}, remove_columns=["speech"]
...     )

>>> result["transcription"][:2]
['MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', "NOR IS MISTER COULTER'S MANNER LESS INTERESTING THAN HIS MATTER"]

Wav2Vec2 specific outputs

class transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput

< source >

( text: typing.Union[typing.List[typing.List[str]], typing.List[str], str]logit_score: typing.Union[typing.List[typing.List[float]], typing.List[float], float] = Nonelm_score: typing.Union[typing.List[typing.List[float]], typing.List[float], float] = Noneword_offsets: typing.Union[typing.List[typing.List[typing.List[typing.Dict[str, typing.Union[int, str]]]]], typing.List[typing.List[typing.Dict[str, typing.Union[int, str]]]], typing.List[typing.Dict[str, typing.Union[int, str]]]] = None )

Parameters

  • text (list of str or str) — Decoded logits in text from. Usually the speech transcription.
  • logit_score (list of float or float) — Total logit score of the beams associated with produced text.
  • lm_score (list of float) — Fused lm_score of the beams associated with produced text.
  • word_offsets (list of List[Dict[str, Union[int, str]]] or List[Dict[str, Union[int, str]]]) — Offsets of the decoded words. In combination with sampling rate and model downsampling rate word offsets can be used to compute time stamps for each word.

Output type of Wav2Vec2DecoderWithLM, with transcription.

class transformers.modeling_outputs.Wav2Vec2BaseModelOutput

< source >

( last_hidden_state: FloatTensor = Noneextract_features: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )

Parameters

  • last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • extract_features (torch.FloatTensor of shape (batch_size, sequence_length, conv_dim[-1])) — Sequence of extracted feature vectors of the last convolutional layer of the model.

  • hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Base class for models that have been trained with the Wav2Vec2 loss objective.

class transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput

< source >

( loss: typing.Optional[torch.FloatTensor] = Noneprojected_states: FloatTensor = Noneprojected_quantized_states: FloatTensor = Nonecodevector_perplexity: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonecontrastive_loss: typing.Optional[torch.FloatTensor] = Nonediversity_loss: typing.Optional[torch.FloatTensor] = None )

Output type of Wav2Vec2ForPreTraining, with potential hidden states and attentions.

class transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2BaseModelOutput

< source >

( last_hidden_state: Array = Noneextract_features: Array = Nonehidden_states: typing.Optional[typing.Tuple[jax.Array]] = Noneattentions: typing.Optional[typing.Tuple[jax.Array]] = None )

Parameters

  • last_hidden_state (jnp.ndarray of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model.

  • extract_features (jnp.ndarray of shape (batch_size, sequence_length, last_conv_dim)) — Sequence of extracted feature vectors of the last convolutional layer of the model with last_conv_dim being the dimension of the last convolutional layer.

  • hidden_states (tuple(jnp.ndarray), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of jnp.ndarray (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the model at the output of each layer plus the initial embedding outputs.

  • attentions (tuple(jnp.ndarray), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of jnp.ndarray (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

Output type of FlaxWav2Vec2BaseModelOutput, with potential hidden states and attentions.

“Returns a new object replacing the specified fields with new values.

class transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput

< source >

( projected_states: Array = Noneprojected_quantized_states: Array = Nonecodevector_perplexity: Array = Nonehidden_states: typing.Optional[typing.Tuple[jax.Array]] = Noneattentions: typing.Optional[typing.Tuple[jax.Array]] = None )

Output type of FlaxWav2Vec2ForPreTrainingOutput, with potential hidden states and attentions.

“Returns a new object replacing the specified fields with new values.

Wav2Vec2Model

class transformers.Wav2Vec2Model

< source >

( config: Wav2Vec2Config )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Wav2Vec2 Model transformer outputting raw hidden-states without any specific head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonemask_time_indices: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Wav2Vec2BaseModelOutput or tuple(torch.FloatTensor)

The Wav2Vec2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoProcessor, Wav2Vec2Model
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

>>> 
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 292, 768]

Wav2Vec2ForCTC

class transformers.Wav2Vec2ForCTC

< source >

( configtarget_lang: typing.Optional[str] = None )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
  • target_lang (str, optional) — Language id of adapter weights. Adapter weights are stored in the format adapter..safetensors or adapter..bin. Only relevant when using an instance of Wav2Vec2ForCTC with adapters. Uses ‘eng’ by default.

Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)

The Wav2Vec2ForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoProcessor, Wav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

>>> 
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)

>>> 
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL'

>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids

>>> 
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
53.48

load_adapter

< source >

( target_lang: strforce_load = True**kwargs )

Load a language adapter model from a pre-trained adapter model.

Activate the special “offline-mode” to use this method in a firewalled environment.

Examples:

>>> from transformers import Wav2Vec2ForCTC, AutoProcessor

>>> ckpt = "facebook/mms-1b-all"
>>> processor = AutoProcessor.from_pretrained(ckpt)
>>> model = Wav2Vec2ForCTC.from_pretrained(ckpt, target_lang="eng")
>>> 
>>> processor.tokenizer.set_target_lang("spa")
>>> model.load_adapter("spa")

Wav2Vec2ForSequenceClassification

class transformers.Wav2Vec2ForSequenceClassification

< source >

( config )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Wav2Vec2 Model with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.

Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)

The Wav2Vec2ForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-ks")

>>> 
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
'_unknown_'

>>> 
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
6.54

Wav2Vec2ForAudioFrameClassification

class transformers.Wav2Vec2ForAudioFrameClassification

< source >

( config )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Wav2Vec2 Model with a frame classification head on top for tasks like Speaker Diarization.

Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)

The Wav2Vec2ForAudioFrameClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForAudioFrameClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sd")
>>> model = Wav2Vec2ForAudioFrameClassification.from_pretrained("anton-l/wav2vec2-base-superb-sd")

>>> 
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)
>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> probabilities = torch.sigmoid(logits[0])
>>> 
>>> labels = (probabilities > 0.5).long()
>>> labels[0].tolist()
[0, 0]

Wav2Vec2ForXVector

class transformers.Wav2Vec2ForXVector

< source >

( config )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Wav2Vec2 Model with an XVector feature extraction head on top for tasks like Speaker Verification.

Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.XVectorOutput or tuple(torch.FloatTensor)

The Wav2Vec2ForXVector forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForXVector
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sv")
>>> model = Wav2Vec2ForXVector.from_pretrained("anton-l/wav2vec2-base-superb-sv")

>>> 
>>> inputs = feature_extractor(
...     [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True
... )
>>> with torch.no_grad():
...     embeddings = model(**inputs).embeddings

>>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()

>>> 
>>> cosine_sim = torch.nn.CosineSimilarity(dim=-1)
>>> similarity = cosine_sim(embeddings[0], embeddings[1])
>>> threshold = 0.7  
>>> if similarity < threshold:
...     print("Speakers are not the same!")
>>> round(similarity.item(), 2)
0.98

Wav2Vec2ForPreTraining

class transformers.Wav2Vec2ForPreTraining

< source >

( config: Wav2Vec2Config )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Wav2Vec2 Model with a quantizer and VQ head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).

This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonemask_time_indices: typing.Optional[torch.BoolTensor] = Nonesampled_negative_indices: typing.Optional[torch.BoolTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput or tuple(torch.FloatTensor)

The Wav2Vec2ForPreTraining forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices, _sample_negative_indices
>>> from datasets import load_dataset

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
>>> model = Wav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-base")

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> input_values = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt").input_values  

>>> 
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length).item()
>>> mask_time_indices = _compute_mask_indices(
...     shape=(batch_size, sequence_length), mask_prob=0.2, mask_length=2
... )
>>> sampled_negative_indices = _sample_negative_indices(
...     features_shape=(batch_size, sequence_length),
...     num_negatives=model.config.num_negatives,
...     mask_time_indices=mask_time_indices,
... )
>>> mask_time_indices = torch.tensor(data=mask_time_indices, device=input_values.device, dtype=torch.long)
>>> sampled_negative_indices = torch.tensor(
...     data=sampled_negative_indices, device=input_values.device, dtype=torch.long
... )

>>> with torch.no_grad():
...     outputs = model(input_values, mask_time_indices=mask_time_indices)

>>> 
>>> cosine_sim = torch.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states, dim=-1)

>>> 
>>> cosine_sim[mask_time_indices.to(torch.bool)].mean() > 0.5
tensor(True)

>>> 
>>> model = model.train()
>>> loss = model(
...     input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
... ).loss

TFWav2Vec2Model

class transformers.TFWav2Vec2Model

< source >

( *args**kwargs )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare TFWav2Vec2 Model transformer outputing raw hidden-states without any specific head on top.

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

  • having all inputs as keyword arguments (like PyTorch models), or
  • having all inputs as a list, tuple or dict in the first positional argument.

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should “just work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

  • a single Tensor with input_values only and nothing else: model(input_values)
  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_values, attention_mask]) or model([input_values, attention_mask, token_type_ids])
  • a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_values": input_values, "token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_values: tf.Tensorattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonetraining: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)

The TFWav2Vec2Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoProcessor, TFWav2Vec2Model
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = TFWav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = processor(ds["speech"][0], return_tensors="tf").input_values  
>>> hidden_states = model(input_values).last_hidden_state

TFWav2Vec2ForSequenceClassification

class transformers.TFWav2Vec2ForSequenceClassification

< source >

( *args**kwargs )

call

< source >

( input_values: tf.Tensorattention_mask: tf.Tensor | None = Noneoutput_attentions: bool | None = Noneoutput_hidden_states: bool | None = Nonereturn_dict: bool | None = Nonelabels: tf.Tensor | None = Nonetraining: bool = False )

TFWav2Vec2ForCTC

class transformers.TFWav2Vec2ForCTC

< source >

( *args**kwargs )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

TFWav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC).

This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

  • having all inputs as keyword arguments (like PyTorch models), or
  • having all inputs as a list, tuple or dict in the first positional argument.

The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit() things should “just work” for you - just pass your inputs and labels in any format that model.fit() supports! If, however, you want to use the second format outside of Keras methods like fit() and predict(), such as when creating your own layers or models with the Keras Functional API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:

  • a single Tensor with input_values only and nothing else: model(input_values)
  • a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: model([input_values, attention_mask]) or model([input_values, attention_mask, token_type_ids])
  • a dictionary with one or several input Tensors associated to the input names given in the docstring: model({"input_values": input_values, "token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!

call

< source >

( input_values: tf.Tensorattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Nonelabels: tf.Tensor | None = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonetraining: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFCausalLMOutput or tuple(tf.Tensor)

The TFWav2Vec2ForCTC forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import tensorflow as tf
>>> from transformers import AutoProcessor, TFWav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = TFWav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = processor(ds["speech"][0], return_tensors="tf").input_values  
>>> logits = model(input_values).logits
>>> predicted_ids = tf.argmax(logits, axis=-1)

>>> transcription = processor.decode(predicted_ids[0])

>>> 
>>> target_transcription = "A MAN SAID TO THE UNIVERSE SIR I EXIST"

>>> 
>>> labels = processor(text=transcription, return_tensors="tf").input_ids

>>> loss = model(input_values, labels=labels).loss

FlaxWav2Vec2Model

class transformers.FlaxWav2Vec2Model

< source >

( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

  • dtype (jax.numpy.dtype, optional, defaults to jax.numpy.float32) — The data type of the computation. Can be one of jax.numpy.float32, jax.numpy.float16 (on GPUs) and jax.numpy.bfloat16 (on TPUs).

    This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.

    Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.

    If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

The bare Wav2Vec2 Model transformer outputting raw hidden-states without any specific head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

The FlaxWav2Vec2PreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from transformers import AutoProcessor, FlaxWav2Vec2Model
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-large-lv60")
>>> model = FlaxWav2Vec2Model.from_pretrained("facebook/wav2vec2-large-lv60")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = processor(
...     ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values  
>>> hidden_states = model(input_values).last_hidden_state

FlaxWav2Vec2ForCTC

class transformers.FlaxWav2Vec2ForCTC

< source >

( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

  • dtype (jax.numpy.dtype, optional, defaults to jax.numpy.float32) — The data type of the computation. Can be one of jax.numpy.float32, jax.numpy.float16 (on GPUs) and jax.numpy.bfloat16 (on TPUs).

    This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.

    Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.

    If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

__call__

< source >

( input_valuesattention_mask = Nonemask_time_indices = Noneparams: dict = Nonedropout_rng: PRNGKey = Nonetrain: bool = Falseoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonefreeze_feature_encoder: bool = Falsereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxMaskedLMOutput or tuple(torch.FloatTensor)

The FlaxWav2Vec2PreTrainedModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import jax.numpy as jnp
>>> from transformers import AutoProcessor, FlaxWav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-large-960h-lv60")
>>> model = FlaxWav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = processor(
...     ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values  
>>> logits = model(input_values).logits
>>> predicted_ids = jnp.argmax(logits, axis=-1)

>>> transcription = processor.decode(predicted_ids[0])
>>> 

FlaxWav2Vec2ForPreTraining

class transformers.FlaxWav2Vec2ForPreTraining

< source >

( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )

Parameters

  • config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

  • dtype (jax.numpy.dtype, optional, defaults to jax.numpy.float32) — The data type of the computation. Can be one of jax.numpy.float32, jax.numpy.float16 (on GPUs) and jax.numpy.bfloat16 (on TPUs).

    This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given dtype.

    Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.

    If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

Wav2Vec2 Model with a quantizer and VQ head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

__call__

< source >

( input_valuesattention_mask = Nonemask_time_indices = Nonegumbel_temperature: int = 1params: dict = Nonedropout_rng: PRNGKey = Nonegumbel_rng: PRNGKey = Nonetrain: bool = Falseoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonefreeze_feature_encoder: bool = Falsereturn_dict: typing.Optional[bool] = None ) → transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput or tuple(torch.FloatTensor)

The FlaxWav2Vec2ForPreTraining forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import optax
>>> import numpy as np
>>> import jax.numpy as jnp
>>> from transformers import AutoFeatureExtractor, FlaxWav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_flax_wav2vec2 import _compute_mask_indices
>>> from datasets import load_dataset
>>> import soundfile as sf

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-large-lv60")
>>> model = FlaxWav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-large-lv60")


>>> def map_to_array(batch):
...     speech, _ = sf.read(batch["file"])
...     batch["speech"] = speech
...     return batch


>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)

>>> input_values = feature_extractor(ds["speech"][0], return_tensors="np").input_values  

>>> 
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
>>> mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.2, mask_length=2)

>>> outputs = model(input_values, mask_time_indices=mask_time_indices)

>>> 
>>> cosine_sim = optax.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states)

>>> 
>>> assert np.asarray(cosine_sim)[mask_time_indices].mean() > 0.5