Wav2Vec2
Excerpt
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Overview
The Wav2Vec2 model was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
The abstract from the paper is the following:
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
Tips:
- Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
- Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using Wav2Vec2CTCTokenizer.
This model was contributed by patrickvonplaten.
Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Wav2Vec2. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
- A notebook on how to leverage a pretrained Wav2Vec2 model for emotion classification. 🌎
- Wav2Vec2ForCTC is supported by this example script and notebook.
- Audio classification task guide
Automatic Speech Recognition
- A blog post on boosting Wav2Vec2 with n-grams in 🤗 Transformers.
- A blog post on how to finetune Wav2Vec2 for English ASR with 🤗 Transformers.
- A blog post on finetuning XLS-R for Multi-Lingual ASR with 🤗 Transformers.
- A notebook on how to create YouTube captions from any video by transcribing audio with Wav2Vec2. 🌎
- Wav2Vec2ForCTC is supported by a notebook on how to finetune a speech recognition model in English, and how to finetune a speech recognition model in any language.
- Automatic speech recognition task guide
🚀 Deploy
- A blog post on how to deploy Wav2Vec2 for Automatic Speech Recogntion with Hugging Face’s Transformers & Amazon SageMaker.
Wav2Vec2Config
class transformers.Wav2Vec2Config
( vocab_size = 32hidden_size = 768num_hidden_layers = 12num_attention_heads = 12intermediate_size = 3072hidden_act = ‘gelu’hidden_dropout = 0.1activation_dropout = 0.1attention_dropout = 0.1feat_proj_dropout = 0.0feat_quantizer_dropout = 0.0final_dropout = 0.1layerdrop = 0.1initializer_range = 0.02layer_norm_eps = 1e-05feat_extract_norm = ‘group’feat_extract_activation = ‘gelu’conv_dim = (512, 512, 512, 512, 512, 512, 512)conv_stride = (5, 2, 2, 2, 2, 2, 2)conv_kernel = (10, 3, 3, 3, 3, 2, 2)conv_bias = Falsenum_conv_pos_embeddings = 128num_conv_pos_embedding_groups = 16do_stable_layer_norm = Falseapply_spec_augment = Truemask_time_prob = 0.05mask_time_length = 10mask_time_min_masks = 2mask_feature_prob = 0.0mask_feature_length = 10mask_feature_min_masks = 0num_codevectors_per_group = 320num_codevector_groups = 2contrastive_logits_temperature = 0.1num_negatives = 100codevector_dim = 256proj_codevector_dim = 256diversity_loss_weight = 0.1ctc_loss_reduction = ‘sum’ctc_zero_infinity = Falseuse_weighted_layer_sum = Falseclassifier_proj_size = 256tdnn_dim = (512, 512, 512, 512, 1500)tdnn_kernel = (5, 3, 3, 1, 1)tdnn_dilation = (1, 2, 3, 1, 1)xvector_output_dim = 512pad_token_id = 0bos_token_id = 1eos_token_id = 2add_adapter = Falseadapter_kernel_size = 3adapter_stride = 2num_adapter_layers = 3output_hidden_size = Noneadapter_attn_dim = None**kwargs )
This is the configuration class to store the configuration of a Wav2Vec2Model. It is used to instantiate an Wav2Vec2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Wav2Vec2 facebook/wav2vec2-base-960h architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import Wav2Vec2Config, Wav2Vec2Model
>>>
>>> configuration = Wav2Vec2Config()
>>>
>>> model = Wav2Vec2Model(configuration)
>>>
>>> configuration = model.config
Wav2Vec2CTCTokenizer
class transformers.Wav2Vec2CTCTokenizer
( vocab_filebos_token = ''eos_token = ''unk_token = '
Parameters
-
bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sentence token. -
eos_token (
str
, optional, defaults to"</s>"
) — The end of sentence token. -
unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. -
pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. -
word_delimiter_token (
str
, optional, defaults to"|"
) — The token used for defining the end of a word. -
do_lower_case (
bool
, optional, defaults toFalse
) — Whether or not to accept lowercase input and lowercase the output when decoding. -
target_lang (
str
, optional) — A target language the tokenizer should set by default.target_lang
has to be defined for multi-lingual, nested vocabulary such as facebook/mms-1b-all.**kwargs — Additional keyword arguments passed along to PreTrainedTokenizer
Constructs a Wav2Vec2CTC tokenizer.
This tokenizer inherits from PreTrainedTokenizer which contains some of the main methods. Users should refer to the superclass for more information regarding such methods.
__call__
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs ) → BatchEncoding
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
decode
( token_ids: typing.Union[int, typing.List[int], ForwardRef(‘np.ndarray’), ForwardRef(‘torch.Tensor’), ForwardRef(‘tf.Tensor’)]skip_special_tokens: bool = Falseclean_up_tokenization_spaces: bool = Noneoutput_char_offsets: bool = Falseoutput_word_offsets: bool = False**kwargs ) → str
or Wav2Vec2CTCTokenizerOutput
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))
.
Example:
>>>
>>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>>
>>> model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-base-960h")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h")
>>>
>>> dataset = load_dataset("common_voice", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)
>>>
>>> input_values = feature_extractor(sample["audio"]["array"], return_tensors="pt").input_values
>>> logits = model(input_values).logits[0]
>>> pred_ids = torch.argmax(logits, axis=-1)
>>>
>>> outputs = tokenizer.decode(pred_ids, output_word_offsets=True)
>>>
>>> time_offset = model.config.inputs_to_logits_ratio / feature_extractor.sampling_rate
>>> word_offsets = [
... {
... "word": d["word"],
... "start_time": round(d["start_offset"] * time_offset, 2),
... "end_time": round(d["end_offset"] * time_offset, 2),
... }
... for d in outputs.word_offsets
... ]
>>>
>>>
>>> word_offsets[:3]
[{'word': 'WHY', 'start_time': 1.42, 'end_time': 1.54}, {'word': 'DOES', 'start_time': 1.64, 'end_time': 1.9}, {'word': 'MILISANDRA', 'start_time': 2.26, 'end_time': 2.9}]
batch_decode
( sequences: typing.Union[typing.List[int], typing.List[typing.List[int]], ForwardRef(‘np.ndarray’), ForwardRef(‘torch.Tensor’), ForwardRef(‘tf.Tensor’)]skip_special_tokens: bool = Falseclean_up_tokenization_spaces: bool = Noneoutput_char_offsets: bool = Falseoutput_word_offsets: bool = False**kwargs ) → List[str]
or Wav2Vec2CTCTokenizerOutput
Convert a list of lists of token ids into a list of strings by calling decode.
Set the target language of a nested multi-lingual dictionary
Wav2Vec2FeatureExtractor
( feature_size = 1sampling_rate = 16000padding_value = 0.0return_attention_mask = Falsedo_normalize = True**kwargs )
Parameters
-
feature_size (
int
, defaults to 1) — The feature dimension of the extracted features. -
sampling_rate (
int
, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). -
padding_value (
float
, defaults to 0.0) — The value that is used to fill the padding values. -
do_normalize (
bool
, optional, defaults toTrue
) — Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly improve the performance for some models, e.g., wav2vec2-lv60. -
return_attention_mask (
bool
, optional, defaults toFalse
) — Whether or not call() should returnattention_mask
.Wav2Vec2 models that have set
config.feat_extract_norm == "group"
, such as wav2vec2-base, have not been trained usingattention_mask
. For such models,input_values
should simply be padded with 0 and noattention_mask
should be passed.For Wav2Vec2 models that have set
config.feat_extract_norm == "layer"
, such as wav2vec2-lv60,attention_mask
should be passed for batched inference.
Constructs a Wav2Vec2 feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( raw_speech: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]]padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsemax_length: typing.Optional[int] = Nonetruncation: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonesampling_rate: typing.Optional[int] = None**kwargs )
Main method to featurize and prepare for the model one or several sequence(s).
Wav2Vec2Processor
Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor and a Wav2Vec2 CTC tokenizer into a single processor.
Wav2Vec2Processor offers all the functionalities of Wav2Vec2FeatureExtractor and PreTrainedTokenizer. See the docstring of call() and decode() for more information.
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s call() and returns its output. If used in the context as_target_processor()
this method forwards all its arguments to PreTrainedTokenizer’s call(). Please refer to the docstring of the above two methods for more information.
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s pad() and returns its output. If used in the context as_target_processor()
this method forwards all its arguments to PreTrainedTokenizer’s pad(). Please refer to the docstring of the above two methods for more information.
from_pretrained
( pretrained_model_name_or_path**kwargs )
save_pretrained
( save_directorypush_to_hub: bool = False**kwargs )
Parameters
- save_directory (
str
oros.PathLike
) — Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will be created if it does not exist). - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - kwargs (
Dict[str, Any]
, optional) — Additional key word arguments passed along to the push_to_hub() method.
Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
This class method is simply calling save_pretrained() and save_pretrained(). Please refer to the docstrings of the methods above for more information.
This method forwards all its arguments to PreTrainedTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to PreTrainedTokenizer’s decode(). Please refer to the docstring of this method for more information.
Wav2Vec2ProcessorWithLM
class transformers.Wav2Vec2ProcessorWithLM
( feature_extractor: FeatureExtractionMixintokenizer: PreTrainedTokenizerBasedecoder: BeamSearchDecoderCTC )
Parameters
- feature_extractor (Wav2Vec2FeatureExtractor) — An instance of Wav2Vec2FeatureExtractor. The feature extractor is a required input.
- tokenizer (Wav2Vec2CTCTokenizer) — An instance of Wav2Vec2CTCTokenizer. The tokenizer is a required input.
- decoder (
pyctcdecode.BeamSearchDecoderCTC
) — An instance ofpyctcdecode.BeamSearchDecoderCTC
. The decoder is a required input.
Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor, a Wav2Vec2 CTC tokenizer and a decoder with language model support into a single processor for language model boosted speech recognition decoding.
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s call() and returns its output. If used in the context as_target_processor()
this method forwards all its arguments to Wav2Vec2CTCTokenizer’s call(). Please refer to the docstring of the above two methods for more information.
When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor’s pad() and returns its output. If used in the context as_target_processor()
this method forwards all its arguments to Wav2Vec2CTCTokenizer’s pad(). Please refer to the docstring of the above two methods for more information.
from_pretrained
( pretrained_model_name_or_path**kwargs )
Parameters
-
pretrained_model_name_or_path (
str
oros.PathLike
) — This can be either:- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like
bert-base-uncased
, or namespaced under a user or organization name, likedbmdz/bert-base-german-cased
. - a path to a directory containing a feature extractor file saved using the save_pretrained() method, e.g.,
./my_model_directory/
. - a path or url to a saved feature extractor JSON file, e.g.,
./my_model_directory/preprocessor_config.json
. **kwargs — Additional keyword arguments passed along to both SequenceFeatureExtractor and PreTrainedTokenizer
- a string, the model id of a pretrained feature_extractor hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like
Instantiate a Wav2Vec2ProcessorWithLM from a pretrained Wav2Vec2 processor.
This class method is simply calling Wav2Vec2FeatureExtractor’s from_pretrained(), Wav2Vec2CTCTokenizer’s from_pretrained(), and pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub
.
Please refer to the docstrings of the methods above for more information.
batch_decode
( logits: ndarraypool: typing.Union[<bound method BaseContext.Pool of <multiprocessing.context.DefaultContext object at 0x7fbb23262be0>>, NoneType] = Nonenum_processes: typing.Optional[int] = Nonebeam_width: typing.Optional[int] = Nonebeam_prune_logp: typing.Optional[float] = Nonetoken_min_logp: typing.Optional[float] = Nonehotwords: typing.Optional[typing.Iterable[str]] = Nonehotword_weight: typing.Optional[float] = Nonealpha: typing.Optional[float] = Nonebeta: typing.Optional[float] = Noneunk_score_offset: typing.Optional[float] = Nonelm_score_boundary: typing.Optional[bool] = Noneoutput_word_offsets: bool = Falsen_best: int = 1 )
Batch decode output logits to audio transcription with language model support.
This function makes use of Python’s multiprocessing. Currently, multiprocessing is available only on Unix systems (see this issue).
If you are decoding multiple batches, consider creating a Pool
and passing it to batch_decode
. Otherwise, batch_decode
will be very slow since it will create a fresh Pool
for each call. See usage example below.
Example: See Decoding multiple audios.
decode
( logits: ndarraybeam_width: typing.Optional[int] = Nonebeam_prune_logp: typing.Optional[float] = Nonetoken_min_logp: typing.Optional[float] = Nonehotwords: typing.Optional[typing.Iterable[str]] = Nonehotword_weight: typing.Optional[float] = Nonealpha: typing.Optional[float] = Nonebeta: typing.Optional[float] = Noneunk_score_offset: typing.Optional[float] = Nonelm_score_boundary: typing.Optional[bool] = Noneoutput_word_offsets: bool = Falsen_best: int = 1 )
Decode output logits to audio transcription with language model support.
Example:
>>>
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>>
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>>
>>> dataset = load_dataset("common_voice", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)
>>>
>>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
>>> with torch.no_grad():
... logits = model(input_values).logits[0].cpu().numpy()
>>>
>>> outputs = processor.decode(logits, output_word_offsets=True)
>>>
>>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate
>>> word_offsets = [
... {
... "word": d["word"],
... "start_time": round(d["start_offset"] * time_offset, 2),
... "end_time": round(d["end_offset"] * time_offset, 2),
... }
... for d in outputs.word_offsets
... ]
>>>
>>>
>>> word_offsets[:4]
[{'word': 'WHY', 'start_time': 1.42, 'end_time': 1.54}, {'word': 'DOES', 'start_time': 1.66, 'end_time': 1.9}, {'word': 'MILISANDRA', 'start_time': 2.26, 'end_time': 2.9}, {'word': 'LOOK', 'start_time': 3.0, 'end_time': 3.16}]
Decoding multiple audios
If you are planning to decode multiple batches of audios, you should consider using batch_decode() and passing an instantiated multiprocessing.Pool
. Otherwise, batch_decode() performance will be slower than calling decode() for each audio individually, as it internally instantiates a new Pool
for every call. See the example below:
>>>
>>> from multiprocessing import get_context
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
>>>
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm").to("cuda")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>>
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> def map_to_array(batch):
... batch["speech"] = batch["audio"]["array"]
... return batch
>>>
>>> dataset = dataset.map(map_to_array, remove_columns=["audio"])
>>> def map_to_pred(batch, pool):
... inputs = processor(batch["speech"], sampling_rate=16_000, padding=True, return_tensors="pt")
... inputs = {k: v.to("cuda") for k, v in inputs.items()}
... with torch.no_grad():
... logits = model(**inputs).logits
... transcription = processor.batch_decode(logits.cpu().numpy(), pool).text
... batch["transcription"] = transcription
... return batch
>>>
>>>
>>>
>>> with get_context("fork").Pool(processes=2) as pool:
... result = dataset.map(
... map_to_pred, batched=True, batch_size=2, fn_kwargs={"pool": pool}, remove_columns=["speech"]
... )
>>> result["transcription"][:2]
['MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL', "NOR IS MISTER COULTER'S MANNER LESS INTERESTING THAN HIS MATTER"]
Wav2Vec2 specific outputs
class transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput
( text: typing.Union[typing.List[typing.List[str]], typing.List[str], str]logit_score: typing.Union[typing.List[typing.List[float]], typing.List[float], float] = Nonelm_score: typing.Union[typing.List[typing.List[float]], typing.List[float], float] = Noneword_offsets: typing.Union[typing.List[typing.List[typing.List[typing.Dict[str, typing.Union[int, str]]]]], typing.List[typing.List[typing.Dict[str, typing.Union[int, str]]]], typing.List[typing.Dict[str, typing.Union[int, str]]]] = None )
Parameters
- text (list of
str
orstr
) — Decoded logits in text from. Usually the speech transcription. - logit_score (list of
float
orfloat
) — Total logit score of the beams associated with produced text. - lm_score (list of
float
) — Fused lm_score of the beams associated with produced text. - word_offsets (list of
List[Dict[str, Union[int, str]]]
orList[Dict[str, Union[int, str]]]
) — Offsets of the decoded words. In combination with sampling rate and model downsampling rate word offsets can be used to compute time stamps for each word.
Output type of Wav2Vec2DecoderWithLM
, with transcription.
class transformers.modeling_outputs.Wav2Vec2BaseModelOutput
( last_hidden_state: FloatTensor = Noneextract_features: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = None )
Parameters
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model. -
extract_features (
torch.FloatTensor
of shape(batch_size, sequence_length, conv_dim[-1])
) — Sequence of extracted feature vectors of the last convolutional layer of the model. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Base class for models that have been trained with the Wav2Vec2 loss objective.
class transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput
( loss: typing.Optional[torch.FloatTensor] = Noneprojected_states: FloatTensor = Noneprojected_quantized_states: FloatTensor = Nonecodevector_perplexity: FloatTensor = Nonehidden_states: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattentions: typing.Optional[typing.Tuple[torch.FloatTensor]] = Nonecontrastive_loss: typing.Optional[torch.FloatTensor] = Nonediversity_loss: typing.Optional[torch.FloatTensor] = None )
Output type of Wav2Vec2ForPreTraining, with potential hidden states and attentions.
class transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2BaseModelOutput
( last_hidden_state: Array = Noneextract_features: Array = Nonehidden_states: typing.Optional[typing.Tuple[jax.Array]] = Noneattentions: typing.Optional[typing.Tuple[jax.Array]] = None )
Parameters
-
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model. -
extract_features (
jnp.ndarray
of shape(batch_size, sequence_length, last_conv_dim)
) — Sequence of extracted feature vectors of the last convolutional layer of the model withlast_conv_dim
being the dimension of the last convolutional layer. -
hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple ofjnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple ofjnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Output type of FlaxWav2Vec2BaseModelOutput
, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
class transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput
( projected_states: Array = Noneprojected_quantized_states: Array = Nonecodevector_perplexity: Array = Nonehidden_states: typing.Optional[typing.Tuple[jax.Array]] = Noneattentions: typing.Optional[typing.Tuple[jax.Array]] = None )
Output type of FlaxWav2Vec2ForPreTrainingOutput
, with potential hidden states and attentions.
“Returns a new object replacing the specified fields with new values.
Wav2Vec2Model
class transformers.Wav2Vec2Model
( config: Wav2Vec2Config )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Wav2Vec2 Model transformer outputting raw hidden-states without any specific head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonemask_time_indices: typing.Optional[torch.FloatTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.Wav2Vec2BaseModelOutput or tuple(torch.FloatTensor)
The Wav2Vec2Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, Wav2Vec2Model
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
>>>
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 292, 768]
Wav2Vec2ForCTC
class transformers.Wav2Vec2ForCTC
( configtarget_lang: typing.Optional[str] = None )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- target_lang (
str
, optional) — Language id of adapter weights. Adapter weights are stored in the format adapter..safetensors or adapter..bin. Only relevant when using an instance of Wav2Vec2ForCTC with adapters. Uses ‘eng’ by default.
Wav2Vec2 Model with a language modeling
head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
The Wav2Vec2ForCTC forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, Wav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>>
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>>
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL'
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>>
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
53.48
load_adapter
( target_lang: strforce_load = True**kwargs )
Load a language adapter model from a pre-trained adapter model.
Activate the special “offline-mode” to use this method in a firewalled environment.
Examples:
>>> from transformers import Wav2Vec2ForCTC, AutoProcessor
>>> ckpt = "facebook/mms-1b-all"
>>> processor = AutoProcessor.from_pretrained(ckpt)
>>> model = Wav2Vec2ForCTC.from_pretrained(ckpt, target_lang="eng")
>>>
>>> processor.tokenizer.set_target_lang("spa")
>>> model.load_adapter("spa")
Wav2Vec2ForSequenceClassification
class transformers.Wav2Vec2ForSequenceClassification
( config )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Wav2Vec2 Model with a sequence classification head on top (a linear layer over the pooled output) for tasks like SUPERB Keyword Spotting.
Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
The Wav2Vec2ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
>>> model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-base-superb-ks")
>>>
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
'_unknown_'
>>>
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
6.54
Wav2Vec2ForAudioFrameClassification
class transformers.Wav2Vec2ForAudioFrameClassification
( config )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Wav2Vec2 Model with a frame classification head on top for tasks like Speaker Diarization.
Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
The Wav2Vec2ForAudioFrameClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForAudioFrameClassification
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sd")
>>> model = Wav2Vec2ForAudioFrameClassification.from_pretrained("anton-l/wav2vec2-base-superb-sd")
>>>
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)
>>> with torch.no_grad():
... logits = model(**inputs).logits
>>> probabilities = torch.sigmoid(logits[0])
>>>
>>> labels = (probabilities > 0.5).long()
>>> labels[0].tolist()
[0, 0]
Wav2Vec2ForXVector
class transformers.Wav2Vec2ForXVector
( config )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Wav2Vec2 Model with an XVector feature extraction head on top for tasks like Speaker Verification.
Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.XVectorOutput or tuple(torch.FloatTensor)
The Wav2Vec2ForXVector forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForXVector
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sv")
>>> model = Wav2Vec2ForXVector.from_pretrained("anton-l/wav2vec2-base-superb-sv")
>>>
>>> inputs = feature_extractor(
... [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True
... )
>>> with torch.no_grad():
... embeddings = model(**inputs).embeddings
>>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
>>>
>>> cosine_sim = torch.nn.CosineSimilarity(dim=-1)
>>> similarity = cosine_sim(embeddings[0], embeddings[1])
>>> threshold = 0.7
>>> if similarity < threshold:
... print("Speakers are not the same!")
>>> round(similarity.item(), 2)
0.98
Wav2Vec2ForPreTraining
class transformers.Wav2Vec2ForPreTraining
( config: Wav2Vec2Config )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Wav2Vec2 Model with a quantizer and VQ
head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving etc.).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
( input_values: typing.Optional[torch.Tensor]attention_mask: typing.Optional[torch.Tensor] = Nonemask_time_indices: typing.Optional[torch.BoolTensor] = Nonesampled_negative_indices: typing.Optional[torch.BoolTensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) → transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForPreTrainingOutput or tuple(torch.FloatTensor)
The Wav2Vec2ForPreTraining forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> import torch
>>> from transformers import AutoFeatureExtractor, Wav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices, _sample_negative_indices
>>> from datasets import load_dataset
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
>>> model = Wav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-base")
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> input_values = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt").input_values
>>>
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length).item()
>>> mask_time_indices = _compute_mask_indices(
... shape=(batch_size, sequence_length), mask_prob=0.2, mask_length=2
... )
>>> sampled_negative_indices = _sample_negative_indices(
... features_shape=(batch_size, sequence_length),
... num_negatives=model.config.num_negatives,
... mask_time_indices=mask_time_indices,
... )
>>> mask_time_indices = torch.tensor(data=mask_time_indices, device=input_values.device, dtype=torch.long)
>>> sampled_negative_indices = torch.tensor(
... data=sampled_negative_indices, device=input_values.device, dtype=torch.long
... )
>>> with torch.no_grad():
... outputs = model(input_values, mask_time_indices=mask_time_indices)
>>>
>>> cosine_sim = torch.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states, dim=-1)
>>>
>>> cosine_sim[mask_time_indices.to(torch.bool)].mean() > 0.5
tensor(True)
>>>
>>> model = model.train()
>>> loss = model(
... input_values, mask_time_indices=mask_time_indices, sampled_negative_indices=sampled_negative_indices
... ).loss
TFWav2Vec2Model
class transformers.TFWav2Vec2Model
( *args**kwargs )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare TFWav2Vec2 Model transformer outputing raw hidden-states without any specific head on top.
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
- a single Tensor with
input_values
only and nothing else:model(input_values)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_values, attention_mask])
ormodel([input_values, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_values": input_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_values: tf.Tensorattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonetraining: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutput or tuple(tf.Tensor)
The TFWav2Vec2Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, TFWav2Vec2Model
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = TFWav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> input_values = processor(ds["speech"][0], return_tensors="tf").input_values
>>> hidden_states = model(input_values).last_hidden_state
TFWav2Vec2ForSequenceClassification
class transformers.TFWav2Vec2ForSequenceClassification
( *args**kwargs )
call
( input_values: tf.Tensorattention_mask: tf.Tensor | None = Noneoutput_attentions: bool | None = Noneoutput_hidden_states: bool | None = Nonereturn_dict: bool | None = Nonelabels: tf.Tensor | None = Nonetraining: bool = False )
TFWav2Vec2ForCTC
class transformers.TFWav2Vec2ForCTC
( *args**kwargs )
Parameters
- config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
TFWav2Vec2 Model with a language modeling
head on top for Connectionist Temporal Classification (CTC).
This model inherits from TFPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a tf.keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior.
TensorFlow models and layers in transformers
accept two formats as input:
- having all inputs as keyword arguments (like PyTorch models), or
- having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format when passing inputs to models and layers. Because of this support, when using methods like model.fit()
things should “just work” for you - just pass your inputs and labels in any format that model.fit()
supports! If, however, you want to use the second format outside of Keras methods like fit()
and predict()
, such as when creating your own layers or models with the Keras Functional
API, there are three possibilities you can use to gather all the input Tensors in the first positional argument:
- a single Tensor with
input_values
only and nothing else:model(input_values)
- a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
model([input_values, attention_mask])
ormodel([input_values, attention_mask, token_type_ids])
- a dictionary with one or several input Tensors associated to the input names given in the docstring:
model({"input_values": input_values, "token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to worry about any of this, as you can just pass inputs like you would to any other Python function!
call
( input_values: tf.Tensorattention_mask: tf.Tensor | None = Nonetoken_type_ids: tf.Tensor | None = Noneposition_ids: tf.Tensor | None = Nonehead_mask: tf.Tensor | None = Noneinputs_embeds: tf.Tensor | None = Noneoutput_attentions: Optional[bool] = Nonelabels: tf.Tensor | None = Noneoutput_hidden_states: Optional[bool] = Nonereturn_dict: Optional[bool] = Nonetraining: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFCausalLMOutput or tuple(tf.Tensor)
The TFWav2Vec2ForCTC forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> import tensorflow as tf
>>> from transformers import AutoProcessor, TFWav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = TFWav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> input_values = processor(ds["speech"][0], return_tensors="tf").input_values
>>> logits = model(input_values).logits
>>> predicted_ids = tf.argmax(logits, axis=-1)
>>> transcription = processor.decode(predicted_ids[0])
>>>
>>> target_transcription = "A MAN SAID TO THE UNIVERSE SIR I EXIST"
>>>
>>> labels = processor(text=transcription, return_tensors="tf").input_ids
>>> loss = model(input_values, labels=labels).loss
FlaxWav2Vec2Model
class transformers.FlaxWav2Vec2Model
( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )
Parameters
-
config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-
dtype (
jax.numpy.dtype
, optional, defaults tojax.numpy.float32
) — The data type of the computation. Can be one ofjax.numpy.float32
,jax.numpy.float16
(on GPUs) andjax.numpy.bfloat16
(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given
dtype
.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
The bare Wav2Vec2 Model transformer outputting raw hidden-states without any specific head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
The FlaxWav2Vec2PreTrainedModel
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, FlaxWav2Vec2Model
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-large-lv60")
>>> model = FlaxWav2Vec2Model.from_pretrained("facebook/wav2vec2-large-lv60")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> input_values = processor(
... ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values
>>> hidden_states = model(input_values).last_hidden_state
FlaxWav2Vec2ForCTC
class transformers.FlaxWav2Vec2ForCTC
( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )
Parameters
-
config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-
dtype (
jax.numpy.dtype
, optional, defaults tojax.numpy.float32
) — The data type of the computation. Can be one ofjax.numpy.float32
,jax.numpy.float16
(on GPUs) andjax.numpy.bfloat16
(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given
dtype
.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
Wav2Vec2 Model with a language modeling
head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
__call__
( input_valuesattention_mask = Nonemask_time_indices = Noneparams: dict = Nonedropout_rng: PRNGKey = Nonetrain: bool = Falseoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonefreeze_feature_encoder: bool = Falsereturn_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxMaskedLMOutput or tuple(torch.FloatTensor)
The FlaxWav2Vec2PreTrainedModel
forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> import jax.numpy as jnp
>>> from transformers import AutoProcessor, FlaxWav2Vec2ForCTC
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-large-960h-lv60")
>>> model = FlaxWav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> input_values = processor(
... ds["speech"][0], sampling_rate=16_000, return_tensors="np"
... ).input_values
>>> logits = model(input_values).logits
>>> predicted_ids = jnp.argmax(logits, axis=-1)
>>> transcription = processor.decode(predicted_ids[0])
>>>
FlaxWav2Vec2ForPreTraining
class transformers.FlaxWav2Vec2ForPreTraining
( config: Wav2Vec2Configinput_shape: typing.Tuple = (1, 1024)seed: int = 0dtype: dtype = <class ‘jax.numpy.float32’>_do_init: bool = True**kwargs )
Parameters
-
config (Wav2Vec2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-
dtype (
jax.numpy.dtype
, optional, defaults tojax.numpy.float32
) — The data type of the computation. Can be one ofjax.numpy.float32
,jax.numpy.float16
(on GPUs) andjax.numpy.bfloat16
(on TPUs).This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If specified all the computation will be performed with the given
dtype
.Note that this only specifies the dtype of the computation and does not influence the dtype of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
Wav2Vec2 Model with a quantizer and VQ
head on top. Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a Flax Linen flax.nn.Module subclass. Use it as a regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
__call__
( input_valuesattention_mask = Nonemask_time_indices = Nonegumbel_temperature: int = 1params: dict = Nonedropout_rng: PRNGKey = Nonegumbel_rng: PRNGKey = Nonetrain: bool = Falseoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonefreeze_feature_encoder: bool = Falsereturn_dict: typing.Optional[bool] = None ) → transformers.models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput or tuple(torch.FloatTensor)
The FlaxWav2Vec2ForPreTraining forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> import optax
>>> import numpy as np
>>> import jax.numpy as jnp
>>> from transformers import AutoFeatureExtractor, FlaxWav2Vec2ForPreTraining
>>> from transformers.models.wav2vec2.modeling_flax_wav2vec2 import _compute_mask_indices
>>> from datasets import load_dataset
>>> import soundfile as sf
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-large-lv60")
>>> model = FlaxWav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-large-lv60")
>>> def map_to_array(batch):
... speech, _ = sf.read(batch["file"])
... batch["speech"] = speech
... return batch
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.map(map_to_array)
>>> input_values = feature_extractor(ds["speech"][0], return_tensors="np").input_values
>>>
>>> batch_size, raw_sequence_length = input_values.shape
>>> sequence_length = model._get_feat_extract_output_lengths(raw_sequence_length)
>>> mask_time_indices = _compute_mask_indices((batch_size, sequence_length), mask_prob=0.2, mask_length=2)
>>> outputs = model(input_values, mask_time_indices=mask_time_indices)
>>>
>>> cosine_sim = optax.cosine_similarity(outputs.projected_states, outputs.projected_quantized_states)
>>>
>>> assert np.asarray(cosine_sim)[mask_time_indices].mean() > 0.5