Main classes
Excerpt
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
DatasetInfo
class datasets.DatasetInfo
( description: str =
Information about a dataset.
DatasetInfo
documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.
Not all fields are known on construction and may be updated later.
from_directory
( dataset_info_dir: strfs = ‘deprecated’storage_options: Optional = None )
Parameters
-
dataset_info_dir (
str
) — The directory containing the metadata file. This should be the root directory of a specific dataset version. -
fs (
fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem used to download the files from.Deprecated in 2.9.0
fs
was deprecated in version 2.9.0 and will be removed in 3.0.0. Please usestorage_options
instead, e.g.storage_options=fs.storage_options
. -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.9.0
Create DatasetInfo from the JSON file in dataset_info_dir
.
This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.
This will overwrite all previous metadata.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> DatasetInfo
<span>>>> </span>ds_info = DatasetInfo.from_directory(<span>"/path/to/directory/"</span>)
write_to_directory
( dataset_info_dirpretty_print = Falsefs = ‘deprecated’storage_options: Optional = None )
Parameters
-
pretty_print (
bool
, defaults toFalse
) — IfTrue
, the JSON will be pretty-printed with the indent level of 4. -
fs (
fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem used to download the files from.Deprecated in 2.9.0
fs
was deprecated in version 2.9.0 and will be removed in 3.0.0. Please usestorage_options
instead, e.g.storage_options=fs.storage_options
. -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.9.0
Write DatasetInfo
and license (if present) as JSON files to dataset_info_dir
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.info.write_to_directory(<span>"/path/to/directory/"</span>)
Dataset
The base class Dataset implements a Dataset backed by an Apache Arrow table.
class datasets.Dataset
( arrow_table: Tableinfo: Optional = Nonesplit: Optional = Noneindices_table: Optional = Nonefingerprint: Optional = None )
A Dataset backed by an Arrow table.
add_column
( name: strcolumn: Unionnew_fingerprint: str )
Parameters
Add column to Dataset.
Added in 1.7
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>more_text = ds[<span>"text"</span>]
<span>>>> </span>ds.add_column(name=<span>"text_2"</span>, column=more_text)
Dataset({
features: [<span>'text'</span>, <span>'label'</span>, <span>'text_2'</span>],
num_rows: <span>1066</span>
})
add_item
( item: dictnew_fingerprint: str )
Parameters
Add item to Dataset.
Added in 1.7
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>new_review = {<span>'label'</span>: <span>0</span>, <span>'text'</span>: <span>'this movie is the absolute worst thing I have ever seen'</span>}
<span>>>> </span>ds = ds.add_item(new_review)
<span>>>> </span>ds[-<span>1</span>]
{<span>'label'</span>: <span>0</span>, <span>'text'</span>: <span>'this movie is the absolute worst thing I have ever seen'</span>}
from_file
( filename: strinfo: Optional = Nonesplit: Optional = Noneindices_filename: Optional = Nonein_memory: bool = False )
Parameters
- filename (
str
) — File name of the dataset. - info (
DatasetInfo
, optional) — Dataset information, like description, citation, etc. - split (
NamedSplit
, optional) — Name of the dataset split. - indices_filename (
str
, optional) — File names of the indices. - in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory.
Instantiate a Dataset backed by an Arrow table at filename.
from_buffer
( buffer: Bufferinfo: Optional = Nonesplit: Optional = Noneindices_buffer: Optional = None )
Parameters
- buffer (
pyarrow.Buffer
) — Arrow buffer. - info (
DatasetInfo
, optional) — Dataset information, like description, citation, etc. - split (
NamedSplit
, optional) — Name of the dataset split. - indices_buffer (
pyarrow.Buffer
, optional) — Indices Arrow buffer.
Instantiate a Dataset backed by an Arrow buffer.
from_pandas
( df: DataFramefeatures: Optional = Noneinfo: Optional = Nonesplit: Optional = Nonepreserve_index: Optional = None )
Parameters
- df (
pandas.DataFrame
) — Dataframe that contains the dataset. - features (Features, optional) — Dataset features.
- info (
DatasetInfo
, optional) — Dataset information, like description, citation, etc. - split (
NamedSplit
, optional) — Name of the dataset split. - preserve_index (
bool
, optional) — Whether to store the index as an additional column in the resulting Dataset. The default ofNone
will store the index as a column, except forRangeIndex
which is stored as metadata only. Usepreserve_index=True
to force it to be stored as a column.
Convert pandas.DataFrame
to a pyarrow.Table
to create a Dataset.
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series
in the DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the case of object
, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object
dtype don’t carry enough information to always lead to a meaningful Arrow type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only contains None/nan
objects, the type is set to null
. This behavior can be avoided by constructing explicit features and passing it to this function.
Example:
<span>>>> </span>ds = Dataset.from_pandas(df)
from_dict
( mapping: dictfeatures: Optional = Noneinfo: Optional = Nonesplit: Optional = None )
Parameters
- mapping (
Mapping
) — Mapping of strings to Arrays or Python lists. - features (Features, optional) — Dataset features.
- info (
DatasetInfo
, optional) — Dataset information, like description, citation, etc. - split (
NamedSplit
, optional) — Name of the dataset split.
Convert dict
to a pyarrow.Table
to create a Dataset.
from_generator
( generator: Callablefeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsegen_kwargs: Optional = Nonenum_proc: Optional = Nonesplit: NamedSplit = NamedSplit(‘train’)**kwargs )
Create a Dataset from a generator.
Example:
<span>>>> </span><span>def</span> <span>gen</span>():
<span>... </span> <span>yield</span> {<span>"text"</span>: <span>"Good"</span>, <span>"label"</span>: <span>0</span>}
<span>... </span> <span>yield</span> {<span>"text"</span>: <span>"Bad"</span>, <span>"label"</span>: <span>1</span>}
...
<span>>>> </span>ds = Dataset.from_generator(gen)
<span>>>> </span><span>def</span> <span>gen</span>(<span>shards</span>):
<span>... </span> <span>for</span> shard <span>in</span> shards:
<span>... </span> <span>with</span> <span>open</span>(shard) <span>as</span> f:
<span>... </span> <span>for</span> line <span>in</span> f:
<span>... </span> <span>yield</span> {<span>"line"</span>: line}
...
<span>>>> </span>shards = [<span>f"data<span>{i}</span>.txt"</span> <span>for</span> i <span>in</span> <span>range</span>(<span>32</span>)]
<span>>>> </span>ds = Dataset.from_generator(gen, gen_kwargs={<span>"shards"</span>: shards})
The Apache Arrow table backing the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.data
MemoryMappedTable
text: string
label: int64
----
text: [[<span>"compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children ."</span>,<span>"the soundtrack alone is worth the price of admission ."</span>,<span>"rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue ."</span>,<span>"beneath the film's obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve ."</span>,<span>"bielinsky is a filmmaker of impressive talent ."</span>,<span>"so beautifully acted and directed , it's clear that washington most certainly has a new career ahead of him if he so chooses ."</span>,<span>"a visual spectacle full of stunning images and effects ."</span>,<span>"a gentle and engrossing character study ."</span>,<span>"it's enough to watch huppert scheming , with her small , intelligent eyes as steady as any noir villain , and to enjoy the perfectly pitched web of tension that chabrol spins ."</span>,<span>"an engrossing portrait of uncompromising artists trying to create something original against the backdrop of a corporate music industry that only seems to care about the bottom line ."</span>,...,<span>"ultimately , jane learns her place as a girl , softens up and loses some of the intensity that made her an interesting character to begin with ."</span>,<span>"ah-nuld's action hero days might be over ."</span>,<span>"it's clear why deuces wild , which was shot two years ago , has been gathering dust on mgm's shelf ."</span>,<span>"feels like nothing quite so much as a middle-aged moviemaker's attempt to surround himself with beautiful , half-naked women ."</span>,<span>"when the precise nature of matthew's predicament finally comes into sharp focus , the revelation fails to justify the build-up ."</span>,<span>"this picture is murder by numbers , and as easy to be bored by as your abc's , despite a few whopping shootouts ."</span>,<span>"hilarious musical comedy though stymied by accents thick as mud ."</span>,<span>"if you are into splatter movies , then you will probably have a reasonably good time with the salton sea ."</span>,<span>"a dull , simple-minded and stereotypical tale of drugs , death and mind-numbing indifference on the inner-city streets ."</span>,<span>"the feature-length stretch . . . strains the show's concept ."</span>]]
label: [[<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,<span>1</span>,...,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>,<span>0</span>]]
The cache files containing the Apache Arrow table backing the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.cache_files
[{<span>'filename'</span>: <span>'/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'</span>}]
Number of columns in the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.num_columns
<span>2</span>
Number of rows in the dataset (same as Dataset.len()).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.num_rows
<span>1066</span>
Names of the columns in the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.column_names
[<span>'text'</span>, <span>'label'</span>]
Shape of the dataset (number of columns, number of rows).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.shape
(<span>1066</span>, <span>2</span>)
unique
( column: str ) → list
Parameters
- column (
str
) — Column name (list all the column names with column_names).
List of unique elements in the given column.
Return a list of the unique elements in a column.
This is implemented in the low-level backend and as such, very fast.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.unique(<span>'label'</span>)
[<span>1</span>, <span>0</span>]
flatten
( new_fingerprint: Optional = Nonemax_depth = 16 ) → Dataset
Parameters
- new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
A copy of the dataset with flattened columns.
Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"squad"</span>, split=<span>"train"</span>)
<span>>>> </span>ds.features
{<span>'answers'</span>: <span>Sequence</span>(feature={<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>), <span>'answer_start'</span>: Value(dtype=<span>'int32'</span>, <span>id</span>=<span>None</span>)}, length=-<span>1</span>, <span>id</span>=<span>None</span>),
<span>'context'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'id'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'title'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds.flatten()
Dataset({
features: [<span>'id'</span>, <span>'title'</span>, <span>'context'</span>, <span>'question'</span>, <span>'answers.text'</span>, <span>'answers.answer_start'</span>],
num_rows: <span>87599</span>
})
cast
( features: Featuresbatch_size: Optional = 1000keep_in_memory: bool = Falseload_from_cache_file: Optional = Nonecache_file_name: Optional = Nonewriter_batch_size: Optional = 1000num_proc: Optional = None ) → Dataset
Cast the dataset to a new set of features.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset, ClassLabel, Value
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>new_features = ds.features.copy()
<span>>>> </span>new_features[<span>'label'</span>] = ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>])
<span>>>> </span>new_features[<span>'text'</span>] = Value(<span>'large_string'</span>)
<span>>>> </span>ds = ds.cast(new_features)
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'large_string'</span>, <span>id</span>=<span>None</span>)}
cast_column
( column: strfeature: Unionnew_fingerprint: Optional = None )
Parameters
- column (
str
) — Column name. - feature (
FeatureType
) — Target feature. - new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Cast column to feature for decoding.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.cast_column(<span>'label'</span>, ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>]))
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
remove_columns
( column_names: Unionnew_fingerprint: Optional = None ) → Dataset
Parameters
- column_names (
Union[str, List[str]]
) — Name of the column(s) to remove. - new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them.
You can also remove a column using map() with remove_columns
but the present method doesn’t copy the data of the remaining columns and is thus faster.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds = ds.remove_columns(<span>'label'</span>)
Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
<span>>>> </span>ds = ds.remove_columns(column_names=ds.column_names)
Dataset({
features: [],
num_rows: <span>0</span>
})
rename_column
( original_column_name: strnew_column_name: strnew_fingerprint: Optional = None ) → Dataset
Parameters
- original_column_name (
str
) — Name of the column to rename. - new_column_name (
str
) — New name for the column. - new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds = ds.rename_column(<span>'label'</span>, <span>'label_new'</span>)
Dataset({
features: [<span>'text'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
rename_columns
( column_mapping: Dictnew_fingerprint: Optional = None ) → Dataset
Parameters
- column_mapping (
Dict[str, str]
) — A mapping of columns to rename to their new names - new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds = ds.rename_columns({<span>'text'</span>: <span>'text_new'</span>, <span>'label'</span>: <span>'label_new'</span>})
Dataset({
features: [<span>'text_new'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
select_columns
( column_names: Unionnew_fingerprint: Optional = None ) → Dataset
Parameters
- column_names (
Union[str, List[str]]
) — Name of the column(s) to keep. - new_fingerprint (
str
, optional) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
A copy of the dataset object which only consists of selected columns.
Select one or several column(s) in the dataset and the features associated to them.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.select_columns([<span>'text'</span>])
Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
class_encode_column
( column: strinclude_nulls: bool = False )
Parameters
-
column (
str
) — The name of the column to cast (list all the column names with column_names) -
include_nulls (
bool
, defaults toFalse
) — Whether to include null values in the class labels. IfTrue
, the null values will be encoded as the"None"
class label.Added in 1.14.2
Casts the given column as ClassLabel and updates the table.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"boolq"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.features
{<span>'answer'</span>: Value(dtype=<span>'bool'</span>, <span>id</span>=<span>None</span>),
<span>'passage'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.class_encode_column(<span>'answer'</span>)
<span>>>> </span>ds.features
{<span>'answer'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'False'</span>, <span>'True'</span>], <span>id</span>=<span>None</span>),
<span>'passage'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
Number of rows in the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.__len__
<bound method Dataset.__len__ of Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>1066</span>
})>
Iterate through the examples.
If a formatting is set with Dataset.set_format() rows will be returned with the selected format.
iter
( batch_size: intdrop_last_batch: bool = False )
Parameters
- batch_size (
int
) — size of each batch to yield. - drop_last_batch (
bool
, default False) — Whether a last batch smaller than the batch_size should be dropped
Iterate through the batches of size batch_size.
If a formatting is set with [~datasets.Dataset.set_format] rows will be returned with the selected format.
formatted_as
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means `getitem“ returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
To be used in a with
statement. Set __getitem__
return format (type and columns).
set_format
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Either output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means__getitem__
returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
. It’s also possible to use custom transforms for formatting using set_transform().
It is possible to call map() after calling set_format
. Since map
may add new columns, then the list of formatted columns
gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted as:
<span>new</span> formatted <span>columns</span> = (<span>all</span> <span>columns</span> - previously unformatted <span>columns</span>)
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>'text'</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds.set_format(<span>type</span>=<span>'numpy'</span>, columns=[<span>'text'</span>, <span>'label'</span>])
<span>>>> </span>ds.<span>format</span>
{<span>'type'</span>: <span>'numpy'</span>,
<span>'format_kwargs'</span>: {},
<span>'columns'</span>: [<span>'text'</span>, <span>'label'</span>],
<span>'output_all_columns'</span>: <span>False</span>}
set_transform
( transform: Optionalcolumns: Optional = Noneoutput_all_columns: bool = False )
Parameters
- transform (
Callable
, optional) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as adict
) as input and returns a batch. This function is applied right before returning the objects in__getitem__
. - columns (
List[str]
, optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns. - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). If set to True, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called. As set_format(), this can be reset using reset_format().
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>'bert-base-uncased'</span>)
<span>>>> </span><span>def</span> <span>encode</span>(<span>batch</span>):
<span>... </span> <span>return</span> tokenizer(batch[<span>'text'</span>], padding=<span>True</span>, truncation=<span>True</span>, return_tensors=<span>'pt'</span>)
<span>>>> </span>ds.set_transform(encode)
<span>>>> </span>ds[<span>0</span>]
{<span>'attention_mask'</span>: tensor([<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>,
<span>1</span>, <span>1</span>]),
<span>'input_ids'</span>: tensor([ <span>101</span>, <span>29353</span>, <span>2135</span>, <span>15102</span>, <span>1996</span>, <span>9428</span>, <span>20868</span>, <span>2890</span>, <span>8663</span>, <span>6895</span>,
<span>20470</span>, <span>2571</span>, <span>3663</span>, <span>2090</span>, <span>4603</span>, <span>3017</span>, <span>3008</span>, <span>1998</span>, <span>2037</span>, <span>24211</span>,
<span>5637</span>, <span>1998</span>, <span>11690</span>, <span>2336</span>, <span>1012</span>, <span>102</span>]),
<span>'token_type_ids'</span>: tensor([<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>,
<span>0</span>, <span>0</span>])}
Reset __getitem__
return format to python objects and all columns.
Same as self.set_format()
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>'text'</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds.set_format(<span>type</span>=<span>'numpy'</span>, columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>])
<span>>>> </span>ds.<span>format</span>
{<span>'columns'</span>: [<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>'numpy'</span>}
<span>>>> </span>ds.reset_format()
<span>>>> </span>ds.<span>format</span>
{<span>'columns'</span>: [<span>'text'</span>, <span>'label'</span>, <span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>None</span>}
with_format
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Either output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means__getitem__
returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
.
It’s also possible to use custom transforms for formatting using with_transform().
Contrary to set_format(), with_format
returns a new Dataset object.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>'text'</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds.<span>format</span>
{<span>'columns'</span>: [<span>'text'</span>, <span>'label'</span>, <span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>None</span>}
<span>>>> </span>ds = ds.with_format(<span>type</span>=<span>'tensorflow'</span>, columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>])
<span>>>> </span>ds.<span>format</span>
{<span>'columns'</span>: [<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>'tensorflow'</span>}
with_transform
( transform: Optionalcolumns: Optional = Noneoutput_all_columns: bool = False )
Parameters
- transform (
Callable
,optional
) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as adict
) as input and returns a batch. This function is applied right before returning the objects in__getitem__
. - columns (
List[str]
,optional
) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns. - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). If set toTrue
, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called.
As set_format(), this can be reset using reset_format().
Contrary to set_transform(), with_transform
returns a new Dataset object.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span><span>def</span> <span>encode</span>(<span>example</span>):
<span>... </span> <span>return</span> tokenizer(example[<span>"text"</span>], padding=<span>True</span>, truncation=<span>True</span>, return_tensors=<span>'pt'</span>)
<span>>>> </span>ds = ds.with_transform(encode)
<span>>>> </span>ds[<span>0</span>]
{<span>'attention_mask'</span>: tensor([<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>,
<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]),
<span>'input_ids'</span>: tensor([ <span>101</span>, <span>18027</span>, <span>16310</span>, <span>16001</span>, <span>1103</span>, <span>9321</span>, <span>178</span>, <span>11604</span>, <span>7235</span>, <span>6617</span>,
<span>1742</span>, <span>2165</span>, <span>2820</span>, <span>1206</span>, <span>6588</span>, <span>22572</span>, <span>12937</span>, <span>1811</span>, <span>2153</span>, <span>1105</span>,
<span>1147</span>, <span>12890</span>, <span>19587</span>, <span>6463</span>, <span>1105</span>, <span>15026</span>, <span>1482</span>, <span>119</span>, <span>102</span>]),
<span>'token_type_ids'</span>: tensor([<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>,
<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>])}
Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
Be careful when running this command that no other process is currently using other cache files.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.cleanup_cache_files()
<span>10</span>
map
( function: Optional = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000drop_last_batch: bool = Falseremove_columns: Union = Nonekeep_in_memory: bool = Falseload_from_cache_file: Optional = Nonecache_file_name: Optional = Nonewriter_batch_size: Optional = 1000features: Optional = Nonedisable_nullable: bool = Falsefn_kwargs: Optional = Nonenum_proc: Optional = Nonesuffix_template: str = ’_{rank:05d}_of_{num_proc:05d}‘new_fingerprint: Optional = Nonedesc: Optional = None )
Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is
False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}
. - If batched is
True
andbatch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is{"text": ["Hello there !"]}
. - If batched is
True
andbatch_size
isn > 1
, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is{"text": ["Hello there !"] * n}
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span><span>def</span> <span>add_prefix</span>(<span>example</span>):
<span>... </span> example[<span>"text"</span>] = <span>"Review: "</span> + example[<span>"text"</span>]
<span>... </span> <span>return</span> example
<span>>>> </span>ds = ds.<span>map</span>(add_prefix)
<span>>>> </span>ds[<span>0</span>:<span>3</span>][<span>"text"</span>]
[<span>'Review: compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .'</span>,
<span>'Review: the soundtrack alone is worth the price of admission .'</span>,
<span>'Review: rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .'</span>]
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> example: tokenizer(example[<span>"text"</span>]), batched=<span>True</span>)
<span>>>> </span>ds = ds.<span>map</span>(add_prefix, num_proc=<span>4</span>)
filter
( function: Optional = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000keep_in_memory: bool = Falseload_from_cache_file: Optional = Nonecache_file_name: Optional = Nonewriter_batch_size: Optional = 1000fn_kwargs: Optional = Nonenum_proc: Optional = Nonesuffix_template: str = ’_{rank:05d}_of_{num_proc:05d}‘new_fingerprint: Optional = Nonedesc: Optional = None )
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.<span>filter</span>(<span>lambda</span> x: x[<span>"label"</span>] == <span>1</span>)
Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>533</span>
})
select
( indices: Iterablekeep_in_memory: bool = Falseindices_cache_file_name: Optional = Nonewriter_batch_size: Optional = 1000new_fingerprint: Optional = None )
Parameters
- indices (
range
,list
,iterable
,ndarray
orSeries
) — Range, list or 1D-array of integer indices for indexing. If the indices correspond to a contiguous range, the Arrow table is simply sliced. However passing a list of indices that are not contiguous creates indices mapping, which is much less efficient, but still faster than recreating an Arrow table made of the requested rows. - keep_in_memory (
bool
, defaults toFalse
) — Keep the indices mapping in memory instead of writing it to a cache file. - indices_cache_file_name (
str
, optional, defaults toNone
) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. - writer_batch_size (
int
, defaults to1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap
. - new_fingerprint (
str
, optional, defaults toNone
) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new dataset with rows selected following the list/array of indices.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds.select(<span>range</span>(<span>4</span>))
Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>4</span>
})
sort
( column_names: Unionreverse: Union = Falsekind = ‘deprecated’null_placement: str = ‘at_end’keep_in_memory: bool = Falseload_from_cache_file: Optional = Noneindices_cache_file_name: Optional = Nonewriter_batch_size: Optional = 1000new_fingerprint: Optional = None )
Create a new dataset sorted according to a single or multiple columns.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>'rotten_tomatoes'</span>, split=<span>'validation'</span>)
<span>>>> </span>ds[<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
<span>>>> </span>sorted_ds = ds.sort(<span>'label'</span>)
<span>>>> </span>sorted_ds[<span>'label'</span>][:<span>10</span>]
[<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>]
<span>>>> </span>another_sorted_ds = ds.sort([<span>'label'</span>, <span>'text'</span>], reverse=[<span>True</span>, <span>False</span>])
<span>>>> </span>another_sorted_ds[<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
shuffle
( seed: Optional = Nonegenerator: Optional = Nonekeep_in_memory: bool = Falseload_from_cache_file: Optional = Noneindices_cache_file_name: Optional = Nonewriter_batch_size: Optional = 1000new_fingerprint: Optional = None )
Parameters
- seed (
int
, optional) — A seed to initialize the default BitGenerator ifgenerator=None
. IfNone
, then fresh, unpredictable entropy will be pulled from the OS. If anint
orarray_like[ints]
is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. - generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), usesnp.random.default_rng
(the default BitGenerator (PCG64) of NumPy). - keep_in_memory (
bool
, defaultFalse
) — Keep the shuffled indices in memory instead of writing it to a cache file. - load_from_cache_file (
Optional[bool]
, defaults toTrue
if caching is enabled) — If a cache file storing the shuffled indices can be identified, use it instead of recomputing. - indices_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name. - writer_batch_size (
int
, defaults to1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap
. - new_fingerprint (
str
, optional, defaults toNone
) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new Dataset where the rows are shuffled.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
Shuffling takes the list of indices [0:len(my_dataset)]
and shuffles it to create an indices mapping. However as soon as your Dataset has an indices mapping, the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore. To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping.
This may take a lot of time depending of the size of your dataset though:
my_dataset[<span>0</span>]
my_dataset = my_dataset.shuffle(seed=<span>42</span>)
my_dataset[<span>0</span>]
my_dataset = my_dataset.flatten_indices()
my_dataset[<span>0</span>]
In this case, we recommend switching to an IterableDataset and leveraging its fast approximate shuffling method IterableDataset.shuffle().
It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal:
my_iterable_dataset = my_dataset.to_iterable_dataset(num_shards=<span>128</span>)
<span>for</span> example <span>in</span> <span>enumerate</span>(my_iterable_dataset):
<span>pass</span>
shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=<span>42</span>, buffer_size=<span>100</span>)
<span>for</span> example <span>in</span> <span>enumerate</span>(shuffled_iterable_dataset):
<span>pass</span>
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds[<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
<span>>>> </span>shuffled_ds = ds.shuffle(seed=<span>42</span>)
<span>>>> </span>shuffled_ds[<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>0</span>, <span>1</span>, <span>1</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>]
skip
( n: int )
Parameters
Create a new Dataset that skips the first n
elements.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>}]
<span>>>> </span>ds = ds.skip(<span>1</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'</span>}]
take
( n: int )
Parameters
Create a new Dataset with only the first n
elements.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>)
<span>>>> </span>small_ds = ds.take(<span>2</span>)
<span>>>> </span><span>list</span>(small_ds)
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>}]
train_test_split
( test_size: Union = Nonetrain_size: Union = Noneshuffle: bool = Truestratify_by_column: Optional = Noneseed: Optional = Nonegenerator: Optional = Nonekeep_in_memory: bool = Falseload_from_cache_file: Optional = Nonetrain_indices_cache_file_name: Optional = Nonetest_indices_cache_file_name: Optional = Nonewriter_batch_size: Optional = 1000train_new_fingerprint: Optional = Nonetest_new_fingerprint: Optional = None )
Return a dictionary (datasets.DatasetDict) with two random train and test subsets (train
and test
Dataset
splits). Splits are created from the dataset according to test_size
, train_size
and shuffle
.
This method is similar to scikit-learn train_test_split
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds = ds.train_test_split(test_size=<span>0.2</span>, shuffle=<span>True</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>852</span>
})
test: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>214</span>
})
})
<span>>>> </span>ds = ds.train_test_split(test_size=<span>0.2</span>, seed=<span>42</span>)
<span>>>> </span>ds = load_dataset(<span>"imdb"</span>,split=<span>"train"</span>)
Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>25000</span>
})
<span>>>> </span>ds = ds.train_test_split(test_size=<span>0.2</span>, stratify_by_column=<span>"label"</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>20000</span>
})
test: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>5000</span>
})
})
shard
( num_shards: intindex: intcontiguous: bool = Falsekeep_in_memory: bool = Falseindices_cache_file_name: Optional = Nonewriter_batch_size: Optional = 1000 )
Parameters
- num_shards (
int
) — How many shards to split the dataset into. - index (
int
) — Which shard to select and return. contiguous — (bool
, defaults toFalse
): Whether to select contiguous blocks of indices for shards. - keep_in_memory (
bool
, defaults toFalse
) — Keep the dataset in memory instead of writing it to a cache file. - indices_cache_file_name (
str
, optional) — Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name. - writer_batch_size (
int
, defaults to1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap
.
Return the index
-nth shard from dataset split into num_shards
pieces.
This shards deterministically. dset.shard(n, i)
will contain all elements of dset whose index mod n = i
.
dset.shard(n, i, contiguous=True)
will instead split dset into contiguous chunks, so it can be easily concatenated back together after processing. If n % i == l
, then the first l
shards will have length (n // i) + 1
, and the remaining shards will have length (n // i)
. datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)])
will return a dataset with the same order as the original.
Be sure to shard before using any randomizing operator (such as shuffle
). It is best if the shard operator is used early in the dataset pipeline.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"validation"</span>)
<span>>>> </span>ds
Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>1066</span>
})
<span>>>> </span>ds.shard(num_shards=<span>2</span>, index=<span>0</span>)
Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>533</span>
})
to_tf_dataset
( batch_size: Optional = Nonecolumns: Union = Noneshuffle: bool = Falsecollate_fn: Optional = Nonedrop_remainder: bool = Falsecollate_fn_args: Optional = Nonelabel_cols: Union = Noneprefetch: bool = Truenum_workers: int = 0num_test_batches: int = 20 )
Create a tf.data.Dataset
from the underlying Dataset. This tf.data.Dataset
will load and collate batches from the Dataset, and is suitable for passing to methods like model.fit()
or model.predict()
. The dataset will yield dicts
for both inputs and labels unless the dict
would contain only a single key, in which case a raw tf.Tensor
is yielded instead.
Example:
<span>>>> </span>ds_train = ds[<span>"train"</span>].to_tf_dataset(
<span>... </span> columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>... </span> shuffle=<span>True</span>,
<span>... </span> batch_size=<span>16</span>,
<span>... </span> collate_fn=data_collator,
<span>... </span>)
push_to_hub
( repo_id: strconfig_name: str = ‘default’set_default: Optional = Nonesplit: Optional = Nonedata_dir: Optional = Nonecommit_message: Optional = Nonecommit_description: Optional = Noneprivate: Optional = Falsetoken: Optional = Nonerevision: Optional = Nonebranch = ‘deprecated’create_pr: Optional = Falsemax_shard_size: Union = Nonenum_shards: Optional = Noneembed_external_files: bool = True )
Pushes the dataset to the hub as a Parquet dataset. The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
The resulting Parquet files are self-contained by default. If your dataset contains Image or Audio data, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files
to False
.
Example:
<span>>>> </span>dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>)
<span>>>> </span>dataset_dict.push_to_hub(<span>"<organization>/<dataset_id>"</span>, private=<span>True</span>)
<span>>>> </span>dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, max_shard_size=<span>"1GB"</span>)
<span>>>> </span>dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, num_shards=<span>1024</span>)
If your dataset has multiple splits (e.g. train/validation/test):
<span>>>> </span>train_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, split=<span>"train"</span>)
<span>>>> </span>val_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, split=<span>"validation"</span>)
<span>>>> </span>
<span>>>> </span>dataset = load_dataset(<span>"<organization>/<dataset_id>"</span>)
<span>>>> </span>train_dataset = dataset[<span>"train"</span>]
<span>>>> </span>val_dataset = dataset[<span>"validation"</span>]
If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages):
<span>>>> </span>english_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, <span>"en"</span>)
<span>>>> </span>french_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, <span>"fr"</span>)
<span>>>> </span>
<span>>>> </span>english_dataset = load_dataset(<span>"<organization>/<dataset_id>"</span>, <span>"en"</span>)
<span>>>> </span>french_dataset = load_dataset(<span>"<organization>/<dataset_id>"</span>, <span>"fr"</span>)
save_to_disk
( dataset_path: Unionfs = ‘deprecated’max_shard_size: Union = Nonenum_shards: Optional = Nonenum_proc: Optional = Nonestorage_options: Optional = None )
Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem
.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
Example:
<span>>>> </span>ds.save_to_disk(<span>"path/to/dataset/directory"</span>)
<span>>>> </span>ds.save_to_disk(<span>"path/to/dataset/directory"</span>, max_shard_size=<span>"1GB"</span>)
<span>>>> </span>ds.save_to_disk(<span>"path/to/dataset/directory"</span>, num_shards=<span>1024</span>)
load_from_disk
( dataset_path: Unionfs = ‘deprecated’keep_in_memory: Optional = Nonestorage_options: Optional = None ) → Dataset or DatasetDict
Loads a dataset that was previously saved using save_to_disk
from a dataset directory, or from a filesystem using any implementation of fsspec.spec.AbstractFileSystem
.
Example:
<span>>>> </span>ds = load_from_disk(<span>"path/to/dataset/directory"</span>)
flatten_indices
( keep_in_memory: bool = Falsecache_file_name: Optional = Nonewriter_batch_size: Optional = 1000features: Optional = Nonedisable_nullable: bool = Falsenum_proc: Optional = Nonenew_fingerprint: Optional = None )
Parameters
- keep_in_memory (
bool
, defaults toFalse
) — Keep the dataset in memory instead of writing it to a cache file. - cache_file_name (
str
, optional, defaultNone
) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. - writer_batch_size (
int
, defaults to1000
) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap
. - features (
Optional[datasets.Features]
, defaults toNone
) — Use a specific Features to store the cache file instead of the automatically generated one. - disable_nullable (
bool
, defaults toFalse
) — Allow null values in the table. - num_proc (
int
, optional, defaultNone
) — Max number of processes when generating cache. Already cached shards are loaded sequentially - new_fingerprint (
str
, optional, defaults toNone
) — The new fingerprint of the dataset after transform. IfNone
, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create and cache a new Dataset by flattening the indices mapping.
to_csv
( path_or_buf: Unionbatch_size: Optional = Nonenum_proc: Optional = Nonestorage_options: Optional = None**to_csv_kwargs ) → int
Exports the dataset to csv
Example:
<span>>>> </span>ds.to_csv(<span>"path/to/dataset/directory"</span>)
to_pandas
( batch_size: Optional = Nonebatched: bool = False )
Parameters
- batched (
bool
) — Set toTrue
to return a generator that yields the dataset as batches ofbatch_size
rows. Defaults toFalse
(returns the whole datasets once). - batch_size (
int
, optional) — The size (number of rows) of the batches ifbatched
isTrue
. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a pandas.DataFrame
. Can also return a generator for large datasets.
to_dict
( batch_size: Optional = Nonebatched = ‘deprecated’ )
Parameters
-
batched (
bool
) — Set toTrue
to return a generator that yields the dataset as batches ofbatch_size
rows. Defaults toFalse
(returns the whole datasets once).Deprecated in 2.11.0
Use
.iter(batch_size=batch_size)
followed by.to_dict()
on the individual batches instead. -
batch_size (
int
, optional) — The size (number of rows) of the batches ifbatched
isTrue
. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
.
Returns the dataset as a Python dict. Can also return a generator for large datasets.
to_json
( path_or_buf: Unionbatch_size: Optional = Nonenum_proc: Optional = Nonestorage_options: Optional = None**to_json_kwargs ) → int
Export the dataset to JSON Lines or JSON.
The default output format is JSON Lines. To export to JSON, pass lines=False
argument and the desired orient
.
Example:
<span>>>> </span>ds.to_json(<span>"path/to/dataset/directory/filename.jsonl"</span>)
to_parquet
( path_or_buf: Unionbatch_size: Optional = Nonestorage_options: Optional = None**parquet_writer_kwargs ) → int
Parameters
-
path_or_buf (
PathLike
orFileOrBuffer
) — Either a path to a file (e.g.file.parquet
), a remote URI (e.g.hf://datasets/username/my_dataset_name/data.parquet
), or a BinaryIO, where the dataset will be saved to in the specified format. -
batch_size (
int
, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE
. -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.19.0
-
**parquet_writer_kwargs (additional keyword arguments) — Parameters to pass to PyArrow’s
pyarrow.parquet.ParquetWriter
.
The number of characters or bytes written.
Exports the dataset to parquet
Example:
<span>>>> </span>ds.to_parquet(<span>"path/to/dataset/directory"</span>)
to_sql
( name: strcon: Unionbatch_size: Optional = None**sql_writer_kwargs ) → int
Exports the dataset to a SQL database.
Example:
<span>>>> </span>
<span>>>> </span>ds.to_sql(<span>"data"</span>, <span>"sqlite:///my_own_db.sql"</span>)
<span>>>> </span>
<span>>>> </span><span>import</span> sqlite3
<span>>>> </span>con = sqlite3.connect(<span>"my_own_db.sql"</span>)
<span>>>> </span><span>with</span> con:
<span>... </span> ds.to_sql(<span>"data"</span>, con)
to_iterable_dataset
( num_shards: Optional = 1 )
Parameters
- num_shards (
int
, default to1
) — Number of shards to define when instantiating the iterable dataset. This is especially useful for big datasets to be able to shuffle properly, and also to enable fast parallel loading using a PyTorch DataLoader or in distributed setups for example. Shards are defined using datasets.Dataset.shard(): it simply slices the data without writing anything on disk.
Get an datasets.IterableDataset from a map-style datasets.Dataset. This is equivalent to loading a dataset in streaming mode with datasets.load_dataset(), but much faster since the data is streamed from local files.
Contrary to map-style datasets, iterable datasets are lazy and can only be iterated over (e.g. using a for loop). Since they are read sequentially in training loops, iterable datasets are much faster than map-style datasets. All the transformations applied to iterable datasets like filtering or processing are done on-the-fly when you start iterating over the dataset.
Still, it is possible to shuffle an iterable dataset using datasets.IterableDataset.shuffle(). This is a fast approximate shuffling that works best if you have multiple shards and if you specify a buffer size that is big enough.
To get the best speed performance, make sure your dataset doesn’t have an indices mapping. If this is the case, the data are not read contiguously, which can be slow sometimes. You can use ds = ds.flatten_indices()
to write your dataset in contiguous chunks of data and have optimal speed before switching to an iterable dataset.
Example:
Basic usage:
<span>>>> </span>ids = ds.to_iterable_dataset()
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
With lazy filtering and processing:
<span>>>> </span>ids = ds.to_iterable_dataset()
<span>>>> </span>ids = ids.<span>filter</span>(filter_fn).<span>map</span>(process_fn)
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
With sharding to enable efficient shuffling:
<span>>>> </span>ids = ds.to_iterable_dataset(num_shards=<span>64</span>)
<span>>>> </span>ids = ids.shuffle(buffer_size=<span>10_000</span>)
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
With a PyTorch DataLoader:
<span>>>> </span><span>import</span> torch
<span>>>> </span>ids = ds.to_iterable_dataset(num_shards=<span>64</span>)
<span>>>> </span>ids = ids.<span>filter</span>(filter_fn).<span>map</span>(process_fn)
<span>>>> </span>dataloader = torch.utils.data.DataLoader(ids, num_workers=<span>4</span>)
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
With a PyTorch DataLoader and shuffling:
<span>>>> </span><span>import</span> torch
<span>>>> </span>ids = ds.to_iterable_dataset(num_shards=<span>64</span>)
<span>>>> </span>ids = ids.shuffle(buffer_size=<span>10_000</span>)
<span>>>> </span>dataloader = torch.utils.data.DataLoader(ids, num_workers=<span>4</span>)
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
In a distributed setup like PyTorch DDP with a PyTorch DataLoader and shuffling
<span>>>> </span><span>from</span> datasets.distributed <span>import</span> split_dataset_by_node
<span>>>> </span>ids = ds.to_iterable_dataset(num_shards=<span>512</span>)
<span>>>> </span>ids = ids.shuffle(buffer_size=<span>10_000</span>)
<span>>>> </span>ids = split_dataset_by_node(ds, world_size=<span>8</span>, rank=<span>0</span>)
<span>>>> </span>dataloader = torch.utils.data.DataLoader(ids, num_workers=<span>4</span>)
<span>>>> </span><span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
With shuffling and multiple epochs:
<span>>>> </span>ids = ds.to_iterable_dataset(num_shards=<span>64</span>)
<span>>>> </span>ids = ids.shuffle(buffer_size=<span>10_000</span>, seed=<span>42</span>)
<span>>>> </span><span>for</span> epoch <span>in</span> <span>range</span>(n_epochs):
<span>... </span> ids.set_epoch(epoch)
<span>... </span> <span>for</span> example <span>in</span> ids:
<span>... </span> <span>pass</span>
Feel free to also use `IterableDataset.set_epoch()` when using a PyTorch DataLoader or in distributed setups.
add_faiss_index
( column: strindex_name: Optional = Nonedevice: Optional = Nonestring_factory: Optional = Nonemetric_type: Optional = Nonecustom_index: Optional = Nonebatch_size: int = 1000train_size: Optional = Nonefaiss_verbose: bool = Falsedtype = <class ‘numpy.float32’> )
Add a dense index using Faiss for fast retrieval. By default the index is done over the vectors of the specified column. You can specify device
if you want to run it on GPU (device
must be the GPU index). You can find more information about Faiss here:
- For string factory
Example:
<span>>>> </span>ds = datasets.load_dataset(<span>'crime_and_punish'</span>, split=<span>'train'</span>)
<span>>>> </span>ds_with_embeddings = ds.<span>map</span>(<span>lambda</span> example: {<span>'embeddings'</span>: embed(example[<span>'line'</span>]}))
<span>>>> </span>ds_with_embeddings.add_faiss_index(column=<span>'embeddings'</span>)
<span>>>> </span>
<span>>>> </span>scores, retrieved_examples = ds_with_embeddings.get_nearest_examples(<span>'embeddings'</span>, embed(<span>'my new query'</span>), k=<span>10</span>)
<span>>>> </span>
<span>>>> </span>ds_with_embeddings.save_faiss_index(<span>'embeddings'</span>, <span>'my_index.faiss'</span>)
<span>>>> </span>ds = datasets.load_dataset(<span>'crime_and_punish'</span>, split=<span>'train'</span>)
<span>>>> </span>
<span>>>> </span>ds.load_faiss_index(<span>'embeddings'</span>, <span>'my_index.faiss'</span>)
<span>>>> </span>
<span>>>> </span>scores, retrieved_examples = ds.get_nearest_examples(<span>'embeddings'</span>, embed(<span>'my new query'</span>), k=<span>10</span>)
add_faiss_index_from_external_arrays
( external_arrays: arrayindex_name: strdevice: Optional = Nonestring_factory: Optional = Nonemetric_type: Optional = Nonecustom_index: Optional = Nonebatch_size: int = 1000train_size: Optional = Nonefaiss_verbose: bool = Falsedtype = <class ‘numpy.float32’> )
Add a dense index using Faiss for fast retrieval. The index is created using the vectors of external_arrays
. You can specify device
if you want to run it on GPU (device
must be the GPU index). You can find more information about Faiss here:
- For string factory
save_faiss_index
( index_name: strfile: Unionstorage_options: Optional = None )
Parameters
-
index_name (
str
) — The index_name/identifier of the index. This is the index_name that is used to call.get_nearest
or.search
. -
file (
str
) — The path to the serialized faiss index on disk or remote URI (e.g."s3://my-bucket/index.faiss"
). -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.11.0
Save a FaissIndex on disk.
load_faiss_index
( index_name: strfile: Uniondevice: Union = Nonestorage_options: Optional = None )
Parameters
-
index_name (
str
) — The index_name/identifier of the index. This is the index_name that is used to call.get_nearest
or.search
. -
file (
str
) — The path to the serialized faiss index on disk or remote URI (e.g."s3://my-bucket/index.faiss"
). -
device (Optional
Union[int, List[int]]
) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU. -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.11.0
Load a FaissIndex from disk.
If you want to do additional configurations, you can have access to the faiss index object by doing .get_index(index_name).faiss_index
to make it fit your needs.
add_elasticsearch_index
( column: strindex_name: Optional = Nonehost: Optional = Noneport: Optional = Nonees_client: Optional = Nonees_index_name: Optional = Nonees_index_config: Optional = None )
Parameters
- column (
str
) — The column of the documents to add to the index. - index_name (
str
, optional) — Theindex_name
/identifier of the index. This is the index name that is used to call get_nearest_examples() or search(). By default it corresponds tocolumn
. - host (
str
, optional, defaults tolocalhost
) — Host of where ElasticSearch is running. - port (
str
, optional, defaults to9200
) — Port of where ElasticSearch is running. - es_client (
elasticsearch.Elasticsearch
, optional) — The elasticsearch client used to create the index if host and port areNone
. - es_index_name (
str
, optional) — The elasticsearch index name used to create the index. - es_index_config (
dict
, optional) — The configuration of the elasticsearch index. Default config is:
Add a text index using ElasticSearch for fast retrieval. This is done in-place.
Example:
<span>>>> </span>es_client = elasticsearch.Elasticsearch()
<span>>>> </span>ds = datasets.load_dataset(<span>'crime_and_punish'</span>, split=<span>'train'</span>)
<span>>>> </span>ds.add_elasticsearch_index(column=<span>'line'</span>, es_client=es_client, es_index_name=<span>"my_es_index"</span>)
<span>>>> </span>scores, retrieved_examples = ds.get_nearest_examples(<span>'line'</span>, <span>'my new query'</span>, k=<span>10</span>)
load_elasticsearch_index
( index_name: stres_index_name: strhost: Optional = Noneport: Optional = Nonees_client: Optional = Nonees_index_config: Optional = None )
Parameters
- index_name (
str
) — Theindex_name
/identifier of the index. This is the index name that is used to callget_nearest
orsearch
. - es_index_name (
str
) — The name of elasticsearch index to load. - host (
str
, optional, defaults tolocalhost
) — Host of where ElasticSearch is running. - port (
str
, optional, defaults to9200
) — Port of where ElasticSearch is running. - es_client (
elasticsearch.Elasticsearch
, optional) — The elasticsearch client used to create the index if host and port areNone
. - es_index_config (
dict
, optional) — The configuration of the elasticsearch index. Default config is:
Load an existing text index using ElasticSearch for fast retrieval.
List the colindex_nameumns
/identifiers of all the attached indexes.
get_index
( index_name: str )
Parameters
List the index_name
/identifiers of all the attached indexes.
drop_index
( index_name: str )
Parameters
Drop the index with the specified column.
search
( index_name: strquery: Unionk: int = 10**kwargs ) → (scores, indices)
Parameters
- index_name (
str
) — The name/identifier of the index. - query (
Union[str, np.ndarray]
) — The query as a string ifindex_name
is a text index or as a numpy array ifindex_name
is a vector index. - k (
int
) — The number of examples to retrieve.
Returns
(scores, indices)
A tuple of (scores, indices)
where:
- scores (
List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples - indices (
List[List[int]]
): the indices of the retrieved examples
Find the nearest examples indices in the dataset to the query.
search_batch
( index_name: strqueries: Unionk: int = 10**kwargs ) → (total_scores, total_indices)
Parameters
- index_name (
str
) — Theindex_name
/identifier of the index. - queries (
Union[List[str], np.ndarray]
) — The queries as a list of strings ifindex_name
is a text index or as a numpy array ifindex_name
is a vector index. - k (
int
) — The number of examples to retrieve per query.
Returns
(total_scores, total_indices)
A tuple of (total_scores, total_indices)
where:
- total_scores (
List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples per query - total_indices (
List[List[int]]
): the indices of the retrieved examples per query
Find the nearest examples indices in the dataset to the query.
get_nearest_examples
( index_name: strquery: Unionk: int = 10**kwargs ) → (scores, examples)
Parameters
- index_name (
str
) — The index_name/identifier of the index. - query (
Union[str, np.ndarray]
) — The query as a string ifindex_name
is a text index or as a numpy array ifindex_name
is a vector index. - k (
int
) — The number of examples to retrieve.
Returns
(scores, examples)
A tuple of (scores, examples)
where:
- scores (
List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples - examples (
dict
): the retrieved examples
Find the nearest examples in the dataset to the query.
get_nearest_examples_batch
( index_name: strqueries: Unionk: int = 10**kwargs ) → (total_scores, total_examples)
Parameters
- index_name (
str
) — Theindex_name
/identifier of the index. - queries (
Union[List[str], np.ndarray]
) — The queries as a list of strings ifindex_name
is a text index or as a numpy array ifindex_name
is a vector index. - k (
int
) — The number of examples to retrieve per query.
Returns
(total_scores, total_examples)
A tuple of (total_scores, total_examples)
where:
- total_scores (
List[List[float]
): the retrieval scores from either FAISS (IndexFlatL2
by default) or ElasticSearch of the retrieved examples per query - total_examples (
List[dict]
): the retrieved examples per query
Find the nearest examples in the dataset to the query.
DatasetInfo object containing all the metadata in the dataset.
NamedSplit object corresponding to a named dataset split.
from_csv
( path_or_paths: Unionsplit: Optional = Nonefeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsenum_proc: Optional = None**kwargs )
Parameters
-
path_or_paths (
path-like
or list ofpath-like
) — Path(s) of the CSV file(s). -
split (NamedSplit, optional) — Split name to be assigned to the dataset.
-
features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. -
num_proc (
int
, optional, defaults toNone
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0
-
**kwargs (additional keyword arguments) — Keyword arguments to be passed to
pandas.read_csv
.
Create Dataset from CSV file(s).
Example:
<span>>>> </span>ds = Dataset.from_csv(<span>'path/to/dataset.csv'</span>)
from_json
( path_or_paths: Unionsplit: Optional = Nonefeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsefield: Optional = Nonenum_proc: Optional = None**kwargs )
Parameters
-
path_or_paths (
path-like
or list ofpath-like
) — Path(s) of the JSON or JSON Lines file(s). -
split (NamedSplit, optional) — Split name to be assigned to the dataset.
-
features (Features, optional) — Dataset features.
-
cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. -
field (
str
, optional) — Field name of the JSON file where the dataset is contained in. -
num_proc (
int
, optional defaults toNone
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0
-
**kwargs (additional keyword arguments) — Keyword arguments to be passed to
JsonConfig
.
Create Dataset from JSON or JSON Lines file(s).
Example:
<span>>>> </span>ds = Dataset.from_json(<span>'path/to/dataset.json'</span>)
from_parquet
( path_or_paths: Unionsplit: Optional = Nonefeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsecolumns: Optional = Nonenum_proc: Optional = None**kwargs )
Parameters
-
path_or_paths (
path-like
or list ofpath-like
) — Path(s) of the Parquet file(s). -
split (
NamedSplit
, optional) — Split name to be assigned to the dataset. -
cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. -
columns (
List[str]
, optional) — If notNone
, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. -
num_proc (
int
, optional, defaults toNone
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0
-
**kwargs (additional keyword arguments) — Keyword arguments to be passed to
ParquetConfig
.
Create Dataset from Parquet file(s).
Example:
<span>>>> </span>ds = Dataset.from_parquet(<span>'path/to/dataset.parquet'</span>)
from_text
( path_or_paths: Unionsplit: Optional = Nonefeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsenum_proc: Optional = None**kwargs )
Parameters
-
path_or_paths (
path-like
or list ofpath-like
) — Path(s) of the text file(s). -
split (
NamedSplit
, optional) — Split name to be assigned to the dataset. -
cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. -
keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. -
num_proc (
int
, optional, defaults toNone
) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0
-
**kwargs (additional keyword arguments) — Keyword arguments to be passed to
TextConfig
.
Create Dataset from text file(s).
Example:
<span>>>> </span>ds = Dataset.from_text(<span>'path/to/dataset.txt'</span>)
from_sql
( sql: Unioncon: Unionfeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
- sql (
str
orsqlalchemy.sql.Selectable
) — SQL query to be executed or a table name. - con (
str
orsqlite3.Connection
orsqlalchemy.engine.Connection
orsqlalchemy.engine.Connection
) — A URI string used to instantiate a database connection or a SQLite3/SQLAlchemy connection object. - features (Features, optional) — Dataset features.
- cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. - keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. - **kwargs (additional keyword arguments) — Keyword arguments to be passed to
SqlConfig
.
Create Dataset from SQL query or database table.
Example:
<span>>>> </span>
<span>>>> </span>ds = Dataset.from_sql(<span>"test_data"</span>, <span>"postgres:///db_name"</span>)
<span>>>> </span>
<span>>>> </span>ds = Dataset.from_sql(<span>"SELECT sentence FROM test_data"</span>, <span>"postgres:///db_name"</span>)
<span>>>> </span>
<span>>>> </span><span>from</span> sqlalchemy <span>import</span> select, text
<span>>>> </span>stmt = select([text(<span>"sentence"</span>)]).select_from(text(<span>"test_data"</span>))
<span>>>> </span>ds = Dataset.from_sql(stmt, <span>"postgres:///db_name"</span>)
The returned dataset can only be cached if con
is specified as URI string.
prepare_for_task
( task: Unionid: int = 0 )
Parameters
-
task (
Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. Ifstr
, supported tasks include:"text-classification"
"question-answering"
If
TaskTemplate
, must be one of the task templates indatasets.tasks
. -
id (
int
, defaults to0
) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks
.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
align_labels_with_mapping
( label2id: Dictlabel_column: str )
Parameters
- label2id (
dict
) — The label name to ID mapping to align the dataset with. - label_column (
str
) — The column name of labels to align on.
Align the dataset’s label ID and label name mapping to match an input label2id
mapping. This is useful when you want to ensure that a model’s predicted labels are aligned with the dataset. The alignment in done using the lowercase label names.
Example:
<span>>>> </span>
<span>>>> </span>ds = load_dataset(<span>"glue"</span>, <span>"mnli"</span>, split=<span>"train"</span>)
<span>>>> </span>
<span>>>> </span>label2id = {<span>'CONTRADICTION'</span>: <span>0</span>, <span>'NEUTRAL'</span>: <span>1</span>, <span>'ENTAILMENT'</span>: <span>2</span>}
<span>>>> </span>ds_aligned = ds.align_labels_with_mapping(label2id, <span>"label"</span>)
datasets.concatenate_datasets
( dsets: Listinfo: Optional = Nonesplit: Optional = Noneaxis: int = 0 )
Parameters
-
dsets (
List[datasets.Dataset]
) — List of Datasets to concatenate. -
info (
DatasetInfo
, optional) — Dataset information, like description, citation, etc. -
axis (
{0, 1}
, defaults to0
) — Axis to concatenate over, where0
means over rows (vertically) and1
means over columns (horizontally).Added in 1.6.0
Converts a list of Dataset with the same schema into a single Dataset.
Example:
<span>>>> </span>ds3 = concatenate_datasets([ds1, ds2])
datasets.interleave_datasets
( datasets: Listprobabilities: Optional = Noneseed: Optional = Noneinfo: Optional = Nonesplit: Optional = Nonestopping_strategy: Literal = ‘first_exhausted’ ) → Dataset or IterableDataset
Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples.
You can use this function on a list of Dataset objects, or on a list of IterableDataset objects.
- If
probabilities
isNone
(default) the new dataset is constructed by cycling between each source to get the examples. - If
probabilities
is notNone
, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities.
The resulting dataset ends when one of the source datasets runs out of examples except when oversampling
is True
, in which case, the resulting dataset ends when all datasets have ran out of examples at least one time.
Note for iterable datasets:
In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. Therefore the “first_exhausted” strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker).
Example:
For regular datasets (map-style):
<span>>>> </span><span>from</span> datasets <span>import</span> Dataset, interleave_datasets
<span>>>> </span>d1 = Dataset.from_dict({<span>"a"</span>: [<span>0</span>, <span>1</span>, <span>2</span>]})
<span>>>> </span>d2 = Dataset.from_dict({<span>"a"</span>: [<span>10</span>, <span>11</span>, <span>12</span>]})
<span>>>> </span>d3 = Dataset.from_dict({<span>"a"</span>: [<span>20</span>, <span>21</span>, <span>22</span>]})
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], probabilities=[<span>0.7</span>, <span>0.2</span>, <span>0.1</span>], seed=<span>42</span>, stopping_strategy=<span>"all_exhausted"</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>10</span>, <span>0</span>, <span>11</span>, <span>1</span>, <span>2</span>, <span>20</span>, <span>12</span>, <span>10</span>, <span>0</span>, <span>1</span>, <span>2</span>, <span>21</span>, <span>0</span>, <span>11</span>, <span>1</span>, <span>2</span>, <span>0</span>, <span>1</span>, <span>12</span>, <span>2</span>, <span>10</span>, <span>0</span>, <span>22</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], probabilities=[<span>0.7</span>, <span>0.2</span>, <span>0.1</span>], seed=<span>42</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>10</span>, <span>0</span>, <span>11</span>, <span>1</span>, <span>2</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3])
<span>>>> </span>dataset[<span>"a"</span>]
[<span>0</span>, <span>10</span>, <span>20</span>, <span>1</span>, <span>11</span>, <span>21</span>, <span>2</span>, <span>12</span>, <span>22</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], stopping_strategy=<span>"all_exhausted"</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>0</span>, <span>10</span>, <span>20</span>, <span>1</span>, <span>11</span>, <span>21</span>, <span>2</span>, <span>12</span>, <span>22</span>]
<span>>>> </span>d1 = Dataset.from_dict({<span>"a"</span>: [<span>0</span>, <span>1</span>, <span>2</span>]})
<span>>>> </span>d2 = Dataset.from_dict({<span>"a"</span>: [<span>10</span>, <span>11</span>, <span>12</span>, <span>13</span>]})
<span>>>> </span>d3 = Dataset.from_dict({<span>"a"</span>: [<span>20</span>, <span>21</span>, <span>22</span>, <span>23</span>, <span>24</span>]})
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3])
<span>>>> </span>dataset[<span>"a"</span>]
[<span>0</span>, <span>10</span>, <span>20</span>, <span>1</span>, <span>11</span>, <span>21</span>, <span>2</span>, <span>12</span>, <span>22</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], stopping_strategy=<span>"all_exhausted"</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>0</span>, <span>10</span>, <span>20</span>, <span>1</span>, <span>11</span>, <span>21</span>, <span>2</span>, <span>12</span>, <span>22</span>, <span>0</span>, <span>13</span>, <span>23</span>, <span>1</span>, <span>10</span>, <span>24</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], probabilities=[<span>0.7</span>, <span>0.2</span>, <span>0.1</span>], seed=<span>42</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>10</span>, <span>0</span>, <span>11</span>, <span>1</span>, <span>2</span>]
<span>>>> </span>dataset = interleave_datasets([d1, d2, d3], probabilities=[<span>0.7</span>, <span>0.2</span>, <span>0.1</span>], seed=<span>42</span>, stopping_strategy=<span>"all_exhausted"</span>)
<span>>>> </span>dataset[<span>"a"</span>]
[<span>10</span>, <span>0</span>, <span>11</span>, <span>1</span>, <span>2</span>, <span>20</span>, <span>12</span>, <span>13</span>, ..., <span>0</span>, <span>1</span>, <span>2</span>, <span>0</span>, <span>24</span>]
For datasets <span>in</span> streaming mode (iterable):
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset, interleave_datasets
<span>>>> </span>d1 = load_dataset(<span>"oscar"</span>, <span>"unshuffled_deduplicated_en"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>d2 = load_dataset(<span>"oscar"</span>, <span>"unshuffled_deduplicated_fr"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>dataset = interleave_datasets([d1, d2])
<span>>>> </span>iterator = <span>iter</span>(dataset)
<span>>>> </span><span>next</span>(iterator)
{<span>'text'</span>: <span>'Mtendere Village was inspired by the vision...}
>>> next(iterator)
{'</span>text<span>': "Média de débat d'</span><span>id</span>ées, de culture...}
datasets.distributed.split_dataset_by_node
( dataset: DatasetTyperank: intworld_size: int ) → Dataset or IterableDataset
Parameters
- dataset (Dataset or IterableDataset) — The dataset to split by node.
- rank (
int
) — Rank of the current node. - world_size (
int
) — Total number of nodes.
The dataset to be used on the node at rank rank
.
Split a dataset for the node at rank rank
in a pool of nodes of size world_size
.
For map-style datasets:
Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset. To maximize data loading throughput, chunks are made of contiguous data on disk if possible.
For iterable datasets:
If the dataset has a number of shards that is a factor of world_size
(i.e. if dataset.n_shards % world_size == 0
), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out of world_size
, skipping the other examples.
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use the
download_mode
parameter in load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use the
download_mode
parameter in load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk()] to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use the
download_mode
parameter in load_dataset().
DatasetDict
Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset
objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.
A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)
The Apache Arrow tables backing each split.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.data
The cache files containing the Apache Arrow table backing each split.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.cache_files
{<span>'test'</span>: [{<span>'filename'</span>: <span>'/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'</span>}],
<span>'train'</span>: [{<span>'filename'</span>: <span>'/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'</span>}],
<span>'validation'</span>: [{<span>'filename'</span>: <span>'/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'</span>}]}
Number of columns in each split of the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.num_columns
{<span>'test'</span>: <span>2</span>, <span>'train'</span>: <span>2</span>, <span>'validation'</span>: <span>2</span>}
Number of rows in each split of the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.num_rows
{<span>'test'</span>: <span>1066</span>, <span>'train'</span>: <span>8530</span>, <span>'validation'</span>: <span>1066</span>}
Names of the columns in each split of the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.column_names
{<span>'test'</span>: [<span>'text'</span>, <span>'label'</span>],
<span>'train'</span>: [<span>'text'</span>, <span>'label'</span>],
<span>'validation'</span>: [<span>'text'</span>, <span>'label'</span>]}
Shape of each split of the dataset (number of rows, number of columns).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.shape
{<span>'test'</span>: (<span>1066</span>, <span>2</span>), <span>'train'</span>: (<span>8530</span>, <span>2</span>), <span>'validation'</span>: (<span>1066</span>, <span>2</span>)}
unique
( column: str ) → Dict[str
, list
]
Parameters
- column (
str
) — column name (list all the column names with column_names)
Dictionary of unique elements in the given column.
Return a list of the unique elements in a column for each split.
This is implemented in the low-level backend and as such, very fast.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.unique(<span>"label"</span>)
{<span>'test'</span>: [<span>1</span>, <span>0</span>], <span>'train'</span>: [<span>1</span>, <span>0</span>], <span>'validation'</span>: [<span>1</span>, <span>0</span>]}
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be careful when running this command that no other process is currently using other cache files.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.cleanup_cache_files()
{<span>'test'</span>: <span>0</span>, <span>'train'</span>: <span>0</span>, <span>'validation'</span>: <span>0</span>}
map
( function: Optional = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000drop_last_batch: bool = Falseremove_columns: Union = Nonekeep_in_memory: bool = Falseload_from_cache_file: Optional = Nonecache_file_names: Optional = Nonewriter_batch_size: Optional = 1000features: Optional = Nonedisable_nullable: bool = Falsefn_kwargs: Optional = Nonenum_proc: Optional = Nonedesc: Optional = None )
Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span><span>def</span> <span>add_prefix</span>(<span>example</span>):
<span>... </span> example[<span>"text"</span>] = <span>"Review: "</span> + example[<span>"text"</span>]
<span>... </span> <span>return</span> example
<span>>>> </span>ds = ds.<span>map</span>(add_prefix)
<span>>>> </span>ds[<span>"train"</span>][<span>0</span>:<span>3</span>][<span>"text"</span>]
[<span>'Review: the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>,
<span>'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>,
<span>'Review: effective but too-tepid biopic'</span>]
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> example: tokenizer(example[<span>"text"</span>]), batched=<span>True</span>)
<span>>>> </span>ds = ds.<span>map</span>(add_prefix, num_proc=<span>4</span>)
filter
( function: Optional = Nonewith_indices: bool = Falsewith_rank: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000keep_in_memory: bool = Falseload_from_cache_file: Optional = Nonecache_file_names: Optional = Nonewriter_batch_size: Optional = 1000fn_kwargs: Optional = Nonenum_proc: Optional = Nonedesc: Optional = None )
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.<span>filter</span>(<span>lambda</span> x: x[<span>"label"</span>] == <span>1</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>4265</span>
})
validation: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>533</span>
})
test: Dataset({
features: [<span>'text'</span>, <span>'label'</span>],
num_rows: <span>533</span>
})
})
sort
( column_names: Unionreverse: Union = Falsekind = ‘deprecated’null_placement: str = ‘at_end’keep_in_memory: bool = Falseload_from_cache_file: Optional = Noneindices_cache_file_names: Optional = Nonewriter_batch_size: Optional = 1000 )
Create a new dataset sorted according to a single or multiple columns.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>'rotten_tomatoes'</span>)
<span>>>> </span>ds[<span>'train'</span>][<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
<span>>>> </span>sorted_ds = ds.sort(<span>'label'</span>)
<span>>>> </span>sorted_ds[<span>'train'</span>][<span>'label'</span>][:<span>10</span>]
[<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>]
<span>>>> </span>another_sorted_ds = ds.sort([<span>'label'</span>, <span>'text'</span>], reverse=[<span>True</span>, <span>False</span>])
<span>>>> </span>another_sorted_ds[<span>'train'</span>][<span>'label'</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
shuffle
( seeds: Union = Noneseed: Optional = Nonegenerators: Optional = Nonekeep_in_memory: bool = Falseload_from_cache_file: Optional = Noneindices_cache_file_names: Optional = Nonewriter_batch_size: Optional = 1000 )
Create a new Dataset where the rows are shuffled.
The transformation is applied to all the datasets of the dataset dictionary.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds[<span>"train"</span>][<span>"label"</span>][:<span>10</span>]
[<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]
<span>>>> </span>shuffled_ds = ds.shuffle(seed=<span>42</span>)
<span>>>> </span>shuffled_ds[<span>"train"</span>][<span>"label"</span>][:<span>10</span>]
[<span>0</span>, <span>1</span>, <span>0</span>, <span>1</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>]
set_format
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means__getitem__
returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects), - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The format is set for every dataset in the dataset dictionary.
It is possible to call map
after calling set_format
. Since map
may add new columns, then the list of formatted columns gets updated. In this case, if you apply map
on a dataset to add a new column, then this column will be formatted:
new formatted columns = (all columns - previously unformatted columns)
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>"text"</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds.set_format(<span>type</span>=<span>"numpy"</span>, columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>])
<span>>>> </span>ds[<span>"train"</span>].<span>format</span>
{<span>'columns'</span>: [<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>'numpy'</span>}
Reset __getitem__
return format to python objects and all columns. The transformation is applied to all the datasets of the dataset dictionary.
Same as self.set_format()
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>"text"</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds.set_format(<span>type</span>=<span>"numpy"</span>, columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>])
<span>>>> </span>ds[<span>"train"</span>].<span>format</span>
{<span>'columns'</span>: [<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>'numpy'</span>}
<span>>>> </span>ds.reset_format()
<span>>>> </span>ds[<span>"train"</span>].<span>format</span>
{<span>'columns'</span>: [<span>'text'</span>, <span>'label'</span>, <span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>None</span>}
formatted_as
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means__getitem__
returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects). - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
To be used in a with
statement. Set __getitem__
return format (type and columns). The transformation is applied to all the datasets of the dataset dictionary.
with_format
( type: Optional = Nonecolumns: Optional = Noneoutput_all_columns: bool = False**format_kwargs )
Parameters
- type (
str
, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax']
.None
means__getitem__
returns python objects (default). - columns (
List[str]
, optional) — Columns to format in the output.None
means__getitem__
returns all columns (default). - output_all_columns (
bool
, defaults toFalse
) — Keep un-formatted columns as well in the output (as python objects). - **format_kwargs (additional keyword arguments) — Keywords arguments passed to the convert function like
np.array
,torch.tensor
ortensorflow.ragged.constant
.
Set __getitem__
return format (type and columns). The data formatting is applied on-the-fly. The format type
(for example “numpy”) is used to format batches when using __getitem__
. The format is set for every dataset in the dataset dictionary.
It’s also possible to use custom transforms for formatting using with_transform().
Contrary to set_format(), with_format
returns a new DatasetDict object with new Dataset objects.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span>ds = ds.<span>map</span>(<span>lambda</span> x: tokenizer(x[<span>'text'</span>], truncation=<span>True</span>, padding=<span>True</span>), batched=<span>True</span>)
<span>>>> </span>ds[<span>"train"</span>].<span>format</span>
{<span>'columns'</span>: [<span>'text'</span>, <span>'label'</span>, <span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>None</span>}
<span>>>> </span>ds = ds.with_format(<span>type</span>=<span>'tensorflow'</span>, columns=[<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>])
<span>>>> </span>ds[<span>"train"</span>].<span>format</span>
{<span>'columns'</span>: [<span>'input_ids'</span>, <span>'token_type_ids'</span>, <span>'attention_mask'</span>, <span>'label'</span>],
<span>'format_kwargs'</span>: {},
<span>'output_all_columns'</span>: <span>False</span>,
<span>'type'</span>: <span>'tensorflow'</span>}
with_transform
( transform: Optionalcolumns: Optional = Noneoutput_all_columns: bool = False )
Parameters
- transform (
Callable
, optional) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in__getitem__
. - columns (
List[str]
, optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns. - output_all_columns (
bool
, defaults to False) — Keep un-formatted columns as well in the output (as python objects). If set toTrue
, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__
return format using this transform. The transform is applied on-the-fly on batches when __getitem__
is called. The transform is set for every dataset in the dataset dictionary
As set_format(), this can be reset using reset_format().
Contrary to set_transform()
, with_transform
returns a new DatasetDict object with new Dataset objects.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-cased"</span>)
<span>>>> </span><span>def</span> <span>encode</span>(<span>example</span>):
<span>... </span> <span>return</span> tokenizer(example[<span>'text'</span>], truncation=<span>True</span>, padding=<span>True</span>, return_tensors=<span>"pt"</span>)
<span>>>> </span>ds = ds.with_transform(encode)
<span>>>> </span>ds[<span>"train"</span>][<span>0</span>]
{<span>'attention_mask'</span>: tensor([<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>,
<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>,
<span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>, <span>1</span>]),
<span>'input_ids'</span>: tensor([ <span>101</span>, <span>1103</span>, <span>2067</span>, <span>1110</span>, <span>17348</span>, <span>1106</span>, <span>1129</span>, <span>1103</span>, <span>6880</span>, <span>1432</span>,
<span>112</span>, <span>188</span>, <span>1207</span>, <span>107</span>, <span>14255</span>, <span>1389</span>, <span>107</span>, <span>1105</span>, <span>1115</span>, <span>1119</span>,
<span>112</span>, <span>188</span>, <span>1280</span>, <span>1106</span>, <span>1294</span>, <span>170</span>, <span>24194</span>, <span>1256</span>, <span>3407</span>, <span>1190</span>,
<span>170</span>, <span>11791</span>, <span>5253</span>, <span>188</span>, <span>1732</span>, <span>7200</span>, <span>10947</span>, <span>12606</span>, <span>2895</span>, <span>117</span>,
<span>179</span>, <span>7766</span>, <span>118</span>, <span>172</span>, <span>15554</span>, <span>1181</span>, <span>3498</span>, <span>6961</span>, <span>3263</span>, <span>1137</span>,
<span>188</span>, <span>1566</span>, <span>7912</span>, <span>14516</span>, <span>6997</span>, <span>119</span>, <span>102</span>]),
<span>'token_type_ids'</span>: tensor([<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>,
<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>,
<span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>, <span>0</span>])}
Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"squad"</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'answers'</span>: <span>Sequence</span>(feature={<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>), <span>'answer_start'</span>: Value(dtype=<span>'int32'</span>, <span>id</span>=<span>None</span>)}, length=-<span>1</span>, <span>id</span>=<span>None</span>),
<span>'context'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'id'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'title'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds.flatten()
DatasetDict({
train: Dataset({
features: [<span>'id'</span>, <span>'title'</span>, <span>'context'</span>, <span>'question'</span>, <span>'answers.text'</span>, <span>'answers.answer_start'</span>],
num_rows: <span>87599</span>
})
validation: Dataset({
features: [<span>'id'</span>, <span>'title'</span>, <span>'context'</span>, <span>'question'</span>, <span>'answers.text'</span>, <span>'answers.answer_start'</span>],
num_rows: <span>10570</span>
})
})
cast
( features: Features )
Parameters
- features (Features) — New features to cast the dataset to. The name and order of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g.
string
←>ClassLabel
you should use map() to update the dataset.
Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>new_features = ds[<span>"train"</span>].features.copy()
<span>>>> </span>new_features[<span>'label'</span>] = ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>])
<span>>>> </span>new_features[<span>'text'</span>] = Value(<span>'large_string'</span>)
<span>>>> </span>ds = ds.cast(new_features)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'large_string'</span>, <span>id</span>=<span>None</span>)}
cast_column
( column: strfeature )
Parameters
Cast column to feature for decoding.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.cast_column(<span>'label'</span>, ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>]))
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
remove_columns
( column_names: Union ) → DatasetDict
Parameters
A copy of the dataset object without the columns to remove.
Remove one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
You can also remove a column using map() with remove_columns
but the present method doesn’t copy the data of the remaining columns and is thus faster.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds = ds.remove_columns(<span>"label"</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>],
num_rows: <span>8530</span>
})
validation: Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
test: Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
})
rename_column
( original_column_name: strnew_column_name: str )
Parameters
- original_column_name (
str
) — Name of the column to rename. - new_column_name (
str
) — New name for the column.
Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.
You can also rename a column using map() with remove_columns
but the present method:
- takes care of moving the original features under the new column name.
- doesn’t copy the data to a new dataset and is thus much faster.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds = ds.rename_column(<span>"label"</span>, <span>"label_new"</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>, <span>'label_new'</span>],
num_rows: <span>8530</span>
})
validation: Dataset({
features: [<span>'text'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
test: Dataset({
features: [<span>'text'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
})
rename_columns
( column_mapping: Dict ) → DatasetDict
Parameters
A copy of the dataset with renamed columns.
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.rename_columns({<span>'text'</span>: <span>'text_new'</span>, <span>'label'</span>: <span>'label_new'</span>})
DatasetDict({
train: Dataset({
features: [<span>'text_new'</span>, <span>'label_new'</span>],
num_rows: <span>8530</span>
})
validation: Dataset({
features: [<span>'text_new'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
test: Dataset({
features: [<span>'text_new'</span>, <span>'label_new'</span>],
num_rows: <span>1066</span>
})
})
select_columns
( column_names: Union )
Parameters
Select one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>)
<span>>>> </span>ds.select_columns(<span>"text"</span>)
DatasetDict({
train: Dataset({
features: [<span>'text'</span>],
num_rows: <span>8530</span>
})
validation: Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
test: Dataset({
features: [<span>'text'</span>],
num_rows: <span>1066</span>
})
})
class_encode_column
( column: strinclude_nulls: bool = False )
Parameters
-
include_nulls (
bool
, defaults toFalse
) — Whether to include null values in the class labels. IfTrue
, the null values will be encoded as the"None"
class label.Added in 1.14.2
Casts the given column as ClassLabel and updates the tables.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"boolq"</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'answer'</span>: Value(dtype=<span>'bool'</span>, <span>id</span>=<span>None</span>),
<span>'passage'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.class_encode_column(<span>"answer"</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'answer'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'False'</span>, <span>'True'</span>], <span>id</span>=<span>None</span>),
<span>'passage'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
push_to_hub
( repo_idconfig_name: str = ‘default’set_default: Optional = Nonedata_dir: Optional = Nonecommit_message: Optional = Nonecommit_description: Optional = Noneprivate: Optional = Falsetoken: Optional = Nonerevision: Optional = Nonebranch = ‘deprecated’create_pr: Optional = Falsemax_shard_size: Union = Nonenum_shards: Optional = Noneembed_external_files: bool = True )
Pushes the DatasetDict to the hub as a Parquet dataset. The DatasetDict is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
Each dataset split will be pushed independently. The pushed dataset will keep the original split names.
The resulting Parquet files are self-contained by default: if your dataset contains Image or Audio data, the Parquet files will store the bytes of your images or audio files. You can disable this by setting embed_external_files
to False.
Example:
<span>>>> </span>dataset_dict.push_to_hub(<span>"<organization>/<dataset_id>"</span>)
<span>>>> </span>dataset_dict.push_to_hub(<span>"<organization>/<dataset_id>"</span>, private=<span>True</span>)
<span>>>> </span>dataset_dict.push_to_hub(<span>"<organization>/<dataset_id>"</span>, max_shard_size=<span>"1GB"</span>)
<span>>>> </span>dataset_dict.push_to_hub(<span>"<organization>/<dataset_id>"</span>, num_shards={<span>"train"</span>: <span>1024</span>, <span>"test"</span>: <span>8</span>})
If you want to add a new configuration (or subset) to a dataset (e.g. if the dataset has multiple tasks/versions/languages):
<span>>>> </span>english_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, <span>"en"</span>)
<span>>>> </span>french_dataset.push_to_hub(<span>"<organization>/<dataset_id>"</span>, <span>"fr"</span>)
<span>>>> </span>
<span>>>> </span>english_dataset = load_dataset(<span>"<organization>/<dataset_id>"</span>, <span>"en"</span>)
<span>>>> </span>french_dataset = load_dataset(<span>"<organization>/<dataset_id>"</span>, <span>"fr"</span>)
save_to_disk
( dataset_dict_path: Unionfs = ‘deprecated’max_shard_size: Union = Nonenum_shards: Optional = Nonenum_proc: Optional = Nonestorage_options: Optional = None )
Saves a dataset dict to a filesystem using fsspec.spec.AbstractFileSystem
.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
Example:
<span>>>> </span>dataset_dict.save_to_disk(<span>"path/to/dataset/directory"</span>)
<span>>>> </span>dataset_dict.save_to_disk(<span>"path/to/dataset/directory"</span>, max_shard_size=<span>"1GB"</span>)
<span>>>> </span>dataset_dict.save_to_disk(<span>"path/to/dataset/directory"</span>, num_shards={<span>"train"</span>: <span>1024</span>, <span>"test"</span>: <span>8</span>})
load_from_disk
( dataset_dict_path: Unionfs = ‘deprecated’keep_in_memory: Optional = Nonestorage_options: Optional = None )
Parameters
-
dataset_dict_path (
path-like
) — Path (e.g."dataset/train"
) or remote URI (e.g."s3//my-bucket/dataset/train"
) of the dataset dict directory where the dataset dict will be loaded from. -
fs (
fsspec.spec.AbstractFileSystem
, optional) — Instance of the remote filesystem where the dataset will be saved to.Deprecated in 2.8.0
fs
was deprecated in version 2.8.0 and will be removed in 3.0.0. Please usestorage_options
instead, e.g.storage_options=fs.storage_options
-
keep_in_memory (
bool
, defaults toNone
) — Whether to copy the dataset in-memory. IfNone
, the dataset will not be copied in-memory unless explicitly enabled by settingdatasets.config.IN_MEMORY_MAX_SIZE
to nonzero. See more details in the improve performance section. -
storage_options (
dict
, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.8.0
Load a dataset that was previously saved using save_to_disk
from a filesystem using fsspec.spec.AbstractFileSystem
.
Example:
<span>>>> </span>ds = load_from_disk(<span>'path/to/dataset/directory'</span>)
from_csv
( path_or_paths: Dictfeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
- path_or_paths (
dict
of path-like) — Path(s) of the CSV file(s). - features (Features, optional) — Dataset features.
- cache_dir (str, optional, defaults to
"~/.cache/huggingface/datasets"
) — Directory to cache data. - keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. - **kwargs (additional keyword arguments) — Keyword arguments to be passed to
pandas.read_csv
.
Create DatasetDict from CSV file(s).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> DatasetDict
<span>>>> </span>ds = DatasetDict.from_csv({<span>'train'</span>: <span>'path/to/dataset.csv'</span>})
from_json
( path_or_paths: Dictfeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
- path_or_paths (
path-like
or list ofpath-like
) — Path(s) of the JSON Lines file(s). - features (Features, optional) — Dataset features.
- cache_dir (str, optional, defaults to
"~/.cache/huggingface/datasets"
) — Directory to cache data. - keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. - **kwargs (additional keyword arguments) — Keyword arguments to be passed to
JsonConfig
.
Create DatasetDict from JSON Lines file(s).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> DatasetDict
<span>>>> </span>ds = DatasetDict.from_json({<span>'train'</span>: <span>'path/to/dataset.json'</span>})
from_parquet
( path_or_paths: Dictfeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = Falsecolumns: Optional = None**kwargs )
Parameters
- path_or_paths (
dict
of path-like) — Path(s) of the CSV file(s). - features (Features, optional) — Dataset features.
- cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. - keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. - columns (
List[str]
, optional) — If notNone
, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’. - **kwargs (additional keyword arguments) — Keyword arguments to be passed to
ParquetConfig
.
Create DatasetDict from Parquet file(s).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> DatasetDict
<span>>>> </span>ds = DatasetDict.from_parquet({<span>'train'</span>: <span>'path/to/dataset/parquet'</span>})
from_text
( path_or_paths: Dictfeatures: Optional = Nonecache_dir: str = Nonekeep_in_memory: bool = False**kwargs )
Parameters
- path_or_paths (
dict
of path-like) — Path(s) of the text file(s). - features (Features, optional) — Dataset features.
- cache_dir (
str
, optional, defaults to"~/.cache/huggingface/datasets"
) — Directory to cache data. - keep_in_memory (
bool
, defaults toFalse
) — Whether to copy the data in-memory. - **kwargs (additional keyword arguments) — Keyword arguments to be passed to
TextConfig
.
Create DatasetDict from text file(s).
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> DatasetDict
<span>>>> </span>ds = DatasetDict.from_text({<span>'train'</span>: <span>'path/to/dataset.txt'</span>})
prepare_for_task
( task: Unionid: int = 0 )
Parameters
-
task (
Union[str, TaskTemplate]
) — The task to prepare the dataset for during training and evaluation. Ifstr
, supported tasks include:"text-classification"
"question-answering"
If
TaskTemplate
, must be one of the task templates indatasets.tasks
. -
id (
int
, defaults to0
) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks
.
Casts datasets.DatasetInfo.features
according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates
after casting.
IterableDataset
The base class IterableDataset implements an iterable Dataset backed by python generators.
class datasets.IterableDataset
( ex_iterable: _BaseExamplesIterableinfo: Optional = Nonesplit: Optional = Noneformatting: Optional = Noneshuffling: Optional = Nonedistributed: Optional = Nonetoken_per_repo_id: Optional = Noneformat_type = ‘deprecated’ )
A Dataset backed by an iterable.
from_generator
( generator: Callablefeatures: Optional = Nonegen_kwargs: Optional = Nonesplit: NamedSplit = NamedSplit(‘train’) ) → IterableDataset
Parameters
-
generator (
Callable
) — A generator function thatyields
examples. -
[](https://huggingface.co/docs/datasets/main/en/package_reference/&num;datasets.IterableDataset.from_generator.gen_kwargs(dict,)**gen\_kwargs(`dict`,** optional) — Keyword arguments to be passed to the
generator
callable. You can define a sharded iterable dataset by passing the list of shards ingen_kwargs
. This can be used to improve shuffling and when iterating over the dataset with multiple workers. -
split (NamedSplit, defaults to
Split.TRAIN
) — Split name to be assigned to the dataset.Added in 2.21.0
Create an Iterable Dataset from a generator.
Example:
<span>>>> </span><span>def</span> <span>gen</span>():
<span>... </span> <span>yield</span> {<span>"text"</span>: <span>"Good"</span>, <span>"label"</span>: <span>0</span>}
<span>... </span> <span>yield</span> {<span>"text"</span>: <span>"Bad"</span>, <span>"label"</span>: <span>1</span>}
...
<span>>>> </span>ds = IterableDataset.from_generator(gen)
<span>>>> </span><span>def</span> <span>gen</span>(<span>shards</span>):
<span>... </span> <span>for</span> shard <span>in</span> shards:
<span>... </span> <span>with</span> <span>open</span>(shard) <span>as</span> f:
<span>... </span> <span>for</span> line <span>in</span> f:
<span>... </span> <span>yield</span> {<span>"line"</span>: line}
...
<span>>>> </span>shards = [<span>f"data<span>{i}</span>.txt"</span> <span>for</span> i <span>in</span> <span>range</span>(<span>32</span>)]
<span>>>> </span>ds = IterableDataset.from_generator(gen, gen_kwargs={<span>"shards"</span>: shards})
<span>>>> </span>ds = ds.shuffle(seed=<span>42</span>, buffer_size=<span>10_000</span>)
<span>>>> </span><span>from</span> torch.utils.data <span>import</span> DataLoader
<span>>>> </span>dataloader = DataLoader(ds.with_format(<span>"torch"</span>), num_workers=<span>4</span>)
remove_columns
( column_names: Union ) → IterableDataset
Parameters
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>, <span>'label'</span>: <span>1</span>}
<span>>>> </span>ds = ds.remove_columns(<span>"label"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
select_columns
( column_names: Union ) → IterableDataset
Parameters
A copy of the dataset object with selected columns.
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>, <span>'label'</span>: <span>1</span>}
<span>>>> </span>ds = ds.select_columns(<span>"text"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
cast_column
( column: strfeature: Union ) → IterableDataset
Parameters
Cast column to feature for decoding.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset, Audio
<span>>>> </span>ds = load_dataset(<span>"PolyAI/minds14"</span>, name=<span>"en-US"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>ds.features
{<span>'audio'</span>: Audio(sampling_rate=<span>8000</span>, mono=<span>True</span>, decode=<span>True</span>, <span>id</span>=<span>None</span>),
<span>'english_transcription'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'intent_class'</span>: ClassLabel(num_classes=<span>14</span>, names=[<span>'abroad'</span>, <span>'address'</span>, <span>'app_error'</span>, <span>'atm_limit'</span>, <span>'balance'</span>, <span>'business_loan'</span>, <span>'card_issues'</span>, <span>'cash_deposit'</span>, <span>'direct_debit'</span>, <span>'freeze'</span>, <span>'high_value_payment'</span>, <span>'joint_account'</span>, <span>'latest_transactions'</span>, <span>'pay_bill'</span>], <span>id</span>=<span>None</span>),
<span>'lang_id'</span>: ClassLabel(num_classes=<span>14</span>, names=[<span>'cs-CZ'</span>, <span>'de-DE'</span>, <span>'en-AU'</span>, <span>'en-GB'</span>, <span>'en-US'</span>, <span>'es-ES'</span>, <span>'fr-FR'</span>, <span>'it-IT'</span>, <span>'ko-KR'</span>, <span>'nl-NL'</span>, <span>'pl-PL'</span>, <span>'pt-PT'</span>, <span>'ru-RU'</span>, <span>'zh-CN'</span>], <span>id</span>=<span>None</span>),
<span>'path'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'transcription'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.cast_column(<span>"audio"</span>, Audio(sampling_rate=<span>16000</span>))
<span>>>> </span>ds.features
{<span>'audio'</span>: Audio(sampling_rate=<span>16000</span>, mono=<span>True</span>, decode=<span>True</span>, <span>id</span>=<span>None</span>),
<span>'english_transcription'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'intent_class'</span>: ClassLabel(num_classes=<span>14</span>, names=[<span>'abroad'</span>, <span>'address'</span>, <span>'app_error'</span>, <span>'atm_limit'</span>, <span>'balance'</span>, <span>'business_loan'</span>, <span>'card_issues'</span>, <span>'cash_deposit'</span>, <span>'direct_debit'</span>, <span>'freeze'</span>, <span>'high_value_payment'</span>, <span>'joint_account'</span>, <span>'latest_transactions'</span>, <span>'pay_bill'</span>], <span>id</span>=<span>None</span>),
<span>'lang_id'</span>: ClassLabel(num_classes=<span>14</span>, names=[<span>'cs-CZ'</span>, <span>'de-DE'</span>, <span>'en-AU'</span>, <span>'en-GB'</span>, <span>'en-US'</span>, <span>'es-ES'</span>, <span>'fr-FR'</span>, <span>'it-IT'</span>, <span>'ko-KR'</span>, <span>'nl-NL'</span>, <span>'pl-PL'</span>, <span>'pt-PT'</span>, <span>'ru-RU'</span>, <span>'zh-CN'</span>], <span>id</span>=<span>None</span>),
<span>'path'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'transcription'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
cast
( features: Features ) → IterableDataset
Parameters
- features (Features) — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g.
string
←>ClassLabel
you should use map() to update the Dataset.
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>new_features = ds.features.copy()
<span>>>> </span>new_features[<span>"label"</span>] = ClassLabel(names=[<span>"bad"</span>, <span>"good"</span>])
<span>>>> </span>new_features[<span>"text"</span>] = Value(<span>"large_string"</span>)
<span>>>> </span>ds = ds.cast(new_features)
<span>>>> </span>ds.features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'large_string'</span>, <span>id</span>=<span>None</span>)}
iter
( batch_size: intdrop_last_batch: bool = False )
Parameters
- batch_size (
int
) — size of each batch to yield. - drop_last_batch (
bool
, default False) — Whether a last batch smaller than the batch_size should be dropped
Iterate through the batches of size batch_size.
map
( function: Optional = Nonewith_indices: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000drop_last_batch: bool = Falseremove_columns: Union = Nonefeatures: Optional = Nonefn_kwargs: Optional = None )
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is
False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}
. - If batched is
True
andbatch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}. - If batched is
True
andbatch_size
isn
> 1, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is{"text": ["Hello there !"] * n}
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>def</span> <span>add_prefix</span>(<span>example</span>):
<span>... </span> example[<span>"text"</span>] = <span>"Review: "</span> + example[<span>"text"</span>]
<span>... </span> <span>return</span> example
<span>>>> </span>ds = ds.<span>map</span>(add_prefix)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'Review: the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'Review: effective but too-tepid biopic'</span>}]
rename_column
( original_column_name: strnew_column_name: str ) → IterableDataset
Parameters
- original_column_name (
str
) — Name of the column to rename. - new_column_name (
str
) — New name for the column.
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
<span>>>> </span>ds = ds.rename_column(<span>"text"</span>, <span>"movie_review"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds))
{<span>'label'</span>: <span>1</span>,
<span>'movie_review'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
filter
( function: Optional = Nonewith_indices = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000fn_kwargs: Optional = None )
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.<span>filter</span>(<span>lambda</span> x: x[<span>"label"</span>] == <span>0</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>0</span>, <span>'movie_review'</span>: <span>'simplistic , silly and tedious .'</span>},
{<span>'label'</span>: <span>0</span>,
<span>'movie_review'</span>: <span>"it's so laddish and juvenile , only teenage boys could possibly find it funny ."</span>},
{<span>'label'</span>: <span>0</span>,
<span>'movie_review'</span>: <span>'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'</span>}]
shuffle
( seed = Nonegenerator: Optional = Nonebuffer_size: int = 1000 )
Parameters
- seed (
int
, optional, defaults toNone
) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffer and also to shuffle the data shards. - generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), usesnp.random.default_rng
(the default BitGenerator (PCG64) of NumPy). - buffer_size (
int
, defaults to1000
) — Size of the buffer.
Randomly shuffles the elements of this dataset.
This dataset fills a buffer with buffer_size
elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1000, then shuffle
will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.
If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using skip() or take() then the order of the shards is kept unchanged.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>}]
<span>>>> </span>shuffled_ds = ds.shuffle(seed=<span>42</span>)
<span>>>> </span><span>list</span>(shuffled_ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>"a sports movie with action that's exciting on the field and a story you care about off it ."</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'at its best , the good girl is a refreshingly adult take on adultery . . .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>"sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."</span>}]
batch
( batch_size: intdrop_last_batch: bool = False )
Parameters
- batch_size (
int
) — The number of samples in each batch. - drop_last_batch (
bool
, defaults toFalse
) — Whether to drop the last incomplete batch.
Group samples from the dataset into batches.
Example:
<span>>>> </span>ds = load_dataset(<span>"some_dataset"</span>, streaming=<span>True</span>)
<span>>>> </span>batched_ds = ds.batch(batch_size=<span>32</span>)
skip
( n: int )
Parameters
Create a new IterableDataset that skips the first n
elements.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>}]
<span>>>> </span>ds = ds.skip(<span>1</span>)
<span>>>> </span><span>list</span>(ds.take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'</span>}]
take
( n: int )
Parameters
Create a new IterableDataset with only the first n
elements.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>, streaming=<span>True</span>)
<span>>>> </span>small_ds = ds.take(<span>2</span>)
<span>>>> </span><span>list</span>(small_ds)
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>}]
Load the state_dict of the dataset. The iteration will restart at the next example from when the state was saved.
Resuming returns exactly where the checkpoint was saved except in two cases:
- examples from shuffle buffers are lost when resuming and the buffers are refilled with new data
- combinations of
.with_format(arrow)
and batched.map()
may skip one batch.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Dataset, concatenate_datasets
<span>>>> </span>ds = Dataset.from_dict({<span>"a"</span>: <span>range</span>(<span>6</span>)}).to_iterable_dataset(num_shards=<span>3</span>)
<span>>>> </span><span>for</span> idx, example <span>in</span> <span>enumerate</span>(ds):
<span>... </span> <span>print</span>(example)
<span>... </span> <span>if</span> idx == <span>2</span>:
<span>... </span> state_dict = ds.state_dict()
<span>... </span> <span>print</span>(<span>"checkpoint"</span>)
<span>... </span> <span>break</span>
<span>>>> </span>ds.load_state_dict(state_dict)
<span>>>> </span><span>print</span>(<span>f"restart from checkpoint"</span>)
<span>>>> </span><span>for</span> example <span>in</span> ds:
<span>... </span> <span>print</span>(example)
which returns:
{<span>'a'</span>: <span>0</span>}
{<span>'a'</span>: <span>1</span>}
{<span>'a'</span>: <span>2</span>}
<span>checkpoint</span>
<span>restart</span> <span>from</span> <span>checkpoint</span>
{<span>'a'</span>: <span>3</span>}
{<span>'a'</span>: <span>4</span>}
{<span>'a'</span>: <span>5</span>}
<span>>>> </span><span>from</span> torchdata.stateful_dataloader <span>import</span> StatefulDataLoader
<span>>>> </span>ds = load_dataset(<span>"deepmind/code_contests"</span>, streaming=<span>True</span>, split=<span>"train"</span>)
<span>>>> </span>dataloader = StatefulDataLoader(ds, batch_size=<span>32</span>, num_workers=<span>4</span>)
<span>>>> </span>
<span>>>> </span>state_dict = dataloader.state_dict()
<span>>>> </span>
<span>>>> </span>dataloader.load_state_dict(state_dict)
Get the current state_dict of the dataset. It corresponds to the state at the latest example it yielded.
Resuming returns exactly where the checkpoint was saved except in two cases:
- examples from shuffle buffers are lost when resuming and the buffers are refilled with new data
- combinations of
.with_format(arrow)
and batched.map()
may skip one batch.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Dataset, concatenate_datasets
<span>>>> </span>ds = Dataset.from_dict({<span>"a"</span>: <span>range</span>(<span>6</span>)}).to_iterable_dataset(num_shards=<span>3</span>)
<span>>>> </span><span>for</span> idx, example <span>in</span> <span>enumerate</span>(ds):
<span>... </span> <span>print</span>(example)
<span>... </span> <span>if</span> idx == <span>2</span>:
<span>... </span> state_dict = ds.state_dict()
<span>... </span> <span>print</span>(<span>"checkpoint"</span>)
<span>... </span> <span>break</span>
<span>>>> </span>ds.load_state_dict(state_dict)
<span>>>> </span><span>print</span>(<span>f"restart from checkpoint"</span>)
<span>>>> </span><span>for</span> example <span>in</span> ds:
<span>... </span> <span>print</span>(example)
which returns:
{<span>'a'</span>: <span>0</span>}
{<span>'a'</span>: <span>1</span>}
{<span>'a'</span>: <span>2</span>}
<span>checkpoint</span>
<span>restart</span> <span>from</span> <span>checkpoint</span>
{<span>'a'</span>: <span>3</span>}
{<span>'a'</span>: <span>4</span>}
{<span>'a'</span>: <span>5</span>}
<span>>>> </span><span>from</span> torchdata.stateful_dataloader <span>import</span> StatefulDataLoader
<span>>>> </span>ds = load_dataset(<span>"deepmind/code_contests"</span>, streaming=<span>True</span>, split=<span>"train"</span>)
<span>>>> </span>dataloader = StatefulDataLoader(ds, batch_size=<span>32</span>, num_workers=<span>4</span>)
<span>>>> </span>
<span>>>> </span>state_dict = dataloader.state_dict()
<span>>>> </span>
<span>>>> </span>dataloader.load_state_dict(state_dict)
DatasetInfo object containing all the metadata in the dataset.
NamedSplit object corresponding to a named dataset split.
IterableDatasetDict
Dictionary with split names as keys (‘train’, ‘test’ for example), and IterableDataset
objects as values.
map
( function: Optional = Nonewith_indices: bool = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: int = 1000drop_last_batch: bool = Falseremove_columns: Union = Nonefn_kwargs: Optional = None )
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary.
You can specify whether the function should be batched or not with the batched
parameter:
- If batched is
False
, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}
. - If batched is
True
andbatch_size
is 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is{"text": ["Hello there !"]}
. - If batched is
True
andbatch_size
isn
> 1, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples. Note that the last batch may have less thann
examples. A batch is a dictionary, e.g. a batch ofn
examples is{"text": ["Hello there !"] * n}
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span><span>def</span> <span>add_prefix</span>(<span>example</span>):
<span>... </span> example[<span>"text"</span>] = <span>"Review: "</span> + example[<span>"text"</span>]
<span>... </span> <span>return</span> example
<span>>>> </span>ds = ds.<span>map</span>(add_prefix)
<span>>>> </span><span>next</span>(<span>iter</span>(ds[<span>"train"</span>]))
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'Review: the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
filter
( function: Optional = Nonewith_indices = Falseinput_columns: Union = Nonebatched: bool = Falsebatch_size: Optional = 1000fn_kwargs: Optional = None )
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.<span>filter</span>(<span>lambda</span> x: x[<span>"label"</span>] == <span>0</span>)
<span>>>> </span><span>list</span>(ds[<span>"train"</span>].take(<span>3</span>))
[{<span>'label'</span>: <span>0</span>, <span>'text'</span>: <span>'Review: simplistic , silly and tedious .'</span>},
{<span>'label'</span>: <span>0</span>,
<span>'text'</span>: <span>"Review: it's so laddish and juvenile , only teenage boys could possibly find it funny ."</span>},
{<span>'label'</span>: <span>0</span>,
<span>'text'</span>: <span>'Review: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'</span>}]
shuffle
( seed = Nonegenerator: Optional = Nonebuffer_size: int = 1000 )
Parameters
- seed (
int
, optional, defaults toNone
) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffer and also to shuffle the data shards. - generator (
numpy.random.Generator
, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None
(default), usesnp.random.default_rng
(the default BitGenerator (PCG64) of NumPy). - buffer_size (
int
, defaults to1000
) — Size of the buffer.
Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size
is set to 1000, then shuffle
will initially select a random element from only the first 1000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1000 element buffer.
If the dataset is made of several shards, it also does shuffle
the order of the shards. However if the order has been fixed by using skip() or take() then the order of the shards is kept unchanged.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span><span>list</span>(ds[<span>"train"</span>].take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson'</span>s expanded vision of j . r . r . tolkien<span>'s middle-earth .'</span>},
{<span>'label'</span>: <span>1</span>, <span>'text'</span>: <span>'effective but too-tepid biopic'</span>}]
<span>>>> </span>ds = ds.shuffle(seed=<span>42</span>)
<span>>>> </span><span>list</span>(ds[<span>"train"</span>].take(<span>3</span>))
[{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>"a sports movie with action that's exciting on the field and a story you care about off it ."</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>'at its best , the good girl is a refreshingly adult take on adultery . . .'</span>},
{<span>'label'</span>: <span>1</span>,
<span>'text'</span>: <span>"sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."</span>}]
with_format
( type: Optional = None )
Parameters
- type (
str
, optional, defaults toNone
) — If set to “torch”, the returned dataset will be a subclass oftorch.utils.data.IterableDataset
to be used in aDataLoader
.
Return a dataset with the specified format. This method only supports the “torch” format for now. The format is set to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span><span>from</span> transformers <span>import</span> AutoTokenizer
<span>>>> </span>tokenizer = AutoTokenizer.from_pretrained(<span>"bert-base-uncased"</span>)
<span>>>> </span><span>def</span> <span>encode</span>(<span>example</span>):
<span>... </span> <span>return</span> tokenizer(examples[<span>"text"</span>], truncation=<span>True</span>, padding=<span>"max_length"</span>)
<span>>>> </span>ds = ds.<span>map</span>(encode, batched=<span>True</span>, remove_columns=[<span>"text"</span>])
<span>>>> </span>ds = ds.with_format(<span>"torch"</span>)
cast
( features: Features ) → IterableDatasetDict
Parameters
- features (
Features
) — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g.string
←>ClassLabel
you should usemap
to update the Dataset.
A copy of the dataset with casted features.
Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>new_features = ds[<span>"train"</span>].features.copy()
<span>>>> </span>new_features[<span>'label'</span>] = ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>])
<span>>>> </span>new_features[<span>'text'</span>] = Value(<span>'large_string'</span>)
<span>>>> </span>ds = ds.cast(new_features)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'large_string'</span>, <span>id</span>=<span>None</span>)}
cast_column
( column: strfeature: Union )
Parameters
Cast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span>ds = ds.cast_column(<span>'label'</span>, ClassLabel(names=[<span>'bad'</span>, <span>'good'</span>]))
<span>>>> </span>ds[<span>"train"</span>].features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'bad'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
remove_columns
( column_names: Union ) → IterableDatasetDict
Parameters
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.remove_columns(<span>"label"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds[<span>"train"</span>]))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
rename_column
( original_column_name: strnew_column_name: str ) → IterableDatasetDict
Parameters
- original_column_name (
str
) — Name of the column to rename. - new_column_name (
str
) — New name for the column.
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.rename_column(<span>"text"</span>, <span>"movie_review"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds[<span>"train"</span>]))
{<span>'label'</span>: <span>1</span>,
<span>'movie_review'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
rename_columns
( column_mapping: Dict ) → IterableDatasetDict
Parameters
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.rename_columns({<span>"text"</span>: <span>"movie_review"</span>, <span>"label"</span>: <span>"rating"</span>})
<span>>>> </span><span>next</span>(<span>iter</span>(ds[<span>"train"</span>]))
{<span>'movie_review'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>,
<span>'rating'</span>: <span>1</span>}
select_columns
( column_names: Union ) → IterableDatasetDict
Parameters
A copy of the dataset object with only selected columns.
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset. The selection is applied to all the datasets of the dataset dictionary.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, streaming=<span>True</span>)
<span>>>> </span>ds = ds.select(<span>"text"</span>)
<span>>>> </span><span>next</span>(<span>iter</span>(ds[<span>"train"</span>]))
{<span>'text'</span>: <span>'the rock is destined to be the 21st century'</span>s new <span>" conan "</span> <span>and</span> that he<span>'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'</span>}
Features
class datasets.Features
( *args**kwargs )
A special dictionary that defines the internal structure of a dataset.
Instantiated with a dictionary of type dict[str, FieldType]
, where keys are the desired column names, and values are the type of that column.
FieldType
can be one of the following:
-
Value feature specifies a single data type value, e.g.
int64
orstring
. -
ClassLabel feature specifies a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset.
-
Python
dict
specifies a composite feature containing a mapping of sub-fields to sub-features. It’s possible to have nested fields of nested fields in an arbitrary manner. -
Python
list
, LargeList or Sequence specifies a composite feature containing a sequence of sub-features, all of the same feature type.A Sequence with an internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatibility layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a Python
list
or a LargeList instead of the Sequence. -
Array2D, Array3D, Array4D or Array5D feature for multidimensional arrays.
-
Audio feature to store the absolute path to an audio file or a dictionary with the relative path to an audio file (“path” key) and its bytes content (“bytes” key). This feature extracts the audio data.
-
Image feature to store the absolute path to an image file, an
np.ndarray
object, aPIL.Image.Image
object or a dictionary with the relative path to an image file (“path” key) and its bytes content (“bytes” key). This feature extracts the image data. -
Translation or TranslationVariableLanguages feature specific to Machine Translation.
Make a deep copy of Features.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>)
<span>>>> </span>copy_of_features = ds.features.copy()
<span>>>> </span>copy_of_features
{<span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'neg'</span>, <span>'pos'</span>], <span>id</span>=<span>None</span>),
<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
decode_batch
( batch: dicttoken_per_repo_id: Optional = None )
Parameters
- batch (
dict[str, list[Any]]
) — Dataset batch data. - token_per_repo_id (
dict
, optional) — To access and decode audio or image files from private repositories on the Hub, you can pass a dictionary repo_id (str) → token (bool or str)
Decode batch with custom feature decoding.
decode_column
( column: listcolumn_name: str )
Parameters
Decode column with custom feature decoding.
decode_example
( example: dicttoken_per_repo_id: Optional = None )
Parameters
- example (
dict[str, Any]
) — Dataset row data. - token_per_repo_id (
dict
, optional) — To access and decode audio or image files from private repositories on the Hub, you can pass a dictionaryrepo_id (str) -> token (bool or str)
.
Decode example with custom feature decoding.
encode_batch
( batch )
Parameters
Encode batch into a format for Arrow.
encode_column
( columncolumn_name: str )
Parameters
Encode column into a format for Arrow.
encode_example
( example )
Parameters
Encode example into a format for Arrow.
Flatten the features. Every dictionary column is removed and is replaced by all the subfields it contains. The new fields are named by concatenating the name of the original column and the subfield name like this: <original>.<subfield>
.
If a column contains nested dictionaries, then all the lower-level subfields names are also concatenated to form new columns: <original>.<subfield>.<subsubfield>
, etc.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"squad"</span>, split=<span>"train"</span>)
<span>>>> </span>ds.features.flatten()
{<span>'answers.answer_start'</span>: <span>Sequence</span>(feature=Value(dtype=<span>'int32'</span>, <span>id</span>=<span>None</span>), length=-<span>1</span>, <span>id</span>=<span>None</span>),
<span>'answers.text'</span>: <span>Sequence</span>(feature=Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>), length=-<span>1</span>, <span>id</span>=<span>None</span>),
<span>'context'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'id'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'question'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>),
<span>'title'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
from_arrow_schema
( pa_schema: Schema )
Parameters
Construct Features from Arrow Schema. It also checks the schema metadata for Hugging Face Datasets features. Non-nullable fields are not supported and set to nullable.
Also, pa.dictionary is not supported and it uses its underlying type instead. Therefore datasets convert DictionaryArray objects to their actual values.
from_dict
( dic ) → Features
Parameters
Construct [Features] from dict.
Regenerate the nested feature object from a deserialized dict. We use the _type key to infer the dataclass name of the feature FieldType.
It allows for a convenient constructor syntax to define features from deserialized JSON dictionaries. This function is used in particular when deserializing a [DatasetInfo] that was dumped to a JSON object. This acts as an analogue to [Features.from_arrow_schema] and handles the recursive field-by-field instantiation, but doesn’t require any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes that [Value] automatically performs.
Example:
<span>>>> </span>Features.from_dict({<span>'_type'</span>: {<span>'dtype'</span>: <span>'string'</span>, <span>'id'</span>: <span>None</span>, <span>'_type'</span>: <span>'Value'</span>}})
{<span>'_type'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}
reorder_fields_as
( other: Features )
Parameters
Reorder Features fields to match the field order of other [Features].
The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features, <span>Sequence</span>, Value
<span>>>> </span>
<span>>>> </span>f1 = Features({<span>"root"</span>: <span>Sequence</span>({<span>"a"</span>: Value(<span>"string"</span>), <span>"b"</span>: Value(<span>"string"</span>)})})
<span>>>> </span>f2 = Features({<span>"root"</span>: {<span>"b"</span>: <span>Sequence</span>(Value(<span>"string"</span>)), <span>"a"</span>: <span>Sequence</span>(Value(<span>"string"</span>))}})
<span>>>> </span><span>assert</span> f1.<span>type</span> != f2.<span>type</span>
<span>>>> </span>
<span>>>> </span>f1.reorder_fields_as(f2)
{<span>'root'</span>: <span>Sequence</span>(feature={<span>'b'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>), <span>'a'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>)}, length=-<span>1</span>, <span>id</span>=<span>None</span>)}
<span>>>> </span><span>assert</span> f1.reorder_fields_as(f2).<span>type</span> == f2.<span>type</span>
Scalar
class datasets.Value
( dtype: strid: Optional = None )
Parameters
Scalar feature value of a particular data type.
The possible dtypes of Value
are as follows:
null
bool
int8
int16
int32
int64
uint8
uint16
uint32
uint64
float16
float32
(alias float)float64
(alias double)time32[(s|ms)]
time64[(us|ns)]
timestamp[(s|ms|us|ns)]
timestamp[(s|ms|us|ns), tz=(tzstring)]
date32
date64
duration[(s|ms|us|ns)]
decimal128(precision, scale)
decimal256(precision, scale)
binary
large_binary
string
large_string
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'stars'</span>: Value(dtype=<span>'int32'</span>)})
<span>>>> </span>features
{<span>'stars'</span>: Value(dtype=<span>'int32'</span>, <span>id</span>=<span>None</span>)}
class datasets.ClassLabel
( num_classes: dataclasses.InitVar[typing.Optional[int]] = Nonenames: List = Nonenames_file: dataclasses.InitVar[typing.Optional[str]] = Noneid: Optional = None )
Parameters
- num_classes (
int
, optional) — Number of classes. All labels must be <num_classes
. - names (
list
ofstr
, optional) — String names for the integer classes. The order in which the names are provided is kept. - names_file (
str
, optional) — Path to a file with names for the integer classes, one per line.
Feature type for integer class labels.
There are 3 ways to define a ClassLabel
, which correspond to the 3 arguments:
num_classes
: Create 0 to (num_classes-1) labels.names
: List of label strings.names_file
: File containing the list of labels.
Under the hood the labels are stored as integers. You can use negative integers to represent unknown/missing labels.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'label'</span>: ClassLabel(num_classes=<span>3</span>, names=[<span>'bad'</span>, <span>'ok'</span>, <span>'good'</span>])})
<span>>>> </span>features
{<span>'label'</span>: ClassLabel(num_classes=<span>3</span>, names=[<span>'bad'</span>, <span>'ok'</span>, <span>'good'</span>], <span>id</span>=<span>None</span>)}
cast_storage
( storage: Union ) → pa.Int64Array
Parameters
Array in the ClassLabel
arrow storage type.
Cast an Arrow array to the ClassLabel
arrow storage type. The Arrow types that can be converted to the ClassLabel
pyarrow storage type are:
pa.string()
pa.int()
Conversion integer
⇒ class name string
.
Regarding unknown/missing labels: passing negative integers raises ValueError
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>)
<span>>>> </span>ds.features[<span>"label"</span>].int2str(<span>0</span>)
<span>'neg'</span>
Conversion class name string
⇒ integer
.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span>ds = load_dataset(<span>"rotten_tomatoes"</span>, split=<span>"train"</span>)
<span>>>> </span>ds.features[<span>"label"</span>].str2int(<span>'neg'</span>)
<span>0</span>
Composite
class datasets.LargeList
( dtype: Anyid: Optional = None )
Parameters
Feature type for large list data composed of child feature data type.
It is backed by pyarrow.LargeListType
, which is like pyarrow.ListType
but with 64-bit rather than 32-bit offsets.
class datasets.Sequence
( feature: Anylength: int = -1id: Optional = None )
Parameters
- feature (
FeatureType
) — A list of features of a single type or a dictionary of types. - length (
int
) — Length of the sequence.
Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features, <span>Sequence</span>, Value, ClassLabel
<span>>>> </span>features = Features({<span>'post'</span>: <span>Sequence</span>(feature={<span>'text'</span>: Value(dtype=<span>'string'</span>), <span>'upvotes'</span>: Value(dtype=<span>'int32'</span>), <span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'hot'</span>, <span>'cold'</span>])})})
<span>>>> </span>features
{<span>'post'</span>: <span>Sequence</span>(feature={<span>'text'</span>: Value(dtype=<span>'string'</span>, <span>id</span>=<span>None</span>), <span>'upvotes'</span>: Value(dtype=<span>'int32'</span>, <span>id</span>=<span>None</span>), <span>'label'</span>: ClassLabel(num_classes=<span>2</span>, names=[<span>'hot'</span>, <span>'cold'</span>], <span>id</span>=<span>None</span>)}, length=-<span>1</span>, <span>id</span>=<span>None</span>)}
Translation
class datasets.Translation
( languages: Listid: Optional = None )
Parameters
- languages (
dict
) — A dictionary for each example mapping string language codes to string translations.
Feature
for translations with fixed languages per example. Here for compatiblity with tfds.
Example:
<span>>>> </span>
<span>>>> </span>datasets.features.Translation(languages=[<span>'en'</span>, <span>'fr'</span>, <span>'de'</span>])
<span>>>> </span>
<span>>>> </span><span>yield</span> {
<span>... </span> <span>'en'</span>: <span>'the cat'</span>,
<span>... </span> <span>'fr'</span>: <span>'le chat'</span>,
<span>... </span> <span>'de'</span>: <span>'die katze'</span>
<span>... </span>}
Flatten the Translation feature into a dictionary.
class datasets.TranslationVariableLanguages
( languages: Optional = Nonenum_languages: Optional = Noneid: Optional = None ) →
language
ortranslation
(variable-length 1Dtf.Tensor
oftf.string
)
Parameters
- languages (
dict
) — A dictionary for each example mapping string language codes to one or more string translations. The languages present may vary from example to example.
Returns
language
ortranslation
(variable-length 1Dtf.Tensor
oftf.string
)
Language codes sorted in ascending order or plain text translations, sorted to align with language codes.
Feature
for translations with variable languages per example. Here for compatiblity with tfds.
Example:
<span>>>> </span>
<span>>>> </span>datasets.features.TranslationVariableLanguages(languages=[<span>'en'</span>, <span>'fr'</span>, <span>'de'</span>])
<span>>>> </span>
<span>>>> </span><span>yield</span> {
<span>... </span> <span>'en'</span>: <span>'the cat'</span>,
<span>... </span> <span>'fr'</span>: [<span>'le chat'</span>, <span>'la chatte,'</span>]
<span>... </span> <span>'de'</span>: <span>'die katze'</span>
<span>... </span>}
<span>>>> </span>
<span>>>> </span>{
<span>... </span> <span>'language'</span>: [<span>'en'</span>, <span>'de'</span>, <span>'fr'</span>, <span>'fr'</span>],
<span>... </span> <span>'translation'</span>: [<span>'the cat'</span>, <span>'die katze'</span>, <span>'la chatte'</span>, <span>'le chat'</span>],
<span>... </span>}
Flatten the TranslationVariableLanguages feature into a dictionary.
Arrays
class datasets.Array2D
( shape: tupledtype: strid: Optional = None )
Parameters
Create a two-dimensional array.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'x'</span>: Array2D(shape=(<span>1</span>, <span>3</span>), dtype=<span>'int32'</span>)})
class datasets.Array3D
( shape: tupledtype: strid: Optional = None )
Parameters
Create a three-dimensional array.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'x'</span>: Array3D(shape=(<span>1</span>, <span>2</span>, <span>3</span>), dtype=<span>'int32'</span>)})
class datasets.Array4D
( shape: tupledtype: strid: Optional = None )
Parameters
Create a four-dimensional array.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'x'</span>: Array4D(shape=(<span>1</span>, <span>2</span>, <span>2</span>, <span>3</span>), dtype=<span>'int32'</span>)})
class datasets.Array5D
( shape: tupledtype: strid: Optional = None )
Parameters
Create a five-dimensional array.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> Features
<span>>>> </span>features = Features({<span>'x'</span>: Array5D(shape=(<span>1</span>, <span>2</span>, <span>2</span>, <span>3</span>, <span>3</span>), dtype=<span>'int32'</span>)})
Audio
class datasets.Audio
( sampling_rate: Optional = Nonemono: bool = Truedecode: bool = Trueid: Optional = None )
Parameters
- sampling_rate (
int
, optional) — Target sampling rate. IfNone
, the native sampling rate is used. - mono (
bool
, defaults toTrue
) — Whether to convert the audio signal to mono by averaging samples across channels. - decode (
bool
, defaults toTrue
) — Whether to decode the audio data. IfFalse
, returns the underlying dictionary in the format{"path": audio_path, "bytes": audio_bytes}
.
Audio Feature
to extract audio data from an audio file.
Input: The Audio feature accepts as input:
-
A
str
: Absolute path to the audio file (i.e. random access is allowed). -
A
dict
with the keys:path
: String with relative path of the audio file to the archive file.bytes
: Bytes content of the audio file.
This is useful for archived files with sequential access.
-
A
dict
with the keys:path
: String with relative path of the audio file to the archive file.array
: Array containing the audio samplesampling_rate
: Integer corresponding to the sampling rate of the audio sample.
This is useful for archived files with sequential access.
Example:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset, Audio
<span>>>> </span>ds = load_dataset(<span>"PolyAI/minds14"</span>, name=<span>"en-US"</span>, split=<span>"train"</span>)
<span>>>> </span>ds = ds.cast_column(<span>"audio"</span>, Audio(sampling_rate=<span>16000</span>))
<span>>>> </span>ds[<span>0</span>][<span>"audio"</span>]
{<span>'array'</span>: array([ <span>2.3443763e-05</span>, <span>2.1729663e-04</span>, <span>2.2145823e-04</span>, ...,
<span>3.8356509e-05</span>, -<span>7.3497440e-06</span>, -<span>2.1754686e-05</span>], dtype=float32),
<span>'path'</span>: <span>'/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'</span>,
<span>'sampling_rate'</span>: <span>16000</span>}
cast_storage
( storage: Union ) → pa.StructArray
Parameters
Array in the Audio arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
Cast an Arrow array to the Audio arrow storage type. The Arrow types that can be converted to the Audio pyarrow storage type are:
pa.string()
- it must contain the “path” datapa.binary()
- it must contain the audio bytespa.struct({"bytes": pa.binary()})
pa.struct({"path": pa.string()})
pa.struct({"bytes": pa.binary(), "path": pa.string()})
- order doesn’t matter
decode_example
( value: dicttoken_per_repo_id: Optional = None ) → dict
Parameters
-
value (
dict
) — A dictionary with keys:path
: String with relative audio file path.bytes
: Bytes of the audio file.
-
token_per_repo_id (
dict
, optional) — To access and decode audio files from private repositories on the Hub, you can pass a dictionary repo_id (str
) → token (bool
orstr
)
Decode example audio file into audio data.
embed_storage
( storage: StructArray ) → pa.StructArray
Parameters
Array in the Audio arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Embed audio files into the Arrow array.
encode_example
( value: Union ) → dict
Parameters
Encode example into a format for Arrow.
If in the decodable state, raise an error, otherwise flatten the feature into a dictionary.
Image
class datasets.Image
( mode: Optional = Nonedecode: bool = Trueid: Optional = None )
Parameters
- mode (
str
, optional) — The mode to convert the image to. IfNone
, the native mode of the image is used. - decode (
bool
, defaults toTrue
) — Whether to decode the image data. IfFalse
, returns the underlying dictionary in the format{"path": image_path, "bytes": image_bytes}
.
Image Feature
to read image data from an image file.
Input: The Image feature accepts as input:
-
A
str
: Absolute path to the image file (i.e. random access is allowed). -
A
dict
with the keys:path
: String with relative path of the image file to the archive file.bytes
: Bytes of the image file.
This is useful for archived files with sequential access.
-
An
np.ndarray
: NumPy array representing an image. -
A
PIL.Image.Image
: PIL image object.
Examples:
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset, Image
<span>>>> </span>ds = load_dataset(<span>"beans"</span>, split=<span>"train"</span>)
<span>>>> </span>ds.features[<span>"image"</span>]
Image(decode=<span>True</span>, <span>id</span>=<span>None</span>)
<span>>>> </span>ds[<span>0</span>][<span>"image"</span>]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at <span>0x15E52E7F0</span>>
<span>>>> </span>ds = ds.cast_column(<span>'image'</span>, Image(decode=<span>False</span>))
{<span>'bytes'</span>: <span>None</span>,
<span>'path'</span>: <span>'/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/healthy/healthy_train.85.jpg'</span>}
cast_storage
( storage: Union ) → pa.StructArray
Parameters
Array in the Image arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Cast an Arrow array to the Image arrow storage type. The Arrow types that can be converted to the Image pyarrow storage type are:
pa.string()
- it must contain the “path” datapa.binary()
- it must contain the image bytespa.struct({"bytes": pa.binary()})
pa.struct({"path": pa.string()})
pa.struct({"bytes": pa.binary(), "path": pa.string()})
- order doesn’t matterpa.list(*)
- it must contain the image array data
decode_example
( value: dicttoken_per_repo_id = None )
Parameters
-
value (
str
ordict
) — A string with the absolute image file path, a dictionary with keys:path
: String with absolute or relative image file path.bytes
: The bytes of the image file.
-
token_per_repo_id (
dict
, optional) — To access and decode image files from private repositories on the Hub, you can pass a dictionary repo_id (str
) → token (bool
orstr
).
Decode example image file into image data.
embed_storage
( storage: StructArray ) → pa.StructArray
Parameters
Array in the Image arrow storage type, that is pa.struct({"bytes": pa.binary(), "path": pa.string()})
.
Embed image files into the Arrow array.
encode_example
( value: Union )
Parameters
Encode example into a format for Arrow.
If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.
Filesystems
class datasets.filesystems.S3FileSystem
( *args**kwargs )
datasets.filesystems.S3FileSystem
is a subclass of s3fs.S3FileSystem
.
Users can use this class to access S3 as if it were a file system. It exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage. Provide credentials either explicitly (key=
, secret=
) or with boto’s credential methods. See botocore documentation for more information. If no credentials are available, use anon=True
.
Examples:
Listing files from public S3 bucket.
<span>>>> </span><span>import</span> datasets
<span>>>> </span>s3 = datasets.filesystems.S3FileSystem(anon=<span>True</span>)
<span>>>> </span>s3.ls(<span>'public-datasets/imdb/train'</span>)
[<span>'dataset_info.json.json'</span>,<span>'dataset.arrow'</span>,<span>'state.json'</span>]
Listing files from private S3 bucket using aws_access_key_id
and aws_secret_access_key
.
<span>>>> </span><span>import</span> datasets
<span>>>> </span>s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
<span>>>> </span>s3.ls(<span>'my-private-datasets/imdb/train'</span>)
[<span>'dataset_info.json.json'</span>,<span>'dataset.arrow'</span>,<span>'state.json'</span>]
Using S3Filesystem
with botocore.session.Session
and custom aws_profile
.
<span>>>> </span><span>import</span> botocore
<span>>>> </span><span>from</span> datasets.filesystems <span>import</span> S3Filesystem
<span>>>> </span>s3_session = botocore.session.Session(profile_name=<span>'my_profile_name'</span>)
<span>>>> </span>s3 = S3FileSystem(session=s3_session)
Loading dataset from S3 using S3Filesystem
and load_from_disk().
<span>>>> </span><span>from</span> datasets <span>import</span> load_from_disk
<span>>>> </span><span>from</span> datasets.filesystems <span>import</span> S3Filesystem
<span>>>> </span>s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
<span>>>> </span>dataset = load_from_disk(<span>'s3://my-private-datasets/imdb/train'</span>, storage_options=s3.storage_options)
<span>>>> </span><span>print</span>(<span>len</span>(dataset))
<span>25000</span>
Saving dataset to S3 using S3Filesystem
and Dataset.save_to_disk().
<span>>>> </span><span>from</span> datasets <span>import</span> load_dataset
<span>>>> </span><span>from</span> datasets.filesystems <span>import</span> S3Filesystem
<span>>>> </span>dataset = load_dataset(<span>"imdb"</span>)
<span>>>> </span>s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
<span>>>> </span>dataset.save_to_disk(<span>'s3://my-private-datasets/imdb/train'</span>, storage_options=s3.storage_options)
( dataset_path: str )
Parameters
- dataset_path (
str
) — Path (e.g.dataset/train
) or remote uri (e.g.s3://my-bucket/dataset/train
) of the dataset directory.
Preprocesses dataset_path
and removes remote filesystem (e.g. removing s3://
).
datasets.filesystems.is_remote_filesystem
( fs: AbstractFileSystem )
Parameters
- fs (
fsspec.spec.AbstractFileSystem
) — An abstract super-class for pythonic file-systems, e.g.fsspec.filesystem('file')
or datasets.filesystems.S3FileSystem.
Checks if fs
is a remote filesystem.
Fingerprint
Hasher that accepts python objects as inputs.