- Chat Templates
- Create a dataset loading script
- Load - Loading Datasets with Hugging Face datasets v3.2.0
Hugging Face 🤗 » Cheat Sheet
Dataset Loading methods
datasets.load_dataset- Load a dataset from the Hugging Face Hub, or a local dataset- Returns
DatasetorDatasetDict- if split is not None: the dataset requested,
- if split is None, a DatasetDict with each split.
- or
IterableDatasetorIterableDatasetDict: ifstreaming=Trueif split is not None, the dataset is requestedif split is None, adatasets.streaming.IterableDatasetDictwith each split.
- Under the hood:
- Load a dataset builder
- Run the dataset builder inc. download if not local and process and cache the dataset in typed Arrow tables for caching
- Return a dataset built from the requested splits in
split(default: all)
- example:
load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train')
- Returns
datasets.load_dataset(..., streaming=True)will load an IterableDataset- example:
load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train', streaming=True)
- example:
datasets.load_from_diskdatasets.get_dataset_split_names- get the list of available splits for a particular config and dataset- example:
datasets.get_dataset_split_names("facebook/voxpopuli", "en")returns['train', 'validation', 'test']- Note: In the
facebook/voxpopuliexample, it is necessary to specify the datasetconfig
- Note: In the
- example:
datasets.get_dataset_config_names- Get the list of available config names for a particular dataset- see examples with
"facebook/voxpopuli"or"facebook/multilingual_librispeech"
- see examples with
>>> from datasets import get_dataset_config_names
>>> get_dataset_config_names("facebook/voxpopuli")
Using the latest cached version of the module from /Users/anilkeshwani/.cache/huggingface/modules/datasets_modules/datasets/facebook--voxpopuli/b5ff837284f0778eefe0f642734e142d8c3f574eba8c9c8a4b13602297f73604 (last modified on Mon Aug 26 13:48:56 2024) since it couldn't be found locally at facebook/voxpopuli, or remotely on the Hugging Face Hub.
['en', 'de', 'fr', 'es', 'pl', 'it', 'ro', 'hu', 'cs', 'nl', 'fi', 'hr', 'sk', 'sl', 'et', 'lt', 'en_accented', 'multilang']>>> get_dataset_config_names("facebook/multilingual_librispeech")
Downloading readme: 100%|████████████████████████| 18.1k/18.1k [00:00<00:00, 40.8kB/s]
Resolving data files: 100%|████████████████████████| 48/48 [00:05<00:00, 8.13it/s]
['dutch', 'french', 'german', 'italian', 'polish', 'portuguese', 'spanish']datasets.get_dataset_infos- Get the meta information about a dataset, returned as a dict mapping config name toDatasetInfoDict
>>> datasets.get_dataset_infos('cornell-movie-review-data/rotten_tomatoes')
README.md: 7.46kB [00:00, 13.3MB/s]
{'default': DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, builder_name='parquet', dataset_name='rotten_tomatoes', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, shard_lengths=None, dataset_name=None), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, shard_lengths=None, dataset_name=None),Note: datasets.get_dataset_infos has bugs e.g. for datasets.get_dataset_infos("facebook/voxpopuli", repo_type="dataset") as below:
>>> import datasets
>>> datasets.get_dataset_infos("facebook/voxpopuli", repo_type="dataset")
README.md: 10.7kB [00:00, 14.4MB/s]
voxpopuli.py: 8.84kB [00:00, 14.0MB/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/inspect.py", line 94, in get_dataset_infos
config_name: get_dataset_config_info(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/inspect.py", line 277, in get_dataset_config_info
builder = load_dataset_builder(
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/load.py", line 1890, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
^^^^^^^^^^^^
File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/builder.py", line 342, in __init__
self.config, self.config_id = self._create_builder_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/builder.py", line 590, in _create_builder_config
raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig VoxpopuliConfig(name='en', version=1.3.0, data_dir=None, data_files=None, description=None) doesn't have a 'repo_type' key.huggingface_hub.list_repo_files- list files in a dataset repo without downloading / caching
from huggingface_hub import list_repo_files
# List all files in the dataset repo
files = list_repo_files("facebook/voxpopuli", repo_type="dataset")
for f in files:
print(f)
# Output (truncated)
# .gitattributes
# README.md
# data/cs/asr_dev.tsv
# data/cs/asr_test.tsv
# data/cs/asr_train.tsv
# data/cs/dev/dev_part_0.tar.gz
# data/cs/test/test_part_0.tar.gz
# data/cs/train/train_part_0.tar.gz
# data/cs/train/train_part_1.tar.gz
# ...Hugging Face datasets 🤗 Resources
Conceptual guides:
Tutorials:
Excerpt from Environment variables (v0.32.4)
HF_HUB_ENABLE_HF_TRANSFER
Set to True for faster uploads and downloads from the Hub using hf_transfer.
By default, huggingface_hub uses the Python-based requests.get and requests.post functions. Although these are reliable and versatile, they may not be the most efficient choice for machines with high bandwidth. hf_transfer is a Rust-based package developed to maximize the bandwidth used by dividing large files into smaller parts and transferring them simultaneously using multiple threads. This approach can potentially double the transfer speed. To use hf_transfer:
- Specify the
hf_transferextra when installinghuggingface_hub(e.g.pip install huggingface_hub[hf_transfer]). - Set
HF_HUB_ENABLE_HF_TRANSFER=1as an environment variable.
Please note that using hf_transfer comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, hf_transfer lacks several user-friendly features such as resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, hf_transfer is not enabled by default in huggingface_hub.
hf_xet is an alternative to hf_transfer. It provides efficient file transfers through a chunk-based deduplication strategy, custom Xet storage (replacing Git LFS), and a seamless integration with huggingface_hub.
Read more about the package and enable with pip install "huggingface_hub[hf_xet]".
HF_XET_HIGH_PERFORMANCE
Set hf-xet to operate with increased settings to maximize network and disk resources on the machine. Enabling high performance mode will try to saturate the network bandwidth of this machine and utilize all CPU cores for parallel upload/download activity. Consider this analogous to setting HF_HUB_ENABLE_HF_TRANSFER=True when uploading / downloading using hf-xet to the Xet storage backend.