Hugging Face

Hugging Face 🤗 » Cheat Sheet

Dataset Loading methods

datasets.load_dataset - Load a dataset from the Hugging Face Hub, or a local dataset
- Returns
  - Dataset or DatasetDict
    - if split is not None: the dataset requested,
    - if split is None, a DatasetDict with each split.
  - or IterableDataset or IterableDatasetDict: if streaming=True
    - if split is not None, the dataset is requested
    - if split is None, a datasets.streaming.IterableDatasetDict with each split.
- Under the hood:
  - Load a dataset builder
  - Run the dataset builder inc. download if not local and process and cache the dataset in typed Arrow tables for caching
  - Return a dataset built from the requested splits in split (default: all)
- example: load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train')
datasets.load_dataset(..., streaming=True) will load an IterableDataset
- example: load_dataset('cornell-movie-review-data/rotten_tomatoes', split='train', streaming=True)
datasets.load_from_disk
datasets.get_dataset_split_names - get the list of available splits for a particular config and dataset
- example: datasets.get_dataset_split_names("facebook/voxpopuli", "en") returns ['train', 'validation', 'test']
  - Note: In the facebook/voxpopuli example, it is necessary to specify the dataset config
datasets.get_dataset_config_names - Get the list of available config names for a particular dataset
- see examples with "facebook/voxpopuli" or "facebook/multilingual_librispeech"

>>> from datasets import  get_dataset_config_names
>>> get_dataset_config_names("facebook/voxpopuli")
Using the latest cached version of the module from /Users/anilkeshwani/.cache/huggingface/modules/datasets_modules/datasets/facebook--voxpopuli/b5ff837284f0778eefe0f642734e142d8c3f574eba8c9c8a4b13602297f73604 (last modified on Mon Aug 26 13:48:56 2024) since it couldn't be found locally at facebook/voxpopuli, or remotely on the Hugging Face Hub.
['en', 'de', 'fr', 'es', 'pl', 'it', 'ro', 'hu', 'cs', 'nl', 'fi', 'hr', 'sk', 'sl', 'et', 'lt', 'en_accented', 'multilang']

>>> get_dataset_config_names("facebook/multilingual_librispeech")
Downloading readme: 100%|████████████████████████| 18.1k/18.1k [00:00<00:00, 40.8kB/s]
Resolving data files: 100%|████████████████████████| 48/48 [00:05<00:00,  8.13it/s]
['dutch', 'french', 'german', 'italian', 'polish', 'portuguese', 'spanish']

datasets.get_dataset_infos - Get the meta information about a dataset, returned as a dict mapping config name to DatasetInfoDict

>>> datasets.get_dataset_infos('cornell-movie-review-data/rotten_tomatoes')
README.md: 7.46kB [00:00, 13.3MB/s]
{'default': DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}, post_processed=None, supervised_keys=None, builder_name='parquet', dataset_name='rotten_tomatoes', config_name='default', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=1074810, num_examples=8530, shard_lengths=None, dataset_name=None), 'validation': SplitInfo(name='validation', num_bytes=134679, num_examples=1066, shard_lengths=None, dataset_name=None),

Note: datasets.get_dataset_infos has bugs e.g. for datasets.get_dataset_infos("facebook/voxpopuli", repo_type="dataset") as below:

>>> import datasets
>>> datasets.get_dataset_infos("facebook/voxpopuli", repo_type="dataset")
README.md: 10.7kB [00:00, 14.4MB/s]
voxpopuli.py: 8.84kB [00:00, 14.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/inspect.py", line 94, in get_dataset_infos
    config_name: get_dataset_config_info(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/inspect.py", line 277, in get_dataset_config_info
    builder = load_dataset_builder(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/load.py", line 1890, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/builder.py", line 342, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/scratch-artemis/anilkeshwani/miniconda3/envs/main/lib/python3.12/site-packages/datasets/builder.py", line 590, in _create_builder_config
    raise ValueError(f"BuilderConfig {builder_config} doesn't have a '{key}' key.")
ValueError: BuilderConfig VoxpopuliConfig(name='en', version=1.3.0, data_dir=None, data_files=None, description=None) doesn't have a 'repo_type' key.

huggingface_hub.list_repo_files - list files in a dataset repo without downloading / caching

from huggingface_hub import list_repo_files
 
# List all files in the dataset repo
files = list_repo_files("facebook/voxpopuli", repo_type="dataset")
 
for f in files:
    print(f)
    
# Output (truncated)
# .gitattributes
# README.md
# data/cs/asr_dev.tsv
# data/cs/asr_test.tsv
# data/cs/asr_train.tsv
# data/cs/dev/dev_part_0.tar.gz
# data/cs/test/test_part_0.tar.gz
# data/cs/train/train_part_0.tar.gz
# data/cs/train/train_part_1.tar.gz
# ...

Hugging Face datasets 🤗 Resources

Main Classes - Load a dataset from the Hub

Conceptual guides:

Tutorials:

Load a dataset from the Hub

Excerpt from Environment variables (v0.32.4)

HF_HUB_ENABLE_HF_TRANSFER

Set to True for faster uploads and downloads from the Hub using hf_transfer.

By default, huggingface_hub uses the Python-based requests.get and requests.post functions. Although these are reliable and versatile, they may not be the most efficient choice for machines with high bandwidth. hf_transfer is a Rust-based package developed to maximize the bandwidth used by dividing large files into smaller parts and transferring them simultaneously using multiple threads. This approach can potentially double the transfer speed. To use hf_transfer:

Specify the hf_transfer extra when installing huggingface_hub (e.g. pip install huggingface_hub[hf_transfer]).
Set HF_HUB_ENABLE_HF_TRANSFER=1 as an environment variable.

Please note that using hf_transfer comes with certain limitations. Since it is not purely Python-based, debugging errors may be challenging. Additionally, hf_transfer lacks several user-friendly features such as resumable downloads and proxies. These omissions are intentional to maintain the simplicity and speed of the Rust logic. Consequently, hf_transfer is not enabled by default in huggingface_hub.

hf_xet is an alternative to hf_transfer. It provides efficient file transfers through a chunk-based deduplication strategy, custom Xet storage (replacing Git LFS), and a seamless integration with huggingface_hub.

Read more about the package and enable with pip install "huggingface_hub[hf_xet]".

HF_XET_HIGH_PERFORMANCE

Set hf-xet to operate with increased settings to maximize network and disk resources on the machine. Enabling high performance mode will try to saturate the network bandwidth of this machine and utilize all CPU cores for parallel upload/download activity. Consider this analogous to setting HF_HUB_ENABLE_HF_TRANSFER=True when uploading / downloading using hf-xet to the Xet storage backend.

🪴 Anil's Garden

Explorer