RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models
Excerpt
Over the last half a year, we have been pleased to see that RedPajama-1T, which we released in March, has ignited the creation of many new language models. So many people from the community have downloaded this 5TB dataset---more than 190,000 times and have been using them in such creative ways! RedPajama-1T consists of 1 trillion high-quality English tokens, but it was only the first step. Today, with the release of RedPajama-V2, we are making a further step towards the development of open datasets by releasing a massive, 30 trillion token web dataset. This is, to our best knowledge, the largest public dataset released specifically for LLM training. Even more excitingly, we include 40+ pre-computed quality annotations, allowing the community to further filter and weigh the data. Specifically, this release includes:
See also SlimPajama
Notes
- We provide our best effort implementations of quality annotations used in C4, Gopher, Pretrainerâs Guide, RefinedWeb and Data Selection for Language Models via Importance Resampling.
- Quality signals indicating how natural a given piece of text is. This includes simple heuristic measures such as the number of sentences, the number of words, the fraction of all-caps words, among others.
- Quality signals indicating how repetitive a given piece of text is. Here follow the Gopher rules (Rae et al.) and compute the fraction of characters that appear in duplicated word n-grams and the fraction of characters in the most frequent word n-gram appearing in the documents.
- Content-based quality signals are comprised of signals that take the content into account such as the density of words appearing in a list of blocked words (similar to C4), or documents which come from a list of domains flagged as containing potentially harmful or otherwise offensive content.Â
- ML-based quality signals revolve around the idea of measuring how similar a given text is to a high-quality domain. Here we use fasttext classifiers trained on various high quality domains such as Wikipedia, as well as importance weights as proposed by Xie et al.
- Deduplication signals with pre-computed Minhash signatures (with 128 permutations) which can be used for fuzzy deduplication at different degrees.
- Take only head and middle (but they retain the tail if you want to use it) and this âreduces the token count by 60%, the number of documents decreases disproportionately more by 71%, indicating that the tail documents are generally shorter.â In other words, the head+middle make up ~40% of tokens but only 29% of documents (so theyâre longer on average)
- further deduplicated the head+middle documents using a Bloom filter, which leads to a reduction in the dataset size by roughly 40%
- Dataset Structure:
- The core of the dataset is composed of the text documents, accompanied by the quality annotations and deduplication clusters. The structure largely follows the one defined by CCNet.
{
"id": "2018-43/0000/en_head.json.gz/0",
"id_int": 7972430436813205988,
"metadata":{
"cc_segment": "crawl-data/...",
"cc_net_source": "2018-43/0000/en_head.json.gz",
"url": "...",
"source_domain": "...",
"language": "en",
"snapshot_id": "2018-43"
},
"quality_signals": {
"ccnet_original_length": [[0, 7033, 8711.0]],
"...": "...",
"rps_doc_stop_word_fraction": [[0, 7033, 0.45121107]],
"rps_lines_num_words": [[0, 25, 2], ..., [6980, 7033, 10]]
}
}
- Thank you to the OLMo team at AI2 and friends at OpenGPT-X for the insightful discussions about datasets and data quality! Also for everyone who builds on the RedPajama dataset, including Cerebras for their SlimPajama efforts, and the over 500 models built on RedPajama to date by the open-source AI community.
- We are grateful to the great team at EleutherAI for paving the path on open training datasets with The Pile and for open-sourcing code we use in training some of the RedPajama models.
Their handling of CC is per CCNet Extracting High Quality Monolingual Datasets from Web Crawl Data - âan automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languagesâ (quoting from their abstract).