LAION2B Dataset
Excerpt
The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.
The LAION2B and projections
- About the LAION5B
- Subsets
- 768d clip embeddings (clip768)
- 32d PCA projections (pca32)
- 96d PCA projections (pca96)
- 1024-bit binary sketches (hamming)
- Gold standard list (computed with 32-bit floating point arithmetic, 100 nearest neighbors)
- Gold standard for public queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)
- Gold standard for private queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)
- Associated captions and image urls (tabular delimited files)
About the LAION5B
The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.
A more detailed description can be found here:
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ⊠& Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.
The English subset, often called LAION2B, contains over 2 billion objects.
Subset of the challenge
The dataset is divided into parts containing close to 1M vectors. We selected the first 112 parts (0000 to 0111); we used the first part to extract the public query set and the rest to extract the database. The subset use approximately 160GB of space and its associated metadata 20GB (the first 112 parts). Embeddings are distributed using single precision (16bits) floating point vectors bundled in the NumPy data-specific format .npz
. They can be loaded on most platforms due to the formatâs popularity.
The challenge has three subsets:
-
10M subset: concatenation of 1-11 parts.
-
30M subset: concatenation of 1-33 parts.
-
100M subset: concatenation of 1-111 parts.
-
public queries: computed from part 0.
All parts should be concatenated in order and also removing NSFW entries (marked in metadata files).
-
Note 1: You will get 768 dimensional 16-bit floating point vectors that may be changed to a 32-bit format to get full speed on legacy hardware.
-
Note 2: Our gold-standards were computed using -normalized vectors (i.e., unitary norms) and the as distance function.
-
Note 3: Our gold-standard
.h5
files contain the 100 nearest neighbors of each query using two associated matricesknns
anddists
, i.e., columns correspond to queries and rows to nearest neighbors for each query.-
The
knns
identifiers start indexing on 1. -
The
dists
contains raw distance values for each corresponding query and object, i.e.,1-\cos(\cdot, \cdot)
; please consider that this is not a proper metric distance. People using metric properties can use the angle with minor changes.
-
Subsets
We provide access to different subsets of the dataset and also created three different lower-dimensional projections that can be used. In particular, we computed two PCA projections using 32 and 96 dimensions and one more projection into binary sketches designed to work with bit-level hamming distance (using 1024 bits). Find below the URLs to download these bundles.
768d clip embeddings (clip768)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-clip768v2-n=100M.h5 | 100M subset | 147G | 9d8ee3347b1edf136b3ef38162ac05c3 |
laion2B-en-clip768v2-n=30M.h5 | 30M subset | 44G | 15a24d28d2304e14711e23baf7fe86a4 |
laion2B-en-clip768v2-n=10M.h5 | 10M subset | 15G | c05e4b1d2b2a0c7663ac9767753e25e1 |
laion2B-en-clip768v2-n=300K.h5 | 300K subset, for developing purposes | 440M | d238b4b037c32bae41e497f95dffa895 |
laion2B-en-clip768v2-n=100K.h5 | 100K subset, for developing purposes | 147M | daef38a64e3cd1c5233231f8be882a64 |
public-queries-10k-clip768v2.h5 | 10k public query set (original 768d embeddings) | 30M | 257b9eb3f7f25776e0d33b22451b7b32 |
private-queries-10k-clip768v2.h5 | 10k private query set (original 768d embeddings) | 30M | f8f3e61bd22d7d64234a0f587ead9fcf |
32d PCA projections (pca32)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-pca32v2-n=100M.h5 | 100M subset | 13G | 02c5726ba41cbfd3320d75ad113ef008 |
laion2B-en-pca32v2-n=30M.h5 | 30M subset | 3.7G | cf34551e4a80689a155052de640874b1 |
laion2B-en-pca32v2-n=10M.h5 | 10M subset | 1.3G | 799dfd317976012a9b768aea123ce6b0 |
laion2B-en-pca32v2-n=300K.h5 | 300K subset, for developing purposes | 37M | aeffa3290eedd6063f138d5a81489128 |
laion2B-en-pca32v2-n=100K.h5 | 100K subset, for developing purposes | 13M | 45a6c4e3774430d6318f808b43053895 |
public-queries-10k-pca32v2.h5 | 10k public query set for 32d PCA projection | 1.3M | 8c0fa4fff523d6263a246f7553d2b92f |
private-queries-10k-pca32v2.h5 | 10k private query set for 32d PCA projection | 1.3M | 57dc078229325b6c161521512585738e |
96d PCA projections (pca96)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-pca96v2-n=100M.h5 | 100M subset | 37G | 715c1f5bfa3da61eaf5e2e8735052043 |
laion2B-en-pca96v2-n=30M.h5 | 30M subset | 11G | 17b783ca3714b4b8084d93d59bac4611 |
laion2B-en-pca96v2-n=10M.h5 | 10M subset | 3.7G | 4f2520b152929bcd34fb3912d4db025e |
laion2B-en-pca96v2-n=300K.h5 | 300K subset, for developing purposes | 110M | 97faba380163a5ec2e1a441c3a6d21b6 |
laion2B-en-pca96v2-n=100K.h5 | 100K subset, for developing purposes | 37M | 73d464eccd6a6695d1f78f67bfbc7b46 |
public-queries-10k-pca96v2.h5 | 10k public query set for 96d PCA projection | 3.7M | f7d0b77f336f8f63803ddb59b4d4b8ed |
private-queries-10k-pca96v2.h5 | 10k private query set for 96d PCA projection | 3.7M | 301330e6d3963dd2db923fd4e858aa4e |
1024-bit binary sketches (hamming)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-hammingv2-n=100M.h5 | 100M subset | 13G | 36030a46f0792d8c520b85a39ea64dfc |
laion2B-en-hammingv2-n=30M.h5 | 30M subset | 3.7G | 9f438fd469e21313684f191d375c63ed |
laion2B-en-hammingv2-n=10M.h5 | 10M subset | 1.3G | 13a28c054a351c2b2cdd8fd918b006ed |
laion2B-en-hammingv2-n=300K.h5 | 300K subset, for developing purposes | 37M | 03533c23fcc18c806cd42653e46fda89 |
laion2B-en-hammingv2-n=100K.h5 | 100K subset, for developing purposes | 13M | 0dcb6fc72284439f67debcb34080b282 |
public-queries-10k-hammingv2.h5 | 10k public query set for 1024-bit binary sketch projection | 1.3M | cd93f7bf61a436b5a45d0b3e1a002667 |
private-queries-10k-pca96v2.h5 | 10k private query set for 1024-bit binary sketch projection | 3.7M | 301330e6d3963dd2db923fd4e858aa4e |
Gold standard list (computed with 32-bit floating point arithmetic, 100 nearest neighbors)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-public-gold-standard-v2-100M.h5 | 100M gold standard | 7.7M | 35de58992c6446c85c56e710b144c90c |
laion2B-en-public-gold-standard-v2-30M.h5 | 30M gold standard | 7.7M | 1726691372d2f62d7b0b97d8bf4f6189 |
laion2B-en-public-gold-standard-v2-10M.h5 | 10M gold standard | 7.7M | b68b17693253d95e1fc94c217af25e95 |
laion2B-en-public-gold-standard-v2-300K.h5 | 300K gold standard | 7.7M | 258654f2a34a1bdbfa031862b4e6cfae |
laion2B-en-public-gold-standard-v2-100K.h5 | 100K gold standard | 7.7M | fe39725772f487e4c86af68e18e87c88 |
Gold standard for public queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-public-gold-standard-v2-100M-F64-IEEE754.h5 | 100M gold standard | 77M | 59321e7e33b5469a5b435ff11305257f |
laion2B-en-public-gold-standard-v2-30M-F64-IEEE754.h5 | 30M gold standard | 77M | a445f32702aa43a176b56c54bf3f03f9 |
laion2B-en-public-gold-standard-v2-10M-F64-IEEE754.h5 | 10M gold standard | 77M | 45b05e4d60b8a66088b378ae7e0d278f |
laion2B-en-public-gold-standard-v2-300K-F64-IEEE754.h5 | 300K gold standard | 77M | 5d635f26630cced971358fd76f37c32e |
Gold standard for private queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)
dataset | description | size | md5 |
---|---|---|---|
laion2B-en-private-gold-standard-v2-10M-F64-IEEE754.h5 | 10M private gold standard | 783K | f384beecb5dddcddca8efc00a7fcd911 |
laion2B-en-private-gold-standard-v2-30M-F64-IEEE754.h5 | 30M private gold standard | 783K | 3b43d7b1251bd1387419a245bec8ba55 |
laion2B-en-private-gold-standard-v2-100M-F64-IEEE754.h5 | 100M private gold standard | 783K | 0ab272fd7b0eee8beec378e67da85b65 |
Associated captions and image urls (tabular delimited files)
dataset | description | size | md5 |
---|---|---|---|
meta-10M.tsv | metadata for the 10M subset | 1.8G | a9abbe13fb19207fb240f74fc03e2476 |
meta-30M.tsv | metadata for the 30M subset | 5.2G | a3205400411f6b82c8748e1a187d87fb |
meta-100M.tsv | metadata for the 100M subset | 18G | 323d0cf4cf22ae6edbc71e18e8110100 |
For instance, you can download the 10M subset and the query set using the following commands from a typical linux terminal:
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=10M.h5 curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/public-queries-10k-clip768v2.h5
People likes demonstrations, and that it is were metadata comes again. You can also download a subset of the associated metadata using the next command:
curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/meta-10M.tsv
Please review the simple jupyter-based demo to see how it can be used.
The -C -
flags can be added if you need to resume a broken download.
Metadata for 100K and 300K does not correspond to first 100K and 300K elements of large subsets. More precisely, 100K and 300K subsets include registers with NSFW missing values while large subsets remove missing values.
Note that our projection models were trained with our 10M subset. Other approaches may vary the resulting quality.
Note: Projections will reduce the resultâs quality concerning the original embeddings, but you can use these datasets to fast prototype your solution and for hyperparameter optimization. Please email us if you are interested in the associated metadata (which can also be obtained as described in the rest of the document.)
The original dataset can be downloaded and processed to get different subsets as described in the downloading and preprocessing LAION page. We encourage challenge participants to use the provided bundles for consistency reasons.