LAION2B Dataset

Excerpt

The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.

The LAION5B dataset is an openly available image collection that has been used for learning very large visual and language deep-neural models; for instance, the famed stable diffusion generative model used it as the training set. The collection equips each image with a URL handle, allowing people to showcase demonstrations easily.

A more detailed description can be found here:

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402.

The English subset, often called LAION2B, contains over 2 billion objects.

Subset of the challenge

The dataset is divided into parts containing close to 1M vectors. We selected the first 112 parts (0000 to 0111); we used the first part to extract the public query set and the rest to extract the database. The subset use approximately 160GB of space and its associated metadata 20GB (the first 112 parts). Embeddings are distributed using single precision (16bits) floating point vectors bundled in the NumPy data-specific format .npz. They can be loaded on most platforms due to the format’s popularity.

The challenge has three subsets:

10M subset: concatenation of 1-11 parts.
30M subset: concatenation of 1-33 parts.
100M subset: concatenation of 1-111 parts.
public queries: computed from part 0.

All parts should be concatenated in order and also removing NSFW entries (marked in metadata files).

Note 1: You will get 768 dimensional 16-bit floating point vectors that may be changed to a 32-bit format to get full speed on legacy hardware.
Note 2: Our gold-standards were computed using $L_{2}$ -normalized vectors (i.e., unitary norms) and the $1 - cos (\cdot, \cdot)$ as distance function.
Note 3: Our gold-standard .h5 files contain the 100 nearest neighbors of each query using two associated matrices knns and dists, i.e., columns correspond to queries and rows to nearest neighbors for each query.
- The knns identifiers start indexing on 1.
- The dists contains raw distance values for each corresponding query and object, i.e., 1-\cos(\cdot, \cdot); please consider that this is not a proper metric distance. People using metric properties can use the angle with minor changes.

Subsets

We provide access to different subsets of the dataset and also created three different lower-dimensional projections that can be used. In particular, we computed two PCA projections using 32 and 96 dimensions and one more projection into binary sketches designed to work with bit-level hamming distance (using 1024 bits). Find below the URLs to download these bundles.

768d clip embeddings (clip768)

dataset	description	size	md5
laion2B-en-clip768v2-n=100M.h5	100M subset	147G	9d8ee3347b1edf136b3ef38162ac05c3
laion2B-en-clip768v2-n=30M.h5	30M subset	44G	15a24d28d2304e14711e23baf7fe86a4
laion2B-en-clip768v2-n=10M.h5	10M subset	15G	c05e4b1d2b2a0c7663ac9767753e25e1
laion2B-en-clip768v2-n=300K.h5	300K subset, for developing purposes	440M	d238b4b037c32bae41e497f95dffa895
laion2B-en-clip768v2-n=100K.h5	100K subset, for developing purposes	147M	daef38a64e3cd1c5233231f8be882a64
public-queries-10k-clip768v2.h5	10k public query set (original 768d embeddings)	30M	257b9eb3f7f25776e0d33b22451b7b32
private-queries-10k-clip768v2.h5	10k private query set (original 768d embeddings)	30M	f8f3e61bd22d7d64234a0f587ead9fcf

32d PCA projections (pca32)

dataset	description	size	md5
laion2B-en-pca32v2-n=100M.h5	100M subset	13G	02c5726ba41cbfd3320d75ad113ef008
laion2B-en-pca32v2-n=30M.h5	30M subset	3.7G	cf34551e4a80689a155052de640874b1
laion2B-en-pca32v2-n=10M.h5	10M subset	1.3G	799dfd317976012a9b768aea123ce6b0
laion2B-en-pca32v2-n=300K.h5	300K subset, for developing purposes	37M	aeffa3290eedd6063f138d5a81489128
laion2B-en-pca32v2-n=100K.h5	100K subset, for developing purposes	13M	45a6c4e3774430d6318f808b43053895
public-queries-10k-pca32v2.h5	10k public query set for 32d PCA projection	1.3M	8c0fa4fff523d6263a246f7553d2b92f
private-queries-10k-pca32v2.h5	10k private query set for 32d PCA projection	1.3M	57dc078229325b6c161521512585738e

96d PCA projections (pca96)

dataset	description	size	md5
laion2B-en-pca96v2-n=100M.h5	100M subset	37G	715c1f5bfa3da61eaf5e2e8735052043
laion2B-en-pca96v2-n=30M.h5	30M subset	11G	17b783ca3714b4b8084d93d59bac4611
laion2B-en-pca96v2-n=10M.h5	10M subset	3.7G	4f2520b152929bcd34fb3912d4db025e
laion2B-en-pca96v2-n=300K.h5	300K subset, for developing purposes	110M	97faba380163a5ec2e1a441c3a6d21b6
laion2B-en-pca96v2-n=100K.h5	100K subset, for developing purposes	37M	73d464eccd6a6695d1f78f67bfbc7b46
public-queries-10k-pca96v2.h5	10k public query set for 96d PCA projection	3.7M	f7d0b77f336f8f63803ddb59b4d4b8ed
private-queries-10k-pca96v2.h5	10k private query set for 96d PCA projection	3.7M	301330e6d3963dd2db923fd4e858aa4e

1024-bit binary sketches (hamming)

dataset	description	size	md5
laion2B-en-hammingv2-n=100M.h5	100M subset	13G	36030a46f0792d8c520b85a39ea64dfc
laion2B-en-hammingv2-n=30M.h5	30M subset	3.7G	9f438fd469e21313684f191d375c63ed
laion2B-en-hammingv2-n=10M.h5	10M subset	1.3G	13a28c054a351c2b2cdd8fd918b006ed
laion2B-en-hammingv2-n=300K.h5	300K subset, for developing purposes	37M	03533c23fcc18c806cd42653e46fda89
laion2B-en-hammingv2-n=100K.h5	100K subset, for developing purposes	13M	0dcb6fc72284439f67debcb34080b282
public-queries-10k-hammingv2.h5	10k public query set for 1024-bit binary sketch projection	1.3M	cd93f7bf61a436b5a45d0b3e1a002667
private-queries-10k-pca96v2.h5	10k private query set for 1024-bit binary sketch projection	3.7M	301330e6d3963dd2db923fd4e858aa4e

Gold standard list (computed with 32-bit floating point arithmetic, 100 nearest neighbors)

dataset	description	size	md5
laion2B-en-public-gold-standard-v2-100M.h5	100M gold standard	7.7M	35de58992c6446c85c56e710b144c90c
laion2B-en-public-gold-standard-v2-30M.h5	30M gold standard	7.7M	1726691372d2f62d7b0b97d8bf4f6189
laion2B-en-public-gold-standard-v2-10M.h5	10M gold standard	7.7M	b68b17693253d95e1fc94c217af25e95
laion2B-en-public-gold-standard-v2-300K.h5	300K gold standard	7.7M	258654f2a34a1bdbfa031862b4e6cfae
laion2B-en-public-gold-standard-v2-100K.h5	100K gold standard	7.7M	fe39725772f487e4c86af68e18e87c88

Gold standard for public queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

dataset	description	size	md5
laion2B-en-public-gold-standard-v2-100M-F64-IEEE754.h5	100M gold standard	77M	59321e7e33b5469a5b435ff11305257f
laion2B-en-public-gold-standard-v2-30M-F64-IEEE754.h5	30M gold standard	77M	a445f32702aa43a176b56c54bf3f03f9
laion2B-en-public-gold-standard-v2-10M-F64-IEEE754.h5	10M gold standard	77M	45b05e4d60b8a66088b378ae7e0d278f
laion2B-en-public-gold-standard-v2-300K-F64-IEEE754.h5	300K gold standard	77M	5d635f26630cced971358fd76f37c32e

Gold standard for private queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

dataset	description	size	md5
laion2B-en-private-gold-standard-v2-10M-F64-IEEE754.h5	10M private gold standard	783K	f384beecb5dddcddca8efc00a7fcd911
laion2B-en-private-gold-standard-v2-30M-F64-IEEE754.h5	30M private gold standard	783K	3b43d7b1251bd1387419a245bec8ba55
laion2B-en-private-gold-standard-v2-100M-F64-IEEE754.h5	100M private gold standard	783K	0ab272fd7b0eee8beec378e67da85b65

Associated captions and image urls (tabular delimited files)

dataset	description	size	md5
meta-10M.tsv	metadata for the 10M subset	1.8G	a9abbe13fb19207fb240f74fc03e2476
meta-30M.tsv	metadata for the 30M subset	5.2G	a3205400411f6b82c8748e1a187d87fb
meta-100M.tsv	metadata for the 100M subset	18G	323d0cf4cf22ae6edbc71e18e8110100

For instance, you can download the 10M subset and the query set using the following commands from a typical linux terminal:

curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/laion2B-en-clip768v2-n=10M.h5 curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/public-queries-10k-clip768v2.h5

People likes demonstrations, and that it is were metadata comes again. You can also download a subset of the associated metadata using the next command:

curl -O https://sisap-23-challenge.s3.amazonaws.com/SISAP23-Challenge/meta-10M.tsv

Please review the simple jupyter-based demo to see how it can be used.

The -C - flags can be added if you need to resume a broken download.

Metadata for 100K and 300K does not correspond to first 100K and 300K elements of large subsets. More precisely, 100K and 300K subsets include registers with NSFW missing values while large subsets remove missing values.

Note that our projection models were trained with our 10M subset. Other approaches may vary the resulting quality.

Note: Projections will reduce the result’s quality concerning the original embeddings, but you can use these datasets to fast prototype your solution and for hyperparameter optimization. Please email us if you are interested in the associated metadata (which can also be obtained as described in the rest of the document.)

The original dataset can be downloaded and processed to get different subsets as described in the downloading and preprocessing LAION page. We encourage challenge participants to use the provided bundles for consistency reasons.

🪴 Anil's Garden

Explorer

LAION2B Dataset

LAION2B Dataset

Excerpt

The LAION2B and projections

About the LAION5B

Subset of the challenge

Subsets

768d clip embeddings (clip768)

32d PCA projections (pca32)

96d PCA projections (pca96)

1024-bit binary sketches (hamming)

Gold standard list (computed with 32-bit floating point arithmetic, 100 nearest neighbors)

Gold standard for public queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

Gold standard for private queries (computed with 64-bit IEEE floating point arithmetic, 1000 nearest neighbors)

Associated captions and image urls (tabular delimited files)

Graph View

Table of Contents

Backlinks