The Stanford Natural Language Inference (SNLI) Corpus

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the task of determining the inference relation between two (short, ordered) texts: entailment, contradiction, or neutral (MacCartney and Manning 2008).

The Corpus

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. We aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [pdf] [bib]

Here are a few example pairs taken from the development portion of the corpus. Each has the judgments of five mechanical turk workers and a consensus judgment.

TextJudgmentsHypothesis
A man inspects the uniform of a figure in some East Asian country.contradiction
C C C C CThe man is sleeping
An older and younger man smiling.neutral
N N E N NTwo men are smiling and laughing at the cats playing on the floor.
A black race car starts up in front of a crowd of people.contradiction
C C C C CA man is driving down a lonely road.
A soccer game with multiple males playing.entailment
E E E E ESome men are playing a sport.
A smiling costumed woman is holding an umbrella.neutral
N N E C NA happy woman in a fairy costume holds an umbrella.

The corpus is distributed in both JSON lines and tab separated value files, which are packaged together (with a readme) here:

Download: SNLI 1.0 (zip, ~100MB)

SNLI is archived at the NYU Faculty Digital Archive.

Creative Commons License
The Stanford Natural Language Inference Corpus by The Stanford NLP Group is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://shannon.cs.illinois.edu/DenotationGraph/.

The corpus includes content from the Flickr 30k corpus (also released under an Attribution-ShareAlike licence), which can be cited by way of this paper:

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 67—78.

About 4k sentences in the training set have captionIDs and pairIDs beginning with ‘vg_‘. These come from a pilot data collection effort that used data from the VisualGenome corpus, which wass still under construction at the time of the release of SNLI. For more information on VisualGenome, see: https://visualgenome.org/.

The hard subset of the test set used in Gururangan et al. 2018 is available in JSONL format here.

Dataset Card

For key information for those considering building applications on this data, see this dataset card created by the Hugging Face Datasets team.

Published results

The following table reflects our informal attempt to catalog published 3-class classification results on the SNLI test set. We define sentence vector-based models as those which perform classification on the sole basis of a pair of fixed-size sentence representations that are computed independently of one another. Reported parameter counts do not include word embeddings. If you would like to add a paper that reports a number at or above the current state of the art, email Sam.

Three-way classification

Publication ModelParameters Train (% acc) Test (% acc)
Feature-based models
Bowman et al. ‘15Unlexicalized features49.450.4
Bowman et al. ‘15+ Unigram and bigram features99.778.2
Sentence vector-based models
Bowman et al. ‘15100D LSTM encoders220k84.877.6
Bowman et al. ‘16300D LSTM encoders3.0m83.980.6
Vendrov et al. ‘151024D GRU encoders w/ unsupervised ‘skip-thoughts’ pre-training15m98.881.4
Mou et al. ‘15300D Tree-based CNN encoders3.5m83.382.1
Bowman et al. ‘16300D SPINN-PI encoders3.7m89.283.2
Yang Liu et al. ‘16600D (300+300) BiLSTM encoders2.0m86.483.3
Munkhdalai & Yu ‘16b300D NTI-SLSTM-LSTM encoders4.0m82.583.4
Yang Liu et al. ‘16600D (300+300) BiLSTM encoders with intra-attention2.8m84.584.2
Conneau et al. ‘174096D BiLSTM with max-pooling40m85.684.5
Munkhdalai & Yu ‘16a300D NSE encoders3.0m86.284.6
Qian Chen et al. ‘17600D (300+300) Deep Gated Attn. BiLSTM encoders (code)12m90.585.5
Tao Shen et al. ‘17300D Directional self-attention network encoders (code)2.4m91.185.6
Jihun Choi et al. ‘17300D Gumbel TreeLSTM encoders2.9m91.285.6
Nie and Bansal ‘17300D Residual stacked encoders9.7m89.885.7
Anonymous ‘181200D REGMAPR (Base+Reg)85.9
Yi Tay et al. ‘18300D CAFE (no cross-sentence attention)3.7m87.385.9
Jihun Choi et al. ‘17600D Gumbel TreeLSTM encoders10m93.186.0
Nie and Bansal ‘17600D Residual stacked encoders29m91.086.0
Tao Shen et al. ‘18300D Reinforced Self-Attention Network3.1m92.686.3
Im and Cho ‘17Distance-based Self-Attention Network4.7m89.686.3
Seonhoon Kim et al. ‘18Densely-Connected Recurrent and Co-Attentive Network (encoder)5.6m91.486.5
Talman et al. ‘18600D Hierarchical BiLSTM with Max Pooling (HBMP, code)22m89.986.6
Qian Chen et al. ‘18600D BiLSTM with generalized pooling65m94.986.6
Kiela et al. ‘18512D Dynamic Meta-Embeddings9m91.686.7
Deunsol Yoon et al. ‘18600D Dynamic Self-Attention Model2.1m87.386.8
Deunsol Yoon et al. ‘182400D Multiple-Dynamic Self-Attention Model7.0m89.087.4
Other neural network models (usually with attention between text and hypothesis words)
Rocktäschel et al. ‘15100D LSTMs w/ word-by-word attention250k85.383.5
Pengfei Liu et al. ‘16a100D DF-LSTM320k85.284.6
Yang Liu et al. ‘16600D (300+300) BiLSTM encoders with intra-attention and symbolic preproc.2.8m85.985.0
Pengfei Liu et al. ‘16b50D stacked TC-LSTMs190k86.785.1
Munkhdalai & Yu ‘16a300D MMA-NSE encoders with attention3.2m86.985.4
Wang & Jiang ‘15300D mLSTM word-by-word attention model1.9m92.086.1
Jianpeng Cheng et al. ‘16300D LSTMN with deep attention fusion1.7m87.385.7
Jianpeng Cheng et al. ‘16450D LSTMN with deep attention fusion3.4m88.586.3
Parikh et al. ‘16200D decomposable attention model380k89.586.3
Parikh et al. ‘16200D decomposable attention model with intra-sentence attention580k90.586.8
Munkhdalai & Yu ‘16b300D Full tree matching NTI-SLSTM-LSTM w/ global attention3.2m88.587.3
Zhiguo Wang et al. ‘17BiMPM1.6m90.987.5
Lei Sha et al. ‘16300D re-read LSTM2.0m90.787.5
Yichen Gong et al. ‘17448D Densely Interactive Inference Network (DIIN, code)4.4m91.288.0
McCann et al. ‘17Biattentive Classification Network + CoVe + Char22m88.588.1
Chuanqi Tan et al. ‘18150D Multiway Attention Network14m94.588.3
Xiaodong Liu et al. ‘18Stochastic Answer Network3.5m93.388.5
Ghaeini et al. ‘18450D DR-BiLSTM7.5m94.188.5
Yi Tay et al. ‘18300D CAFE4.7m89.888.5
Qian Chen et al. ‘17KIM4.3m94.188.6
Qian Chen et al. ‘16600D ESIM + 300D Syntactic TreeLSTM (code)7.7m93.588.6
Peters et al. ‘18ESIM + ELMo8.0m91.688.7
Boyuan Pan et al. ‘18300D DMAN9.2m95.488.8
Zhiguo Wang et al. ‘17BiMPM Ensemble6.4m93.288.8
Yichen Gong et al. ‘17448D Densely Interactive Inference Network (DIIN, code) Ensemble17m92.388.9
Seonhoon Kim et al. ‘18Densely-Connected Recurrent and Co-Attentive Network6.7m93.188.9
Qian Chen et al. ‘17KIM Ensemble43m93.689.1
Ghaeini et al. ‘18450D DR-BiLSTM Ensemble45m94.889.3
Peters et al. ‘18ESIM + ELMo Ensemble40m92.189.3
Yi Tay et al. ‘18300D CAFE Ensemble17.5m92.589.3
Chuanqi Tan et al. ‘18150D Multiway Attention Network Ensemble58m95.589.4
Boyuan Pan et al. ‘18300D DMAN Ensemble79m96.189.6
Radford et al. ‘18Fine-Tuned LM-Pretrained Transformer85m96.689.9
Seonhoon Kim et al. ‘18Densely-Connected Recurrent and Co-Attentive Network Ensemble53.3m95.090.1
Zhuosheng Zhang et al. ‘19aSJRC (BERT-Large +SRL)308m95.791.3
Xiaodong Liu et al. ‘19MT-DNN330m97.291.6
Zhousheng Zhang et al. ‘19bSemBERT339m94.491.9
Pilault et al. ‘20CA-MTL340m92.692.1
Sun et al., ’20RoBERTa-large + self-explaining layer355m+?92.3
Wang et al., ’21EFL (Entailment as Few-shot Learner) + RoBERTa-large355m?93.1
  • A spell-checked version of the test and development sets. (Warning: Results on these sets are not directly comparable to results on the regular dev and test sets, and will not be listed here.)
  • The MultiGenre NLI (MultiNLI or MNLI) Corpus. The corpus is in the same format as SNLI and is comparable in size, but it includes a more diverse variety of text styles and topics, as well as an auxiliary test set for cross-genre transfer evaluation.
  • The FraCaS test suite for natural language inference, in XML format
  • MedNLI: A Natural Language Inference Dataset For The Clinical Domain
  • XNLI: A Cross-Lingual Natural Language Inference Evaluation Set
  • e-SNLI: Explanation annotations over SNLI.

Contact Information

For any comments or questions, please email Sam, Gabor, and Chris.