Title: The Curious Case of Neural Text Degeneration
Authors: Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi
Published: 22nd April 2019 (Monday) @ 07:17:18
Link: http://arxiv.org/abs/1904.09751v2

Abstract

Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.


Main idea: propose sampling from the top-p portion of the probability mass, expanding and contracting the candidate pool dynamically

  • GPT-2 versus human generation in Figure 2 - probabilities of human tokens are much more volatile - beam search produces consistently high probability tokens, but the text itself is degenerate (repetitions)
  • Why is text produced by pure sampling so degenerate? In this work we show that the “unreliable tail” is to blame. This unreliable tail is composed of tens of thousands of candidate tokens with relatively low probability that are over-represented in the aggregate.
  • Nice realisation: temperature itself can also be used to control the long (“unreliable”) tail. You could set v low temperature, to suppress low probability tokens in the long tail (probably not effective as you probably need to get close to for it to work well, by which point you’re close to greedy? Idk just a guess)
    • “Instead of relying on a fixed top-k, or using a temperature parameter to control the shape of the distribution without sufficiently suppressing the unreliable tail”
  • paper uses Self-BLEU (Zhu et al., 2018) statistics from Texygen A Benchmarking Platform for Text Generation Models
  • The HUSE evaluation demonstrates that Nucleus Sampling is the best overall decoding strategy. We include generated examples for qualitative analysis – see Figure 3 for a representative example, and further examples in the appendix
  • Low temperature sampling has also been used to partially alleviate the issues of top-k sampling discussed above, by shaping the distribution before top-k sampling (Radford et al., 2018; Fan et al., 2018). However, recent analysis has shown that, while lowering the temperature improves generation quality, it comes at the cost of decreasing diversity (Caccia et al., 2018; Hashimoto et al., 2019).

A couple of key results:

  • Importantly, we argue that the optimal generation strategy should produce text which has a perplexity close to that of the gold text: Even though the model has the ability to generate text that has lower perplexity (higher probability), such text tends to have low diversity and get stuck in repetition loops, as shown in §5 and illustrated in Figure 4
  • We see that perplexity of text obtained from pure sampling is worse than the perplexity of the gold. This indicates that the model is confusing itself: sampling too many unlikely tokens and creating context that makes it difficult to recover the human distribution of text, as in Figure 1. Yet, setting the temperature lower creates diversity and repetition issues, as we shall see in §5. Even with our relatively fine-grained parameter sweep, Nucleus Sampling obtains closest perplexity to human text, as shown in Table 1

Nucleus Sampling Defined:

We propose a new stochastic decoding method: Nucleus Sampling. The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from. Given a distribution , we define its top- vocabulary as the smallest set such that

Figure 2: The probability assigned to tokens generated by Beam Search and humans, given the same context. Note the increased variance that characterizes human text, in contrast with the endless repetition of text decoded by Beam Search.

Figure 6: Perplexities of generations from various decoding methods. Note that beam search has unnaturally low perplexities. A similar effect is seen using a temperature of 0.7 with top-k as in both Radford et al. (2019) and Fan et al. (2018). Sampling, Top-k, and Nucleus can all be calibrated to human perplexities, but the first two face coherency issues when their parameters are set this high.

  • plot of the Conditional Perplexity on temperature seems to suddenly cross the human-level PPL
  • isn’t top-k better, since it is more gradual when varying the parameter inc. where it passes the human PPL threshold?