Came out of a chat with Ben Peters
Nucleus sampling: GPT-2 days, top-k/top-p sampling was proposed because modelsâ next token probability mass distributions had very long tails, so tokens with an individually low probability would still often be sampled due to the high total probability over these tokens. Top-k sampling selects the top tokens by probability and samples only from those (i.e. sampling from a conditional distribution).
Ari Holtzman propsed nucleus sampling in The Curious Case of Neural Text Degeneration which is a means to âdynamicallyâ set the based instead on a (the?) quantity of interest: total probability of the candidate tokens.
Note this is (mostly) orthogonal to beam search. With no beam search (equivalently ) there is a single text sequence being generated. Sampling still occurs in this context if using non-zero temperature; i.e. unless performing greedy generation since this simply means performing stochastic sampling from the probability (mass) distribution of tokens at each decoding (generation) step. Note that sampling can be combined with beam search (rationale for saying âalmostâ orthogonal, since presumably with sampling, it impacts the dynamics of the beams).