🪴 Anil's Garden

❯

❯

Sampling for Text Generation, Nucleus Sampling (top-$p$), the need for top-$k$ and Beam Search

Sampling for Text Generation, Nucleus Sampling (top-$p$), the need for top-$k$ and Beam Search

06 Jan 20251 min read

sampling
text-generation
inference
note

Came out of a chat with Ben Peters

Nucleus sampling: GPT-2 days, top-k/top-p sampling was proposed because models’ next token probability mass distributions had very long tails, so tokens with an individually low probability would still often be sampled due to the high total probability over these tokens. Top-k sampling selects the top $k$ tokens by probability and samples only from those (i.e. sampling from a conditional distribution).

Ari Holtzman propsed nucleus sampling in The Curious Case of Neural Text Degeneration which is a means to “dynamically” set the $k$ based instead on a (the?) quantity of interest: total probability of the candidate tokens.

Note this is (mostly) orthogonal to beam search. With no beam search (equivalently $beam size = 1$ ) there is a single text sequence being generated. Sampling still occurs in this context if using non-zero temperature; i.e. unless performing greedy generation since this simply means performing stochastic sampling from the probability (mass) distribution of tokens at each decoding (generation) step. Note that sampling can be combined with beam search (rationale for saying “almost” orthogonal, since presumably with sampling, it impacts the dynamics of the beams).

Graph View

Backlinks

No backlinks found

Website
Bluesky
Twitter/X
GitHub
LinkedIn
Instagram
Goodreads
Letterboxd
🍋