(2) Andrej Karpathy on X: “Speculative execution for LLMs is an excellent inference-time optimization. It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might” / X
Excerpt
Post
Post
Conversation
Speculative execution for LLMs is an excellent inference-time optimization. It hinges on the following unintuitive observation: forwarding an LLM on a single input token takes about as much time as forwarding an LLM on K input tokens in a batch (for larger K than you might think). This unintuitive fact is because sampling is heavily memory bound: most of the “work” is not doing compute, it is reading in the weights of the transformer from VRAM into on-chip cache for processing. So if you’re going to do all that work of reading in all those weights, you might as well apply them to a whole batch of input vectors. I went into more detail in an earlier thread: x.com/karpathy/statu The reason we can’t naively use this fact to sample in chunks of K tokens at a time is that every N-th token depends on what token we sample at time at step N-1. There is a serial dependency, so the baseline implementation just goes one by one left to right. Now the clever idea is to use a small and cheap draft model to first generate a candidate sequence of K tokens - a “draft”. Then we feed all of these together through the big model in a batch. This is almost as fast as feeding in just one token, per the above. Then we go from left to right over the logits predicted by the model and sample tokens. Any sample that agrees with the draft allows us to immediately skip forward to the next token. If there is a disagreement then we throw the draft away and eat the cost of doing some throwaway work (sampling the draft and the forward passing for all the later tokens). The reason this works in practice is that most of the time the draft tokens get accepted, because they are easy, so even a much smaller draft model gets them. As these easy tokens get accepted, we skip through those parts in leaps. The hard tokens where the big model disagrees “fall back” to original speed, but actually a bit slower because of all the extra work. So TLDR: this one weird trick works because LLMs are memory bound at inference time, in the “batch size 1” setting of sampling a single sequence of interest, that a large fraction of “local LLM” use cases fall into. And because most tokens are “easy”. References arxiv.org/abs/2302.01318 arxiv.org/abs/1811.03115 arxiv.org/abs/2211.17192
Quote
Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra
[
](https://twitter.com/anilkeshwani)
Since the first token (or first couple) is usually the hardest to get right, is there any work around letting the bigger model decode the first couple, then letting the smaller model go? Or is this unnecessary? It feels like a lot of the mistakes would come in those first few
Show more
Hey hey — wasn’t expecting this, but thanks for highlighting our results! This was super fun to work on.
Would there be a noticeable difference in performance if I asked ChatGPT to write down a sequence of a thousand 1s versus a sequence of a thousand random numbers? For example: 1111111111111111… vs 47509216798781…
Speculative execution… transformers are CPUs now
branch prediction is back on the menu boys
Beautiful explanation. The human brain almost certainly does not optimize this way since batch size is a uniquely AI parameter that we use to speed up training and now, to speed up intense as well.
Here’s an idea that may be a more end to end extension. What if we train the LLM so each layer tries to predict the output token with a separate output head, such that at any time it can exit computing the remaining layers. Then you could compute the lower layer outputs
Show more
This is brilliant and I feel like we are barely scratching the surface on these performance optimizations.
[
](https://twitter.com/memdotai)
Saved! Here’s the compiled thread: mem.ai/p/kuh8zkfGk7fk AI-generated summary: “Speculative execution for LLMs is an optimization technique that can be used at inference time. It is based on the counterintuitive observation that forwarding an LLM on a…