Andrej Karpathy @karpathy 2023-08-15

“How is LLaMa.cpp possible?”

great post by @finbarrtimbers

https://finbarr.ca/how-is-llama-cpp-possible/…

llama.cpp surprised many people (myself included) with how quickly you can run large LLMs on small computers, e.g. 7B runs @ ~16 tok/s on a MacBook. Wait don’t you need supercomputers to work with LLMs?

TLDR at batch_size=1 (i.e. just generating a single stream of prediction on your computer), the inference is super duper memory-bound. The on-chip compute units are twiddling their thumbs while sucking model weights through a straw from DRAM. Every individual weight that is expensively loaded from DRAM onto the chip is only used for a single instant multiply to process each new input token. So the stat to look at is not FLOPS but the memory bandwidth.

Let’s take a look:

A100: 1935 GB/s memory bandwidth, 1248 TOPS

MacBook M2: 100 GB/s, 7 TFLOPS

The compute is ~200X but the memory bandwidth only ~20X. So the little M2 chip that could will only be about ~20X slower than a mighty A100. This is ~10X faster than you might naively expect just looking at ops.

The situation becomes a lot more different when you inference at a very high batch size (e.g. ~160+), such as when you’re hosting an LLM engine simultaneously serving a lot of parallel requests. Or in training, where you aren’t forced to go serially token by token and can parallelize across both batch and time dimension, because the next token targets (labels) are known. In these cases, once you load the weights into on-chip cache and pay that large fixed cost, you can re-use them across many input examples and reach ~50%+ utilization, actually making those FLOPS count.

So TLDR why is LLM inference surprisingly fast on your MacBook? If all you want to do is batch 1 inference (i.e. a single “stream” of generation), only the memory bandwidth matters. And the memory bandwidth gap between chips is a lot smaller, and has been a lot harder to scale compared to flops.

supplemental figure

https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8…

Image


Andrej Karpathy @karpathy 2023-08-16

Two notes I wanted to add:

  1. In addition to parallel inference and training, prompt encoding is also parallelizable even at batch_size=1 because the prompt tokens can be encoded by the LLM in parallel instead of decoded serially one by one. The token inputs into LLMs always have shape (B,T), batch by time. Parallel inference decoding is (high B, T=1), training is (high B, high T), and long prompts is (B=1, high T). So this workload can also become compute-bound (e.g. above 160 tokens) and the A100 would shine again. As your prompts get longer, your MacBook will fall farther behind the A100.
  2. The M2 chips from Apple are actually quite an amazing lineup and come in much larger shapes and sizes. The M2 Pro, M2 Max have 200 and 400 GB/s (you can get these in a MacBook Pro!), and the M2 Ultra (in Mac Studio) has 800 GB/s. So the M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today. https://en.wikipedia.org/wiki/Apple_M2

Image


Daniel Gross @danielgross 2023-08-15

Like how those breakfast buffets chefs are rate-limited by distance from kitchen’s egg refrigerator (bandwidth), not the size of the burner (FLOPs)…


Elon Musk @elonmusk 2023-08-16

A100s already seem so quaint


Sayak Paul @RisingSayak 2023-08-16

> So the little M2 chip that could will only be about ~20X slower than a mighty A100. This is ~10X faster than you might naively expect just looking at ops.

Don’t quite follow. How it’s faster? In what regard?


psyv @psyv282j9d 2023-08-15

I presume the A100 bandwidth number you mention is for the on-board GPU RAM. what is the bandwidth when the model is too big for GPU RAM and needs to be shuttled back-and-forth to system RAM?

I think this scenario is where Apple Silicon has the advantage. All system RAM is


Danielle Fong  @DanielleFong 2023-08-15

excellent post


Self @SelfInfinity 2023-08-15

I wonder if humanity will be truly compute bound or bandwidth bound 🤔


Whole Mars Catalog @WholeMarsBlog 2023-08-15

Exciting to think what future generations of apple silicon will be capable of as these use cases are becoming more mainstream


JimZ @moonares 2023-08-15

For any intel iris integrated GPU, it goes up to 10 t/s for generation, and 70 t/s for prompt reading. Which means less than 10 seconds to analyze a 1000 token length article.


Sualeh @sualehasif996 2023-08-16

you have to give it to apple to have the foresight of putting the cpu and gpu on the same die and putting a p amazing interconnect between them!


Jeremy Fox @JeremyDanielFox 2023-10-12

Andrej, I don’t understand why time has to be spent loading the model weights onto the chip. I mean I know it happens once, but once we do it, can’t we keep them on and then just keep using them for every decode step for every new request? Why do they have to keep being loaded?


Andrej Karpathy @karpathy 2023-10-12

Good question there’s nowhere near enough space on chip. Memory on chip (SRAM, right next to the compute units) is ~1000X lower capacity than the HBM memory (VRAM), which is its own memory-dedicated chip nearby.

Flash attention paper to help RE memory hierarchy sad:

Image


buildx @buildxmake 2023-08-16

Makes one wonder if the expected demand for compute based on chatgpt success is blown out of proportion.


Denis A. @den_run_ai 2023-08-16

Does AMD Genoa-X CPU with 1GB L3 3D-V cache at 6TB/s theoretical bandwidth accelerate this memory bound LLaMa.cpp?


Nasim Rahaman @nasim_rahaman 2023-08-16

Wait, isn’t the mem requirement for KV cache missing a factor of n_tokens?


Russ @rmarcilhoo 2023-08-15

@effyyzhang @suchenzang

@hardmaru


Locally Optimal @dingoactual 2023-08-27

@memdotai mem it


Mem @memdotai 2023-08-27

Saved! Here’s the compiled thread: https://mem.ai/p/4ip5m7Mx1zvjtjHtyqup…

🪄 AI-generated summary:

“llama.cpp is a programming language that allows for large LLMs to be run on small computers, such as a MacBook, with impressive speed. This has surprised many people, as it was…