Suppose you have a bunch of GPUs. How many LLM forward passes can you do with them?
(A forward pass is activation of all the parameters in the model, which produces one token of output, which is roughly equivalent to one word).
Here’s my attempt to understand this topic as a non-specialist. I’ve had it checked over by some technical advisors, but I don’t claim any special expertise. I wrote it because I haven’t been able to find an accessible explainer elsewhere. I appreciate corrections.
The most obvious approach – the one I usually see non-experts taking – is to look up how many FLOP per second your GPU can process, then how many FLOP it takes to run a forward pass, and then divide the two.
For example, Nvidia’s A100 GPU is listed at 312 teraflop per second (3e14) on its spec sheet (FP16 tensor), a forward pass of GPT-4 requires 5.6e11 FLOP per forward pass.1 So that would imply a single GPU can do about 560 forward passes per second.
But this turns out to be much too high.
Even if it were possible to achieve spec sheet FLOP in a real life application (it’s not), this wouldn’t be the relevant figure because, in practice, inference is limited more by memory than by FLOP:
Each forward pass requires all the parameters to also pass through the GPU’s memory. If 280 billion parameters are activated, and each parameter requires 16-bits = 2 bytes to encode it, then 560 gigabytes must pass through memory.2
But the A100’s memory bandwidth is 2000 gigabytes per second – only enough for 4 forward passes.
However, 4 forward passes per second is much too low.
In practice, GPUs are parallelised, so multiple forward passes are processed in batches and many other optimisations are applied, allowing real world efficiency to be much higher than the memory bandwidth of an individual GPU would suggest.
So, the first FLOP-based method is an upper bound, the second memory-based method a lower bound, and the real world lies somewhere in between.
Figuring out where real world efficiency lies is tricky. Not only is it an area of active research, but it also depends on many factors, such as acceptable latency, context length, batch size, size of the model, etc.
The best estimate I’ve seen so far is from Semianalysis. In their article on GPT-4 architecture, they estimate that a cluster of 128 A100s can output 1 million tokens for $4.9 of compute (assuming fairly high utilisation and a context seqlen of 8k).
If A100s cost $1/hour on the cloud in 2023, running the cluster costs $128 per hour. This means the cluster must produce $128/4.9 = 26 million forward passes per hour.
That’s about 60 forward passes per chip per second – about 10% of the theoretical max, but 15 times better than the lower bound.
(Interestingly, it’s significantly worse than the ~33% utilisation of FLOP that can be achieved in training, which means that even if a certain number of FLOP were used for training, the same GPUs couldn’t produce that many FLOP if applied to inference.)
What about more advanced GPUs?
In the same article, Semianalysis provides a similar figure for the H100. They also have a newer post analysing the inference performance of the newer Blackwell chips, which gives a rough sense of how the B200 compares to the H100.3 For the H200, I looked at some comparisons of inference performance with the H100 and guessed 1.7x.
From this, I can make the following table:
One interesting point is that inference throughput has increased about 20x in 5 years, compared to only an 8x increase in FLOP.
This seems to be at least partly because Nvidia has crammed a lot more memory bandwidth into the newest chips, and memory is still usually a bigger constraint to inference than FLOP/s. It also allows for larger batch sizes. And more broadly it seems like the recent generation of chips have been more optimised for inference relative to training.
However, the underlying speed of memory bandwidth has been increasing more slowly than FLOP for years (the so-called ‘memory wall’), so while memory has been able to catch up recently, my best guess would be that it’s a temporary effect.
My figures are also based on FP16 and 16-bit encoding of model weights, but it seems like inference is switching to FP8 and 8-bit encoding, which could also roughly double how much inference can be done per GPU.
(FP16 means that 16 bits are used to encode each number in the computation. The more bits used, the more precisely the number can be encoded, reducing rounding errors. However, it turns out that ML algorithms can often perform about as well with less accurate encodings. If fewer bits are used for each number, you can do more calculations while using less compute.)
What about future models and algorithms?
As a first pass, FLOP and memory requirements per forward pass scale linearly with the number of model parameters activated per forward pass. So if a model is 10 times bigger, all else equal we’ll only be able to perform about one tenth as many forward passes.
In reality, there are many complications.
For example, if users really value long contexts and low latency, that makes it harder to batch, which pushes throughput towards the lower bound. If, however, most inference is short context and higher latency, throughput could be much closer to the theoretical max (but probably not above about ~50%).
We should also probably expect chips, parallelisation and algorithms to get better optimised over inference over time, allowing throughput to get closer to the max, or to achieve more performance with less compute. These effects can be big.
One example is that new parallelisation techniques could be discovered, allowing inference to get closer to the upper bound of available FLOP.
Another example is that by using a mixture of experts structure, GPT-4 only needs to activate about one tenth of its parameters on a typical forward pass, so it only requires about a tenth of the compute suggested by its total number of parameters. Future models might be able to activate an even smaller fraction of parameters to achieve similar performance.
Models can also be trained using more data and use fewer parameters, which makes the model cheaper to run.
As a concrete example, by using 10 times as many experts, a lot of data and some other architecture improvements, it seems like DeepSeek has been able to achieve performance approaching GPT-4 while only activating about a tenth as many parameters.
In the future, more ways to make inference more efficient will likely be invented.
As a final aside, Dylan Patel of Semianalysis claims that the computing requirements to run a forward pass will increase more slowly than linearly with model size. I’m not sure exactly why this is, but it could be because larger models open up more possibilities for optimisation.
What about input tokens?
Everything above has been just about the compute needed to produce one output token. But in reality, the compute required also depends on the number of tokens that are input into the model before producing the output.
I’ve heard conflicting accounts of the relationship between input tokens and compute, but a technical advisor told me that for FLOP adding input and output tokens works as a rough rule of thumb.
For example, if you input 1,000 input tokens and get 50 output tokens, then you need about 1050 times the FLOP required for one forward pass.
That also lines up with this article by Az16 (and would be consistent with the fees for using LLMs being linear in input tokens, though there are other reasons for this).
So, we can roughly say the throughput numbers above hold for the number of input or output tokens in most cases.
However, one important difference with input tokens is that they can be inputted into the model without any context (unlike output tokens that require a KV cache). This makes it easier to get closer to the upper bound for input tokens than with output tokens. (This is could be why input tokens are usually about half the price of output tokens.)
Another complication is that when producing output tokens, if the number of proceeding input tokens (/ context) is very large, the memory requirements can increase faster than linear, pushing performance towards the lower bound per token.
Summing up
We can look at the FLOP per second of future chips to get a rough upper bound on future ability to do inference, and memory bandwidth to get a lower bound. Then we can compare that to the size of future models.
Historically, our ability to use maximum available FLOP in inference has been worse than in training.
However, inference throughput has been getting closer to the upper bound recently, as chips have been more adapted to inference (especially through having more memory bandwidth), and parallelisation & batching techniques have improved. This trend could continue (up to a max of maybe around 50% of the upper bound) if we discover more parallelisation techniques. Or, it could start to reverse due to the memory wall.
Algorithmic improvements have also allowed models to achieve the same performance while using much less compute, and that trend seems likely to continue.
Switching from FP16 to FP8 could also roughly double how much inference we can do with a given cluster of GPUs.
From Semianalysis, “GPT-4 architecture”, July 2023:
GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers versus the ~175 billion parameters of GPT-3…
Furthermore, OpenAI utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed to per forward pass.
While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.
Furthermore, there are roughly ~55B shared parameters for attention.
Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 GFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 GFLOP that would be required per forward pass of a purely dense model.
One complication is that GPT-4 uses a mixture of experts structure, and so isn’t a dense model. The total number of parameters are ~8x larger than those that get activated in a forward pass. However, these extra parameters also need to pass through the memory in some situations, which would further decrease the lower bound. I’m ignoring this complication and treating GPT-4 as a dense model with 280 billion parameters.
Another complication is that if the context / KV cache is very long, then it can start to use more memory than the model weights, which would further decrease the lower bound.
Unfortunately there’s not a completely direct comparison. The new post doesn’t cover the A100, and the H100 analysis is for 32k input tokens rather than 8k. However, they do say “As such, in a large model like GPT-4 B200 brings ~4x to ~7x performance gains for GPT-4 inference, when quantization is set fairly, depending on the point in the interactivity curve chosen.”
See discussion of this post on Less Wrong: https://www.lesswrong.com/posts/g7H2sSGHAeYxCHzrz/how-much-ai-inference-can-we-do