Inference

KV Cache (Key-Value Cache)

In transformer models, the KV cache stores the Key (K) and Value (V) tensors for all previously processed tokens. During text generation (which happens one token at a time), the model only needs to compute the Query (Q), Key, and Value for the *new* token, and can multiply the new Query with the cached Keys and Values of the past context. This dramatically reduces computational complexity from O(N^2) to O(N) per step, making long-context inference possible, though it consumes significant VRAM.