Essays··8 min read
Pulling Apart the Inference Stack
By mid-2026 every serious inference framework has accepted that the two halves of a forward pass want different hardware: prefill on compute-bound GPUs, decode on bandwidth-bound ones, the KV cache shipped between them over a fast fabric. It is the deepest reshaping of LLM serving since continuous batching — and it happened almost entirely without anyone outside the inference crowd noticing.
llm-inferenceprefill-decode-disaggregationvllmkv-cache
Read