- Essays··8 min read
Pulling Apart the Inference Stack
By mid-2026 every serious inference framework has accepted that the two halves of a forward pass want different hardware: prefill on compute-bound GPUs, decode on bandwidth-bound ones, the KV cache shipped between them over a fast fabric. It is the deepest reshaping of LLM serving since continuous batching — and it happened almost entirely without anyone outside the inference crowd noticing.
llm-inferenceprefill-decode-disaggregationvllmkv-cacheRead - Essays··8 min read
Cheap hits, confident wrong answers
Prefix caching is a fact; semantic caching is a bet. One is free and lossless, the other can return a confident, well-formatted, wrong answer with an HTTP 200. Both are true in the same architecture diagram.
llm-inferencesemantic-cachingfinopsprefix-cachingRead