Dispatchestag · llm-inference

Filed under llm-inference.

2 Dispatches carry this tag.

Essays·22 June 2026·8 min read
Pulling Apart the Inference Stack
By mid-2026 every serious inference framework has accepted that the two halves of a forward pass want different hardware: prefill on compute-bound GPUs, decode on bandwidth-bound ones, the KV cache shipped between them over a fast fabric. It is the deepest reshaping of LLM serving since continuous batching — and it happened almost entirely without anyone outside the inference crowd noticing.
llm-inferenceprefill-decode-disaggregationvllmkv-cache
Read
Essays·10 June 2026·8 min read
Cheap hits, confident wrong answers
Prefix caching is a fact; semantic caching is a bet. One is free and lossless, the other can return a confident, well-formatted, wrong answer with an HTTP 200. Both are true in the same architecture diagram.
llm-inferencesemantic-cachingfinopsprefix-caching
Read

← All Dispatches