the memory wall, the interconnect wall, and the one budget line that actually moved
Seven hundred billion dollars in quarterly capex and the useful question is not the size of the number — it is which of three walls the number is actually paying down. Memory bandwidth, accelerator interconnect, thermal-plus-power. Most boards approve the headline without asking.
the memory wall, the interconnect wall, and the one budget line that actually moved
The headline number from the Q1 2026 earnings calls was capital expenditure. Roughly seven hundred billion dollars across Microsoft, Meta, Amazon and Alphabet for the year, depending on which accounting line you trust, with quarterly figures of Alphabet at $35.7B, Amazon at $44.2B, Microsoft at $30.9B (fiscal Q3), and Meta at $20B — and Meta has since raised its full-year guide to $125–145B. The number gets argued over on cable. What gets argued over inside chief technology offices afterwards is not the size of the number. It is which constraint that number is actually paying down.
About two-thirds of Microsoft's quarterly capex went to short-lived assets — chiefly GPUs and CPUs. Read that twice. The accounting depreciation curve on those purchases is steep, and the return has to clear before the asset is worth swapping out. That is the math underneath every architecture choice in this market, and most of the public discussion skims past it on the way to a FLOPS headline.
hbm4, sold out and asymmetric
The thing not for sale is HBM4. SK Hynix has allocated its entire 2026 supply. Micron has said the same about 2025 and 2026. The split of NVIDIA's Rubin-class allocation is now understood to land at roughly two-thirds SK Hynix, with Samsung scrambling its way into the mid-20% band and Micron near 20%. The 12-Hi stacks shipped first. The 16-Hi stacks are the next chokepoint and the one that will price-discriminate which hyperscaler runs the densest racks.
Why does this matter to your build? Because a Rubin-class GPU's effective inference throughput is bounded much more by the bandwidth of its on-package memory than by its peak FP4 number. The headline for the Vera Rubin NVL72 rack — 3.6 exaflops of NVFP4 inference, 2.5 exaflops of training — is impressive arithmetic. What that arithmetic means in production depends on whether your model's parameter set and KV cache can be fed fast enough to keep the matrix-multiply units busy. A workload that is memory-bandwidth-bound at 30% MFU on Blackwell does not become un-bound by being placed on a denser chip. It becomes 30% MFU on a more expensive denser chip.
When NVIDIA says NVL72 delivers ten times the inference throughput per watt of Blackwell at one-tenth the cost per token — and they do say that, in the Rubin platform release — the discount you apply is not zero, but it is also not zero in the other direction. The number is calibrated against benchmarks NVIDIA chose. The practitioner version: if you are running mixture-of-experts at scale with aggressive KV sharing, you will see most of that uplift. If you are running dense models with long contexts, you will see substantially less.
This is also why the metric to watch on a new build is not MFU — model FLOPs utilisation — but MBU, memory bandwidth utilisation. A well-tuned dense training workload hits 40–50% MFU on a modern accelerator. A poorly-fed memory-bandwidth-bound inference workload can hit 25% MFU while running its HBM bus at 80% saturation, and the useful diagnostic is the second number, not the first. Most enterprise teams do not yet measure MBU. Vendor benchmarks rarely report it. If your inference unit economics have surprised you on the downside this year, that is the first instrument to install before approving the next hardware refresh.
the interconnect that is shipping versus the interconnect that has a spec
The other axis is connectivity, and this is where the second half of 2026 gets interesting. On 7 April the UALink Consortium published its 2.0 specification — backed by AMD, AWS, Cisco, HPE, Intel, Meta and Microsoft — supporting up to 1,024 accelerators per pod over 200G data-link and physical layers, explicitly aimed at providing an open substitute for NVIDIA's NVLink fabric. The version 1.0 silicon has not yet shipped. That order — spec 2.0 before silicon 1.0 — is the consortium's central tell. They are trying to standardise the interface before any single vendor's product hardens the customer's expectations.
UALink 2.0 adds in-network compute, reduced latency, and a clean separation between transport and protocol so the same silicon can address 200G today and 400G tomorrow. Tom's Hardware flags the 1,024-GPU scale-up and 200 GT/s bandwidth as the headline figures. Roughly comparable to NVLink 5. What is not comparable is the deployed base. Every Blackwell rack already on the floor is NVLink. Every Rubin rack arriving in the second half of 2026 will be NVLink 6. The AMD MI400-series with UALink 1.0 is still being qualified.
Here is the stake. NVLink will not be displaced inside the training plane on any timeline that matters to a 2026 or 2027 build. UALink will succeed first in the inference plane — places where cost discipline is harder and the elastic-supply assumption is more comfortable, like enterprise on-prem inference and tier-2 cloud — and only later negotiate its way back into training. I would not bet against UALink on a ten-year arc. I would not put it inside a Q3 procurement plan either.
the third wall is the wet one
The third constraint that actually shapes a 2026 build is thermal. Vera Rubin NVL72 is liquid-cooled by mandate. Microsoft has been retrofitting its Maia AI Accelerator fleet with direct-to-chip Sidekick cold plates inside existing colocation footprints. Dell'Oro projects the data centre liquid-cooling market at roughly $7B in manufacturer revenue by 2029, growing at a 31.7% CAGR through 2030. The operational delta is real: lower sustained temperatures have been associated with around 12% better AI training performance and roughly half the heat-related failure rate.
The capex line nobody puts on the headline slide is the facility retrofit. If your existing colocation hall was designed for air at five to eight kilowatts per rack, you are not running a Rubin NVL72 in it. You are either building net-new, paying a colo provider a substantial premium to refit, or — increasingly — accepting that the next eighteen months of inference workload will live somewhere other than your existing footprint. That is the quietest single driver of cloud-versus-colo decisions I see in advisory work this quarter.
the inference layer is where the cost math falls apart cleanly
One more piece, because it is where most enterprise inference bills get inflated by a factor of two to four without anyone noticing. The serving-engine layer — vLLM, SGLang, TensorRT-LLM — is now mature enough that the choice has measurable consequences. Recent H100 benchmarks show TensorRT-LLM running 15–30% higher throughput than vLLM at matched concurrency, with speculative decoding adding up to 3.6× on top. SGLang's RadixAttention pays off on workloads with heavy shared-prefix structure: agentic systems, retrieval pipelines, multi-step prompting. vLLM remains the right default for fast iteration and model flexibility.
The discipline question is whether your engineering team has the time to absorb the compile-step overhead and stability constraints of TensorRT-LLM in exchange for a 15–30% throughput uplift. A workload running at $40k a month in token spend yields a $6–12k monthly margin from that switch — one engineer's loaded cost paid back in a quarter. A workload running at $4k a month, the calculus inverts. Most teams treat this as a one-time platform decision rather than a workload-by-workload one. They are wrong, but the mistake is cheaper than getting the hardware decision wrong.
the macro envelope
All of this sits inside an electricity envelope that is no longer hypothetical. The IEA's latest projection puts global data-centre electricity consumption at 1,100 TWh in 2026 — an 18% upward revision from December 2025 — with AI-accelerated server demand growing at 30% annually in the base case. Roughly equivalent to Japan's national consumption.
And the grid is where the constraint stops being a curve and becomes a wall. A developer entering Dominion Energy's interconnection queue in Northern Virginia today is being quoted connection dates of 2033. Constrained primary markets are running five to seven years out. Ireland's CRU reopened the door in December 2025, but only for sites that bring 100% of their grid connection in behind-the-meter generation and match 80% of demand with onshore renewables — a substantially higher bar than 2022. Denmark has paused new approvals outright. The Dutch moratorium has loosened but not closed. The marginal new gigawatt of inference capacity is increasingly being built in Texas, Arizona, the US Midwest, the Nordics outside Denmark, and the Gulf — and the geography of where your token gets served from will change accordingly.
HBM4 you can negotiate over. Interconnect you can swap on a five-year cycle. Liquid cooling you can pay for. Grid interconnects you cannot conjure on demand, and the bottleneck list in 2027 will look different from 2026 because of where the substation queues already sit today.
Three walls. Memory bandwidth, accelerator interconnect, thermal-plus-power. The capex headline number captures none of them precisely. The boards that approve the headline number without asking which wall they are funding against will spend the back half of 2026 explaining why their compute bill outgrew their inference revenue. That is the part nobody on a vendor stage will tell you.
Tarry Singh is the founder and CEO of Real AI, an enterprise AI advisory and deployment firm working with global enterprises on production agent systems, model risk, and AI sovereignty strategy. He also leads Earthscan for Energy AI startup, and is a founding contributor to the EU-funded HCAIM and PANORAIMA programmes for responsible AI education across European universities. He writes at tarrysingh.com.