ASICinferenceNvidiaBroadcomTPU

The Decode Tax: Who Survives When Custom Silicon Demands Hyperscaler Scale

Nvidia spent $20B on Groq to fix the part of inference where GPUs waste 99.8% of their silicon. The distribution surface question is settled. The open question is who captures margin in the repricing.

The Decode Tax: 591x over-provisioned, 0.17% GPU utilization, $20B Groq IP license

TLDR

GPUs waste 99.8% of their silicon during the decode phase of LLM inference. That is an actual physics constraint - HBM bandwidth is 45x slower than SRAM. Nvidia licensed Groq's IP for $20B to solve this problem. The only companies that have been successful at monetizing custom ASICs so far, are hyperscalers with captive workloads (eg: Google) and the design houses they hire (Broadcom: $73B backlog, Marvell: $75B pipeline). Everyone else, eg: neoclouds carrying $Billions+ in debt, merchant ASIC startups without anchor tenants - is structurally locked out.
The ASIC vs GPU debate is over. ASICs won inference, and GPUs won training. The open question is who controls the inference fabric and captures the margin.
H100 GPU runs at 0.17% utilization during decode. 99.8% of silicon idle.

Special Research Edition | ~4,000 words | 6 charts | March 2026

  1. Distribution surfaces - why captive scale gates ASIC survival
  2. ASIC feasibility - inference vs training, CUDA's narrowing moat, supply chain
  3. Neoclouds - why they can't deploy custom silicon
  4. The Groq IP licensing - prefill/decode economics, roofline math, SRAM physics
  5. NVLink Fusion - Nvidia as fabric layer
  6. Who captures margin

Nvidia spent $20B on Groq to fix the part of inference where GPUs waste 99.8% of their silicon. That number sounds like hyperbole. It isn't. During the decode phase of LLM inference - the autoregressive, one-token-at-a-time part that dominates real-time serving - a modern GPU uses roughly 0.2% of its compute. The rest of the die just waits on memory bandwidth. I spent enough time at Google staring at TPU utilization dashboards to know what a workload mismatch looks like, and this one is hard to overstate. It falls directly out of the roofline model (sourced at the end): at batch size one, arithmetic intensity is 1 FLOP/byte against a ridge point of 591. The chip is 591x over-provisioned for the actual workload.

The distribution surface question - who has the captive query volume to justify half a billion dollars in NRE for custom silicon - was settled years ago by Google, then confirmed by Amazon, Meta, and Microsoft. OpenAI's $10B Cerebras deal does not change the picture. Cerebras was 87% dependent on a single sovereign client before that contract showed up. It needed OpenAI's 300M weekly active users to have a viable business, which is the thesis working, not a counterexample.

The question that matters now is narrower and more immediate: inference is already the majority of AI compute spend, GPUs are structurally wrong for the decode half of it, and Nvidia just paid $20B to bolt on the architecture that fixes it. What does the repricing look like, and who ends up on the right side?

The Distribution Surface Thesis

Every custom AI ASIC that made it to production at scale has a massive captive workload behind it. Every attempt without one ended in an acquisition, a pivot, or a quiet shutdown. There are no exceptions to this pattern, though Cerebras comes closest to looking like one (more on that shortly).

Related Deep Dives