P1-C5 · Why GPU / HBM / Liquid Cooling / Nuclear Power¶
Core Takeaway
Every piece of hardware is one bottleneck for LLMs — whichever bottleneck gets unblocked, that company's stock moves.
AI Industry Knowledge — History → Technology → Supply Chain → Business → Application → Geopolitics
P1-C5 (Part 1, Chapter 5). After this chapter, you can reverse-engineer why the entire hardware stack is designed this way from LLM working principles, without memorizing supply chain tickers.
1. The Problem: Why Can't You Train LLMs with CPUs?¶
You see hyperscalers spending $725B on GPUs, but never ask why they can't use cheaper CPUs. You see SK Hynix's stock soaring, but don't know how HBM differs from regular DRAM. You see Vertiv up 200% and think it's an air conditioning company.
How LLMs work (learned in C4) dictates every hardware requirement — you can derive the entire hardware stack from first principles, and then you won't need to memorize 60 supply chain tickers.
2. The Solution: LLM's 4 Core Requirements → 4 Hardware Categories¶
| LLM Needs | Physical Bottleneck | Solving Hardware | Key Players |
|---|---|---|---|
| Massive parallel matrix multiplication (training) | CPU serial is slow | GPU / ASIC | NVDA · AMD · Google TPU |
| Fast data feeding (don't let GPU wait) | DRAM bandwidth insufficient | HBM high-bandwidth memory | SK Hynix · Micron · Samsung |
| GPU-to-GPU communication (1000+ card clusters) | Standard Ethernet is slow | NVLink / InfiniBand / Optical modules | NVDA Mellanox · ANET · COHR |
| Cooling + stable high power | Air cooling can't handle 800W+ chips | Liquid cooling + Nuclear / Gas | VRT · CEG · VST · ETN |
Each link has a "physical bottleneck → solving hardware → key company". The hardware stack maps one-to-one with companies.
3. How It Works: 4 Bottlenecks Explained in Detail¶
3.1 GPU vs CPU — Parallel Matrix Multiplication¶
LLM training spends 99% of time on matrix multiplication (neural networks are essentially matrices).
- CPU: 8-128 cores, each core handles complex tasks independently (like 100 PhDs)
- GPU: 10,000+ cores, each core does simple arithmetic (like 10,000 elementary students doing addition/subtraction)
- Matrix multiplication: 10,000 elementary students doing arithmetic is 100x faster than 100 PhDs
**NVDA H100**: 1 card with 80GB HBM, 700W power, $30K-40K. A training cluster has 1024-8192 cards.
**AMD MI300X / Google TPU / AWS Trainium: Same concept, different implementations. The CUDA ecosystem (NVDA's 20-year moat) keeps NVDA at 80%+ of the training market**.
3.2 HBM vs Regular Memory — Data Throughput¶
GPUs compute fast, but before computing, data must be read from memory into the GPU. Regular DRAM bandwidth is insufficient → GPU spends 80% of time waiting for data → wasted.
HBM (High Bandwidth Memory): 3D stacked memory, bandwidth 10x that of DDR5.
- SK Hynix: Primary HBM3e supplier, NVDA uses 70%+ from SK Hynix
- Micron: Ramped in 2024, gaining share
- Samsung: Slow to qualify with NVDA (technology / yield / strike triple whammy), losing market share
→ HBM shortage is NVDA's shipment ceiling. Monitoring HBM capacity is monitoring NVDA's revenue ceiling.
3.3 NVLink / InfiniBand / Optical Modules — GPU-to-GPU Communication¶
One LLM is too large for a single GPU → distributed across 1000+ GPUs. They need high-speed communication (gradient synchronization).
- NVLink: Between NVDA's own GPUs, 1.8TB/s (Blackwell)
- InfiniBand: Between clusters (NVDA acquired Mellanox in 2019 to secure this)
- Optical modules: Data center cabling, speeds from 400G → 800G → 1.6T → CPO (Co-Packaged Optics)
**COHR / LITE / AAOI**: Optical modules. NVDA invested $2B strategically in COHR / LITE to lock supply. ANET: Network switching (META's primary supplier, used for east-west fabric).
→ Optical module price increases = leading indicator of AI capex acceleration (as cluster scales up, optical module demand grows quadratically).
3.4 Liquid Cooling + Nuclear Power — Cooling + Sustained High Power¶
H100: 700W. Blackwell B200: 1200W. Air cooling can't handle it → liquid cooling is a must.
1 Stargate data center = 1 GW. That's the power output of 1 nuclear power plant.
- VRT (Vertiv): Liquid cooling + data center electrical king
- CEG (Constellation): MSFT's 20-year nuclear PPA (Three Mile Island restart)
- VST (Vistra): Natural gas + nuclear
- ETN (Eaton) / HUBB: Power distribution
- GEV (GE Vernova): Gas turbines (backup + peak)
→ Energy is the real bottleneck for 2026+. You can buy GPUs, but you can't buy electricity (building a nuclear plant takes 10 years). That's why CEG / VST / GEV stocks soared in 2024+.
4. vs C4 — What You Already Know¶
| Dimension | C4 Gives You | C5 Adds |
|---|---|---|
| LLM working principles | ✓ | Doesn't explain hardware |
| Hardware stack | ✗ | LLM → 4 bottlenecks → 4 hardware categories → key companies |
| Investment significance | Knows training vs inference compute | Knows which bottleneck unblocking moves which company's stock; monitoring HBM / optical modules / power is a leading indicator |
C4 = Software. C5 = Hardware + Physics. Without C5, you don't know the true physical logic behind each link in the 60-ticker supply chain.
5. Try It: Estimate GPT-4's Electricity Usage for One Training Run¶
Task (10 minutes):
GPT-4 training estimate:
- 10,000 H100s, each 700W = 7 MW (peak)
- Train for 6 months = 4380 hours
- Compute utilization ~50% average
- Total electricity = 7 MW × 4380 × 0.5 = 15.3 GWh
Reference:
- 1 US household annual electricity ~10 MWh = 0.01 GWh
- 15.3 GWh = 1530 household-years
But this is one run. GPT-4 was trained multiple times (experiments + failures + final), total electricity estimated ~50 GWh = 5000 household-years.
Self-check (3 items met → proceed to P1-C6):
- You can explain **why SK Hynix's stock flies in sync with NVDA**
- You can explain why CEG (nuclear) surged 200%+ in 2024+
- You can predict which link will rally next from hardware bottlenecks: HBM4 (2026)? Liquid cooling penetration (2026-27)? 1.6T optical modules?
6. What's Next¶
You can now reverse-engineer the hardware stack from LLMs. Now map the hardware stack to specific companies — which role each of the 60 tickers plays, and what they depend on.
→ P1-C6 · Supply Chain 5 Roles + 60 Ticker Map Upgrade the existing supply chain diagram; with C1-C5 as foundation, you're no longer learning in isolation.
7. Deep Dive (optional): CPO / NVLink vs Infiniband / TPU Economics / Inference Hardware Divergence¶
Click to see 5 hardware trends
CPO (Co-Packaged Optics) — 2025+: Optical modules go from pluggable to packaged together with the switch chip. Power consumption drops 50%, bandwidth doubles 2x. But CPO yield is difficult, mass production slow. Key players: TSM (packaging), AVGO (switch), Coherent (optical). → If CPO truly mass-produces in 2026, the entire optical module paradigm shifts, reshuffling existing players.
NVLink vs InfiniBand vs Ethernet: NVDA pushes NVLink (between its own GPUs) + InfiniBand (between clusters). But the Ultra Ethernet Consortium (Cisco/Arista/Intel/AMD/MSFT) is jointly promoting standard Ethernet for AI fabric. Long term, NVDA's networking advantage may be diluted.
TPU Economics (Google internal): TPU v5p performance is comparable to H100, but Google uses it internally (not sold externally). This diverts 30-50% of Google's demand from NVDA, but the total market is unchanged (Google uses the same compute even without buying NVDA).
Inference Hardware Divergence — Training vs inference hardware will separate in the future: Training: Massive clusters (NVDA Blackwell dominates) Inference: Single card / edge / small chips (Groq / Cerebras / SambaNova / Apple NPU). NVDA Blackwell also optimizes inference but competitors have a chance.
HBM4 (2026) — Next generation: SK Hynix mass production timeline, bandwidth doubles again. NVDA Rubin (2026 H2) uses HBM4. This is the starting point for the next HBM shortage cycle.