P1-C2 · Transformer Revolution + Scaling Laws¶
Core Takeaway
An 8-page paper changed a $1 trillion industry.
AI Industry Knowledge — History → Technology → Supply Chain → Business → Applications → Geopolitics
P1-C2 (Part 1, Chapter 2). After this chapter, you'll be able to explain why the 2017 Transformer paper was a true turning point, and why scaling laws gave hyperscalers the confidence to spend $725B/yr on capex.
1. The Problem: 70 Years of "Breakthroughs" — Why Is 2017 Really Different?¶
In C1, you saw 5 eras — 4 winters all followed the pattern: "tech breakthrough → capital influx → failure to deliver → retreat."
So why is this time (post-2017 Transformer) different? Not faith, but technical answers:
- 1957 Perceptron — breakthrough, but algorithmic bottleneck (single layer can't do XOR)
- 1986 Backpropagation — breakthrough, but compute bottleneck (no GPU)
- 1997 Deep Blue — breakthrough, but chess only (no generality)
- 2012 AlexNet — breakthrough, but perception only (image classification, no generation/reasoning)
- 2017 Transformer + scaling laws — first time: general + predictable scale + true generation
The next 5 sections explain what "predictable scale" means and why capital won't retreat this time (short term).
2. The Solution: Transformer + Scaling Laws — The Revolution Comes From Combining Both¶
| Component | Paper | Key Discovery |
|---|---|---|
| Transformer | "Attention Is All You Need" (Vaswani et al., Google Brain 2017) | Abandons RNN sequential processing, parallel attention makes training 10x faster |
| Scaling Laws | Kaplan et al. (OpenAI 2020), Chinchilla (DeepMind 2022) | loss = power law of parameters × data × compute — add resources, performance improves predictably |
Transformer alone isn't enough — it's just a more efficient architecture. Scaling laws alone aren't enough — without Transformer, you can't scale up. Combined → first time you can "buy capability with money": hyperscalers saw that $1B → $10B → $100B capex all translate into proportional model improvements, so they dared to spend $725B/yr.
3. How It Works: Attention Intuition + Scaling Laws Power Law¶
3.1 Attention Intuition (vs RNN)¶
RNN Era: Reading a sentence is like a cassette tape — one word at a time, sequentially. Long sentences forget the beginning ("long-range dependency problem").
Transformer Attention: Reading a sentence is like looking at a map — see all words at once + compute how each word relates to every other. Parallel processing at any length.
RNN: word1 → word2 → word3 → ... → finally done (slow, forgetful)
Transformer: [word1, word2, word3, ...] attend simultaneously (fast, no forgetting)
Result: Training speed 10x+ (massively parallel on GPUs). This made "large models" possible.
3.2 Scaling Laws — Why Capital Dared to Invest This Time¶
The Kaplan 2020 paper proved: Model capability = f(parameters, data, compute) follows a power law — add resources, performance improves predictably.
GPT-2 (1.5B params) → writes fluent sentences
GPT-3 (175B) → zero-shot cross-task
GPT-4 (~1.7T est.) → cross-modal + complex reasoning
o1/o3 (reasoning models) → math/code surpasses humans
→ This is the first time in history "money can buy capability" — and you can predict how much capability for how much money.
This is the underlying logic behind hyperscaler $725B/yr capex: as long as scaling laws hold, whoever spends more on capex gets a stronger model, and wins the application race.
3.3 Chinchilla Correction (2022 DeepMind)¶
The Kaplan paper had a bug — it overemphasized parameters and underemphasized data.
Chinchilla's finding: Data must scale proportionally with parameters for optimal results. GPT-3's 175B parameters actually had insufficient data — it was "underfit."
→ That's why from 2023 onward, everyone went crazy hoarding data (Reddit/Twitter/publisher licenses). Data has become a scarce resource.
4. vs What You Already Know from C1¶
| Dimension | C1 Gave You | C2 Adds |
|---|---|---|
| Time | 5 eras, 70-year timeline | Zoom in on the 2017 point |
| Explanation | "Why 4 winters happened" | Technical answer to "Why this time might be different" |
| Investment significance | Don't default to belief | Know that 5 conditions holding = no winter; scaling laws holding is the most critical one |
C1 = story. C2 = technical answer. Without C2, you don't know why this time might not be a winter — you can only have faith.
5. Try It: Estimate Scaling Jumps + Reasoning Models as a New Dimension¶
Task 1 (10 minutes):
GPT-2 → GPT-3: params 1.5B → 175B = 117x. Capability jump: write sentences → zero-shot cross-task
GPT-3 → GPT-4: params est. 175B → 1.7T = 10x. Capability jump: zero-shot → complex reasoning / cross-modal
Question: GPT-4 → GPT-5 (assume 17T) — what capability jump do you expect?
Task 2 (5 minutes):
Read the first paragraph of the OpenAI o1 blog post. → Reasoning models use "test-time compute" (inference compute) to trade thinking time for capability. This is scaling law curve #2 — not just training can scale, inference can too.
Self-check (3 items checked → proceed to P1-C3):
- You can explain in one sentence why Transformer is faster than RNN
- You can explain why the Chinchilla correction made data a scarce resource
- You can state that "reasoning model scaling" and "training scaling" are two independent curves
6. What's Next¶
Transformer + scaling laws made LLMs possible. But why did NVDA take the lead, not Intel / AMD / Google?
The 2017 paper was written by Google, GPUs were sold by NVDA, and Intel was still the chip king. Why, 9 years later, is NVDA worth $3T?
→ P1-C3 · Why NVDA Is Not Intel explains 20 years of CUDA + Jensen's platform strategy vs Intel's profit protection.
7. Deep Dive (optional): RLHF / Reasoning Models / Data Wall Risk¶
Click to see LLM scaling dimensions 4 + 5
Scaling dimension 1: Parameters (Kaplan 2020) — GPT-3, GPT-4 Scaling dimension 2: Data (Chinchilla 2022) — everyone hoarding data Scaling dimension 3: Post-training RLHF (Anthropic Constitutional AI + OpenAI InstructGPT) — making models "obedient" Scaling dimension 4: Inference compute (o1/o3) — don't change the model, trade thinking time for capability Scaling dimension 5: Agentic loop (Claude Code / browser use) — models use tools themselves
Data wall risk (important 2025+): Human-quality text is estimated at ~40T tokens. GPT-4 training used ~13T. At Chinchilla ratios, GPT-6 would need ~100T+ tokens — not enough.
→ Solutions: (a) synthetic data (b) video / multimodal © real-world robotics data. → AI winter wildcard: If synthetic data causes model quality degradation (model collapse), scaling dimension 2 breaks, and the investment thesis changes dramatically.