P1-C2 · Transformer Revolution + Scaling Laws¶

Core Takeaway

An 8-page paper changed a $1 trillion industry.

AI Industry Knowledge — History → Technology → Supply Chain → Business → Applications → Geopolitics

P1-C2 (Part 1, Chapter 2). After this chapter, you'll be able to explain why the 2017 Transformer paper was a true turning point, and why scaling laws gave hyperscalers the confidence to spend $725B/yr on capex.

1. The Problem: 70 Years of "Breakthroughs" — Why Is 2017 Really Different?¶

In C1, you saw 5 eras — 4 winters all followed the pattern: "tech breakthrough → capital influx → failure to deliver → retreat."

So why is this time (post-2017 Transformer) different? Not faith, but technical answers:

1957 Perceptron — breakthrough, but algorithmic bottleneck (single layer can't do XOR)
1986 Backpropagation — breakthrough, but compute bottleneck (no GPU)
1997 Deep Blue — breakthrough, but chess only (no generality)
2012 AlexNet — breakthrough, but perception only (image classification, no generation/reasoning)
2017 Transformer + scaling laws — first time: general + predictable scale + true generation

The next 5 sections explain what "predictable scale" means and why capital won't retreat this time (short term).

2. The Solution: Transformer + Scaling Laws — The Revolution Comes From Combining Both¶

Component	Paper	Key Discovery
Transformer	"Attention Is All You Need" (Vaswani et al., Google Brain 2017)	Abandons RNN sequential processing, parallel attention makes training 10x faster
Scaling Laws	Kaplan et al. (OpenAI 2020), Chinchilla (DeepMind 2022)	loss = power law of parameters × data × compute — add resources, performance improves predictably

Transformer alone isn't enough — it's just a more efficient architecture. Scaling laws alone aren't enough — without Transformer, you can't scale up. Combined → first time you can "buy capability with money": hyperscalers saw that $1B → $10B → $100B capex all translate into proportional model improvements, so they dared to spend $725B/yr.

3. How It Works: Attention Intuition + Scaling Laws Power Law¶

3.1 Attention Intuition (vs RNN)¶

RNN Era: Reading a sentence is like a cassette tape — one word at a time, sequentially. Long sentences forget the beginning ("long-range dependency problem").

Transformer Attention: Reading a sentence is like looking at a map — see all words at once + compute how each word relates to every other. Parallel processing at any length.

RNN:      word1 → word2 → word3 → ... → finally done (slow, forgetful)
Transformer: [word1, word2, word3, ...] attend simultaneously (fast, no forgetting)

Result: Training speed 10x+ (massively parallel on GPUs). This made "large models" possible.

3.2 Scaling Laws — Why Capital Dared to Invest This Time¶

The Kaplan 2020 paper proved: Model capability = f(parameters, data, compute) follows a power law — add resources, performance improves predictably.

GPT-2 (1.5B params)  → writes fluent sentences
GPT-3 (175B disclosed) → zero-shot cross-task
GPT-4 (~1.7T external estimate; OpenAI did NOT disclose) ⚠️ → cross-modal + complex reasoning
o1/o3 (reasoning models; params / compute undisclosed) → math/code surpasses humans

→ This is the first time in history "money can buy capability" — and you can predict how much capability for how much money.

This is the underlying logic behind hyperscaler $725B/yr capex: as long as scaling laws hold, whoever spends more on capex gets a stronger model, and wins the application race.

3.3 Chinchilla Correction (2022 DeepMind)¶

The Kaplan paper had a bug — it overemphasized parameters and underemphasized data.

Chinchilla's finding: Data must scale proportionally with parameters for optimal results. GPT-3's 175B parameters actually had insufficient data — it was "underfit."

→ That's why from 2023 onward, everyone went crazy hoarding data (Reddit/Twitter/publisher licenses). Data has become a scarce resource.

4. vs What You Already Know from C1¶

Dimension	C1 Gave You	C2 Adds
Time	5 eras, 70-year timeline	Zoom in on the 2017 point
Explanation	"Why 4 winters happened"	Technical answer to "Why this time might be different"
Investment significance	Don't default to belief	Know that 5 conditions holding = no winter; scaling laws holding is the most critical one

C1 = story. C2 = technical answer. Without C2, you don't know why this time might not be a winter — you can only have faith.

5. Try It: Estimate Scaling Jumps + Reasoning Models as a New Dimension¶

Task 1 (10 minutes):

GPT-2 → GPT-3: params 1.5B → 175B = 117x. Capability jump: write sentences → zero-shot cross-task
GPT-3 → GPT-4: params 175B (disclosed) → ~1.7T (**external estimate; OpenAI did NOT disclose** per [GPT-4 Tech Report](https://arxiv.org/abs/2303.08774)) ≈ 10x. Capability jump: zero-shot → complex reasoning / cross-modal

Question: GPT-4 → GPT-5 (assume 17T) — what capability jump do you expect?

Task 2 (5 minutes):

Read the first paragraph of the OpenAI o1 blog post. → Reasoning models use "test-time compute" (inference compute) to trade thinking time for capability. This is scaling law curve #2 — not just training can scale, inference can too.

Self-check (3 items checked → proceed to P1-C3):

You can explain in one sentence why Transformer is faster than RNN
You can explain why the Chinchilla correction made data a scarce resource
You can state that "reasoning model scaling" and "training scaling" are two independent curves

6. What's Next¶

Transformer + scaling laws made LLMs possible. But why did NVDA take the lead, not Intel / AMD / Google?

The 2017 paper was written by Google, GPUs were sold by NVDA, and Intel was still the chip king. Why, 9 years later, is NVDA worth ~$5.2T market cap (📅 as of 2026-05-22, SEC 10-Q FY27 Q1 — numbers change, learn the methodology)?

→ P1-C3 · Why NVDA Is Not Intel explains 20 years of CUDA + Jensen's platform strategy vs Intel's profit protection.

7. Deep Dive (optional): RLHF / Reasoning Models / Data Wall Risk¶

Click to see LLM scaling dimensions 4 + 5

Scaling dimension 1: Parameters (Kaplan 2020) — GPT-3, GPT-4 Scaling dimension 2: Data (Chinchilla 2022) — everyone hoarding data Scaling dimension 3: Post-training RLHF (Anthropic Constitutional AI + OpenAI InstructGPT) — making models "obedient" Scaling dimension 4: Inference compute (o1/o3) — don't change the model, trade thinking time for capability Scaling dimension 5: Agentic loop (Claude Code / browser use) — models use tools themselves

Data wall risk (important 2025+): Public human-generated text effective stock ~300T tokens (90% CI 100T-1000T, per Epoch AI 2024) including web + books + papers + code. ~40T tokens refers to a narrower curated high-quality subset, NOT the public-text ceiling. GPT-4 training used ~13T. At Chinchilla ratios, GPT-6 would need ~100T+ tokens — even using the full ~300T provides only ~2-3x headroom; the data wall still approaches within 5-8 years.

→ Solutions: (a) synthetic data (b) video / multimodal © real-world robotics data. → AI winter wildcard: If synthetic data causes model quality degradation (model collapse), scaling dimension 2 breaks, and the investment thesis changes dramatically.

8. Further Reading (this chapter — Transformer + Scaling Laws)¶

All free sources, aligned with P5 0-paid policy

Classic papers / primary sources:

Vaswani et al. "Attention Is All You Need" (2017) — 8-page paper, the Transformer starting point
Kaplan et al. "Scaling Laws for Neural Language Models" (OpenAI 2020) — Scientific basis capital dared to bet on
Hoffmann et al. "Chinchilla" (DeepMind 2022) — Optimal data / parameter ratio; GPT-4 onward trained this way
OpenAI "Learning to Reason with LLMs" (o1 system card, 2024) — Official explanation of the inference-compute new dimension

Wikipedia (3-10 min, full timelines + primary citations):

"Transformer (deep learning architecture)" — Architecture + subsequent evolution (GPT / BERT / T5)
"Attention (machine learning)" — Past and present of attention mechanisms
"Large language model" — Full LLM lineage + scaling curve references

Videos / public lectures (~1-3 hr each):

Andrej Karpathy "Let's build GPT from scratch" (2 hr, YouTube) — Hand-coded nano-GPT; watching this gives you a true grasp of Transformer
Andrej Karpathy "Intro to LLM" (1 hr, YouTube) — Explains LLMs clearly, no math
3Blue1Brown "Attention in transformers" (~30 min) — Visualized attention intuition

Podcasts (1-3 hr each):

Lex Fridman #333 — Andrej Karpathy — 2.5 hr deep dive, Transformer / scaling / training intuition
Lex Fridman #367 — Sam Altman — GPT-4-era OpenAI perspective

Blogs / Lilian Weng (OpenAI applied research):

Lilian Weng "The Transformer Family" — Evolution of the entire Transformer family
Lilian Weng "Attention? Attention!" — Attention mechanism survey

Books (library):

Sebastian Raschka "Build a Large Language Model (From Scratch)" (2024) — Build an LLM line-by-line
Stephen Wolfram "What Is ChatGPT Doing... and Why Does It Work?" (2023) — Short, gives intuition for LLM internals

Pair with this chapter's self-check:

After Karpathy's 2 videos + Wikipedia "Transformer" + the Chinchilla paper abstract, you should be able to answer "what are scaling laws / why capital dared to bet" and "reasoning models vs. scaling law's 2^nd curve."