Skip to content

P1-C2 · Transformer Revolution + Scaling Laws

Core Takeaway

An 8-page paper changed a $1 trillion industry.

AI Industry Knowledge — History → Technology → Supply Chain → Business → Applications → Geopolitics

P1-C2 (Part 1, Chapter 2). After this chapter, you'll be able to explain why the 2017 Transformer paper was a true turning point, and why scaling laws gave hyperscalers the confidence to spend $725B/yr on capex.


1. The Problem: 70 Years of "Breakthroughs" — Why Is 2017 Really Different?

In C1, you saw 5 eras — 4 winters all followed the pattern: "tech breakthrough → capital influx → failure to deliver → retreat."

So why is this time (post-2017 Transformer) different? Not faith, but technical answers:

  • 1957 Perceptron — breakthrough, but algorithmic bottleneck (single layer can't do XOR)
  • 1986 Backpropagation — breakthrough, but compute bottleneck (no GPU)
  • 1997 Deep Blue — breakthrough, but chess only (no generality)
  • 2012 AlexNet — breakthrough, but perception only (image classification, no generation/reasoning)
  • 2017 Transformer + scaling laws — first time: general + predictable scale + true generation

The next 5 sections explain what "predictable scale" means and why capital won't retreat this time (short term).


2. The Solution: Transformer + Scaling Laws — The Revolution Comes From Combining Both

Component Paper Key Discovery
Transformer "Attention Is All You Need" (Vaswani et al., Google Brain 2017) Abandons RNN sequential processing, parallel attention makes training 10x faster
Scaling Laws Kaplan et al. (OpenAI 2020), Chinchilla (DeepMind 2022) loss = power law of parameters × data × compute — add resources, performance improves predictably

Transformer alone isn't enough — it's just a more efficient architecture. Scaling laws alone aren't enough — without Transformer, you can't scale up. Combined → first time you can "buy capability with money": hyperscalers saw that $1B → $10B → $100B capex all translate into proportional model improvements, so they dared to spend $725B/yr.


3. How It Works: Attention Intuition + Scaling Laws Power Law

3.1 Attention Intuition (vs RNN)

RNN Era: Reading a sentence is like a cassette tape — one word at a time, sequentially. Long sentences forget the beginning ("long-range dependency problem").

Transformer Attention: Reading a sentence is like looking at a map — see all words at once + compute how each word relates to every other. Parallel processing at any length.

RNN:      word1 → word2 → word3 → ... → finally done (slow, forgetful)
Transformer: [word1, word2, word3, ...] attend simultaneously (fast, no forgetting)

Result: Training speed 10x+ (massively parallel on GPUs). This made "large models" possible.

3.2 Scaling Laws — Why Capital Dared to Invest This Time

The Kaplan 2020 paper proved: Model capability = f(parameters, data, compute) follows a power law — add resources, performance improves predictably.

GPT-2 (1.5B params)  → writes fluent sentences
GPT-3 (175B)         → zero-shot cross-task
GPT-4 (~1.7T est.)   → cross-modal + complex reasoning
o1/o3 (reasoning models) → math/code surpasses humans

This is the first time in history "money can buy capability" — and you can predict how much capability for how much money.

This is the underlying logic behind hyperscaler $725B/yr capex: as long as scaling laws hold, whoever spends more on capex gets a stronger model, and wins the application race.

3.3 Chinchilla Correction (2022 DeepMind)

The Kaplan paper had a bug — it overemphasized parameters and underemphasized data.

Chinchilla's finding: Data must scale proportionally with parameters for optimal results. GPT-3's 175B parameters actually had insufficient data — it was "underfit."

→ That's why from 2023 onward, everyone went crazy hoarding data (Reddit/Twitter/publisher licenses). Data has become a scarce resource.


4. vs What You Already Know from C1

Dimension C1 Gave You C2 Adds
Time 5 eras, 70-year timeline Zoom in on the 2017 point
Explanation "Why 4 winters happened" Technical answer to "Why this time might be different"
Investment significance Don't default to belief Know that 5 conditions holding = no winter; scaling laws holding is the most critical one

C1 = story. C2 = technical answer. Without C2, you don't know why this time might not be a winter — you can only have faith.


5. Try It: Estimate Scaling Jumps + Reasoning Models as a New Dimension

Task 1 (10 minutes):

GPT-2 → GPT-3: params 1.5B → 175B = 117x. Capability jump: write sentences → zero-shot cross-task
GPT-3 → GPT-4: params est. 175B → 1.7T = 10x. Capability jump: zero-shot → complex reasoning / cross-modal

Question: GPT-4 → GPT-5 (assume 17T) — what capability jump do you expect?

Task 2 (5 minutes):

Read the first paragraph of the OpenAI o1 blog post. → Reasoning models use "test-time compute" (inference compute) to trade thinking time for capability. This is scaling law curve #2 — not just training can scale, inference can too.

Self-check (3 items checked → proceed to P1-C3):

  • You can explain in one sentence why Transformer is faster than RNN
  • You can explain why the Chinchilla correction made data a scarce resource
  • You can state that "reasoning model scaling" and "training scaling" are two independent curves

6. What's Next

Transformer + scaling laws made LLMs possible. But why did NVDA take the lead, not Intel / AMD / Google?

The 2017 paper was written by Google, GPUs were sold by NVDA, and Intel was still the chip king. Why, 9 years later, is NVDA worth $3T?

→ P1-C3 · Why NVDA Is Not Intel explains 20 years of CUDA + Jensen's platform strategy vs Intel's profit protection.


7. Deep Dive (optional): RLHF / Reasoning Models / Data Wall Risk

Click to see LLM scaling dimensions 4 + 5

Scaling dimension 1: Parameters (Kaplan 2020) — GPT-3, GPT-4 Scaling dimension 2: Data (Chinchilla 2022) — everyone hoarding data Scaling dimension 3: Post-training RLHF (Anthropic Constitutional AI + OpenAI InstructGPT) — making models "obedient" Scaling dimension 4: Inference compute (o1/o3) — don't change the model, trade thinking time for capability Scaling dimension 5: Agentic loop (Claude Code / browser use) — models use tools themselves

Data wall risk (important 2025+): Human-quality text is estimated at ~40T tokens. GPT-4 training used ~13T. At Chinchilla ratios, GPT-6 would need ~100T+ tokens — not enough.

→ Solutions: (a) synthetic data (b) video / multimodal © real-world robotics data. → AI winter wildcard: If synthetic data causes model quality degradation (model collapse), scaling dimension 2 breaks, and the investment thesis changes dramatically.