Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both

What problem does it solve

“TPUs often struggle with dynamic computation graphs, custom operations, or model architectures that don't fit their systolic array design. GPUs handle these cases naturally because they're fundamentally programmable processors rather than fixed function accelerators. — r/NVDA_St...”

You know that feeling when your ML training job takes forever, so you throw more CPU cores at it and... nothing changes? Or when you provision expensive GPUs but your inference latency still sucks? The problem isn't the hardware — it's matching your workload to the wrong architecture. CPUs excel at branching logic and low-latency decisions. GPUs crush parallel matrix operations. TPUs dominate when your workload fits their systolic array design. Before: you guess which hardware to use and hope. Now: you understand exactly why each architecture wins for specific workloads.

hardwaregpucputpumachine-learninginfrastructuresystem-design

How it works

Think of it like organizing a kitchen. A CPU is one master chef who handles any recipe, makes complex decisions, and switches tasks instantly — but can only cook one dish at a time. A GPU is 10,000 line cooks who each do one simple task (chop this vegetable) in perfect sync — amazing for repetitive work, terrible for anything requiring judgment. A TPU is a custom assembly line built for one specific dish — it makes that dish faster than anyone, but can't cook anything else. The key insight: CPUs optimize for latency (get one thing done fast), GPUs optimize for throughput (get many things done eventually), and TPUs optimize for one specific throughput pattern (dense matrix multiplication).

Key takeaways

✦

01

CPU architecture — why YOU care: Few powerful cores with complex control logic handle branching, system calls, and interrupts. Your OS, database, and web server run here because they need flexibility. Use CPUs when your code has lots of if...

⟁

02

GPU architecture — why YOU care: Thousands of simple cores execute the same instruction across massive datasets (SIMT/SIMD). Streaming Multiprocessors (SMs) contain warp schedulers, CUDA cores, register files, and L1 cache. Use GPUs when y...

⊕

03

TPU architecture — why YOU care: Systolic arrays flow data through a grid of processing units in a rhythmic pattern, with compiler-controlled dataflow and on-chip weight/activation buffers. Use TPUs when you're doing dense matrix multiplic...

◈

04

GPU memory hierarchy — why YOU care: L1 cache per SM, shared L2 cache, then high-bandwidth global memory with high latency. Understanding this explains why coalesced memory access matters — while some threads wait on memory, thousands of o...

∞

05

Warp execution — why YOU care: GPUs execute 32 threads (a warp) in lockstep. Divergent branching within a warp kills performance because both paths must execute serially. This is why GPU code avoids conditionals inside parallel sections.

◎

06

TPU limitations — why YOU care: TPUs struggle with dynamic computation graphs, custom operations, and model architectures that don't fit their systolic array design. They're fixed-function accelerators, not programmable processors. If your...

✺

07

GPU versatility — why YOU care: The same GPU that trains your model runs inference, handles computer vision, processes scientific simulations, and renders graphics. When new techniques emerge, GPUs adapt. TPUs often require waiting for com...

Should you care?

Who it’s for

If you're a developer or ML engineer making infrastructure decisions about where to run compute workloads — this is for you. Especially valuable if you've wondered why your GPU code isn't faster, or when to use cloud TPUs vs GPU instances. Also relevant for system architects designing ML pipelines. Not useful if you only run pre-packaged SaaS tools that abstract hardware away.

Worth exploring

Yes — this is foundational knowledge that affects every compute-intensive project. The mental model of latency vs throughput vs specialization will change how you think about infrastructure. The one caveat: this is architecture-level understanding, not a tutorial. You'll need to apply this knowledge to your specific stack (PyTorch, TensorFlow, CUDA, etc.). The insight about GPU programmability vs TPU specialization alone is worth the read.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both

Underrated tools. Unfiltered takes.