Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both
Snaplyze Digest
R&D intermediate 3 min read Mar 18, 2026 Updated Apr 1, 2026

Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both

“Your ML training takes 10 hours on CPU, 30 minutes on GPU, 5 minutes on TPU — here's why architecture matters more than clock speed.”

In Short

The same matrix multiplication runs at completely different speeds on CPU, GPU, and TPU — not because of clock speed, but because of architecture. CPUs handle complex branching and system calls with low latency. GPUs spread work across thousands of cores that execute the same instruction simultaneously (SIMT). TPUs use systolic arrays designed purely for matrix math, with compiler-controlled dataflow and on-chip memory. Pick the wrong hardware and you waste 90% of your compute budget. Pick the right one and you ship faster for less money.

hardwaregpucputpumachine-learning
Why It Matters
The practical pain point this digest is really about.

You know that feeling when your ML training job takes forever, so you throw more CPU cores at it and... nothing changes? Or when you provision expensive GPUs but your inference latency still sucks? The problem isn't the hardware — it's matching your workload to the wrong architecture. CPUs excel at branching logic and low-latency decisions. GPUs crush parallel matrix operations. TPUs dominate when your workload fits their systolic array design. Before: you guess which hardware to use and hope. Now: you understand exactly why each architecture wins for specific workloads.

How It Works
The mechanism, architecture, or workflow behind it.

Think of it like organizing a kitchen. A CPU is one master chef who handles any recipe, makes complex decisions, and switches tasks instantly — but can only cook one dish at a time. A GPU is 10,000 line cooks who each do one simple task (chop this vegetable) in perfect sync — amazing for repetitive work, terrible for anything requiring judgment. A TPU is a custom assembly line built for one specific dish — it makes that dish faster than anyone, but can't cook anything else. The key insight: CPUs optimize for latency (get one thing done fast), GPUs optimize for throughput (get many things done eventually), and TPUs optimize for one specific throughput pattern (dense matrix multiplication).

Key Takeaways
7 fast bullets that make the core value obvious.
  • CPU architecture — why YOU care: Few powerful cores with complex control logic handle branching, system calls, and interrupts. Your OS, database, and web server run here because they need flexibility. Use CPUs when your...
  • GPU architecture — why YOU care: Thousands of simple cores execute the same instruction across massive datasets (SIMT/SIMD). Streaming Multiprocessors (SMs) contain warp schedulers, CUDA cores, register files, and L1 ca...
  • TPU architecture — why YOU care: Systolic arrays flow data through a grid of processing units in a rhythmic pattern, with compiler-controlled dataflow and on-chip weight/activation buffers. Use TPUs when you're doing de...
  • GPU memory hierarchy — why YOU care: L1 cache per SM, shared L2 cache, then high-bandwidth global memory with high latency. Understanding this explains why coalesced memory access matters — while some threads wait on me...
  • Warp execution — why YOU care: GPUs execute 32 threads (a warp) in lockstep. Divergent branching within a warp kills performance because both paths must execute serially. This is why GPU code avoids conditionals inside ...
  • TPU limitations — why YOU care: TPUs struggle with dynamic computation graphs, custom operations, and model architectures that don't fit their systolic array design. They're fixed-function accelerators, not programmable...
  • GPU versatility — why YOU care: The same GPU that trains your model runs inference, handles computer vision, processes scientific simulations, and renders graphics. When new techniques emerge, GPUs adapt. TPUs often req...
Should You Care?
Audience fit, decision signal, and the original source in one place.

Who It Is For

If you're a developer or ML engineer making infrastructure decisions about where to run compute workloads — this is for you. Especially valuable if you've wondered why your GPU code isn't faster, or when to use cloud TPUs vs GPU instances. Also relevant for system architects designing ML pipelines. Not useful if you only run pre-packaged SaaS tools that abstract hardware away.

Worth Exploring?

Yes — this is foundational knowledge that affects every compute-intensive project. The mental model of latency vs throughput vs specialization will change how you think about infrastructure. The one caveat: this is architecture-level understanding, not a tutorial. You'll need to apply this knowledge to your specific stack (PyTorch, TensorFlow, CUDA, etc.). The insight about GPU programmability vs TPU specialization alone is worth the read.

View original source
What the full digest unlocks

There is more here than the public preview.

This page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.

Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.

Open the full digest

Snaplyze

Go beyond the preview

Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.

Install Snaplyze