R&D intermediate 3 min read Mar 18, 2026 · Updated Apr 1, 2026
Public Preview Sign in free for the full digest →

Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both

“Your ML training takes 10 hours on CPU, 30 minutes on GPU, 5 minutes on TPU — here's why architecture matters more than clock speed.”

Why your ML code runs 100x slower on CPU than GPU — and when TPU wins both
8 Views
0 Likes
0 Bookmarks
Source · blog.bytebytego.com

“TPUs often struggle with dynamic computation graphs, custom operations, or model architectures that don't fit their systolic array design. GPUs handle these cases naturally because they're fundamentally programmable processors rather than fixed function accelerators. — r/NVDA_St...”

You know that feeling when your ML training job takes forever, so you throw more CPU cores at it and... nothing changes? Or when you provision expensive GPUs but your inference latency still sucks? The problem isn't the hardware — it's matching your workload to the wrong architecture. CPUs excel at branching logic and low-latency decisions. GPUs crush parallel matrix operations. TPUs dominate when your workload fits their systolic array design. Before: you guess which hardware to use and hope. Now: you understand exactly why each architecture wins for specific workloads.

hardwaregpucputpumachine-learninginfrastructuresystem-design

Think of it like organizing a kitchen. A CPU is one master chef who handles any recipe, makes complex decisions, and switches tasks instantly — but can only cook one dish at a time. A GPU is 10,000 line cooks who each do one simple task (chop this vegetable) in perfect sync — amazing for repetitive work, terrible for anything requiring judgment. A TPU is a custom assembly line built for one specific dish — it makes that dish faster than anyone, but can't cook anything else. The key insight: CPUs optimize for latency (get one thing done fast), GPUs optimize for throughput (get many things done eventually), and TPUs optimize for one specific throughput pattern (dense matrix multiplication).

01
CPU architecture — why YOU care: Few powerful cores with complex control logic handle branching, system calls, and interrupts. Your OS, database, and web server run here because they need flexibility. Use CPUs when your code has lots of if...
02
GPU architecture — why YOU care: Thousands of simple cores execute the same instruction across massive datasets (SIMT/SIMD). Streaming Multiprocessors (SMs) contain warp schedulers, CUDA cores, register files, and L1 cache. Use GPUs when y...
03
TPU architecture — why YOU care: Systolic arrays flow data through a grid of processing units in a rhythmic pattern, with compiler-controlled dataflow and on-chip weight/activation buffers. Use TPUs when you're doing dense matrix multiplic...
04
GPU memory hierarchy — why YOU care: L1 cache per SM, shared L2 cache, then high-bandwidth global memory with high latency. Understanding this explains why coalesced memory access matters — while some threads wait on memory, thousands of o...
05
Warp execution — why YOU care: GPUs execute 32 threads (a warp) in lockstep. Divergent branching within a warp kills performance because both paths must execute serially. This is why GPU code avoids conditionals inside parallel sections.
06
TPU limitations — why YOU care: TPUs struggle with dynamic computation graphs, custom operations, and model architectures that don't fit their systolic array design. They're fixed-function accelerators, not programmable processors. If your...
07
GPU versatility — why YOU care: The same GPU that trains your model runs inference, handles computer vision, processes scientific simulations, and renders graphics. When new techniques emerge, GPUs adapt. TPUs often require waiting for com...
Who it’s for

If you're a developer or ML engineer making infrastructure decisions about where to run compute workloads — this is for you. Especially valuable if you've wondered why your GPU code isn't faster, or when to use cloud TPUs vs GPU instances. Also relevant for system architects designing ML pipelines. Not useful if you only run pre-packaged SaaS tools that abstract hardware away.

Worth exploring

Yes — this is foundational knowledge that affects every compute-intensive project. The mental model of latency vs throughput vs specialization will change how you think about infrastructure. The one caveat: this is architecture-level understanding, not a tutorial. You'll need to apply this knowledge to your specific stack (PyTorch, TensorFlow, CUDA, etc.). The insight about GPU programmability vs TPU specialization alone is worth the read.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →