R&D advanced 3 min read May 5, 2026
Public Preview Sign in free for the full digest →

Tequila Fixes Ternary QAT: 10B Tokens, Not 4T

“BitNet needed 4 trillion training tokens to reach competitive accuracy. Tequila gets there with 10 billion by fixing one gradient starvation bug that silenced a large fraction of ternary model weights.”

Tequila Fixes Ternary QAT: 10B Tokens, Not 4T
1 Views
0 Likes
0 Bookmarks
Source · paperswithcode.com

“"We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and op...”

You want to run an LLM on a CPU—a phone chip, an edge server, or a laptop—without a GPU. Ternary quantization (weights of -1, 0, or +1) makes this possible, but getting a ternary model to match full-precision accuracy required training on 100 billion to 4 trillion tokens before Tequila, comparable to training a full model from scratch. The specific bottleneck: during training, weights near the quantization boundary receive only noisy, contradictory gradients from the Straight-Through Estimator and never escape, permanently reducing model capacity with no error message or obvious sign. You spend the compute budget, and a large fraction of model parameters learn nothing useful.

llmquantizationternaryedge-airesearch-papermodel-compressioniclr-2026

Ternary quantization maps every weight to -1, 0, or +1. Weights that fall near zero contribute nothing to the forward pass and receive noisy gradients from the Straight-Through Estimator during backpropagation—this is deadzone trapping. Tequila adds a tiny differentiable reactivation parameter λ (default 0.001) that routes a clean, input-independent gradient back to each trapped weight, bypassing the noisy STE path. Those trapped weights also get repurposed: their weighted sum is precomputed offline as a fixed bias C(W) per layer and fused into the kernel, adding less than 0.1% inference overhead. The final forward pass is Y = X·Ŵ·α + C(W), structurally identical to BitNet at inference—multiplication-free, CPU-native—but trained with far more stable gradient flow.

01
Deadzone reactivation via λ — you get a clean, input-agnostic gradient channel routed directly to every trapped weight during training, which is the specific fix that cuts required QAT training from 100B–4T tokens down to 10B tokens per pr...
02
Input-agnostic precomputed bias — the sum of dead-weight contributions gets computed offline once and fused into each layer's kernel, keeping your inference path multiplication-free with less than 0.1% runtime overhead
03
Plug-in module design — you drop Tequila into an existing ternary QAT pipeline using Absmean quantization (the same base as BitNet, Spectra, and BitCPM) without redesigning your training loop
04
3.0× CPU inference speedup — you retain the same hardware benefit as other ternary LLMs because weights stay ternary and use the lookup-table inference path from bitnet.cpp, benchmarked on Intel 8263C
05
AngelSlim v0.3.0 integration — Tencent ships Tequila in their open-source toolkit (released January 13, 2026, 925 GitHub stars) so you start from a maintained codebase rather than a research prototype
06
Mixed-gradient dual role — each dead weight participates in both the ternary multiply and the bias pathway simultaneously, giving a gradient that combines input-specific signal (x_i · ∂L/∂Y) with clean global signal (λ · ∂L/∂Y)
Who it’s for

If you are an ML engineer building LLMs for CPU-based or edge deployment—phones, embedded devices, serverless inference—and you are already familiar with quantization-aware training, this paper's fix directly addresses the training inefficiency that makes ternary models expensive to produce. If you are hitting long training requirements with BitNet b1.58 or Spectra, Tequila's module is a direct replacement for the QAT optimizer step. Not useful if you need a model larger than 3B parameters yet—all validated results are on LLaMA-3.2-1B and 3B only, and scaling behavior above that is untested.

Worth exploring

Worth studying now if you work on model compression for edge hardware—the gradient analysis is rigorous, ICLR 2026 acceptance is a meaningful quality bar, and the code is in a maintained toolkit (AngelSlim v0.3.0). Hold off on production use: validation only covers 1B and 3B LLaMA models, llama.cpp integration is in early discussion (two-person thread as of January 2026), and the input-agnostic bias approximation's accuracy cost is unquantified by the authors.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →