Tequila Fixes Ternary QAT: 10B Tokens, Not 4T

What problem does it solve

“"We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and op...”

You want to run an LLM on a CPU—a phone chip, an edge server, or a laptop—without a GPU. Ternary quantization (weights of -1, 0, or +1) makes this possible, but getting a ternary model to match full-precision accuracy required training on 100 billion to 4 trillion tokens before Tequila, comparable to training a full model from scratch. The specific bottleneck: during training, weights near the quantization boundary receive only noisy, contradictory gradients from the Straight-Through Estimator and never escape, permanently reducing model capacity with no error message or obvious sign. You spend the compute budget, and a large fraction of model parameters learn nothing useful.

llmquantizationternaryedge-airesearch-papermodel-compressioniclr-2026

How it works

Ternary quantization maps every weight to -1, 0, or +1. Weights that fall near zero contribute nothing to the forward pass and receive noisy gradients from the Straight-Through Estimator during backpropagation—this is deadzone trapping. Tequila adds a tiny differentiable reactivation parameter λ (default 0.001) that routes a clean, input-independent gradient back to each trapped weight, bypassing the noisy STE path. Those trapped weights also get repurposed: their weighted sum is precomputed offline as a fixed bias C(W) per layer and fused into the kernel, adding less than 0.1% inference overhead. The final forward pass is Y = X·Ŵ·α + C(W), structurally identical to BitNet at inference—multiplication-free, CPU-native—but trained with far more stable gradient flow.

Key takeaways

✦

01

Deadzone reactivation via λ — you get a clean, input-agnostic gradient channel routed directly to every trapped weight during training, which is the specific fix that cuts required QAT training from 100B–4T tokens down to 10B tokens per pr...

⟁

02

Input-agnostic precomputed bias — the sum of dead-weight contributions gets computed offline once and fused into each layer's kernel, keeping your inference path multiplication-free with less than 0.1% runtime overhead

⊕

03

Plug-in module design — you drop Tequila into an existing ternary QAT pipeline using Absmean quantization (the same base as BitNet, Spectra, and BitCPM) without redesigning your training loop

◈

04

3.0× CPU inference speedup — you retain the same hardware benefit as other ternary LLMs because weights stay ternary and use the lookup-table inference path from bitnet.cpp, benchmarked on Intel 8263C

∞

05

AngelSlim v0.3.0 integration — Tencent ships Tequila in their open-source toolkit (released January 13, 2026, 925 GitHub stars) so you start from a maintained codebase rather than a research prototype

◎

06

Mixed-gradient dual role — each dead weight participates in both the ternary multiply and the bias pathway simultaneously, giving a gradient that combines input-specific signal (x_i · ∂L/∂Y) with clean global signal (λ · ∂L/∂Y)

Should you care?

Who it’s for

If you are an ML engineer building LLMs for CPU-based or edge deployment—phones, embedded devices, serverless inference—and you are already familiar with quantization-aware training, this paper's fix directly addresses the training inefficiency that makes ternary models expensive to produce. If you are hitting long training requirements with BitNet b1.58 or Spectra, Tequila's module is a direct replacement for the QAT optimizer step. Not useful if you need a model larger than 3B parameters yet—all validated results are on LLaMA-3.2-1B and 3B only, and scaling behavior above that is untested.

Worth exploring

Worth studying now if you work on model compression for edge hardware—the gradient analysis is rigorous, ICLR 2026 acceptance is a meaningful quality bar, and the code is in a maintained toolkit (AngelSlim v0.3.0). Hold off on production use: validation only covers 1B and 3B LLaMA models, llama.cpp integration is in early discussion (two-person thread as of January 2026), and the input-agnostic bias approximation's accuracy cost is unquantified by the authors.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Tequila Fixes Ternary QAT: 10B Tokens, Not 4T

Underrated tools. Unfiltered takes.