“"We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and op...”
You want to run an LLM on a CPU—a phone chip, an edge server, or a laptop—without a GPU. Ternary quantization (weights of -1, 0, or +1) makes this possible, but getting a ternary model to match full-precision accuracy required training on 100 billion to 4 trillion tokens before Tequila, comparable to training a full model from scratch. The specific bottleneck: during training, weights near the quantization boundary receive only noisy, contradictory gradients from the Straight-Through Estimator and never escape, permanently reducing model capacity with no error message or obvious sign. You spend the compute budget, and a large fraction of model parameters learn nothing useful.
Ternary quantization maps every weight to -1, 0, or +1. Weights that fall near zero contribute nothing to the forward pass and receive noisy gradients from the Straight-Through Estimator during backpropagation—this is deadzone trapping. Tequila adds a tiny differentiable reactivation parameter λ (default 0.001) that routes a clean, input-independent gradient back to each trapped weight, bypassing the noisy STE path. Those trapped weights also get repurposed: their weighted sum is precomputed offline as a fixed bias C(W) per layer and fused into the kernel, adding less than 0.1% inference overhead. The final forward pass is Y = X·Ŵ·α + C(W), structurally identical to BitNet at inference—multiplication-free, CPU-native—but trained with far more stable gradient flow.
If you are an ML engineer building LLMs for CPU-based or edge deployment—phones, embedded devices, serverless inference—and you are already familiar with quantization-aware training, this paper's fix directly addresses the training inefficiency that makes ternary models expensive to produce. If you are hitting long training requirements with BitNet b1.58 or Spectra, Tequila's module is a direct replacement for the QAT optimizer step. Not useful if you need a model larger than 3B parameters yet—all validated results are on LLaMA-3.2-1B and 3B only, and scaling behavior above that is untested.
Worth studying now if you work on model compression for edge hardware—the gradient analysis is rigorous, ICLR 2026 acceptance is a meaningful quality bar, and the code is in a maintained toolkit (AngelSlim v0.3.0). Hold off on production use: validation only covers 1B and 3B LLaMA models, llama.cpp integration is in early discussion (two-person thread as of January 2026), and the input-agnostic bias approximation's accuracy cost is unquantified by the authors.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.