Kronos: Finance Foundation Model with a Data Leakage Bug Open
Snaplyze Digest
Tech Products advanced 2 min read Apr 13, 2026 Updated Apr 15, 2026

Kronos: Finance Foundation Model with a Data Leakage Bug Open

“16.5k stars, AAAI 2026 paper — but an open issue alleges its benchmark numbers come from a data leakage bug.”

In Short

Kronos pre-trains a Transformer on 12 billion K-line records from 45 global exchanges — the first open-source foundation model built specifically for financial candlestick data. It tokenizes OHLCV price data into discrete tokens and predicts future candles autoregressively, achieving 93% RankIC improvement over the top general-purpose TSFM per the paper. But an open GitHub issue (#227) alleges the finetuning pipeline leaks future information through per-window normalization, and r/quant users report predictions that diverge wildly from input ranges.

aifinancetime-seriesfoundation-modelpython
Why It Matters
The practical pain point this digest is really about.

You know that feeling when you try to apply a general-purpose time series model to financial data and it misses the noise patterns, the regime changes, and the cross-asset dynamics that make markets unique? Existing TSFMs like TimesFM and Chronos treat all time series the same — weather, server metrics, stock prices. Financial candlestick data has unique characteristics (OHLCV structure, high noise, non-stationarity) that general models handle poorly. Kronos targets this gap with a finance-specific tokenizer and pre-training on 12B+ K-line records.

How It Works
The mechanism, architecture, or workflow behind it.

Think of it like a language model, but instead of words it reads candlestick bars. Step 1: A hierarchical tokenizer converts each OHLCV bar (Open, High, Low, Close, Volume) into discrete tokens that preserve price dynamics and trade activity. Step 2: A decoder-only Transformer (4.1M to 102.3M params, open-sourced) is pre-trained on 12B+ tokenized K-line records using next-token prediction — same objective as GPT. At inference, you give it historical OHLCV data and a future timestamp range, and it autoregresses forward to generate forecasted candles with temperature and nucleus sampling for probabilistic outputs.

Key Takeaways
6 fast bullets that make the core value obvious.
  • Finance-specific tokenizer — converts continuous OHLCV into hierarchical discrete tokens that preserve price dynamics, unlike general TSFMs that flatten all data types into one representation
  • Zero-shot forecasting — generate predictions on any market without retraining, using probabilistic sampling (temperature + nucleus) for confidence intervals
  • Multi-task support — handles price forecasting, volatility prediction (9% lower MAE per paper), and synthetic K-line generation (22% better fidelity) from a single model
  • Qlib finetuning pipeline — adapt the model to your specific market or strategy using Microsoft Qlib for data prep, with multi-GPU training via torchrun
  • Multiple model sizes — choose between 4.1M (mini, 2048 context), 24.7M (small), and 102.3M (base) params depending on your compute budget and latency needs
  • Batch prediction — forecast multiple assets simultaneously via predict_batch with GPU parallelism
Should You Care?
Audience fit, decision signal, and the original source in one place.

Who It Is For

If you're a quant researcher or ML engineer building price forecasting, volatility modeling, or synthetic data pipelines for financial markets, this is directly relevant. Also useful if you study tokenizer design for non-language domains. Not useful if you need cross-asset portfolio signals in a single forward pass, or if you want a production-ready trading system — the authors themselves call it...

Worth Exploring?

Worth exploring as a research artifact and educational reference for finance-specific tokenizer design. The AAAI 2026 acceptance gives it academic credibility. However, tread carefully: the data leakage allegation in issue #227 is unresolved, users report broken predictions in issue #229, the repo has no formal releases, no maintainer activity since January 2026, and 152 open issues. Treat it as experimental — study the tokenizer architecture and paper, but do not rely on its benchmark claims or use it for real trading until the leakage issue is resolved.

View original source
What the full digest unlocks

There is more here than the public preview.

This page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.

Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.

Open the full digest

Snaplyze

Go beyond the preview

Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.

Install Snaplyze