R&D advanced 3 min read Jun 26, 2026
Public Preview Sign in free for the full digest →

Baidu's OCR Model: Trending on Paperwithcode

“Baidu's new OCR model extracts text with 86.81 accuracy — but scores 0.97 on formatting, exposing the exact architectural trade-off hiding inside the 'unlimited' claim.”

Baidu's OCR Model: Trending on Paperwithcode
Source · paperswithcode.co

You know that feeling when you feed a 40-page PDF to an LLM-based OCR tool and watch it crawl to a halt because the model's memory grows with every line it outputs? Standard transformer decoders store a key-value cache that expands linearly with the output sequence length — fine for a paragraph, disastrous for a 32-page insurance form. You either truncate the document, split it into chunks and stitch results manually, or accept degraded quality at the boundaries. None of those options work cleanly in a production document pipeline.

ocrdocument-parsingvision-language-modelresearch-paperopen-sourcebaiduattention-mechanism

R-SWA replaces the standard attention layer in the decoder with one that attends to two things only: the visual embeddings from the input image (the 'reference' tokens) and a fixed-size window of the 128 most recent output tokens (1024 for multi-page mode). Think of it like a typist who always looks at the original document and only glances back at the last paragraph they typed — rather than re-reading everything from page one on each keystroke. This caps the KV cache at a constant size regardless of how long the output grows, expressed as `L_m + min(n, T)` where L_m is reference token length and n is the window size. You pass a document image (or a PDF converted to images via PyMuPDF) to the model, and it generates text up to 32,768 tokens in one pass across two inference configs: 'gundam' (640px crop mode, lower memory) and 'base' (1024px, higher quality, multi-page only).

01
Constant KV cache via R-SWA — parsing a 30-page document uses the same memory as parsing 1 page, so you stop hitting OOM errors on long enterprise documents
02
Single 32K-token forward pass — you send one request and get the full document back, rather than chunking and stitching across multiple API calls
03
Two inference configs — 'gundam' (640px + crop) for lower VRAM usage, 'base' (1024px) for higher quality; pick based on your GPU budget
04
MIT license on code and weights — you can deploy this commercially without a licensing conversation
05
OpenAI-compatible SGLang API endpoint — you can point existing tooling at the SGLang server with minimal integration changes
06
PDF pipeline included — PyMuPDF converts PDFs to images before inference, so you pass a .pdf and get text out without preprocessing code
Who it’s for

If you're building a document intelligence pipeline — parsing insurance forms, financial filings, or government PDFs at scale — and the memory wall of standard LLM decoders is your current bottleneck, this is directly aimed at your problem. You need a CUDA 12.9 machine (no Apple Silicon support as of June 2026) and tolerance for a custom SGLang wheel install. This is not ready for you if you need reliable table and structured formatting output — the ParseBench formatting score of 0.97 makes that a hard no for structured document use cases.

Worth exploring

Worth exploring if your specific pain is memory blowup on long documents and you only need raw text extraction — the 86.81 text content score is legitimate. Skip it if you need tables, formatting, or structured output: a 0.97 formatting score means the model fails at that task. The repo has 5 commits and zero tagged releases as of June 26, 2026, with 20 open issues catalogued in the first 72 hours including broken Apple Silicon support — treat this as an early research preview, not a production dependency.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →