Baidu's OCR Model: Trending on Paperwithcode

What problem does it solve

You know that feeling when you feed a 40-page PDF to an LLM-based OCR tool and watch it crawl to a halt because the model's memory grows with every line it outputs? Standard transformer decoders store a key-value cache that expands linearly with the output sequence length — fine for a paragraph, disastrous for a 32-page insurance form. You either truncate the document, split it into chunks and stitch results manually, or accept degraded quality at the boundaries. None of those options work cleanly in a production document pipeline.

ocrdocument-parsingvision-language-modelresearch-paperopen-sourcebaiduattention-mechanism

How it works

R-SWA replaces the standard attention layer in the decoder with one that attends to two things only: the visual embeddings from the input image (the 'reference' tokens) and a fixed-size window of the 128 most recent output tokens (1024 for multi-page mode). Think of it like a typist who always looks at the original document and only glances back at the last paragraph they typed — rather than re-reading everything from page one on each keystroke. This caps the KV cache at a constant size regardless of how long the output grows, expressed as `L_m + min(n, T)` where L_m is reference token length and n is the window size. You pass a document image (or a PDF converted to images via PyMuPDF) to the model, and it generates text up to 32,768 tokens in one pass across two inference configs: 'gundam' (640px crop mode, lower memory) and 'base' (1024px, higher quality, multi-page only).

Key takeaways

✦

01

Constant KV cache via R-SWA — parsing a 30-page document uses the same memory as parsing 1 page, so you stop hitting OOM errors on long enterprise documents

⟁

02

Single 32K-token forward pass — you send one request and get the full document back, rather than chunking and stitching across multiple API calls

⊕

03

Two inference configs — 'gundam' (640px + crop) for lower VRAM usage, 'base' (1024px) for higher quality; pick based on your GPU budget

◈

04

MIT license on code and weights — you can deploy this commercially without a licensing conversation

∞

05

OpenAI-compatible SGLang API endpoint — you can point existing tooling at the SGLang server with minimal integration changes

◎

06

PDF pipeline included — PyMuPDF converts PDFs to images before inference, so you pass a .pdf and get text out without preprocessing code

Should you care?

Who it’s for

If you're building a document intelligence pipeline — parsing insurance forms, financial filings, or government PDFs at scale — and the memory wall of standard LLM decoders is your current bottleneck, this is directly aimed at your problem. You need a CUDA 12.9 machine (no Apple Silicon support as of June 2026) and tolerance for a custom SGLang wheel install. This is not ready for you if you need reliable table and structured formatting output — the ParseBench formatting score of 0.97 makes that a hard no for structured document use cases.

Worth exploring

Worth exploring if your specific pain is memory blowup on long documents and you only need raw text extraction — the 86.81 text content score is legitimate. Skip it if you need tables, formatting, or structured output: a 0.97 formatting score means the model fails at that task. The repo has 5 commits and zero tagged releases as of June 26, 2026, with 20 open issues catalogued in the first 72 hours including broken Apple Silicon support — treat this as an early research preview, not a production dependency.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Baidu's OCR Model: Trending on Paperwithcode

Underrated tools. Unfiltered takes.