You know that feeling when you feed a 40-page PDF to an LLM-based OCR tool and watch it crawl to a halt because the model's memory grows with every line it outputs? Standard transformer decoders store a key-value cache that expands linearly with the output sequence length — fine for a paragraph, disastrous for a 32-page insurance form. You either truncate the document, split it into chunks and stitch results manually, or accept degraded quality at the boundaries. None of those options work cleanly in a production document pipeline.
R-SWA replaces the standard attention layer in the decoder with one that attends to two things only: the visual embeddings from the input image (the 'reference' tokens) and a fixed-size window of the 128 most recent output tokens (1024 for multi-page mode). Think of it like a typist who always looks at the original document and only glances back at the last paragraph they typed — rather than re-reading everything from page one on each keystroke. This caps the KV cache at a constant size regardless of how long the output grows, expressed as `L_m + min(n, T)` where L_m is reference token length and n is the window size. You pass a document image (or a PDF converted to images via PyMuPDF) to the model, and it generates text up to 32,768 tokens in one pass across two inference configs: 'gundam' (640px crop mode, lower memory) and 'base' (1024px, higher quality, multi-page only).
If you're building a document intelligence pipeline — parsing insurance forms, financial filings, or government PDFs at scale — and the memory wall of standard LLM decoders is your current bottleneck, this is directly aimed at your problem. You need a CUDA 12.9 machine (no Apple Silicon support as of June 2026) and tolerance for a custom SGLang wheel install. This is not ready for you if you need reliable table and structured formatting output — the ParseBench formatting score of 0.97 makes that a hard no for structured document use cases.
Worth exploring if your specific pain is memory blowup on long documents and you only need raw text extraction — the 86.81 text content score is legitimate. Skip it if you need tables, formatting, or structured output: a 0.97 formatting score means the model fails at that task. The repo has 5 commits and zero tagged releases as of June 26, 2026, with 20 open issues catalogued in the first 72 hours including broken Apple Silicon support — treat this as an early research preview, not a production dependency.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.