R&D advanced 3 min read May 11, 2026
Public Preview Sign in free for the full digest →

9B model beats Qwen3-Omni-30B on 6 of 7 omni tasks

“A 9B open-source model beats Alibaba's 30B on 6 of 7 real-time omni-modal tasks — and runs at 11GB VRAM on a single RTX 4090.”

9B model beats Qwen3-Omni-30B on 6 of 7 omni tasks
Source · huggingface.co

“MiniCPM-o 4.5 is an early exploration of real-time full-duplex omni-modal interaction; long, dynamic real-world streaming robustness needs improvement. — Paper Section 8, Limitations, arXiv:2604.27393”

You know that feeling when you ask an AI voice assistant something and it has to stop everything — stop watching, stop listening — just to reply? Current multimodal models alternate between perceiving and responding: the model listens, then pauses to generate an answer, then resumes listening. That cycle prevents interruption, real-time commentary on live video, and any response to things that change while the model is speaking. It's the architectural reason why voice AI still feels like a walkie-talkie rather than a phone call.

multimodalllmopen-sourcespeechfull-duplexedge-airesearch-paper

Omni-Flow divides time into 1-second chunks. In each chunk, the model receives a group of visual tokens (camera input), audio tokens (microphone input), and output tokens (what it's currently saying), all serialized as one sequence for standard next-token prediction. A Listen-Speak control token at the start of each group tells the model whether to stay quiet or generate speech. Ablations in the paper confirm 1.0-second windows beat 0.2-second and 0.1-second windows significantly, and explicit boundary tokens beat implicit ones. A second technique called TAIL (Time-Aligned Interleaving) regulates how many text tokens are generated per window based on accumulated speech playback progress, preventing the lag where the model has internally moved on while the user is still hearing older audio.

01
Full-duplex interaction via Omni-Flow — you can interrupt the model mid-sentence and it adjusts in the next 1-second window because input and output streams run in parallel, not in alternating phases
02
11GB VRAM at INT4 on a single RTX 4090 — 212.3 tokens/s with 0.58s first-token latency via vLLM, making the full omni-modal stack fit on a single consumer GPU (note: standard PyTorch BF16 requires ≥28GB VRAM)
03
Beats Qwen3-Omni-30B-A3B on 6 of 7 omni-modal benchmarks — Daily-Omni 80.2 vs 70.7, Video-Holmes 64.3 vs 50.4, JointAVBench 60.0 vs 53.1, AVUT-Human 78.6 vs 74.2, at 3× fewer active parameters
04
Document parsing that outscores Gemini 2.5 Flash — OmniDocBench EN score 0.109 vs 0.214 (lower is better), directly useful for extracting structured text from dense PDFs and multilingual scans
05
Speech quality leads on SeedTTS — ZH CER 0.86 and EN WER 2.38, both best among compared models; emotion control on Expresso scores 29.8 vs CosyVoice2's 17.9
06
Proactive behavior from the same framework — the model can issue reminders or commentary without an explicit user prompt, because the Omni-Flow design lets it decide to speak in any 1-second window based on what it currently observes
07
Apache-2.0 with full inference stack — weights, code, and the custom llama.cpp-omni C++ runtime are all public, runnable on macOS, Windows, and Linux without API fees
Who it’s for

If you're building voice agents, real-time transcription services, or video understanding systems and you need an open-source model without a cloud API dependency, MiniCPM-o 4.5 is the most capable self-hostable full-duplex option published as of May 2026. Not useful yet if you need multilingual speech beyond English and Chinese, if FutureOmni-type tasks are central to your use case (Qwen3-Omni-30B wins there: 62.1 vs 56.1), or if ≥28GB VRAM for the standard PyTorch path is not available in your infrastructure.

Worth exploring

The paper is 11 days old and the authors explicitly label it 'an early exploration' with documented speech instability in streaming mode. Explore it now if you're prototyping a real-time voice+vision system or researching full-duplex interaction design. Hold off on production use — full-duplex TAIL mode carries a measured 65% higher EN WER vs the non-real-time path (paper Table 10), and the custom llama.cpp-omni inference stack for edge deployment adds operational complexity most teams won't want in a critical system.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →