9B model beats Qwen3-Omni-30B on 6 of 7 omni tasks

What problem does it solve

“MiniCPM-o 4.5 is an early exploration of real-time full-duplex omni-modal interaction; long, dynamic real-world streaming robustness needs improvement. — Paper Section 8, Limitations, arXiv:2604.27393”

You know that feeling when you ask an AI voice assistant something and it has to stop everything — stop watching, stop listening — just to reply? Current multimodal models alternate between perceiving and responding: the model listens, then pauses to generate an answer, then resumes listening. That cycle prevents interruption, real-time commentary on live video, and any response to things that change while the model is speaking. It's the architectural reason why voice AI still feels like a walkie-talkie rather than a phone call.

multimodalllmopen-sourcespeechfull-duplexedge-airesearch-paper

How it works

Omni-Flow divides time into 1-second chunks. In each chunk, the model receives a group of visual tokens (camera input), audio tokens (microphone input), and output tokens (what it's currently saying), all serialized as one sequence for standard next-token prediction. A Listen-Speak control token at the start of each group tells the model whether to stay quiet or generate speech. Ablations in the paper confirm 1.0-second windows beat 0.2-second and 0.1-second windows significantly, and explicit boundary tokens beat implicit ones. A second technique called TAIL (Time-Aligned Interleaving) regulates how many text tokens are generated per window based on accumulated speech playback progress, preventing the lag where the model has internally moved on while the user is still hearing older audio.

Key takeaways

✦

01

Full-duplex interaction via Omni-Flow — you can interrupt the model mid-sentence and it adjusts in the next 1-second window because input and output streams run in parallel, not in alternating phases

⟁

02

11GB VRAM at INT4 on a single RTX 4090 — 212.3 tokens/s with 0.58s first-token latency via vLLM, making the full omni-modal stack fit on a single consumer GPU (note: standard PyTorch BF16 requires ≥28GB VRAM)

⊕

03

Beats Qwen3-Omni-30B-A3B on 6 of 7 omni-modal benchmarks — Daily-Omni 80.2 vs 70.7, Video-Holmes 64.3 vs 50.4, JointAVBench 60.0 vs 53.1, AVUT-Human 78.6 vs 74.2, at 3× fewer active parameters

◈

04

Document parsing that outscores Gemini 2.5 Flash — OmniDocBench EN score 0.109 vs 0.214 (lower is better), directly useful for extracting structured text from dense PDFs and multilingual scans

∞

05

Speech quality leads on SeedTTS — ZH CER 0.86 and EN WER 2.38, both best among compared models; emotion control on Expresso scores 29.8 vs CosyVoice2's 17.9

◎

06

Proactive behavior from the same framework — the model can issue reminders or commentary without an explicit user prompt, because the Omni-Flow design lets it decide to speak in any 1-second window based on what it currently observes

✺

07

Apache-2.0 with full inference stack — weights, code, and the custom llama.cpp-omni C++ runtime are all public, runnable on macOS, Windows, and Linux without API fees

Should you care?

Who it’s for

If you're building voice agents, real-time transcription services, or video understanding systems and you need an open-source model without a cloud API dependency, MiniCPM-o 4.5 is the most capable self-hostable full-duplex option published as of May 2026. Not useful yet if you need multilingual speech beyond English and Chinese, if FutureOmni-type tasks are central to your use case (Qwen3-Omni-30B wins there: 62.1 vs 56.1), or if ≥28GB VRAM for the standard PyTorch path is not available in your infrastructure.

Worth exploring

The paper is 11 days old and the authors explicitly label it 'an early exploration' with documented speech instability in streaming mode. Explore it now if you're prototyping a real-time voice+vision system or researching full-duplex interaction design. Hold off on production use — full-duplex TAIL mode carries a measured 65% higher EN WER vs the non-real-time path (paper Table 10), and the custom llama.cpp-omni inference stack for edge deployment adds operational complexity most teams won't want in a critical system.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

9B model beats Qwen3-Omni-30B on 6 of 7 omni tasks

Underrated tools. Unfiltered takes.