“MiniCPM-o 4.5 is an early exploration of real-time full-duplex omni-modal interaction; long, dynamic real-world streaming robustness needs improvement. — Paper Section 8, Limitations, arXiv:2604.27393”
You know that feeling when you ask an AI voice assistant something and it has to stop everything — stop watching, stop listening — just to reply? Current multimodal models alternate between perceiving and responding: the model listens, then pauses to generate an answer, then resumes listening. That cycle prevents interruption, real-time commentary on live video, and any response to things that change while the model is speaking. It's the architectural reason why voice AI still feels like a walkie-talkie rather than a phone call.
Omni-Flow divides time into 1-second chunks. In each chunk, the model receives a group of visual tokens (camera input), audio tokens (microphone input), and output tokens (what it's currently saying), all serialized as one sequence for standard next-token prediction. A Listen-Speak control token at the start of each group tells the model whether to stay quiet or generate speech. Ablations in the paper confirm 1.0-second windows beat 0.2-second and 0.1-second windows significantly, and explicit boundary tokens beat implicit ones. A second technique called TAIL (Time-Aligned Interleaving) regulates how many text tokens are generated per window based on accumulated speech playback progress, preventing the lag where the model has internally moved on while the user is still hearing older audio.
If you're building voice agents, real-time transcription services, or video understanding systems and you need an open-source model without a cloud API dependency, MiniCPM-o 4.5 is the most capable self-hostable full-duplex option published as of May 2026. Not useful yet if you need multilingual speech beyond English and Chinese, if FutureOmni-type tasks are central to your use case (Qwen3-Omni-30B wins there: 62.1 vs 56.1), or if ≥28GB VRAM for the standard PyTorch path is not available in your infrastructure.
The paper is 11 days old and the authors explicitly label it 'an early exploration' with documented speech instability in streaming mode. Explore it now if you're prototyping a real-time voice+vision system or researching full-duplex interaction design. Hold off on production use — full-duplex TAIL mode carries a measured 65% higher EN WER vs the non-real-time path (paper Table 10), and the custom llama.cpp-omni inference stack for edge deployment adds operational complexity most teams won't want in a critical system.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.