“"Would you consider hosting this dataset on https://huggingface.co/datasets as well?" — NielsRogge, GitHub issue #1”
You know that feeling when an audio AI only answers after you upload the whole clip? That works for a recording, but it breaks when sound keeps changing in real time. Existing streaming systems often focus on one task, like speech recognition or voice chat. Audio Interaction Model tries to give you one model that listens in chunks, decides whether the moment deserves a response, and then speaks only when needed.
Think of it like a receptionist who listens all day but only interrupts when a visitor needs help. Audio-Interaction takes 16 kHz audio in 0.4-second chunks, uses the Qwen2.5-Omni audio tower, and tracks a LISTENING state or a SPEAKING state. In LISTENING, it predicts either `KEEP_SILENCE` or `TEXT_BEGIN`; in SPEAKING, it generates text until `TEXT_END`. SoundFlow wraps this with streaming data construction, interaction-aware training, and FIFO asynchronous inference so encoding and decoding do not block each other.
If you build voice agents, audio assistants, or multimodal agents, this gives you a concrete architecture for learned response timing. It is also useful if you study audio LLM evaluation and want benchmarks beyond clip-level question answering. It is not for you yet if you need complete training reproducibility or proven interruption/full-duplex behavior in the public demo.
Treat this as experimental research with runnable pieces, not a production-ready voice stack. The strongest reason to explore it is the learned silence/speak decision plus public weights; the strongest reason to wait is that the full dataset pipeline and full training configs are still marked Coming. Use it for evaluation, prototypes, and architecture study before you trust it in a product.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.