“Microsoft built a voice AI that generates 90-minute podcasts, then pulled the code after 11 days—here's what's left.”
Microsoft built an open-source voice AI that generates 90-minute multi-speaker podcasts in one pass—then removed the TTS code 11 days later after discovering misuse. The remaining ASR model transcribes 60-minute audio files with speaker diarization, supports 50+ languages, and just landed in Hugging Face Transformers v5.3.0. It gives you structured output: who said what, when, with timestamps. The 7B parameter model beats ElevenLabs v3 and Gemini 2.5 Pro in subjective quality tests (MOS 3.76 vs 3.40 and 3.66).
You know that feeling when you need to transcribe a 45-minute meeting with 4 people talking over each other? Current tools either choke on long audio, can't tell speakers apart, or require you to slice audio into chunks—losing context and creating sync nightmares. Or when you want to generate a podcast from a script, but existing TTS sounds robotic, can't handle multiple speakers naturally, and tops out at 5 minutes before quality degrades. You end up with either expensive API calls or hours of manual post-processing.
Think of it like this: instead of processing audio 50 times per second (like most models), VibeVoice processes it 7.5 times per second—like reading every 10th word instead of every word, but still understanding the sentence. This 3200× compression comes from a dual-tokenizer system: one tokenizer captures the acoustic fingerprint (what it sounds like), another captures semantic meaning (what's being said). You feed text + voice prompts into a Qwen2.5 LLM, which predicts what the audio should sound like. A diffusion model then generates the actual audio frame-by-frame, conditioned on the LLM's understanding. For ASR, it reverses this: audio in → tokenizer → LLM → structured text with speaker labels and timestamps out. The key insight: by compressing so aggressively while preserving fidelity, you can process 90 minutes of audio in a single 64K-token context window without chunking.
If you're building voice-enabled apps, transcribing meetings/podcasts, or generating synthetic audio content, this gives you production-ready ASR with diarization built-in. Ideal for teams who need long-form transcription without stitching together chunked results. Not useful if you need the TTS functionality—the code is gone. Also not for you if you need overlapping speech modeling or non-Englis...
Yes for the ASR model—it's production-ready, integrated into Transformers, and beats or matches competitors on accuracy (1.29% WER vs Gemini's 1.73%). The 50+ language support and built-in diarization make it practical for real workloads. The finetuning code and vLLM support mean you can customize and scale it. Skip the TTS entirely—Microsoft removed the code in September 2025 after misuse, and it's not coming back. The realtime streaming TTS (0.5B) is still available but research-only with a 10-minute cap.
View original sourceThis page gives you the hook. The full Snaplyze digest goes deeper so you can move from curiosity to decision with less noise.
Open the full digest to read the deeper breakdown, compare viewpoints, and get the practical next-step playbooks.
Read the full digest for deep-dive insight, Easy Mode, Pro Mode, and practical playbooks you can actually use.
Install Snaplyze