Microsoft's VibeVoice: 90-min AI podcasts

What problem does it solve

“VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft's guidi...”

You know that feeling when you need to transcribe a 45-minute meeting with 4 people talking over each other? Current tools either choke on long audio, can't tell speakers apart, or require you to slice audio into chunks—losing context and creating sync nightmares. Or when you want to generate a podcast from a script, but existing TTS sounds robotic, can't handle multiple speakers naturally, and tops out at 5 minutes before quality degrades. You end up with either expensive API calls or hours of manual post-processing.

aivoiceasrspeech-recognitionopen-sourcemicrosoftpython

How it works

Think of it like this: instead of processing audio 50 times per second (like most models), VibeVoice processes it 7.5 times per second—like reading every 10th word instead of every word, but still understanding the sentence. This 3200× compression comes from a dual-tokenizer system: one tokenizer captures the acoustic fingerprint (what it sounds like), another captures semantic meaning (what's being said). You feed text + voice prompts into a Qwen2.5 LLM, which predicts what the audio should sound like. A diffusion model then generates the actual audio frame-by-frame, conditioned on the LLM's understanding. For ASR, it reverses this: audio in → tokenizer → LLM → structured text with speaker labels and timestamps out. The key insight: by compressing so aggressively while preserving fidelity, you can process 90 minutes of audio in a single 64K-token context window without chunking.

Key takeaways

✦

01

90-minute single-pass generation — you get hour-long multi-speaker audio in one coherent generation, no stitching chunks together or managing context windows

⟁

02

4-speaker conversations with natural turn-taking — you specify who speaks when, the model maintains consistent voices and handles back-and-forth dialogue without manual speaker switching

⊕

03

60-minute ASR with built-in diarization — you upload one long file and get structured output: Speaker A at 0:15, Speaker B at 2:30, with timestamps and content, no separate diarization step

◈

04

50+ language support for ASR — you process multilingual audio without switching models or configuring language detection

∞

05

Custom hotwords for domain accuracy — you provide technical terms or names, the model prioritizes them during transcription, reducing WER on specialized content

◎

06

7.5 Hz ultra-low frame rate — you process 80× fewer tokens than Encodec, meaning faster inference and longer context windows on the same hardware

✺

07

Transformers integration (v5.3.0+) — you load the ASR model with two lines of code, no custom dependencies or repo cloning

Should you care?

Who it’s for

If you're building voice-enabled apps, transcribing meetings/podcasts, or generating synthetic audio content, this gives you production-ready ASR with diarization built-in. Ideal for teams who need long-form transcription without stitching together chunked results. Not useful if you need the TTS functionality—the code is gone. Also not for you if you need overlapping speech modeling or non-English/Chinese TTS (ASR supports 50+ languages, but TTS only does English and Chinese, and TTS code is removed anyway).

Worth exploring

Yes for the ASR model—it's production-ready, integrated into Transformers, and beats or matches competitors on accuracy (1.29% WER vs Gemini's 1.73%). The 50+ language support and built-in diarization make it practical for real workloads. The finetuning code and vLLM support mean you can customize and scale it. Skip the TTS entirely—Microsoft removed the code in September 2025 after misuse, and it's not coming back. The realtime streaming TTS (0.5B) is still available but research-only with a 10-minute cap.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Microsoft's VibeVoice: 90-min AI podcasts

Underrated tools. Unfiltered takes.