GitHub Repos intermediate 3 min read Apr 8, 2026 · Updated Apr 15, 2026
Public Preview Sign in free for the full digest →

Microsoft's VibeVoice: 90-min AI podcasts

“Microsoft built a voice AI that generates 90-minute podcasts, then pulled the code after 11 days—here's what's left.”

Microsoft's VibeVoice: 90-min AI podcasts
7 Views
0 Likes
0 Bookmarks
Source · github.com

“VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft's guidi...”

You know that feeling when you need to transcribe a 45-minute meeting with 4 people talking over each other? Current tools either choke on long audio, can't tell speakers apart, or require you to slice audio into chunks—losing context and creating sync nightmares. Or when you want to generate a podcast from a script, but existing TTS sounds robotic, can't handle multiple speakers naturally, and tops out at 5 minutes before quality degrades. You end up with either expensive API calls or hours of manual post-processing.

aivoiceasrspeech-recognitionopen-sourcemicrosoftpython

Think of it like this: instead of processing audio 50 times per second (like most models), VibeVoice processes it 7.5 times per second—like reading every 10th word instead of every word, but still understanding the sentence. This 3200× compression comes from a dual-tokenizer system: one tokenizer captures the acoustic fingerprint (what it sounds like), another captures semantic meaning (what's being said). You feed text + voice prompts into a Qwen2.5 LLM, which predicts what the audio should sound like. A diffusion model then generates the actual audio frame-by-frame, conditioned on the LLM's understanding. For ASR, it reverses this: audio in → tokenizer → LLM → structured text with speaker labels and timestamps out. The key insight: by compressing so aggressively while preserving fidelity, you can process 90 minutes of audio in a single 64K-token context window without chunking.

01
90-minute single-pass generation — you get hour-long multi-speaker audio in one coherent generation, no stitching chunks together or managing context windows
02
4-speaker conversations with natural turn-taking — you specify who speaks when, the model maintains consistent voices and handles back-and-forth dialogue without manual speaker switching
03
60-minute ASR with built-in diarization — you upload one long file and get structured output: Speaker A at 0:15, Speaker B at 2:30, with timestamps and content, no separate diarization step
04
50+ language support for ASR — you process multilingual audio without switching models or configuring language detection
05
Custom hotwords for domain accuracy — you provide technical terms or names, the model prioritizes them during transcription, reducing WER on specialized content
06
7.5 Hz ultra-low frame rate — you process 80× fewer tokens than Encodec, meaning faster inference and longer context windows on the same hardware
07
Transformers integration (v5.3.0+) — you load the ASR model with two lines of code, no custom dependencies or repo cloning
Who it’s for

If you're building voice-enabled apps, transcribing meetings/podcasts, or generating synthetic audio content, this gives you production-ready ASR with diarization built-in. Ideal for teams who need long-form transcription without stitching together chunked results. Not useful if you need the TTS functionality—the code is gone. Also not for you if you need overlapping speech modeling or non-English/Chinese TTS (ASR supports 50+ languages, but TTS only does English and Chinese, and TTS code is removed anyway).

Worth exploring

Yes for the ASR model—it's production-ready, integrated into Transformers, and beats or matches competitors on accuracy (1.29% WER vs Gemini's 1.73%). The 50+ language support and built-in diarization make it practical for real workloads. The finetuning code and vLLM support mean you can customize and scale it. Skip the TTS entirely—Microsoft removed the code in September 2025 after misuse, and it's not coming back. The realtime streaming TTS (0.5B) is still available but research-only with a 10-minute cap.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →