GitHub Repos intermediate 2 min read Apr 7, 2026 · Updated Apr 15, 2026
Public Preview Sign in free for the full digest →

Hackathon Project: Real-time voice+vision AI runs on your laptop

“Six months ago this needed an RTX 5090. Now it runs on your M3 Pro.”

Hackathon Project: Real-time voice+vision AI runs on your laptop
6 Views
1 Likes
0 Bookmarks
Source · github.com

“The difference is that our app is much worse than ChatGPT, for real. If you're subscribed to ChatGPT, please just use that. We're making this for people that can't afford $20/mo. — Fikri Karim, creator of Bule AI (September 2025)”

You know that feeling when you want to build a voice AI feature but cloud APIs cost $20/month per user and send all your data to someone else's servers? Or when you see OpenAI's multimodal demos and think 'that's exactly what I need' but realize it requires their infrastructure? Running real-time voice+vision AI locally used to demand a desktop GPU that costs more than a used car. Parlor exists because its creator runs a free English-learning voice AI service and needed to eliminate server costs entirely.

aiopen-sourcepythonllmvoice-assistantmultimodalon-device

Your browser captures microphone audio and camera frames, sending them over WebSocket to a local FastAPI server. The server feeds audio and JPEG images into Gemma 4 E2B (Google's 2.3B parameter multimodal model) via LiteRT-LM, which understands both speech and vision simultaneously. The model generates text responses, which Kokoro TTS converts to speech — streaming sentence-by-sentence back to your browser. Silero VAD in the browser detects when you're speaking so you don't need push-to-talk, and barge-in lets you interrupt the AI mid-sentence.

01
Hands-free voice detection — Silero VAD automatically knows when you're speaking, no button-pressing required
02
Barge-in support — interrupt the AI mid-sentence just like a real conversation
03
Sentence-level TTS streaming — audio starts playing before the full response finishes generating, cutting perceived latency
04
On-device multimodal — Gemma 4 E2B processes speech and vision together without cloud calls
05
Cross-platform TTS — uses MLX acceleration on Mac, ONNX on Linux for fast text-to-speech
06
Auto-model download — pulls 2.6GB Gemma model from HuggingFace on first run, no manual setup
07
Real-time vision — camera feeds JPEG frames directly to the model for live object recognition and discussion
Who it’s for

If you're a developer curious about on-device AI who wants to see what's now possible on laptop hardware, this is your demo. Also relevant if you're building privacy-first applications or need zero-marginal-cost voice AI. Not for you if you need production-ready code (security issues exist), Windows support (LiteRT-LM doesn't support it), or agentic coding capabilities (creator explicitly says it can't do this).

Worth exploring

Yes, but strictly as an experiment. The project is 4 days old (April 3-6, 2026) with a 'research preview' label and 3 open issues including security vulnerabilities. What makes it worth your time: it proves real-time multimodal AI now runs on laptop-class hardware. The 705 GitHub stars in days show genuine developer interest. Try it to understand what's now possible locally, but don't build on it yet — wait for security patches and broader platform support.

Developer playbook
Tech stack, code snippet, sentiment, alternatives.
PM playbook
Adoption angles, user fit, positioning.
CEO playbook
Traction signals, ROI, build vs buy.
Deep-dive insight
Full long-form analysis, no fluff.
Easy mode
Core idea, fast — when you need the gist.
Pro mode
Technical nuance, edge cases, tradeoffs.
Read the full digest
Go beyond the preview

Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.

Underrated tools. Unfiltered takes.

Read the full digest in the Snaplyze app for deep-dive insight, Easy and Pro modes, and the playbooks you can actually use.

Install Snaplyze →