Hackathon Project: Real-time voice+vision AI runs on your laptop

What problem does it solve

“The difference is that our app is much worse than ChatGPT, for real. If you're subscribed to ChatGPT, please just use that. We're making this for people that can't afford $20/mo. — Fikri Karim, creator of Bule AI (September 2025)”

You know that feeling when you want to build a voice AI feature but cloud APIs cost $20/month per user and send all your data to someone else's servers? Or when you see OpenAI's multimodal demos and think 'that's exactly what I need' but realize it requires their infrastructure? Running real-time voice+vision AI locally used to demand a desktop GPU that costs more than a used car. Parlor exists because its creator runs a free English-learning voice AI service and needed to eliminate server costs entirely.

aiopen-sourcepythonllmvoice-assistantmultimodalon-device

How it works

Your browser captures microphone audio and camera frames, sending them over WebSocket to a local FastAPI server. The server feeds audio and JPEG images into Gemma 4 E2B (Google's 2.3B parameter multimodal model) via LiteRT-LM, which understands both speech and vision simultaneously. The model generates text responses, which Kokoro TTS converts to speech — streaming sentence-by-sentence back to your browser. Silero VAD in the browser detects when you're speaking so you don't need push-to-talk, and barge-in lets you interrupt the AI mid-sentence.

Key takeaways

✦

01

Hands-free voice detection — Silero VAD automatically knows when you're speaking, no button-pressing required

⟁

02

Barge-in support — interrupt the AI mid-sentence just like a real conversation

⊕

03

Sentence-level TTS streaming — audio starts playing before the full response finishes generating, cutting perceived latency

◈

04

On-device multimodal — Gemma 4 E2B processes speech and vision together without cloud calls

∞

05

Cross-platform TTS — uses MLX acceleration on Mac, ONNX on Linux for fast text-to-speech

◎

06

Auto-model download — pulls 2.6GB Gemma model from HuggingFace on first run, no manual setup

✺

07

Real-time vision — camera feeds JPEG frames directly to the model for live object recognition and discussion

Should you care?

Who it’s for

If you're a developer curious about on-device AI who wants to see what's now possible on laptop hardware, this is your demo. Also relevant if you're building privacy-first applications or need zero-marginal-cost voice AI. Not for you if you need production-ready code (security issues exist), Windows support (LiteRT-LM doesn't support it), or agentic coding capabilities (creator explicitly says it can't do this).

Worth exploring

Yes, but strictly as an experiment. The project is 4 days old (April 3-6, 2026) with a 'research preview' label and 3 open issues including security vulnerabilities. What makes it worth your time: it proves real-time multimodal AI now runs on laptop-class hardware. The 705 GitHub stars in days show genuine developer interest. Try it to understand what's now possible locally, but don't build on it yet — wait for security patches and broader platform support.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Hackathon Project: Real-time voice+vision AI runs on your laptop

Underrated tools. Unfiltered takes.