Redis creator's 284B local LLM: 27 t/s on a MacBook

What problem does it solve

“"current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks." — antirez, README (https://raw.githubusercontent.com/antirez/ds4/main/README.md)”

You know that feeling when you want to run a genuinely capable AI model locally but every option either requires 500 GB of RAM, forces you through five layers of Python framework, or delivers a generic quantization mix that trades off the parts of the model you actually care about? Running DeepSeek V4 Flash through llama.cpp or Ollama on Apple Silicon means accepting whatever quant strategy works for all models rather than this model. You also hit the coding-agent trap: every new session resends a 25k-token system prompt, burning 4 minutes of prefill before any useful work starts. `ds4` targets exactly this: one model, one platform, no framework overhead, and a disk cache that skips repeated prefill after the first cold start.

local-llmapple-siliconmetalinference-enginedeepseekcopen-source

How it works

You download an 81 GB model file (a custom 2-bit quantized GGUF from antirez's HuggingFace repo) and run a single `make`. The engine maps the model into memory and builds a Metal compute graph specifically tuned for DeepSeek V4 Flash's MoE architecture — only the routed expert layers get quantized to 2-bit, while attention, projections, and routing stay at full precision. When you run the server, it holds one live KV session in memory and hashes each incoming request's token sequence with SHA1. On a match, it loads a saved checkpoint from disk instead of re-running prefill from token zero — turning a 4-minute cold start into a sub-second resume. The OpenAI and Anthropic API endpoints translate client JSON tool schemas to DeepSeek's internal DSML format and map responses back, so existing agent clients connect without modification.

Key takeaways

✦

01

Asymmetric 2-bit MoE quantization — the 284B model fits in 81 GB by quantizing only the routed expert layers (up/gate at IQ2_XXS, down at Q2_K), leaving attention and routing at full precision, so you preserve model quality where it matter...

⟁

02

Disk KV cache keyed by SHA1 of token IDs — coding agents that resend a 25k-token system prompt on every request skip the 4-minute cold prefill after the first session; the cache survives server restarts and session switches automatically

⊕

03

Native OpenAI + Anthropic dual-endpoint server — drop ds4-server behind any existing agent client without changing code; tool calls translate between client JSON schemas and DeepSeek's DSML format transparently

◈

04

DeepSeek-specific thinking mode controls — you toggle between nothink, thinking, and think-max per request without switching models; thinking section length scales with prompt complexity per README, using roughly 1/5 the tokens of comparab...

∞

05

Official logit test vectors for correctness gating — benchmarks run against logprobs captured from the official DeepSeek API, so quantization or attention regressions surface before they produce wrong code in production

◎

06

Pre-built integration configs for opencode, Pi, and Claude Code — copy-paste JSON blocks in the README wire each agent to ds4-server in under 5 minutes, with context limits and model aliases pre-configured

Should you care?

Who it’s for

If you own a Mac with 128 GB+ unified memory (M3 Max, M3 Ultra, M4 Max, or M5 class) and want to run a coding agent locally without sending code to cloud endpoints, ds4 is built for your exact setup. It also suits engineers studying purpose-built Metal compute graphs for MoE models — the source is single-file C with no framework overhead and detailed comments on kernel and quantization decisions. This is not useful yet if you are on Linux (build fails per issue #21), Windows, NVIDIA hardware (CUDA path yields only 12 t/s generation, non-viable), or any Mac with under 128 GB unified memory.

Worth exploring

Worth exploring now if you have 128 GB+ Apple Silicon hardware and want the most optimized local path for DeepSeek V4 Flash — the disk KV cache alone saves meaningful time in coding-agent workflows, and the project had 4,184 stars in 48 hours. Wait if you need Linux, NVIDIA, or multi-user inference: the project self-labels as alpha, the CPU path crashes the macOS kernel, tool-use breaks at ~50k tokens per HN reports, and there is no GGUF conversion script so you depend entirely on antirez's HuggingFace repo for model updates.

6 more sections · unlock free

Developer playbook

Tech stack, code snippet, sentiment, alternatives.

PM playbook

Adoption angles, user fit, positioning.

CEO playbook

Traction signals, ROI, build vs buy.

Deep-dive insight

Full long-form analysis, no fluff.

Easy mode

Core idea, fast — when you need the gist.

Pro mode

Technical nuance, edge cases, tradeoffs.

Sign in free — unlock all 6

Redis creator's 284B local LLM: 27 t/s on a MacBook

Underrated tools. Unfiltered takes.