“"current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks." — antirez, README (https://raw.githubusercontent.com/antirez/ds4/main/README.md)”
You know that feeling when you want to run a genuinely capable AI model locally but every option either requires 500 GB of RAM, forces you through five layers of Python framework, or delivers a generic quantization mix that trades off the parts of the model you actually care about? Running DeepSeek V4 Flash through llama.cpp or Ollama on Apple Silicon means accepting whatever quant strategy works for all models rather than this model. You also hit the coding-agent trap: every new session resends a 25k-token system prompt, burning 4 minutes of prefill before any useful work starts. `ds4` targets exactly this: one model, one platform, no framework overhead, and a disk cache that skips repeated prefill after the first cold start.
You download an 81 GB model file (a custom 2-bit quantized GGUF from antirez's HuggingFace repo) and run a single `make`. The engine maps the model into memory and builds a Metal compute graph specifically tuned for DeepSeek V4 Flash's MoE architecture — only the routed expert layers get quantized to 2-bit, while attention, projections, and routing stay at full precision. When you run the server, it holds one live KV session in memory and hashes each incoming request's token sequence with SHA1. On a match, it loads a saved checkpoint from disk instead of re-running prefill from token zero — turning a 4-minute cold start into a sub-second resume. The OpenAI and Anthropic API endpoints translate client JSON tool schemas to DeepSeek's internal DSML format and map responses back, so existing agent clients connect without modification.
If you own a Mac with 128 GB+ unified memory (M3 Max, M3 Ultra, M4 Max, or M5 class) and want to run a coding agent locally without sending code to cloud endpoints, ds4 is built for your exact setup. It also suits engineers studying purpose-built Metal compute graphs for MoE models — the source is single-file C with no framework overhead and detailed comments on kernel and quantization decisions. This is not useful yet if you are on Linux (build fails per issue #21), Windows, NVIDIA hardware (CUDA path yields only 12 t/s generation, non-viable), or any Mac with under 128 GB unified memory.
Worth exploring now if you have 128 GB+ Apple Silicon hardware and want the most optimized local path for DeepSeek V4 Flash — the disk KV cache alone saves meaningful time in coding-agent workflows, and the project had 4,184 stars in 48 hours. Wait if you need Linux, NVIDIA, or multi-user inference: the project self-labels as alpha, the CPU path crashes the macOS kernel, tool-use breaks at ~50k tokens per HN reports, and there is no GGUF conversion script so you depend entirely on antirez's HuggingFace repo for model updates.
Deep-dive insight, Easy and Pro modes, plus action playbooks — the full breakdown is one tap away.