Back to Home
Coding Agents

antirez/ds4: DeepSeek 4 Flash local inference engine for Metal

AI-assisted — drafted with AI, reviewed by editors

Oliver Schmidt

DevOps engineer covering AI agents for operations and deployment.

May 9, 20269 min read

# ds4.c: The Hyper-Focused Metal Inference Engine That's Making DeepSeek V4 Flash Sing on Your Mac 🚀 In a world drowning in generic model runners and sprawling AI frameworks, one project dares to as...

ds4.c: The Hyper-Focused Metal Inference Engine That's Making DeepSeek V4 Flash Sing on Your Mac 🚀

In a world drowning in generic model runners and sprawling AI frameworks, one project dares to ask: what if we just did one thing, and did it exceptionally well?

Enter ds4.c — a razor-sharp, Metal-native inference engine built exclusively for DeepSeek V4 Flash. Created by none other than Salvatore Sanfilippo (yes, that antirez — the creator of Redis), this project is a love letter to focused engineering, the Mac ecosystem, and the idea that a 284-billion-parameter model deserves a purpose-built engine rather than being crammed into a one-size-fits-all runtime.

With over 3,400 stars in just days of launch, the developer community is clearly resonating with the philosophy: stop building frameworks, start building experiences.


🎯 Why One Model Deserves Its Own Engine

The local AI inference landscape is rich. Projects like llama.cpp, vllm, and ollama serve broad model ecosystems admirably. So why would anyone build — and use — an engine locked to a single model?

The answer lies in what DeepSeek V4 Flash actually is.

"This project takes a deliberately narrow bet: one model at a time, official-vector validation, long-context tests, and enough agent integration to know if it really works."

After rigorous comparison with powerful smaller dense models, the team behind ds4.c found that DeepSeek V4 Flash isn't just another model — it's a paradigm shift for local inference:

Feature DeepSeek V4 Flash Typical Dense Models (27B–35B)
Active Parameters Fewer (MoE architecture) All parameters active
Thinking Mode Length Proportional to complexity (~1/5th of others) Often bloated, impractical
Context Window 1 million tokens Typically 128K–256K
Knowledge Depth 284B params — vast knowledge Noticeably thinner at edge topics
Writing Quality "Quasi-frontier" English & Italian Good but distinguishable
KV Cache Extremely compressed, disk-persistent RAM-bound, memory hungry
2-bit Quantization Works well with special quantization Usually quality degrades badly

This isn't incremental improvement. This is a fundamentally different cost-to-capability ratio.


🏗️ Architecture: Intentionally Narrow, Intentionally Brilliant

Unlike generic GGUF runners, ds4.c is purpose-built from the ground up:

  • DeepSeek V4 Flash-specific Metal graph executor — not a translation layer
  • Custom GGUF loader — only reads the specially crafted quantization files for this project
  • DS4-specific prompt rendering — tuned for DeepSeek's chat template
  • KV state management — with disk persistence as a first-class citizen
  • HTTP API server glue — ready for agent integration out of the box

The philosophy is crystalline:

"The KV cache is actually a first-class disk citizen."

Modern MacBooks ship with blazing-fast SSDs. DeepSeek V4 Flash features an incredibly compressed KV cache. The combination means you can persist your conversation state to disk and resume it — something that fundamentally changes how you think about local inference on personal hardware.


💡 The "Three Pillars" Vision

What sets this project apart isn't just the code — it's the completeness of the vision. The team believes local inference should be three things working together seamlessly:

  1. 🔧 Inference Engine — with a clean HTTP API
  2. 📦 Purpose-built GGUF — quantized specifically for the engine's assumptions
  3. 🧪 Validation & Testing — including coding agent integration and official logit comparison

This isn't "download a model, run it, hope for the best." This is end-to-end local inference done right — validated against the official implementation's logits at different context sizes, tested with coding agents, and verified for tool-calling reliability.


⚡ The Magic of Asymmetric 2-Bit Quantization

Here's where things get really interesting. The 2-bit quantizations shipped with ds4.c aren't your typical aggressive compression. They use a brilliantly asymmetric strategy:

Component Quantization Level Rationale
Routed MoE Experts (up/gate) IQ2_XXS Majority of model space — maximize compression
Routed MoE Experts (down) Q2_K Balanced quality/compression
Shared Experts Untouched Critical for quality preservation
Projections Untouched Core reasoning components
Routing Untouched Decision-making integrity

Because the routed MoE experts occupy the vast majority of the model's disk footprint, this strategy achieves massive compression while preserving the components that matter most. The result? A 284-billion-parameter model that runs on a MacBook with 128GB of RAM — and actually works under coding agents with reliable tool calling.


🍎 Metal-First: Embracing the Apple Silicon Future

This project is Metal-only by design. No CUDA. No Vulkan. No CPU fallback (well, technically there's one, but more on that in a moment).

The decision to go Metal-only isn't laziness — it's focus. By targeting a single GPU compute API, the team can:

  • Optimize Metal kernels specifically for Apple Silicon's unified memory architecture
  • Leverage the tight integration between macOS, Metal, and NVMe storage
  • Ship a smaller, more testable codebase
  • Validate every computation path rigorously

"The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code."

Yes, you read that right. There's a macOS kernel bug. The honesty is refreshing — and a reminder that even platform vendors ship broken software. The Metal path, however, works beautifully.


🤖 AI-Assisted Development: A Transparent Approach

One of the most fascinating aspects of ds4.c is its development methodology:

"This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging."

The team is completely transparent about this. Humans drive the architecture, the vision, the testing, and the debugging. AI assists with code generation. This isn't hidden or shameful — it's stated plainly, with the caveat that if you're uncomfortable with AI-assisted code, this project isn't for you.

It's a pragmatic approach to building complex systems software in 2026, and the results speak for themselves.


🙏 Standing on the Shoulders of Giants

The project's relationship with llama.cpp and GGML deserves special mention. From the README:

"ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there."

Borrowed and adapted elements include:

  • GGUF quantization layouts and tables
  • CPU quantization and dot-product logic
  • Certain Metal kernels
  • GGML authors' copyright notice retained in the LICENSE file

This is open-source at its best — building upon the collective wisdom of the community while pushing in a bold new direction. The MIT license ensures this knowledge continues to flow freely.


🚀 Quick Start

Ready to run DeepSeek V4 Flash on your Mac? Here's how to get started with ds4.c:

Prerequisites

  • A Mac with Apple Silicon (M-series chip)
  • 128 GB RAM minimum (for 2-bit quantization)
  • 256 GB RAM recommended (for 4-bit quantization)
  • macOS with Metal support

Installation & Model Download

Clone the repository and download your preferred quantization:

# Clone the repository
git clone https://github.com/antirez/ds4.git
cd ds4

# Build the project
make

# Download the 2-bit model (for 128 GB RAM machines)
./download_model.sh q2

# OR download the 4-bit model (for >= 256 GB RAM machines)
./download_model.sh q4

The download script fetches the specially crafted GGUF files from https://huggingface.co/antirez/deepseek-v4-gguf and stores them locally.

Running Inference

Once the model is downloaded, start the inference server with the HTTP API:

# Start the inference server
./ds4 serve --model ./models/deepseek-v4-flash-q2.gguf

The server exposes an HTTP API compatible with common agent frameworks, making integration with coding assistants and tool-calling pipelines straightforward.


🔮 Looking Ahead

The project is self-described as alpha quality code, but the roadmap hints at exciting possibilities:

  • Updated DeepSeek V4 Flash models — the team expects DeepSeek to release improved versions, and the engine is designed to evolve with them
  • Possible CUDA support — mentioned as a "perhaps," but nothing more
  • Continued validation — more context sizes, more agent integration, more edge-case testing

The constraint remains firm: local inference credible on high-end personal machines or Mac Studios, starting from 128GB of memory.


📊 How It Compares

Aspect ds4.c llama.cpp Ollama
Scope Single model (DeepSeek V4 Flash) General-purpose General-purpose
GPU Backend Metal only Metal, CUDA, Vulkan, CPU Multiple
KV Cache Disk-persistent, compressed RAM-based RAM-based
Quantization Custom asymmetric 2-bit Standard GGUF quantizations Standard GGUF quantizations
Validation Official logit comparison Community tested Community tested
Best For DeepSeek V4 Flash power users Broad model support Easy local deployment

This isn't about being "better" than llama.cpp — it's about being different. Where llama.cpp casts a wide net, ds4.c goes deep.


🏆 Verdict: A Masterclass in Focused Engineering

ds4.c is not for everyone, and it doesn't try to be. It's for the developer or researcher who has a Mac Studio with serious RAM, wants to run DeepSeek V4 Flash at its absolute best, and appreciates software that does one thing with extraordinary care.

The asymmetric quantization strategy alone is worth studying — it's a masterclass in understanding where quality lives in a model and protecting those components at all costs. The disk-persistent KV cache is a glimpse into the future of local inference. And the transparent development methodology sets a standard for how AI-assisted open-source projects should communicate with their communities.

Is it alpha software? Yes. Is there a kernel bug lurking in the CPU path? Apparently. Does it only run on Metal? Absolutely.

But within its deliberately narrow scope, ds4.c achieves something remarkable: it makes a 284-billion-parameter model feel finished on personal hardware. Not just runnable. Not just demo-able. Finished.

If you have the hardware and the curiosity, this project deserves your attention.


Star the repository: antirez/ds4 on GitHub 📦 License: MIT 🍎 Platform: macOS (Apple Silicon, Metal) 🧠 Model: DeepSeek V4 Flash

Keywords

ds4.cDeepSeek V4 FlashMetal inference engineantirezlocal AI inferenceApple SiliconGGUF2-bit quantizationMac AIdisk KV cacheopen source AIDeepSeek localmacOS inference engineMoE quantizationRedis creator AI project

Keep reading

More from DriftSeas on AI agents and the tools around them.