ds4.c: The Hyper-Focused Metal Inference Engine That's Making DeepSeek V4 Flash Sing on Your Mac 🚀

In a world drowning in generic model runners and sprawling AI frameworks, one project dares to ask: what if we just did one thing, and did it exceptionally well?

Enter ds4.c — a razor-sharp, Metal-native inference engine built exclusively for DeepSeek V4 Flash. Created by none other than Salvatore Sanfilippo (yes, that antirez — the creator of Redis), this project is a love letter to focused engineering, the Mac ecosystem, and the idea that a 284-billion-parameter model deserves a purpose-built engine rather than being crammed into a one-size-fits-all runtime.

With over 3,400 stars in just days of launch, the developer community is clearly resonating with the philosophy: stop building frameworks, start building experiences.

🎯 Why One Model Deserves Its Own Engine

The local AI inference landscape is rich. Projects like llama.cpp, vllm, and ollama serve broad model ecosystems admirably. So why would anyone build — and use — an engine locked to a single model?

The answer lies in what DeepSeek V4 Flash actually is.

"This project takes a deliberately narrow bet: one model at a time, official-vector validation, long-context tests, and enough agent integration to know if it really works."

After rigorous comparison with powerful smaller dense models, the team behind ds4.c found that DeepSeek V4 Flash isn't just another model — it's a paradigm shift for local inference:

Feature	DeepSeek V4 Flash	Typical Dense Models (27B–35B)
Active Parameters	Fewer (MoE architecture)	All parameters active
Thinking Mode Length	Proportional to complexity (~1/5th of others)	Often bloated, impractical
Context Window	1 million tokens	Typically 128K–256K
Knowledge Depth	284B params — vast knowledge	Noticeably thinner at edge topics
Writing Quality	"Quasi-frontier" English & Italian	Good but distinguishable
KV Cache	Extremely compressed, disk-persistent	RAM-bound, memory hungry
2-bit Quantization	Works well with special quantization	Usually quality degrades badly

This isn't incremental improvement. This is a fundamentally different cost-to-capability ratio.

🏗️ Architecture: Intentionally Narrow, Intentionally Brilliant

Unlike generic GGUF runners, ds4.c is purpose-built from the ground up:

DeepSeek V4 Flash-specific Metal graph executor — not a translation layer
Custom GGUF loader — only reads the specially crafted quantization files for this project
DS4-specific prompt rendering — tuned for DeepSeek's chat template
KV state management — with disk persistence as a first-class citizen
HTTP API server glue — ready for agent integration out of the box

The philosophy is crystalline:

"The KV cache is actually a first-class disk citizen."

Modern MacBooks ship with blazing-fast SSDs. DeepSeek V4 Flash features an incredibly compressed KV cache. The combination means you can persist your conversation state to disk and resume it — something that fundamentally changes how you think about local inference on personal hardware.

💡 The "Three Pillars" Vision

What sets this project apart isn't just the code — it's the completeness of the vision. The team believes local inference should be three things working together seamlessly:

🔧 Inference Engine — with a clean HTTP API
📦 Purpose-built GGUF — quantized specifically for the engine's assumptions
🧪 Validation & Testing — including coding agent integration and official logit comparison

This isn't "download a model, run it, hope for the best." This is end-to-end local inference done right — validated against the official implementation's logits at different context sizes, tested with coding agents, and verified for tool-calling reliability.

⚡ The Magic of Asymmetric 2-Bit Quantization

Here's where things get really interesting. The 2-bit quantizations shipped with ds4.c aren't your typical aggressive compression. They use a brilliantly asymmetric strategy:

Component	Quantization Level	Rationale
Routed MoE Experts (up/gate)	`IQ2_XXS`	Majority of model space — maximize compression
Routed MoE Experts (down)	`Q2_K`	Balanced quality/compression
Shared Experts	Untouched	Critical for quality preservation
Projections	Untouched	Core reasoning components
Routing	Untouched	Decision-making integrity

Because the routed MoE experts occupy the vast majority of the model's disk footprint, this strategy achieves massive compression while preserving the components that matter most. The result? A 284-billion-parameter model that runs on a MacBook with 128GB of RAM — and actually works under coding agents with reliable tool calling.

🍎 Metal-First: Embracing the Apple Silicon Future

This project is Metal-only by design. No CUDA. No Vulkan. No CPU fallback (well, technically there's one, but more on that in a moment).

The decision to go Metal-only isn't laziness — it's focus. By targeting a single GPU compute API, the team can:

Optimize Metal kernels specifically for Apple Silicon's unified memory architecture
Leverage the tight integration between macOS, Metal, and NVMe storage
Ship a smaller, more testable codebase
Validate every computation path rigorously

"The CPU path is only for correctness check, but warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code."

Yes, you read that right. There's a macOS kernel bug. The honesty is refreshing — and a reminder that even platform vendors ship broken software. The Metal path, however, works beautifully.

🤖 AI-Assisted Development: A Transparent Approach

One of the most fascinating aspects of ds4.c is its development methodology:

"This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging."

The team is completely transparent about this. Humans drive the architecture, the vision, the testing, and the debugging. AI assists with code generation. This isn't hidden or shameful — it's stated plainly, with the caveat that if you're uncomfortable with AI-assisted code, this project isn't for you.

It's a pragmatic approach to building complex systems software in 2026, and the results speak for themselves.

🙏 Standing on the Shoulders of Giants

The project's relationship with llama.cpp and GGML deserves special mention. From the README:

"ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there."

Borrowed and adapted elements include:

GGUF quantization layouts and tables
CPU quantization and dot-product logic
Certain Metal kernels
GGML authors' copyright notice retained in the LICENSE file

This is open-source at its best — building upon the collective wisdom of the community while pushing in a bold new direction. The MIT license ensures this knowledge continues to flow freely.

🚀 Quick Start

Ready to run DeepSeek V4 Flash on your Mac? Here's how to get started with ds4.c:

Prerequisites

A Mac with Apple Silicon (M-series chip)
128 GB RAM minimum (for 2-bit quantization)
256 GB RAM recommended (for 4-bit quantization)
macOS with Metal support

Installation & Model Download

Clone the repository and download your preferred quantization:

# Clone the repository
git clone https://github.com/antirez/ds4.git
cd ds4

# Build the project
make

# Download the 2-bit model (for 128 GB RAM machines)
./download_model.sh q2

# OR download the 4-bit model (for >= 256 GB RAM machines)
./download_model.sh q4

The download script fetches the specially crafted GGUF files from https://huggingface.co/antirez/deepseek-v4-gguf and stores them locally.

Running Inference

Once the model is downloaded, start the inference server with the HTTP API:

# Start the inference server
./ds4 serve --model ./models/deepseek-v4-flash-q2.gguf

The server exposes an HTTP API compatible with common agent frameworks, making integration with coding assistants and tool-calling pipelines straightforward.

🔮 Looking Ahead

The project is self-described as alpha quality code, but the roadmap hints at exciting possibilities:

Updated DeepSeek V4 Flash models — the team expects DeepSeek to release improved versions, and the engine is designed to evolve with them
Possible CUDA support — mentioned as a "perhaps," but nothing more
Continued validation — more context sizes, more agent integration, more edge-case testing

The constraint remains firm: local inference credible on high-end personal machines or Mac Studios, starting from 128GB of memory.

📊 How It Compares

Aspect	ds4.c	llama.cpp	Ollama
Scope	Single model (DeepSeek V4 Flash)	General-purpose	General-purpose
GPU Backend	Metal only	Metal, CUDA, Vulkan, CPU	Multiple
KV Cache	Disk-persistent, compressed	RAM-based	RAM-based
Quantization	Custom asymmetric 2-bit	Standard GGUF quantizations	Standard GGUF quantizations
Validation	Official logit comparison	Community tested	Community tested
Best For	DeepSeek V4 Flash power users	Broad model support	Easy local deployment

This isn't about being "better" than llama.cpp — it's about being different. Where llama.cpp casts a wide net, ds4.c goes deep.

🏆 Verdict: A Masterclass in Focused Engineering

ds4.c is not for everyone, and it doesn't try to be. It's for the developer or researcher who has a Mac Studio with serious RAM, wants to run DeepSeek V4 Flash at its absolute best, and appreciates software that does one thing with extraordinary care.

The asymmetric quantization strategy alone is worth studying — it's a masterclass in understanding where quality lives in a model and protecting those components at all costs. The disk-persistent KV cache is a glimpse into the future of local inference. And the transparent development methodology sets a standard for how AI-assisted open-source projects should communicate with their communities.

Is it alpha software? Yes. Is there a kernel bug lurking in the CPU path? Apparently. Does it only run on Metal? Absolutely.

But within its deliberately narrow scope, ds4.c achieves something remarkable: it makes a 284-billion-parameter model feel finished on personal hardware. Not just runnable. Not just demo-able. Finished.

If you have the hardware and the curiosity, this project deserves your attention.

⭐ Star the repository: antirez/ds4 on GitHub 📦 License: MIT 🍎 Platform: macOS (Apple Silicon, Metal) 🧠 Model: DeepSeek V4 Flash

antirez/ds4: DeepSeek 4 Flash local inference engine for Metal