Run DeepSeek V4 Flash Locally with llama.cpp on a Single GPU
In this article
What You're Setting Up
DeepSeek V4 Flash is the smaller model in the V4 series — 284B total parameters, 37B active per forward pass via sparse MoE, shipping with an FP4+FP8 mixed GGUF that comes in at ~146GB. The goal here is running it on a single GPU rather than a multi-GPU vLLM cluster.
The challenge: as of mid-2026, upstream llama.cpp does not fully support DeepSeek V4's architecture. The model uses a hybrid CSA+HCA attention mechanism and native FP4/FP8 sparse expert weights that require a community patch to load correctly. This guide uses a WIP branch (wip/deepseek-v4-support) that provides the necessary architecture support.
Requirements:
- A GPU with ≥96GB VRAM (RTX Pro 6000 Blackwell is the sweet spot — 96GB at lower cost than H100)
- Or a cloud GPU pod (this guide uses RunPod, but Lambda/Vast.ai work similarly)
- Linux environment (Ubuntu 22.04 recommended)
- Hugging Face account + access token
Step 1: Provision the GPU Environment
If running on RunPod, create a new GPU pod with these settings:
| Setting | Value |
|---|---|
| GPU | RTX PRO 6000 (96GB VRAM) |
| Container Disk | 50 GB |
| Volume Disk | 300 GB |
| Exposed Port | 8910 |
| Template | Latest PyTorch |
| Environment Variable | HF_TOKEN = your HF token |
Port 8910 is where llama-server will expose the web UI. Set your HF_TOKEN as an environment variable on the pod — it's needed for authenticated HF downloads.
Once the pod is running, open JupyterLab and launch a terminal. Verify the GPU:
nvidia-smi
Expected output includes the GPU model, total VRAM, CUDA version, and driver version. If you see the RTX Pro 6000 with 96GB, the pod is configured correctly.
Step 2: Install Build Dependencies
apt-get update
apt-get install -y \
pciutils \
build-essential \
cmake \
git \
curl \
wget \
libcurl4-openssl-dev \
tmux \
python3 \
python3-pip \
python3-venv
These install CMake (required to build llama.cpp), CUDA dev headers, Python, and general utilities.
Step 3: Build the Modified llama.cpp
Move to the workspace directory:
cd /workspace
Clone the WIP branch with DeepSeek V4 support:
git clone -b wip/deepseek-v4-support \
https://github.com/nisparks/llama.cpp.git \
llama.cpp-deepseek-v4
Configure the build with CUDA enabled:
cmake llama.cpp-deepseek-v4 \
-B llama.cpp-deepseek-v4/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release
Build the inference binaries:
cmake --build llama.cpp-deepseek-v4/build \
--config Release \
-j \
--clean-first \
--target llama-cli llama-server llama-gguf-split
Copy the compiled binaries to the project root:
cp llama.cpp-deepseek-v4/build/bin/llama-* llama.cpp-deepseek-v4/
Verify the build:
llama.cpp-deepseek-v4/llama-server --help
If the help menu prints successfully, the build is complete.
Step 4: Download the Model
Install Hugging Face download tools:
pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer
Enable the fast transfer backend:
export HF_HUB_ENABLE_HF_TRANSFER=1
Create the model directory:
mkdir -p /workspace/models/deepseek-v4-flash-fp4-fp8
Download the GGUF:
huggingface-cli download nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF \
DeepSeek-V4-Flash-FP4-FP8-native.gguf \
--local-dir /workspace/models/deepseek-v4-flash-fp4-fp8
With HF_HUB_ENABLE_HF_TRANSFER=1 and a valid HF_TOKEN, downloads can reach ~2GB/s on high-bandwidth cloud pods. The 146GB file takes 1–2 minutes at those speeds.
Verify after download:
ls -lh /workspace/models/deepseek-v4-flash-fp4-fp8
Expected:
total 146G
-rw-rw-rw- 1 root root 146G DeepSeek-V4-Flash-FP4-FP8-native.gguf
Step 5: Start llama-server
cd /workspace/llama.cpp-deepseek-v4
./llama-server \
--model /workspace/models/deepseek-v4-flash-fp4-fp8/DeepSeek-V4-Flash-FP4-FP8-native.gguf \
--alias "DeepSeek-V4-Flash" \
--host 0.0.0.0 \
--port 8910 \
--jinja \
--fit on \
--threads 16 \
--threads-batch 16 \
--ctx-size 32768 \
--batch-size 2048 \
--ubatch-size 512 \
--flash-attn on \
--temp 0.7 \
--top-p 0.95 \
--cont-batching \
--metrics \
--perf
Key flags explained:
--fit on— automatically fits the model across available GPU and CPU memory using the optimal layer split--ctx-size 32768— 32K context window; increase to 65536 if VRAM headroom allows--flash-attn on— enables Flash Attention for faster inference and lower memory at long contexts--cont-batching— continuous batching for serving multiple concurrent requests--metrics— exposes a/metricsendpoint for throughput monitoring
The model takes 60–120 seconds to load into GPU memory. When ready, the terminal prints:
llama server listening at http://0.0.0.0:8910
On RunPod, navigate to the pod's Exposed Ports panel and click the link for port 8910 to open the llama.cpp Web UI in your browser.
Step 6: Test the Model
The Web UI exposes a chat interface. You can also use curl against the OpenAI-compatible endpoint:
curl http://localhost:8910/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Solve: 5x + 12 = 47. Show your work."}],
"temperature": 0.7
}'
Benchmark observations (single RTX Pro 6000):
| Task | Performance | Notes |
|---|---|---|
| Structured math reasoning | Strong | Correct step-by-step algebra, proper variable substitution |
| Writing and explanation | Good | Clear prose; slightly generic conclusions |
| UI / HTML generation | Average | Functional output, weak visual design quality |
| Full Python project gen | Weak | Syntax errors, broken f-strings; needs post-generation debugging |
| Throughput (~32K ctx) | ~8–12 tok/s | On a single RTX Pro 6000; bottlenecked by model size |
The 8–12 tokens/second throughput on a 146GB GGUF with 96GB VRAM reflects significant CPU offloading — a portion of layers run on CPU rather than GPU. If throughput matters more than running the full model, consider 30B–70B alternatives like Qwen3-30B-A3B or a quantized Llama 3.3 70B.
Practical Verdict
Running DeepSeek V4 Flash locally is possible but not seamless. The main friction points:
- No turnkey GGUF from established providers (Unsloth, bartowski) as of this writing
- The community branch (
wip/deepseek-v4-support) is pre-merge; expect rough edges - 8–12 tok/s on one 96GB GPU is slow for interactive use
If your goal is evaluating V4 Flash locally: this setup works. If your goal is production or developer-friendly local inference, Qwen3-32B or a Llama 3.3 70B INT4 quantization delivers better results with far less setup friction.
What to Read Next
- Self-Hosting DeepSeek V4 with vLLM — multi-GPU production deployment with proper tensor parallelism
- How to Run Qwen3 Locally with Ollama — easier local setup for a comparable reasoning model with active community support
- Deploy Models on RunPod — GPU cloud setup guide for any model
Related Guides
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
Event-Driven Architecture for Agentic AI: The Architect's Guide
A comprehensive architectural guide to designing resilient, real-time agentic AI systems using event-driven architecture — covering loose coupling, fault isolation, reference architecture, and governance patterns.
Cursor AI: Complete Setup and Practical Coding Guide
Everything developers need to use Cursor AI effectively — installation, the full keyboard shortcut map, inline code generation, chat with codebase context, tab autocomplete, @ mentions, custom rules, and how it compares to GitHub Copilot.