Run DeepSeek V4 Flash Locally with llama.cpp on a Single GPU

May 24, 2026 • guides

What You're Setting Up

DeepSeek V4 Flash is the smaller model in the V4 series — 284B total parameters, 37B active per forward pass via sparse MoE, shipping with an FP4+FP8 mixed GGUF that comes in at ~146GB. The goal here is running it on a single GPU rather than a multi-GPU vLLM cluster.

The challenge: as of mid-2026, upstream llama.cpp does not fully support DeepSeek V4's architecture. The model uses a hybrid CSA+HCA attention mechanism and native FP4/FP8 sparse expert weights that require a community patch to load correctly. This guide uses a WIP branch (wip/deepseek-v4-support) that provides the necessary architecture support.

Requirements:

A GPU with ≥96GB VRAM (RTX Pro 6000 Blackwell is the sweet spot — 96GB at lower cost than H100)
Or a cloud GPU pod (this guide uses RunPod, but Lambda/Vast.ai work similarly)
Linux environment (Ubuntu 22.04 recommended)
Hugging Face account + access token

Step 1: Provision the GPU Environment

If running on RunPod, create a new GPU pod with these settings:

Setting	Value
GPU	RTX PRO 6000 (96GB VRAM)
Container Disk	50 GB
Volume Disk	300 GB
Exposed Port	8910
Template	Latest PyTorch
Environment Variable	`HF_TOKEN` = your HF token

Port 8910 is where llama-server will expose the web UI. Set your HF_TOKEN as an environment variable on the pod — it's needed for authenticated HF downloads.

Once the pod is running, open JupyterLab and launch a terminal. Verify the GPU:

nvidia-smi

Expected output includes the GPU model, total VRAM, CUDA version, and driver version. If you see the RTX Pro 6000 with 96GB, the pod is configured correctly.

Step 2: Install Build Dependencies

apt-get update

apt-get install -y \
  pciutils \
  build-essential \
  cmake \
  git \
  curl \
  wget \
  libcurl4-openssl-dev \
  tmux \
  python3 \
  python3-pip \
  python3-venv

These install CMake (required to build llama.cpp), CUDA dev headers, Python, and general utilities.

Step 3: Build the Modified llama.cpp

Move to the workspace directory:

cd /workspace

Clone the WIP branch with DeepSeek V4 support:

git clone -b wip/deepseek-v4-support \
  https://github.com/nisparks/llama.cpp.git \
  llama.cpp-deepseek-v4

Configure the build with CUDA enabled:

cmake llama.cpp-deepseek-v4 \
  -B llama.cpp-deepseek-v4/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release

Build the inference binaries:

cmake --build llama.cpp-deepseek-v4/build \
  --config Release \
  -j \
  --clean-first \
  --target llama-cli llama-server llama-gguf-split

Copy the compiled binaries to the project root:

cp llama.cpp-deepseek-v4/build/bin/llama-* llama.cpp-deepseek-v4/

Verify the build:

llama.cpp-deepseek-v4/llama-server --help

If the help menu prints successfully, the build is complete.

Step 4: Download the Model

Install Hugging Face download tools:

pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer

Enable the fast transfer backend:

export HF_HUB_ENABLE_HF_TRANSFER=1

Create the model directory:

mkdir -p /workspace/models/deepseek-v4-flash-fp4-fp8

Download the GGUF:

huggingface-cli download nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF \
  DeepSeek-V4-Flash-FP4-FP8-native.gguf \
  --local-dir /workspace/models/deepseek-v4-flash-fp4-fp8

With HF_HUB_ENABLE_HF_TRANSFER=1 and a valid HF_TOKEN, downloads can reach ~2GB/s on high-bandwidth cloud pods. The 146GB file takes 1–2 minutes at those speeds.

Verify after download:

ls -lh /workspace/models/deepseek-v4-flash-fp4-fp8

Expected:

total 146G
-rw-rw-rw- 1 root root 146G DeepSeek-V4-Flash-FP4-FP8-native.gguf

Step 5: Start llama-server

cd /workspace/llama.cpp-deepseek-v4

./llama-server \
  --model /workspace/models/deepseek-v4-flash-fp4-fp8/DeepSeek-V4-Flash-FP4-FP8-native.gguf \
  --alias "DeepSeek-V4-Flash" \
  --host 0.0.0.0 \
  --port 8910 \
  --jinja \
  --fit on \
  --threads 16 \
  --threads-batch 16 \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 512 \
  --flash-attn on \
  --temp 0.7 \
  --top-p 0.95 \
  --cont-batching \
  --metrics \
  --perf

Key flags explained:

--fit on — automatically fits the model across available GPU and CPU memory using the optimal layer split
--ctx-size 32768 — 32K context window; increase to 65536 if VRAM headroom allows
--flash-attn on — enables Flash Attention for faster inference and lower memory at long contexts
--cont-batching — continuous batching for serving multiple concurrent requests
--metrics — exposes a /metrics endpoint for throughput monitoring

The model takes 60–120 seconds to load into GPU memory. When ready, the terminal prints:

llama server listening at http://0.0.0.0:8910

On RunPod, navigate to the pod's Exposed Ports panel and click the link for port 8910 to open the llama.cpp Web UI in your browser.

Step 6: Test the Model

The Web UI exposes a chat interface. You can also use curl against the OpenAI-compatible endpoint:

curl http://localhost:8910/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Solve: 5x + 12 = 47. Show your work."}],
    "temperature": 0.7
  }'

Benchmark observations (single RTX Pro 6000):

Task	Performance	Notes
Structured math reasoning	Strong	Correct step-by-step algebra, proper variable substitution
Writing and explanation	Good	Clear prose; slightly generic conclusions
UI / HTML generation	Average	Functional output, weak visual design quality
Full Python project gen	Weak	Syntax errors, broken f-strings; needs post-generation debugging
Throughput (~32K ctx)	~8–12 tok/s	On a single RTX Pro 6000; bottlenecked by model size

The 8–12 tokens/second throughput on a 146GB GGUF with 96GB VRAM reflects significant CPU offloading — a portion of layers run on CPU rather than GPU. If throughput matters more than running the full model, consider 30B–70B alternatives like Qwen3-30B-A3B or a quantized Llama 3.3 70B.

Practical Verdict

Running DeepSeek V4 Flash locally is possible but not seamless. The main friction points:

No turnkey GGUF from established providers (Unsloth, bartowski) as of this writing
The community branch (wip/deepseek-v4-support) is pre-merge; expect rough edges
8–12 tok/s on one 96GB GPU is slow for interactive use

If your goal is evaluating V4 Flash locally: this setup works. If your goal is production or developer-friendly local inference, Qwen3-32B or a Llama 3.3 70B INT4 quantization delivers better results with far less setup friction.

Related Guides

guides

Shan • 2026-07-03

llmself-hostedollamahardwareprivacy

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Complete 2026 guide to running LLMs locally for privacy and cost savings. Set up Ollama, llama.cpp, and vLLM on your hardware.

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

Run DeepSeek V4 Flash Locally with llama.cpp on a Single GPU

In this article

What You're Setting Up

Step 1: Provision the GPU Environment

Step 2: Install Build Dependencies

Step 3: Build the Modified llama.cpp

Step 4: Download the Model

Step 5: Start llama-server

Step 6: Test the Model

Practical Verdict

What to Read Next

Related Guides

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production