Run DeepSeek V4 Flash Locally with llama.cpp on a Single GPU

May 24, 2026guides

What You're Setting Up

DeepSeek V4 Flash is the smaller model in the V4 series — 284B total parameters, 37B active per forward pass via sparse MoE, shipping with an FP4+FP8 mixed GGUF that comes in at ~146GB. The goal here is running it on a single GPU rather than a multi-GPU vLLM cluster.

The challenge: as of mid-2026, upstream llama.cpp does not fully support DeepSeek V4's architecture. The model uses a hybrid CSA+HCA attention mechanism and native FP4/FP8 sparse expert weights that require a community patch to load correctly. This guide uses a WIP branch (wip/deepseek-v4-support) that provides the necessary architecture support.

Requirements:

  • A GPU with ≥96GB VRAM (RTX Pro 6000 Blackwell is the sweet spot — 96GB at lower cost than H100)
  • Or a cloud GPU pod (this guide uses RunPod, but Lambda/Vast.ai work similarly)
  • Linux environment (Ubuntu 22.04 recommended)
  • Hugging Face account + access token

Step 1: Provision the GPU Environment

If running on RunPod, create a new GPU pod with these settings:

SettingValue
GPURTX PRO 6000 (96GB VRAM)
Container Disk50 GB
Volume Disk300 GB
Exposed Port8910
TemplateLatest PyTorch
Environment VariableHF_TOKEN = your HF token

Port 8910 is where llama-server will expose the web UI. Set your HF_TOKEN as an environment variable on the pod — it's needed for authenticated HF downloads.

Once the pod is running, open JupyterLab and launch a terminal. Verify the GPU:

nvidia-smi

Expected output includes the GPU model, total VRAM, CUDA version, and driver version. If you see the RTX Pro 6000 with 96GB, the pod is configured correctly.


Step 2: Install Build Dependencies

apt-get update

apt-get install -y \
  pciutils \
  build-essential \
  cmake \
  git \
  curl \
  wget \
  libcurl4-openssl-dev \
  tmux \
  python3 \
  python3-pip \
  python3-venv

These install CMake (required to build llama.cpp), CUDA dev headers, Python, and general utilities.


Step 3: Build the Modified llama.cpp

Move to the workspace directory:

cd /workspace

Clone the WIP branch with DeepSeek V4 support:

git clone -b wip/deepseek-v4-support \
  https://github.com/nisparks/llama.cpp.git \
  llama.cpp-deepseek-v4

Configure the build with CUDA enabled:

cmake llama.cpp-deepseek-v4 \
  -B llama.cpp-deepseek-v4/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release

Build the inference binaries:

cmake --build llama.cpp-deepseek-v4/build \
  --config Release \
  -j \
  --clean-first \
  --target llama-cli llama-server llama-gguf-split

Copy the compiled binaries to the project root:

cp llama.cpp-deepseek-v4/build/bin/llama-* llama.cpp-deepseek-v4/

Verify the build:

llama.cpp-deepseek-v4/llama-server --help

If the help menu prints successfully, the build is complete.


Step 4: Download the Model

Install Hugging Face download tools:

pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer

Enable the fast transfer backend:

export HF_HUB_ENABLE_HF_TRANSFER=1

Create the model directory:

mkdir -p /workspace/models/deepseek-v4-flash-fp4-fp8

Download the GGUF:

huggingface-cli download nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF \
  DeepSeek-V4-Flash-FP4-FP8-native.gguf \
  --local-dir /workspace/models/deepseek-v4-flash-fp4-fp8

With HF_HUB_ENABLE_HF_TRANSFER=1 and a valid HF_TOKEN, downloads can reach ~2GB/s on high-bandwidth cloud pods. The 146GB file takes 1–2 minutes at those speeds.

Verify after download:

ls -lh /workspace/models/deepseek-v4-flash-fp4-fp8

Expected:

total 146G
-rw-rw-rw- 1 root root 146G DeepSeek-V4-Flash-FP4-FP8-native.gguf

Step 5: Start llama-server

cd /workspace/llama.cpp-deepseek-v4

./llama-server \
  --model /workspace/models/deepseek-v4-flash-fp4-fp8/DeepSeek-V4-Flash-FP4-FP8-native.gguf \
  --alias "DeepSeek-V4-Flash" \
  --host 0.0.0.0 \
  --port 8910 \
  --jinja \
  --fit on \
  --threads 16 \
  --threads-batch 16 \
  --ctx-size 32768 \
  --batch-size 2048 \
  --ubatch-size 512 \
  --flash-attn on \
  --temp 0.7 \
  --top-p 0.95 \
  --cont-batching \
  --metrics \
  --perf

Key flags explained:

  • --fit on — automatically fits the model across available GPU and CPU memory using the optimal layer split
  • --ctx-size 32768 — 32K context window; increase to 65536 if VRAM headroom allows
  • --flash-attn on — enables Flash Attention for faster inference and lower memory at long contexts
  • --cont-batching — continuous batching for serving multiple concurrent requests
  • --metrics — exposes a /metrics endpoint for throughput monitoring

The model takes 60–120 seconds to load into GPU memory. When ready, the terminal prints:

llama server listening at http://0.0.0.0:8910

On RunPod, navigate to the pod's Exposed Ports panel and click the link for port 8910 to open the llama.cpp Web UI in your browser.


Step 6: Test the Model

The Web UI exposes a chat interface. You can also use curl against the OpenAI-compatible endpoint:

curl http://localhost:8910/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Solve: 5x + 12 = 47. Show your work."}],
    "temperature": 0.7
  }'

Benchmark observations (single RTX Pro 6000):

TaskPerformanceNotes
Structured math reasoningStrongCorrect step-by-step algebra, proper variable substitution
Writing and explanationGoodClear prose; slightly generic conclusions
UI / HTML generationAverageFunctional output, weak visual design quality
Full Python project genWeakSyntax errors, broken f-strings; needs post-generation debugging
Throughput (~32K ctx)~8–12 tok/sOn a single RTX Pro 6000; bottlenecked by model size

The 8–12 tokens/second throughput on a 146GB GGUF with 96GB VRAM reflects significant CPU offloading — a portion of layers run on CPU rather than GPU. If throughput matters more than running the full model, consider 30B–70B alternatives like Qwen3-30B-A3B or a quantized Llama 3.3 70B.


Practical Verdict

Running DeepSeek V4 Flash locally is possible but not seamless. The main friction points:

  • No turnkey GGUF from established providers (Unsloth, bartowski) as of this writing
  • The community branch (wip/deepseek-v4-support) is pre-merge; expect rough edges
  • 8–12 tok/s on one 96GB GPU is slow for interactive use

If your goal is evaluating V4 Flash locally: this setup works. If your goal is production or developer-friendly local inference, Qwen3-32B or a Llama 3.3 70B INT4 quantization delivers better results with far less setup friction.


Related Guides