Self-Hosting DeepSeek V4 with vLLM: Hardware Requirements and Deployment Guide

May 24, 2026 • guides

Should You Self-Host DeepSeek V4?
Hardware Requirements
V4-Flash
V4-Pro
Downloading Model Weights
vLLM Deployment Setup
Serve V4-Flash on 4× A100 80GB
Serve V4-Flash on 2× H200
Expert Parallelism and Tensor Parallelism
Quantization Options
Configuring the 1M Context Window
AWS Instance Selection
Break-Even Analysis: Self-Host vs DeepSeek API
What to Read Next

Should You Self-Host DeepSeek V4?

DeepSeek V4 ships under MIT license with open weights on Hugging Face — both V4-Flash (284B parameters, FP4+FP8 mixed, ~158GB) and V4-Pro (1.6T parameters, ~862GB). The license permits commercial use, fine-tuning, and infrastructure modification without restrictions.

Three legitimate reasons to self-host:

Data sovereignty — DeepSeek's hosted API routes through infrastructure outside your control. For regulated industries, defense-adjacent work, or GDPR-sensitive applications, keeping inference on your own hardware removes that dependency entirely.
Fine-tuning — The MIT license lets you modify and serve fine-tuned variants. The hosted API does not.
Consistent latency — Self-hosted inference has no quota limits or API rate fluctuations.

The reason not to self-host is usually cost. At DeepSeek's own API pricing ($0.14/M input, $0.28/M output), break-even against AWS on-demand pricing for V4-Flash requires roughly 3–4 billion tokens/day of sustained throughput on a reserved instance. Most teams never reach that volume. Run your own numbers before committing.

Hardware Requirements

V4-Flash

V4-Flash is the practical self-hosting target. The FP4+FP8 Instruct checkpoint weights are ~158GB. Add ~10GB for a full 1M-token KV cache (DeepSeek V4 uses only ~7% of V3.2's KV cache footprint) and a few GB of runtime overhead: total VRAM budget is roughly 170–175GB.

GPU Config	Total VRAM	Fit V4-Flash?	Notes
2× H200 141GB	282 GB	✅ Comfortable	Recommended; headroom for full 1M context
2× RTX Pro 6000 Blackwell	192 GB	✅ Comfortable	More affordable than H200; NVLink required
4× A100 80GB	320 GB	✅ Works	vLLM prefers power-of-2 TP; 2× A100 (160GB) is just under budget
2× A100 80GB	160 GB	⚠️ Tight	Below budget once KV cache is loaded at full context; reduce max-model-len
4× RTX 4090	96 GB	❌ No	Only feasible with INT4 quantization (~80GB); expect quality degradation

The 4× A100 recommendation in many guides is a vLLM artifact: tensor parallelism works best with power-of-two GPU counts. 2× A100 gives 160GB which is technically below the full-context budget, so the next safe power-of-two is 4× (320GB). The extra headroom is a side effect, not a requirement.

V4-Pro

V4-Pro at ~862GB is a real cluster problem:

Config	Total VRAM	Notes
8× H200 141GB (single node)	1,128 GB	Minimum single-node config
DGX H200	1,128 GB	Purpose-built; NVLink fabric included
2× p5.48xlarge (16× H100 80GB)	1,280 GB	Multi-node; requires NVLink + InfiniBand

Unless you specifically need V4-Pro's superior agentic coding and knowledge depth over V4-Flash, V4-Flash delivers 85–95% of V4-Pro quality at a fraction of the infrastructure cost.

System RAM: 256GB+ for V4-Flash. Storage: 500GB NVMe minimum.

Downloading Model Weights

Both models live under the deepseek-ai organization on Hugging Face. Use the Instruct FP4+FP8 checkpoints for production:

pip install huggingface_hub

# V4-Flash (~158GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash

# V4-Pro (~862GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \
  --local-dir ./deepseek-v4-pro

For faster transfers, enable hf_transfer:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

With a good network connection and HF token set, this can reach 1–2 GB/s on capable infrastructure.

vLLM Deployment Setup

vLLM ≥0.8.0 is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for long contexts.

pip install "vllm>=0.8.0"

Serve V4-Flash on 4× A100 80GB

python -m vllm.entrypoints.openai.api_server \
  --model ./deepseek-v4-flash \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --trust-remote-code \
  --port 8000

Serve V4-Flash on 2× H200

python -m vllm.entrypoints.openai.api_server \
  --model ./deepseek-v4-flash \
  --tensor-parallel-size 2 \
  --max-model-len 1048576 \
  --trust-remote-code \
  --port 8000

vLLM exposes an OpenAI-compatible API. Point any OpenAI SDK client at it by changing base_url:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain mixture-of-experts routing."}],
)
print(response.choices[0].message.content)

Expert Parallelism and Tensor Parallelism

MoE models benefit from two distinct parallelism strategies, and understanding the difference determines how efficiently you use multi-GPU setups:

Tensor Parallelism (TP): Splits individual layer weight matrices across GPUs. Required when a single layer is too large for one GPU. Set with --tensor-parallel-size. Communication-heavy — works best on GPUs connected by NVLink.

Expert Parallelism (EP): Routes different MoE expert sub-networks to different GPUs. Because MoE only activates a small fraction of experts per token, EP lets you spread experts without duplicating all parameters everywhere. Lower communication overhead than TP.

For V4-Flash on 2× or 4× A100 or H200, tensor parallelism alone is sufficient. vLLM handles EP automatically when you increase GPU count beyond what TP alone needs.

For V4-Pro on 8+ GPUs, combine TP and EP by tuning --tensor-parallel-size and monitoring GPU utilization to find the sweet spot.

Quantization Options

DeepSeek ships V4 in two formats on Hugging Face:

FP8 Mixed (Base checkpoints): Higher quality, larger footprint. Dense parameters in FP8.
FP4+FP8 Mixed (Instruct checkpoints): MoE expert weights in FP4, other parameters in FP8. Recommended format — balances quality and memory.

Community quantizations (GGUF, AWQ, GPTQ) are available and can compress V4-Flash to ~80GB at INT4, potentially fitting on 4× RTX 4090. The trade-off is measurable quality degradation on reasoning, math, and complex instruction-following tasks.

Practical guidance: The official FP4+FP8 Instruct checkpoint is already aggressively quantized from the full FP32 baseline. For production inference, stick with the official format unless VRAM constraints leave no alternative. Reserve INT4 community quantizations for development or evaluation on consumer hardware.

Configuring the 1M Context Window

V4's hybrid CSA+HCA attention architecture dramatically reduces KV cache memory compared to V3.2 — roughly 7% of the footprint at equivalent context length. A full 1M-token context on V4-Flash consumes approximately:

158GB weights
~10GB KV cache at full 1M context
~5–7GB runtime overhead

Total: ~170–175GB

Set --max-model-len in vLLM to match your VRAM headroom:

# Start conservative on 4x A100 (320GB total, ~150GB headroom after weights)
--max-model-len 131072   # 128K

# Full 1M on 2x H200 (282GB total, ~110GB headroom after weights)
--max-model-len 1048576  # 1M

Monitor GPU memory during inference under load and adjust upward until you're within 5–10GB of VRAM capacity.

AWS Instance Selection

Instance	GPUs	Total VRAM	On-Demand ($/hr)	Use Case
p5.48xlarge	8× H100 80GB	640 GB	~$55	V4-Flash (comfortable)
p5e.48xlarge	8× H200 141GB	1,128 GB	~$40–50	V4-Pro (single node)
p5en.48xlarge	8× H200 141GB (200Gbps)	1,128 GB	~$63	V4-Pro + faster fabric
2× p5.48xlarge	16× H100 80GB	1,280 GB	~$110	V4-Pro (multi-node)

1-year reserved instances cut costs roughly 40%. Spot instances are viable for batch inference but not for latency-sensitive serving due to interruption risk.

For most teams evaluating self-hosting, a single p5.48xlarge running V4-Flash is the practical starting point.

Break-Even Analysis: Self-Host vs DeepSeek API

DeepSeek's API pricing (as of May 2026):

Input: $0.14 per million tokens
Output: $0.28 per million tokens
Blended 50/50 rate: ~$0.21 per million tokens

At $0.21/M, spending $790/day (reserved p5.48xlarge) equates to roughly 3.8 billion tokens/day. A single 8× H100 node running V4-Flash cannot physically sustain that throughput.

The cost argument almost never closes. Self-host when:

Data residency or regulatory compliance require on-premises or VPC inference
You need custom fine-tuned weights the hosted API cannot serve
Your workload requires consistent low-latency that API quotas cannot guarantee

For all other cases, the API is cheaper and requires no infrastructure maintenance.

Related Guides

guides

Shan • 2026-07-03

llmself-hostedollamahardwareprivacy

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Complete 2026 guide to running LLMs locally for privacy and cost savings. Set up Ollama, llama.cpp, and vLLM on your hardware.

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

Self-Hosting DeepSeek V4 with vLLM: Hardware Requirements and Deployment Guide

In this article

Should You Self-Host DeepSeek V4?

Hardware Requirements

V4-Flash

V4-Pro

Downloading Model Weights

vLLM Deployment Setup

Serve V4-Flash on 4× A100 80GB

Serve V4-Flash on 2× H200

Expert Parallelism and Tensor Parallelism

Quantization Options

Configuring the 1M Context Window

AWS Instance Selection

Break-Even Analysis: Self-Host vs DeepSeek API

What to Read Next

Related Guides

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production