Self-Hosting DeepSeek V4 with vLLM: Hardware Requirements and Deployment Guide
In this article
- Should You Self-Host DeepSeek V4?
- Hardware Requirements
- V4-Flash
- V4-Pro
- Downloading Model Weights
- vLLM Deployment Setup
- Serve V4-Flash on 4× A100 80GB
- Serve V4-Flash on 2× H200
- Expert Parallelism and Tensor Parallelism
- Quantization Options
- Configuring the 1M Context Window
- AWS Instance Selection
- Break-Even Analysis: Self-Host vs DeepSeek API
- What to Read Next
Should You Self-Host DeepSeek V4?
DeepSeek V4 ships under MIT license with open weights on Hugging Face — both V4-Flash (284B parameters, FP4+FP8 mixed, ~158GB) and V4-Pro (1.6T parameters, ~862GB). The license permits commercial use, fine-tuning, and infrastructure modification without restrictions.
Three legitimate reasons to self-host:
- Data sovereignty — DeepSeek's hosted API routes through infrastructure outside your control. For regulated industries, defense-adjacent work, or GDPR-sensitive applications, keeping inference on your own hardware removes that dependency entirely.
- Fine-tuning — The MIT license lets you modify and serve fine-tuned variants. The hosted API does not.
- Consistent latency — Self-hosted inference has no quota limits or API rate fluctuations.
The reason not to self-host is usually cost. At DeepSeek's own API pricing ($0.14/M input, $0.28/M output), break-even against AWS on-demand pricing for V4-Flash requires roughly 3–4 billion tokens/day of sustained throughput on a reserved instance. Most teams never reach that volume. Run your own numbers before committing.
Hardware Requirements
V4-Flash
V4-Flash is the practical self-hosting target. The FP4+FP8 Instruct checkpoint weights are ~158GB. Add ~10GB for a full 1M-token KV cache (DeepSeek V4 uses only ~7% of V3.2's KV cache footprint) and a few GB of runtime overhead: total VRAM budget is roughly 170–175GB.
| GPU Config | Total VRAM | Fit V4-Flash? | Notes |
|---|---|---|---|
| 2× H200 141GB | 282 GB | ✅ Comfortable | Recommended; headroom for full 1M context |
| 2× RTX Pro 6000 Blackwell | 192 GB | ✅ Comfortable | More affordable than H200; NVLink required |
| 4× A100 80GB | 320 GB | ✅ Works | vLLM prefers power-of-2 TP; 2× A100 (160GB) is just under budget |
| 2× A100 80GB | 160 GB | ⚠️ Tight | Below budget once KV cache is loaded at full context; reduce max-model-len |
| 4× RTX 4090 | 96 GB | ❌ No | Only feasible with INT4 quantization (~80GB); expect quality degradation |
The 4× A100 recommendation in many guides is a vLLM artifact: tensor parallelism works best with power-of-two GPU counts. 2× A100 gives 160GB which is technically below the full-context budget, so the next safe power-of-two is 4× (320GB). The extra headroom is a side effect, not a requirement.
V4-Pro
V4-Pro at ~862GB is a real cluster problem:
| Config | Total VRAM | Notes |
|---|---|---|
| 8× H200 141GB (single node) | 1,128 GB | Minimum single-node config |
| DGX H200 | 1,128 GB | Purpose-built; NVLink fabric included |
| 2× p5.48xlarge (16× H100 80GB) | 1,280 GB | Multi-node; requires NVLink + InfiniBand |
Unless you specifically need V4-Pro's superior agentic coding and knowledge depth over V4-Flash, V4-Flash delivers 85–95% of V4-Pro quality at a fraction of the infrastructure cost.
System RAM: 256GB+ for V4-Flash. Storage: 500GB NVMe minimum.
Downloading Model Weights
Both models live under the deepseek-ai organization on Hugging Face. Use the Instruct FP4+FP8 checkpoints for production:
pip install huggingface_hub
# V4-Flash (~158GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir ./deepseek-v4-flash
# V4-Pro (~862GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \
--local-dir ./deepseek-v4-pro
For faster transfers, enable hf_transfer:
pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
With a good network connection and HF token set, this can reach 1–2 GB/s on capable infrastructure.
vLLM Deployment Setup
vLLM ≥0.8.0 is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for long contexts.
pip install "vllm>=0.8.0"
Serve V4-Flash on 4× A100 80GB
python -m vllm.entrypoints.openai.api_server \
--model ./deepseek-v4-flash \
--tensor-parallel-size 4 \
--max-model-len 131072 \
--trust-remote-code \
--port 8000
Serve V4-Flash on 2× H200
python -m vllm.entrypoints.openai.api_server \
--model ./deepseek-v4-flash \
--tensor-parallel-size 2 \
--max-model-len 1048576 \
--trust-remote-code \
--port 8000
vLLM exposes an OpenAI-compatible API. Point any OpenAI SDK client at it by changing base_url:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Explain mixture-of-experts routing."}],
)
print(response.choices[0].message.content)
Expert Parallelism and Tensor Parallelism
MoE models benefit from two distinct parallelism strategies, and understanding the difference determines how efficiently you use multi-GPU setups:
Tensor Parallelism (TP): Splits individual layer weight matrices across GPUs. Required when a single layer is too large for one GPU. Set with --tensor-parallel-size. Communication-heavy — works best on GPUs connected by NVLink.
Expert Parallelism (EP): Routes different MoE expert sub-networks to different GPUs. Because MoE only activates a small fraction of experts per token, EP lets you spread experts without duplicating all parameters everywhere. Lower communication overhead than TP.
For V4-Flash on 2× or 4× A100 or H200, tensor parallelism alone is sufficient. vLLM handles EP automatically when you increase GPU count beyond what TP alone needs.
For V4-Pro on 8+ GPUs, combine TP and EP by tuning --tensor-parallel-size and monitoring GPU utilization to find the sweet spot.
Quantization Options
DeepSeek ships V4 in two formats on Hugging Face:
- FP8 Mixed (Base checkpoints): Higher quality, larger footprint. Dense parameters in FP8.
- FP4+FP8 Mixed (Instruct checkpoints): MoE expert weights in FP4, other parameters in FP8. Recommended format — balances quality and memory.
Community quantizations (GGUF, AWQ, GPTQ) are available and can compress V4-Flash to ~80GB at INT4, potentially fitting on 4× RTX 4090. The trade-off is measurable quality degradation on reasoning, math, and complex instruction-following tasks.
Practical guidance: The official FP4+FP8 Instruct checkpoint is already aggressively quantized from the full FP32 baseline. For production inference, stick with the official format unless VRAM constraints leave no alternative. Reserve INT4 community quantizations for development or evaluation on consumer hardware.
Configuring the 1M Context Window
V4's hybrid CSA+HCA attention architecture dramatically reduces KV cache memory compared to V3.2 — roughly 7% of the footprint at equivalent context length. A full 1M-token context on V4-Flash consumes approximately:
- 158GB weights
- ~10GB KV cache at full 1M context
- ~5–7GB runtime overhead
Total: ~170–175GB
Set --max-model-len in vLLM to match your VRAM headroom:
# Start conservative on 4x A100 (320GB total, ~150GB headroom after weights)
--max-model-len 131072 # 128K
# Full 1M on 2x H200 (282GB total, ~110GB headroom after weights)
--max-model-len 1048576 # 1M
Monitor GPU memory during inference under load and adjust upward until you're within 5–10GB of VRAM capacity.
AWS Instance Selection
| Instance | GPUs | Total VRAM | On-Demand ($/hr) | Use Case |
|---|---|---|---|---|
| p5.48xlarge | 8× H100 80GB | 640 GB | ~$55 | V4-Flash (comfortable) |
| p5e.48xlarge | 8× H200 141GB | 1,128 GB | ~$40–50 | V4-Pro (single node) |
| p5en.48xlarge | 8× H200 141GB (200Gbps) | 1,128 GB | ~$63 | V4-Pro + faster fabric |
| 2× p5.48xlarge | 16× H100 80GB | 1,280 GB | ~$110 | V4-Pro (multi-node) |
1-year reserved instances cut costs roughly 40%. Spot instances are viable for batch inference but not for latency-sensitive serving due to interruption risk.
For most teams evaluating self-hosting, a single p5.48xlarge running V4-Flash is the practical starting point.
Break-Even Analysis: Self-Host vs DeepSeek API
DeepSeek's API pricing (as of May 2026):
- Input: $0.14 per million tokens
- Output: $0.28 per million tokens
- Blended 50/50 rate: ~$0.21 per million tokens
At $0.21/M, spending $790/day (reserved p5.48xlarge) equates to roughly 3.8 billion tokens/day. A single 8× H100 node running V4-Flash cannot physically sustain that throughput.
The cost argument almost never closes. Self-host when:
- Data residency or regulatory compliance require on-premises or VPC inference
- You need custom fine-tuned weights the hosted API cannot serve
- Your workload requires consistent low-latency that API quotas cannot guarantee
For all other cases, the API is cheaper and requires no infrastructure maintenance.
What to Read Next
- Deploy Models on RunPod — rent GPU infrastructure by the hour for V4-Flash evaluation before committing to reserved instances
- Run DeepSeek V4 Flash Locally with llama.cpp — single-GPU local setup using a modified llama.cpp build and GGUF quantization
- Production-Grade RAG Architecture — connect your self-hosted V4 endpoint to a retrieval pipeline
Related Guides
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
Event-Driven Architecture for Agentic AI: The Architect's Guide
A comprehensive architectural guide to designing resilient, real-time agentic AI systems using event-driven architecture — covering loose coupling, fault isolation, reference architecture, and governance patterns.
Cursor AI: Complete Setup and Practical Coding Guide
Everything developers need to use Cursor AI effectively — installation, the full keyboard shortcut map, inline code generation, chat with codebase context, tab autocomplete, @ mentions, custom rules, and how it compares to GitHub Copilot.