Self-Hosting DeepSeek V4 with vLLM: Hardware Requirements and Deployment Guide

May 24, 2026guides

Should You Self-Host DeepSeek V4?

DeepSeek V4 ships under MIT license with open weights on Hugging Face — both V4-Flash (284B parameters, FP4+FP8 mixed, ~158GB) and V4-Pro (1.6T parameters, ~862GB). The license permits commercial use, fine-tuning, and infrastructure modification without restrictions.

Three legitimate reasons to self-host:

  1. Data sovereignty — DeepSeek's hosted API routes through infrastructure outside your control. For regulated industries, defense-adjacent work, or GDPR-sensitive applications, keeping inference on your own hardware removes that dependency entirely.
  2. Fine-tuning — The MIT license lets you modify and serve fine-tuned variants. The hosted API does not.
  3. Consistent latency — Self-hosted inference has no quota limits or API rate fluctuations.

The reason not to self-host is usually cost. At DeepSeek's own API pricing ($0.14/M input, $0.28/M output), break-even against AWS on-demand pricing for V4-Flash requires roughly 3–4 billion tokens/day of sustained throughput on a reserved instance. Most teams never reach that volume. Run your own numbers before committing.


Hardware Requirements

V4-Flash

V4-Flash is the practical self-hosting target. The FP4+FP8 Instruct checkpoint weights are ~158GB. Add ~10GB for a full 1M-token KV cache (DeepSeek V4 uses only ~7% of V3.2's KV cache footprint) and a few GB of runtime overhead: total VRAM budget is roughly 170–175GB.

GPU ConfigTotal VRAMFit V4-Flash?Notes
2× H200 141GB282 GB✅ ComfortableRecommended; headroom for full 1M context
2× RTX Pro 6000 Blackwell192 GB✅ ComfortableMore affordable than H200; NVLink required
4× A100 80GB320 GB✅ WorksvLLM prefers power-of-2 TP; 2× A100 (160GB) is just under budget
2× A100 80GB160 GB⚠️ TightBelow budget once KV cache is loaded at full context; reduce max-model-len
4× RTX 409096 GB❌ NoOnly feasible with INT4 quantization (~80GB); expect quality degradation

The 4× A100 recommendation in many guides is a vLLM artifact: tensor parallelism works best with power-of-two GPU counts. 2× A100 gives 160GB which is technically below the full-context budget, so the next safe power-of-two is 4× (320GB). The extra headroom is a side effect, not a requirement.

V4-Pro

V4-Pro at ~862GB is a real cluster problem:

ConfigTotal VRAMNotes
8× H200 141GB (single node)1,128 GBMinimum single-node config
DGX H2001,128 GBPurpose-built; NVLink fabric included
2× p5.48xlarge (16× H100 80GB)1,280 GBMulti-node; requires NVLink + InfiniBand

Unless you specifically need V4-Pro's superior agentic coding and knowledge depth over V4-Flash, V4-Flash delivers 85–95% of V4-Pro quality at a fraction of the infrastructure cost.

System RAM: 256GB+ for V4-Flash. Storage: 500GB NVMe minimum.


Downloading Model Weights

Both models live under the deepseek-ai organization on Hugging Face. Use the Instruct FP4+FP8 checkpoints for production:

pip install huggingface_hub
# V4-Flash (~158GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir ./deepseek-v4-flash

# V4-Pro (~862GB)
huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \
  --local-dir ./deepseek-v4-pro

For faster transfers, enable hf_transfer:

pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

With a good network connection and HF token set, this can reach 1–2 GB/s on capable infrastructure.


vLLM Deployment Setup

vLLM ≥0.8.0 is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for long contexts.

pip install "vllm>=0.8.0"

Serve V4-Flash on 4× A100 80GB

python -m vllm.entrypoints.openai.api_server \
  --model ./deepseek-v4-flash \
  --tensor-parallel-size 4 \
  --max-model-len 131072 \
  --trust-remote-code \
  --port 8000

Serve V4-Flash on 2× H200

python -m vllm.entrypoints.openai.api_server \
  --model ./deepseek-v4-flash \
  --tensor-parallel-size 2 \
  --max-model-len 1048576 \
  --trust-remote-code \
  --port 8000

vLLM exposes an OpenAI-compatible API. Point any OpenAI SDK client at it by changing base_url:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")
response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain mixture-of-experts routing."}],
)
print(response.choices[0].message.content)

Expert Parallelism and Tensor Parallelism

MoE models benefit from two distinct parallelism strategies, and understanding the difference determines how efficiently you use multi-GPU setups:

Tensor Parallelism (TP): Splits individual layer weight matrices across GPUs. Required when a single layer is too large for one GPU. Set with --tensor-parallel-size. Communication-heavy — works best on GPUs connected by NVLink.

Expert Parallelism (EP): Routes different MoE expert sub-networks to different GPUs. Because MoE only activates a small fraction of experts per token, EP lets you spread experts without duplicating all parameters everywhere. Lower communication overhead than TP.

For V4-Flash on 2× or 4× A100 or H200, tensor parallelism alone is sufficient. vLLM handles EP automatically when you increase GPU count beyond what TP alone needs.

For V4-Pro on 8+ GPUs, combine TP and EP by tuning --tensor-parallel-size and monitoring GPU utilization to find the sweet spot.


Quantization Options

DeepSeek ships V4 in two formats on Hugging Face:

  • FP8 Mixed (Base checkpoints): Higher quality, larger footprint. Dense parameters in FP8.
  • FP4+FP8 Mixed (Instruct checkpoints): MoE expert weights in FP4, other parameters in FP8. Recommended format — balances quality and memory.

Community quantizations (GGUF, AWQ, GPTQ) are available and can compress V4-Flash to ~80GB at INT4, potentially fitting on 4× RTX 4090. The trade-off is measurable quality degradation on reasoning, math, and complex instruction-following tasks.

Practical guidance: The official FP4+FP8 Instruct checkpoint is already aggressively quantized from the full FP32 baseline. For production inference, stick with the official format unless VRAM constraints leave no alternative. Reserve INT4 community quantizations for development or evaluation on consumer hardware.


Configuring the 1M Context Window

V4's hybrid CSA+HCA attention architecture dramatically reduces KV cache memory compared to V3.2 — roughly 7% of the footprint at equivalent context length. A full 1M-token context on V4-Flash consumes approximately:

  • 158GB weights
  • ~10GB KV cache at full 1M context
  • ~5–7GB runtime overhead

Total: ~170–175GB

Set --max-model-len in vLLM to match your VRAM headroom:

# Start conservative on 4x A100 (320GB total, ~150GB headroom after weights)
--max-model-len 131072   # 128K

# Full 1M on 2x H200 (282GB total, ~110GB headroom after weights)
--max-model-len 1048576  # 1M

Monitor GPU memory during inference under load and adjust upward until you're within 5–10GB of VRAM capacity.


AWS Instance Selection

InstanceGPUsTotal VRAMOn-Demand ($/hr)Use Case
p5.48xlarge8× H100 80GB640 GB~$55V4-Flash (comfortable)
p5e.48xlarge8× H200 141GB1,128 GB~$40–50V4-Pro (single node)
p5en.48xlarge8× H200 141GB (200Gbps)1,128 GB~$63V4-Pro + faster fabric
2× p5.48xlarge16× H100 80GB1,280 GB~$110V4-Pro (multi-node)

1-year reserved instances cut costs roughly 40%. Spot instances are viable for batch inference but not for latency-sensitive serving due to interruption risk.

For most teams evaluating self-hosting, a single p5.48xlarge running V4-Flash is the practical starting point.


Break-Even Analysis: Self-Host vs DeepSeek API

DeepSeek's API pricing (as of May 2026):

  • Input: $0.14 per million tokens
  • Output: $0.28 per million tokens
  • Blended 50/50 rate: ~$0.21 per million tokens

At $0.21/M, spending $790/day (reserved p5.48xlarge) equates to roughly 3.8 billion tokens/day. A single 8× H100 node running V4-Flash cannot physically sustain that throughput.

The cost argument almost never closes. Self-host when:

  • Data residency or regulatory compliance require on-premises or VPC inference
  • You need custom fine-tuned weights the hosted API cannot serve
  • Your workload requires consistent low-latency that API quotas cannot guarantee

For all other cases, the API is cheaper and requires no infrastructure maintenance.


Related Guides