Run Claude Code Locally with Ollama: A Complete Setup Guide
Why Run an AI Coding Agent Locally?
Most terminal-based coding agents phone home to a cloud API. That works fine until you're on a flaky connection, need to keep proprietary code off external servers, or simply want to avoid per-token billing on long agentic sessions.
Running Claude Code against a local Ollama server eliminates all three problems. Your code stays on your machine, inference is fast (especially on a GPU), and the workflow is identical to the cloud-backed version. This guide walks through the complete setup — from installing the runtime to running a multi-step coding task.
Prerequisites
Before you start, make sure your system has:
Hardware:
- NVIDIA GPU with at least 16 GB VRAM (24 GB recommended for larger context windows)
- 16–32 GB system RAM
- ~25 GB free disk space
If you have no GPU, inference will fall back to CPU, which is noticeably slower but still functional.
Software:
- Linux or macOS (Windows users: use WSL2 with GPU passthrough)
- NVIDIA driver and CUDA Toolkit installed (version 13.1 or compatible)
- Node.js (required for the Claude Code installer)
Verify your GPU setup before proceeding:
nvidia-smi
You should see your GPU model, available VRAM, and the active CUDA version listed in the output.
Step 1: Install Ollama
Ollama is the local runtime that downloads, manages, and serves models. It also exposes an HTTP API that external tools — including Claude Code — can talk to.
On Linux, a single command handles the full installation:
curl -fsSL https://ollama.com/install.sh | sh
On macOS and Windows, download the installer from ollama.com and follow the on-screen instructions. Ollama runs as a background service and checks for updates automatically.
Once installed, confirm the version:
ollama -v
If you get a "command not found" error, the service may not have started yet. Launch it manually in one terminal:
ollama serve
Then run ollama -v in a second terminal. Once the version prints cleanly, Ollama is ready.
Step 2: Pull and Test the Model
With Ollama running, download a local model. GLM 4.7 Flash is a strong choice for agentic coding — it's fast on GPU and supports a large context window.
ollama pull glm-4.7-flash
After the download completes, run a quick sanity check in interactive mode:
ollama run glm-4.7-flash
Type a short prompt and verify you get a sensible response. If you're on a GPU, the reply should come back in well under a second.
You can also verify the model responds over the local HTTP API, which is how Claude Code will communicate with it:
curl http://localhost:11434/api/chat -d '{
"model": "glm-4.7-flash",
"messages": [{"role": "user", "content": "Hello!"}]
}'
A JSON response confirms the API is live and the model is loaded.
Step 3: Set an Appropriate Context Length
Claude Code and agentic workflows in general require a reasonable context window to function well. However, very large windows can cause two problems: inference slows dramatically, and the model sometimes enters repetitive thinking loops.
Testing across several sizes has shown 20,000 tokens to be a reliable sweet spot — large enough for multi-file coding tasks, without sacrificing generation speed.
Stop the running Ollama server with Ctrl + C, then restart it with the context length override:
OLLAMA_CONTEXT_LENGTH=20000 ollama serve
Confirm the setting is active in a new terminal:
ollama ps
The output should show the model running on your GPU with CONTEXT set to 20000:
NAME ID SIZE PROCESSOR CONTEXT
glm-4.7-flash:latest d1a8a26252f1 21 GB 100% GPU 20000
Step 4: Install Claude Code
Claude Code is a terminal-based coding agent built by Anthropic. You interact with it in natural language, and it handles writing, editing, refactoring, and executing code as part of multi-step workflows.
Install it with the official script:
curl -fsSL https://claude.ai/install.sh | bash
Once complete, the claude command will be available in your terminal.
Step 5: Connect Claude Code to Ollama
Navigate to your project directory:
mkdir my-local-project
cd my-local-project
The recommended way to launch Claude Code with Ollama is to use the built-in launch command, which automatically configures the API routing for you:
ollama launch claude --model glm-4.7-flash
Alternatively, you can configure the environment variables manually and then run claude directly:
# Linux / macOS
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
claude
# Windows (PowerShell)
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
claude
Once the Claude Code interface opens in your terminal, confirm it is pointing to your local model:
/model
If the output shows glm-4.7-flash, the setup is complete.
Step 6: Run Your First Agentic Task
With everything wired up, Claude Code will respond using your locally-running model. Start with a simple greeting to confirm response speed — on a GPU, replies should be near-instantaneous.
For a more realistic test, ask Claude Code to build something complete:
"Build a command-line Snake game in Python."
Before it generates code, enable Planning Mode by pressing Shift + Tab twice. The model will outline its approach first — you can review the plan, ask for adjustments, and then tell it to proceed. Claude Code will create the required files and provide instructions for running the result.
Bonus: Use a Local GGUF File Directly
If you already have a GGUF model file downloaded and want to skip re-downloading via Ollama, you can register it manually with a Modelfile:
FROM ./glm-4.7-flash.gguf
PARAMETER temperature 0.8
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0
Register it:
ollama create glm-4.7-flash-local -f Modelfile
Then run it like any other Ollama model:
ollama run glm-4.7-flash-local
Which Models Work Best?
Not every local model handles agentic workflows cleanly. Tool calling and multi-step planning put real demands on the model's instruction-following ability. As of early 2026, the community-tested options are:
| Model | Strengths | |---|---| | GLM 4.7 Flash | Very fast on GPU, 128k context, great tool calling | | Qwen 2.5 Coder (32B/7B) | Best open-source coding reasoning overall | | Codestral | Strong on Python and complex logic; heavier on VRAM |
Start with GLM 4.7 Flash if you want the fastest time-to-working-setup. Move to Qwen 2.5 Coder if you need stronger reasoning on complex refactoring tasks.
What You've Built
At the end of this setup you have a fully local, fully private coding agent that:
- Runs inference on your own hardware with no API calls leaving your machine
- Uses the same Claude Code interface and workflow patterns as the cloud version
- Can be pointed at any Ollama-compatible model with a single flag change
The total setup time, excluding model download, is around five minutes. If you're working in an environment with poor connectivity or handling sensitive codebases, this is a practical and reliable alternative to cloud-hosted agents.