Run Claude Code Locally with Ollama: A Complete Setup Guide

April 29, 2026 • guides

Why Run an AI Coding Agent Locally?

Most terminal-based coding agents phone home to a cloud API. That works fine until you're on a flaky connection, need to keep proprietary code off external servers, or simply want to avoid per-token billing on long agentic sessions.

Running Claude Code against a local Ollama server eliminates all three problems. Your code stays on your machine, inference is fast (especially on a GPU), and the workflow is identical to the cloud-backed version. This guide walks through the complete setup — from installing the runtime to running a multi-step coding task.

Prerequisites

Before you start, make sure your system has:

Hardware:

NVIDIA GPU with at least 16 GB VRAM (24 GB recommended for larger context windows)
16–32 GB system RAM
~25 GB free disk space

If you have no GPU, inference will fall back to CPU, which is noticeably slower but still functional.

Software:

Linux or macOS (Windows users: use WSL2 with GPU passthrough)
NVIDIA driver and CUDA Toolkit installed (version 13.1 or compatible)
Node.js (required for the Claude Code installer)

Verify your GPU setup before proceeding:

nvidia-smi

You should see your GPU model, available VRAM, and the active CUDA version listed in the output.

Step 1: Install Ollama

Ollama is the local runtime that downloads, manages, and serves models. It also exposes an HTTP API that external tools — including Claude Code — can talk to.

On Linux, a single command handles the full installation:

curl -fsSL https://ollama.com/install.sh | sh

On macOS and Windows, download the installer from ollama.com and follow the on-screen instructions. Ollama runs as a background service and checks for updates automatically.

Once installed, confirm the version:

ollama -v

If you get a "command not found" error, the service may not have started yet. Launch it manually in one terminal:

ollama serve

Then run ollama -v in a second terminal. Once the version prints cleanly, Ollama is ready.

Step 2: Pull and Test the Model

With Ollama running, download a local model. GLM 4.7 Flash is a strong choice for agentic coding — it's fast on GPU and supports a large context window.

ollama pull glm-4.7-flash

After the download completes, run a quick sanity check in interactive mode:

ollama run glm-4.7-flash

Type a short prompt and verify you get a sensible response. If you're on a GPU, the reply should come back in well under a second.

You can also verify the model responds over the local HTTP API, which is how Claude Code will communicate with it:

curl http://localhost:11434/api/chat -d '{
  "model": "glm-4.7-flash",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

A JSON response confirms the API is live and the model is loaded.

Step 3: Set an Appropriate Context Length

Claude Code and agentic workflows in general require a reasonable context window to function well. However, very large windows can cause two problems: inference slows dramatically, and the model sometimes enters repetitive thinking loops.

Testing across several sizes has shown 20,000 tokens to be a reliable sweet spot — large enough for multi-file coding tasks, without sacrificing generation speed.

Stop the running Ollama server with Ctrl + C, then restart it with the context length override:

OLLAMA_CONTEXT_LENGTH=20000 ollama serve

Confirm the setting is active in a new terminal:

ollama ps

The output should show the model running on your GPU with CONTEXT set to 20000:

NAME                    ID              SIZE     PROCESSOR    CONTEXT
glm-4.7-flash:latest    d1a8a26252f1    21 GB    100% GPU     20000

Step 4: Install Claude Code

Claude Code is a terminal-based coding agent built by Anthropic. You interact with it in natural language, and it handles writing, editing, refactoring, and executing code as part of multi-step workflows.

Install it with the official script:

curl -fsSL https://claude.ai/install.sh | bash

Once complete, the claude command will be available in your terminal.

Step 5: Connect Claude Code to Ollama

Navigate to your project directory:

mkdir my-local-project
cd my-local-project

The recommended way to launch Claude Code with Ollama is to use the built-in launch command, which automatically configures the API routing for you:

ollama launch claude --model glm-4.7-flash

Alternatively, you can configure the environment variables manually and then run claude directly:

# Linux / macOS
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
claude

# Windows (PowerShell)
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
claude

Once the Claude Code interface opens in your terminal, confirm it is pointing to your local model:

/model

If the output shows glm-4.7-flash, the setup is complete.

Step 6: Run Your First Agentic Task

With everything wired up, Claude Code will respond using your locally-running model. Start with a simple greeting to confirm response speed — on a GPU, replies should be near-instantaneous.

For a more realistic test, ask Claude Code to build something complete:

"Build a command-line Snake game in Python."

Before it generates code, enable Planning Mode by pressing Shift + Tab twice. The model will outline its approach first — you can review the plan, ask for adjustments, and then tell it to proceed. Claude Code will create the required files and provide instructions for running the result.

Bonus: Use a Local GGUF File Directly

If you already have a GGUF model file downloaded and want to skip re-downloading via Ollama, you can register it manually with a Modelfile:

FROM ./glm-4.7-flash.gguf

PARAMETER temperature 0.8
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0

ollama create glm-4.7-flash-local -f Modelfile

Then run it like any other Ollama model:

ollama run glm-4.7-flash-local

Which Models Work Best?

Not every local model handles agentic workflows cleanly. Tool calling and multi-step planning put real demands on the model's instruction-following ability. As of early 2026, the community-tested options are:

| Model | Strengths | |---|---| | GLM 4.7 Flash | Very fast on GPU, 128k context, great tool calling | | Qwen 2.5 Coder (32B/7B) | Best open-source coding reasoning overall | | Codestral | Strong on Python and complex logic; heavier on VRAM |

Start with GLM 4.7 Flash if you want the fastest time-to-working-setup. Move to Qwen 2.5 Coder if you need stronger reasoning on complex refactoring tasks.

What You've Built

At the end of this setup you have a fully local, fully private coding agent that:

Runs inference on your own hardware with no API calls leaving your machine
Uses the same Claude Code interface and workflow patterns as the cloud version
Can be pointed at any Ollama-compatible model with a single flag change

The total setup time, excluding model download, is around five minutes. If you're working in an environment with poor connectivity or handling sensitive codebases, this is a practical and reliable alternative to cloud-hosted agents.