Run Claude Code Locally with Ollama: A Complete Setup Guide

April 29, 2026guides

Why Run an AI Coding Agent Locally?

Most terminal-based coding agents phone home to a cloud API. That works fine until you're on a flaky connection, need to keep proprietary code off external servers, or simply want to avoid per-token billing on long agentic sessions.

Running Claude Code against a local Ollama server eliminates all three problems. Your code stays on your machine, inference is fast (especially on a GPU), and the workflow is identical to the cloud-backed version. This guide walks through the complete setup — from installing the runtime to running a multi-step coding task.


Prerequisites

Before you start, make sure your system has:

Hardware:

  • NVIDIA GPU with at least 16 GB VRAM (24 GB recommended for larger context windows)
  • 16–32 GB system RAM
  • ~25 GB free disk space

If you have no GPU, inference will fall back to CPU, which is noticeably slower but still functional.

Software:

  • Linux or macOS (Windows users: use WSL2 with GPU passthrough)
  • NVIDIA driver and CUDA Toolkit installed (version 13.1 or compatible)
  • Node.js (required for the Claude Code installer)

Verify your GPU setup before proceeding:

nvidia-smi

You should see your GPU model, available VRAM, and the active CUDA version listed in the output.


Step 1: Install Ollama

Ollama is the local runtime that downloads, manages, and serves models. It also exposes an HTTP API that external tools — including Claude Code — can talk to.

On Linux, a single command handles the full installation:

curl -fsSL https://ollama.com/install.sh | sh

On macOS and Windows, download the installer from ollama.com and follow the on-screen instructions. Ollama runs as a background service and checks for updates automatically.

Once installed, confirm the version:

ollama -v

If you get a "command not found" error, the service may not have started yet. Launch it manually in one terminal:

ollama serve

Then run ollama -v in a second terminal. Once the version prints cleanly, Ollama is ready.


Step 2: Pull and Test the Model

With Ollama running, download a local model. GLM 4.7 Flash is a strong choice for agentic coding — it's fast on GPU and supports a large context window.

ollama pull glm-4.7-flash

After the download completes, run a quick sanity check in interactive mode:

ollama run glm-4.7-flash

Type a short prompt and verify you get a sensible response. If you're on a GPU, the reply should come back in well under a second.

You can also verify the model responds over the local HTTP API, which is how Claude Code will communicate with it:

curl http://localhost:11434/api/chat -d '{
  "model": "glm-4.7-flash",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

A JSON response confirms the API is live and the model is loaded.


Step 3: Set an Appropriate Context Length

Claude Code and agentic workflows in general require a reasonable context window to function well. However, very large windows can cause two problems: inference slows dramatically, and the model sometimes enters repetitive thinking loops.

Testing across several sizes has shown 20,000 tokens to be a reliable sweet spot — large enough for multi-file coding tasks, without sacrificing generation speed.

Stop the running Ollama server with Ctrl + C, then restart it with the context length override:

OLLAMA_CONTEXT_LENGTH=20000 ollama serve

Confirm the setting is active in a new terminal:

ollama ps

The output should show the model running on your GPU with CONTEXT set to 20000:

NAME                    ID              SIZE     PROCESSOR    CONTEXT
glm-4.7-flash:latest    d1a8a26252f1    21 GB    100% GPU     20000

Step 4: Install Claude Code

Claude Code is a terminal-based coding agent built by Anthropic. You interact with it in natural language, and it handles writing, editing, refactoring, and executing code as part of multi-step workflows.

Install it with the official script:

curl -fsSL https://claude.ai/install.sh | bash

Once complete, the claude command will be available in your terminal.


Step 5: Connect Claude Code to Ollama

Navigate to your project directory:

mkdir my-local-project
cd my-local-project

The recommended way to launch Claude Code with Ollama is to use the built-in launch command, which automatically configures the API routing for you:

ollama launch claude --model glm-4.7-flash

Alternatively, you can configure the environment variables manually and then run claude directly:

# Linux / macOS
export ANTHROPIC_BASE_URL="http://localhost:11434"
export ANTHROPIC_AUTH_TOKEN="ollama"
export ANTHROPIC_API_KEY=""
claude
# Windows (PowerShell)
$env:ANTHROPIC_BASE_URL = "http://localhost:11434"
$env:ANTHROPIC_AUTH_TOKEN = "ollama"
$env:ANTHROPIC_API_KEY = ""
claude

Once the Claude Code interface opens in your terminal, confirm it is pointing to your local model:

/model

If the output shows glm-4.7-flash, the setup is complete.


Step 6: Run Your First Agentic Task

With everything wired up, Claude Code will respond using your locally-running model. Start with a simple greeting to confirm response speed — on a GPU, replies should be near-instantaneous.

For a more realistic test, ask Claude Code to build something complete:

"Build a command-line Snake game in Python."

Before it generates code, enable Planning Mode by pressing Shift + Tab twice. The model will outline its approach first — you can review the plan, ask for adjustments, and then tell it to proceed. Claude Code will create the required files and provide instructions for running the result.


Bonus: Use a Local GGUF File Directly

If you already have a GGUF model file downloaded and want to skip re-downloading via Ollama, you can register it manually with a Modelfile:

FROM ./glm-4.7-flash.gguf

PARAMETER temperature 0.8
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0

Register it:

ollama create glm-4.7-flash-local -f Modelfile

Then run it like any other Ollama model:

ollama run glm-4.7-flash-local

Which Models Work Best?

Not every local model handles agentic workflows cleanly. Tool calling and multi-step planning put real demands on the model's instruction-following ability. As of early 2026, the community-tested options are:

| Model | Strengths | |---|---| | GLM 4.7 Flash | Very fast on GPU, 128k context, great tool calling | | Qwen 2.5 Coder (32B/7B) | Best open-source coding reasoning overall | | Codestral | Strong on Python and complex logic; heavier on VRAM |

Start with GLM 4.7 Flash if you want the fastest time-to-working-setup. Move to Qwen 2.5 Coder if you need stronger reasoning on complex refactoring tasks.


What You've Built

At the end of this setup you have a fully local, fully private coding agent that:

  • Runs inference on your own hardware with no API calls leaving your machine
  • Uses the same Claude Code interface and workflow patterns as the cloud version
  • Can be pointed at any Ollama-compatible model with a single flag change

The total setup time, excluding model download, is around five minutes. If you're working in an environment with poor connectivity or handling sensitive codebases, this is a practical and reliable alternative to cloud-hosted agents.