How to Run Local LLMs Securely Using Ollama

April 10, 2026Guide
AMA
AI Mastery ArchitectLead Systems Engineer
RAGCUDALLM OpsAgentic Systems

While proprietary models like GPT-4o are incredibly powerful, they require sending every prompt over the internet to a third-party server. For enterprises handling PII, confidential trade secrets, or regulated healthcare data, that is often a hard dealbreaker.

The solution is to run open-weights models entirely on your own hardware. This guide covers the complete picture: installing Ollama, selecting the right model and quantization level for your hardware, customizing model behavior with a Modelfile, querying the built-in REST API, and integrating with Python and Node.js applications.


Prerequisites

Before you begin, confirm your environment meets the minimum requirements.

Hardware:

  • 8 GB RAM — runs 3B–4B models comfortably on CPU
  • 16 GB RAM — runs 7B–8B models on CPU (slow but functional)
  • 8 GB VRAM — runs 7B–8B models at full GPU speed (recommended minimum for a good experience)
  • 16–24 GB VRAM — runs 13B–32B models at GPU speed
  • ~10–25 GB of free disk space per model (depending on size and quantization)

Supported GPU backends:

  • NVIDIA (CUDA 11.3+)
  • AMD (ROCm 5.7+)
  • Apple Silicon (Metal via llama.cpp)

CPU-only inference works but is noticeably slower — expect 3–8 tokens per second on a modern desktop versus 60–120+ tokens per second with a GPU.

Software:

  • macOS 12+, Ubuntu 20.04+, or Windows 10/11
  • curl for the Linux installer or the desktop installer for macOS/Windows

Step 1: Install Ollama

Ollama handles model download, GGUF quantization management, and hardware acceleration configuration automatically — there is nothing to configure manually.

Linux (one-liner):

curl -fsSL https://ollama.com/install.sh | sh

macOS / Windows: Download the installer from ollama.com. The macOS installer ships as a standard .pkg and registers Ollama as a background menu-bar application. The Windows installer handles PATH registration automatically.

After installation, confirm Ollama is running:

ollama --version
# ollama version 0.4.x

If you receive a "command not found" error, start the server manually in one terminal:

ollama serve

Then re-run the version check in a second terminal. Once it prints cleanly, the HTTP server is live on localhost:11434.


Step 2: Choose a Model and Quantization Level

Ollama models are distributed as GGUF files — a binary format that bundles weights and metadata into a single self-contained file. The quantization level you choose determines the trade-off between model quality and hardware requirements.

Quantization Quick Reference

Quant Bits per Weight Quality vs. FP16 Min VRAM (7B) Best For
Q2_K 2–3 bit Noticeable degradation ~3 GB Extremely memory-constrained devices
Q4_K_M 4 bit (medium) ~1–2% perplexity loss ~5 GB Best balance of size and quality — recommended default
Q5_K_M 5 bit (medium) ~0.5% perplexity loss ~6 GB When you have the VRAM and want near-lossless quality
Q8_0 8 bit Negligible loss (<0.1%) ~9 GB GPU servers with 24 GB+ VRAM
F16 16 bit (full) Reference quality ~16 GB Benchmarking and fine-tune serving

Model Recommendations by Use Case

Model Size Strengths Context Window
Llama 3.2 3B / 11B General-purpose chat, fast on CPU 128k
Qwen 2.5 Coder 7B / 32B Best open-source coding & reasoning 128k
Mistral 7B 7B Instruction-following, summarization 32k
Gemma 3 4B / 12B / 27B Multilingual, multimodal (vision) 128k
DeepSeek-R1 8B / 32B / 70B Chain-of-thought reasoning, math 64k
Phi-4 Mini 3.8B Fastest on CPU; surprisingly capable 16k

Match Your Hardware Before You Pull

Before running any ollama pull command, check how much RAM or VRAM your machine has — this determines the largest model you can run at full GPU speed.

Your Hardware Max Comfortable Fit Recommended Pull What to Expect
8 GB RAM, no dedicated GPU 3B–4B model (Q4) ollama pull llama3.2:3b or phi4-mini CPU inference, ~4–8 tok/s. Fully functional for everyday tasks.
16 GB RAM, no dedicated GPU 7B–8B model (Q4) ollama pull llama3.1:8b or mistral CPU inference, ~3–6 tok/s. Good quality, patient workflow.
8 GB VRAM (e.g. RTX 3070/4060) 7B–8B model (Q4_K_M) ollama pull llama3.1:8b or qwen2.5-coder:7b Full GPU, 40–80 tok/s. Excellent for development.
16–24 GB VRAM (e.g. RTX 3090/4090) 13B–32B model (Q4_K_M) ollama pull qwen2.5-coder:32b or deepseek-r1:14b Full GPU, 30–60 tok/s. Production-quality reasoning.
40–80 GB VRAM (e.g. A100, H100, dual 4090) 70B model (Q4_K_M or Q8) ollama pull llama3.3:70b or deepseek-r1:70b Full GPU, 20–40 tok/s. Near-frontier open-weight quality.
Apple Silicon (M1/M2/M3, unified memory) Up to ~60–70% of total RAM M1 8 GB → 3B; M2 16 GB → 7B; M3 Max 64 GB → 34B Metal backend, 30–80 tok/s. Very efficient per-watt.

[!CAUTION] If you pull a model that exceeds your available VRAM, Ollama will not crash — but it will silently split the model layers between your GPU and system RAM. This is called CPU offloading and causes a dramatic speed drop (sometimes 5–10× slower) that can make the model feel unresponsive. If generation is very slow (< 3 tokens/second), the model is almost certainly too large for your hardware. Pull a smaller quantization or a smaller model variant instead.

To check how much VRAM you have before pulling:

# NVIDIA
nvidia-smi --query-gpu=name,memory.total --format=csv

# macOS (shows unified memory)
system_profiler SPHardwareDataType | grep Memory

# Linux (CPU RAM)
free -h

Step 3: Pull a Model

Pull Llama 3.2 (defaults to Q4_K_M quantization, ~2.0 GB):

ollama pull llama3.2

To pull a specific quantization variant explicitly:

# Pull the higher-quality Q8 variant of the 3B model
ollama pull llama3.2:3b-q8_0

# Pull a 70B model for a high-VRAM server
# Note: use llama3.3 here — the llama3.2 series only ships 1B and 3B
ollama pull llama3.3:70b

Ollama will show a progress bar during the download. Once complete, the weights are cached in ~/.ollama/models/ and reused on all subsequent runs.

List all locally available models at any time:

ollama list

Step 4: Run a Model

The fastest way to interact is the built-in REPL:

ollama run llama3.2

You will be dropped into an interactive terminal session. The model runs entirely on your local RAM/VRAM — no packet ever leaves your network interface.

>>> Summarize the key risks in our Q4 financial report.
[paste any confidential data here safely]

To exit the REPL: type /bye or press Ctrl + D.

Useful REPL Commands

CommandEffect
/show infoDisplay model architecture and parameter count
/show modelfilePrint the active Modelfile configuration
/set parameter temperature 0.2Adjust temperature on the fly
/save my-sessionSave the current context to a named session

Step 5: Customize Behavior with a Modelfile

A Modelfile lets you create a persistent, named variant of any base model with a custom system prompt, temperature, and stop tokens — equivalent to a deployed "assistant persona."

Create a file called Modelfile:

FROM llama3.2

# Keep responses focused and technical
SYSTEM """
You are a senior backend engineer specializing in distributed systems.
Answer all questions with production-grade code examples.
Avoid hand-wavy explanations — be specific and precise.
"""

# Lower temperature = more deterministic output
PARAMETER temperature 0.1

# Stop generation cleanly at common sentence boundaries
PARAMETER stop "Human:"
PARAMETER stop "User:"

# Increase context window for long code reviews
PARAMETER num_ctx 32768

Register and run your custom persona:

ollama create senior-engineer -f Modelfile
ollama run senior-engineer

Your custom model now appears in ollama list alongside the base models and can be used identically via the REST API.


Step 6: Using the REST API

Ollama exposes a local HTTP server on port 11434 by default. This is the integration surface for all applications, scripts, and tools.

Generate (single-turn)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain the CAP theorem in two sentences.",
  "stream": false
}'

Response:

{
  "model": "llama3.2",
  "response": "The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance. In practice, network partitions are unavoidable, so engineers must choose between consistency (CP systems like ZooKeeper) or availability (AP systems like Cassandra).",
  "done": true,
  "total_duration": 1420803000
}

Chat (multi-turn with history)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "system",    "content": "You are a Rust expert." },
    { "role": "user",      "content": "What is ownership?" },
    { "role": "assistant", "content": "Ownership is Rust memory management model..." },
    { "role": "user",      "content": "How does it relate to borrowing?" }
  ],
  "stream": false
}'

Streaming Responses

Set "stream": true (or omit it — streaming is the default) to receive Server-Sent Events. Each chunk contains a partial response token and a done flag that flips to true on the final chunk:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a Python async HTTP client.",
  "stream": true
}'

Model Management Endpoints

# List all local models
curl http://localhost:11434/api/tags

# Check which models are currently loaded in memory
curl http://localhost:11434/api/ps

# Delete a model
curl -X DELETE http://localhost:11434/api/delete -d '{"model": "llama3.2"}'

Step 7: OpenAI-Compatible API Mode

Ollama implements the OpenAI Chat Completions API at /v1/chat/completions. This means any library or tool written for OpenAI — including the official Python and JS SDKs — works with Ollama with a two-line configuration change.

This is critical for integrating Ollama into existing applications without any code-level rewrites.

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",  # trailing slash is required
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a privacy-focused security auditor."},
        {"role": "user",   "content": "Review the following authentication code for vulnerabilities:\n\n[paste code]"},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

Node.js / TypeScript (openai SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1/",  // trailing slash required
  apiKey: "ollama",
});

async function reviewCode(code: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen2.5-coder:7b",
    messages: [
      { role: "system",  content: "You are an expert TypeScript developer." },
      { role: "user",    content: `Refactor this for readability:\n\n${code}` },
    ],
    stream: false,
  });
  return response.choices[0].message.content ?? "";
}

LangChain Integration

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(
    model="llama3.2",
    base_url="http://localhost:11434",
    temperature=0.0,
)

messages = [
    SystemMessage(content="You are a data privacy expert."),
    HumanMessage(content="What are GDPR's key obligations for data processors?"),
]

response = llm.invoke(messages)
print(response.content)

Step 8: Multimodal Vision Models

Ollama also supports vision-capable models that accept images as part of the prompt — useful for document analysis, diagram understanding, and screenshot debugging.

Pull a vision model:

ollama pull gemma3:4b     # Google's Gemma 3 (4B, vision-enabled)
ollama pull llava:13b     # LLaVA — the original open vision model

Send an image via the API:

# Linux: use -w0 to suppress line wrapping
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Identify any security concerns in this system architecture diagram.",
  "images": ["'$(base64 -w0 architecture.png)'"]
}'

# macOS: base64 wraps at 76 chars by default — pipe through tr to strip newlines
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Identify any security concerns in this system architecture diagram.",
  "images": ["'$(base64 architecture.png | tr -d '\n')'"]
}'

Or via Python with the openai SDK (install first: pip install openai):

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

with open("architecture.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",      "text": "What security risks do you see in this diagram?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
        ],
    }],
)
print(response.choices[0].message.content)

Step 9: Performance Tuning

Environment Variables

Control Ollama's runtime behavior via environment variables before starting the server:

# Run on GPU 0 only (useful on multi-GPU machines)
CUDA_VISIBLE_DEVICES=0 ollama serve

# Increase context window globally (default is 4096)
OLLAMA_CONTEXT_LENGTH=32768 ollama serve

# Keep models loaded in VRAM indefinitely (avoids reload latency)
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Allow access from other machines on the network
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Verify GPU Offload

After starting a model, check how many layers are offloaded to the GPU:

ollama ps
NAME                    ID              SIZE     PROCESSOR    CONTEXT
llama3.2:latest         a80c4f17acd5    2.0 GB   100% GPU     8192

If you see CPU instead of GPU, your VRAM is too small for the full model. Either use a smaller quantization (Q4 instead of Q8) or a smaller model variant.

Concurrent Request Handling

Ollama supports parallel requests by default (up to 4 simultaneous generations per model instance). For high-throughput API use, increase the concurrency limit:

OLLAMA_NUM_PARALLEL=8 ollama serve

Step 10: Network Exposure and Security

By default, Ollama listens on 127.0.0.1:11434 — local-only, not accessible from other machines. This is the secure default for single-developer use.

To expose Ollama on your LAN (for shared team access or a home server):

OLLAMA_HOST=0.0.0.0:11434 ollama serve

[!WARNING] Do not expose Ollama directly to the public internet. The API has no built-in authentication. If you need external access, put Nginx or Caddy in front with HTTPS and HTTP Basic Auth, or use a VPN tunnel (WireGuard, Tailscale) to restrict access to trusted peers only.

For fine-grained access control in team environments, consider wrapping Ollama with Open WebUI — an open-source frontend that adds user accounts, model-level permissions, and an audit log.


What You've Built

At the end of this guide, you have a fully private, zero-cost AI backend that:

  • Runs open-weight models with GPU acceleration on your own hardware
  • Exposes a REST API that any application can call with no external dependencies
  • Is compatible with the OpenAI SDK — requiring only a two-line base_url change in existing code
  • Can serve multimodal (vision + text) requests using the same interface
  • Keeps all prompts, context, and outputs off third-party servers permanently

Whether you are processing confidential financial records, reviewing proprietary source code, or building a privacy-first product, Ollama gives you state-of-the-art language model capabilities with full auditability and zero egress risk.

Related Guides