What is Ollama and how does it work?

Ollama is an open-source runtime that packages large language model weights alongside a local inference server. It handles model downloading, GGUF quantization selection, GPU/CPU layer offloading, and exposes an OpenAI-compatible REST API at localhost:11434. Models run entirely on your hardware — no internet connection is required after the initial download.

What GPU do I need to run LLMs locally with Ollama?

Any NVIDIA GPU with 8GB+ VRAM can run 7B parameter models at usable speed. 16GB VRAM handles 13B–14B models comfortably. Apple Silicon Macs (M1/M2/M3) work well because VRAM is unified with system memory — an M2 Pro with 32GB can run 30B+ models. CPU-only inference works but is 10–20x slower.

Is Ollama free to use?

Yes. Ollama is fully open source (MIT license) and free to use. The models it runs are also open-weight and free — Llama 3, Qwen3, Mistral, and others are available at no cost. There are no subscription fees, no API limits, and no usage tracking.

Can I use Ollama with Python or JavaScript?

Yes. Ollama provides official Python and JavaScript SDKs (pip install ollama / npm install ollama). It also exposes an OpenAI-compatible API, so any code using the OpenAI SDK works by changing the base_url to http://localhost:11434/v1 and the api_key to any placeholder string.

How to Run Local LLMs Securely Using Ollama

April 10, 2026 • Guide

AMA

AI Mastery ArchitectLead Systems Engineer

RAGCUDALLM OpsAgentic Systems

Prerequisites
Step 1: Install Ollama
Step 2: Choose a Model and Quantization Level
Quantization Quick Reference
Model Recommendations by Use Case
Match Your Hardware Before You Pull
Step 3: Pull a Model
Step 4: Run a Model
Useful REPL Commands
Step 5: Customize Behavior with a Modelfile
Step 6: Using the REST API
Generate (single-turn)
Chat (multi-turn with history)
Streaming Responses
Model Management Endpoints
Step 7: OpenAI-Compatible API Mode
Python (openai SDK)
Node.js / TypeScript (openai SDK)
LangChain Integration
Step 8: Multimodal Vision Models
Step 9: Performance Tuning
Environment Variables
Verify GPU Offload
Concurrent Request Handling
Step 10: Network Exposure and Security
What You've Built

While proprietary models like GPT-4o are incredibly powerful, they require sending every prompt over the internet to a third-party server. For enterprises handling PII, confidential trade secrets, or regulated healthcare data, that is often a hard dealbreaker.

The solution is to run open-weights models entirely on your own hardware. This guide covers the complete picture: installing Ollama, selecting the right model and quantization level for your hardware, customizing model behavior with a Modelfile, querying the built-in REST API, and integrating with Python and Node.js applications.

Prerequisites

Before you begin, confirm your environment meets the minimum requirements.

Hardware:

8 GB RAM — runs 3B–4B models comfortably on CPU
16 GB RAM — runs 7B–8B models on CPU (slow but functional)
8 GB VRAM — runs 7B–8B models at full GPU speed (recommended minimum for a good experience)
16–24 GB VRAM — runs 13B–32B models at GPU speed
~10–25 GB of free disk space per model (depending on size and quantization)

Supported GPU backends:

NVIDIA (CUDA 11.3+)
AMD (ROCm 5.7+)
Apple Silicon (Metal via llama.cpp)

CPU-only inference works but is noticeably slower — expect 3–8 tokens per second on a modern desktop versus 60–120+ tokens per second with a GPU.

Software:

macOS 12+, Ubuntu 20.04+, or Windows 10/11
curl for the Linux installer or the desktop installer for macOS/Windows

Step 1: Install Ollama

Ollama handles model download, GGUF quantization management, and hardware acceleration configuration automatically — there is nothing to configure manually.

Linux (one-liner):

curl -fsSL https://ollama.com/install.sh | sh

macOS / Windows: Download the installer from ollama.com. The macOS installer ships as a standard .pkg and registers Ollama as a background menu-bar application. The Windows installer handles PATH registration automatically.

After installation, confirm Ollama is running:

ollama --version
# ollama version 0.4.x

If you receive a "command not found" error, start the server manually in one terminal:

ollama serve

Then re-run the version check in a second terminal. Once it prints cleanly, the HTTP server is live on localhost:11434.

Step 2: Choose a Model and Quantization Level

Ollama models are distributed as GGUF files — a binary format that bundles weights and metadata into a single self-contained file. The quantization level you choose determines the trade-off between model quality and hardware requirements.

Quantization Quick Reference

Quant	Bits per Weight	Quality vs. FP16	Min VRAM (7B)	Best For
Q2_K	2–3 bit	Noticeable degradation	~3 GB	Extremely memory-constrained devices
Q4_K_M	4 bit (medium)	~1–2% perplexity loss	~5 GB	Best balance of size and quality — recommended default
Q5_K_M	5 bit (medium)	~0.5% perplexity loss	~6 GB	When you have the VRAM and want near-lossless quality
Q8_0	8 bit	Negligible loss (<0.1%)	~9 GB	GPU servers with 24 GB+ VRAM
F16	16 bit (full)	Reference quality	~16 GB	Benchmarking and fine-tune serving

Model Recommendations by Use Case

Model	Size	Strengths	Context Window
Llama 3.2	3B / 11B	General-purpose chat, fast on CPU	128k
Qwen 2.5 Coder	7B / 32B	Best open-source coding & reasoning	128k
Mistral 7B	7B	Instruction-following, summarization	32k
Gemma 3	4B / 12B / 27B	Multilingual, multimodal (vision)	128k
DeepSeek-R1	8B / 32B / 70B	Chain-of-thought reasoning, math	64k
Phi-4 Mini	3.8B	Fastest on CPU; surprisingly capable	16k

Match Your Hardware Before You Pull

Before running any ollama pull command, check how much RAM or VRAM your machine has — this determines the largest model you can run at full GPU speed.

Your Hardware	Max Comfortable Fit	Recommended Pull	What to Expect
8 GB RAM, no dedicated GPU	3B–4B model (Q4)	`ollama pull llama3.2:3b` or `phi4-mini`	CPU inference, ~4–8 tok/s. Fully functional for everyday tasks.
16 GB RAM, no dedicated GPU	7B–8B model (Q4)	`ollama pull llama3.1:8b` or `mistral`	CPU inference, ~3–6 tok/s. Good quality, patient workflow.
8 GB VRAM (e.g. RTX 3070/4060)	7B–8B model (Q4_K_M)	`ollama pull llama3.1:8b` or `qwen2.5-coder:7b`	Full GPU, 40–80 tok/s. Excellent for development.
16–24 GB VRAM (e.g. RTX 3090/4090)	13B–32B model (Q4_K_M)	`ollama pull qwen2.5-coder:32b` or `deepseek-r1:14b`	Full GPU, 30–60 tok/s. Production-quality reasoning.
40–80 GB VRAM (e.g. A100, H100, dual 4090)	70B model (Q4_K_M or Q8)	`ollama pull llama3.3:70b` or `deepseek-r1:70b`	Full GPU, 20–40 tok/s. Near-frontier open-weight quality.
Apple Silicon (M1/M2/M3, unified memory)	Up to ~60–70% of total RAM	M1 8 GB → 3B; M2 16 GB → 7B; M3 Max 64 GB → 34B	Metal backend, 30–80 tok/s. Very efficient per-watt.

[!CAUTION] If you pull a model that exceeds your available VRAM, Ollama will not crash — but it will silently split the model layers between your GPU and system RAM. This is called CPU offloading and causes a dramatic speed drop (sometimes 5–10× slower) that can make the model feel unresponsive. If generation is very slow (< 3 tokens/second), the model is almost certainly too large for your hardware. Pull a smaller quantization or a smaller model variant instead.

To check how much VRAM you have before pulling:

# NVIDIA
nvidia-smi --query-gpu=name,memory.total --format=csv

# macOS (shows unified memory)
system_profiler SPHardwareDataType | grep Memory

# Linux (CPU RAM)
free -h

Step 3: Pull a Model

Pull Llama 3.2 (defaults to Q4_K_M quantization, ~2.0 GB):

ollama pull llama3.2

To pull a specific quantization variant explicitly:

# Pull the higher-quality Q8 variant of the 3B model
ollama pull llama3.2:3b-q8_0

# Pull a 70B model for a high-VRAM server
# Note: use llama3.3 here — the llama3.2 series only ships 1B and 3B
ollama pull llama3.3:70b

Ollama will show a progress bar during the download. Once complete, the weights are cached in ~/.ollama/models/ and reused on all subsequent runs.

List all locally available models at any time:

ollama list

Step 4: Run a Model

The fastest way to interact is the built-in REPL:

ollama run llama3.2

You will be dropped into an interactive terminal session. The model runs entirely on your local RAM/VRAM — no packet ever leaves your network interface.

>>> Summarize the key risks in our Q4 financial report.
[paste any confidential data here safely]

To exit the REPL: type /bye or press Ctrl + D.

Useful REPL Commands

Command	Effect
`/show info`	Display model architecture and parameter count
`/show modelfile`	Print the active Modelfile configuration
`/set parameter temperature 0.2`	Adjust temperature on the fly
`/save my-session`	Save the current context to a named session

Step 5: Customize Behavior with a Modelfile

A Modelfile lets you create a persistent, named variant of any base model with a custom system prompt, temperature, and stop tokens — equivalent to a deployed "assistant persona."

Create a file called Modelfile:

FROM llama3.2

# Keep responses focused and technical
SYSTEM """
You are a senior backend engineer specializing in distributed systems.
Answer all questions with production-grade code examples.
Avoid hand-wavy explanations — be specific and precise.
"""

# Lower temperature = more deterministic output
PARAMETER temperature 0.1

# Stop generation cleanly at common sentence boundaries
PARAMETER stop "Human:"
PARAMETER stop "User:"

# Increase context window for long code reviews
PARAMETER num_ctx 32768

ollama create senior-engineer -f Modelfile
ollama run senior-engineer

Your custom model now appears in ollama list alongside the base models and can be used identically via the REST API.

Step 6: Using the REST API

Ollama exposes a local HTTP server on port 11434 by default. This is the integration surface for all applications, scripts, and tools.

Generate (single-turn)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain the CAP theorem in two sentences.",
  "stream": false
}'

Response:

{
  "model": "llama3.2",
  "response": "The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance. In practice, network partitions are unavoidable, so engineers must choose between consistency (CP systems like ZooKeeper) or availability (AP systems like Cassandra).",
  "done": true,
  "total_duration": 1420803000
}

Chat (multi-turn with history)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "system",    "content": "You are a Rust expert." },
    { "role": "user",      "content": "What is ownership?" },
    { "role": "assistant", "content": "Ownership is Rust memory management model..." },
    { "role": "user",      "content": "How does it relate to borrowing?" }
  ],
  "stream": false
}'

Streaming Responses

Set "stream": true (or omit it — streaming is the default) to receive Server-Sent Events. Each chunk contains a partial response token and a done flag that flips to true on the final chunk:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a Python async HTTP client.",
  "stream": true
}'

Model Management Endpoints

# List all local models
curl http://localhost:11434/api/tags

# Check which models are currently loaded in memory
curl http://localhost:11434/api/ps

# Delete a model
curl -X DELETE http://localhost:11434/api/delete -d '{"model": "llama3.2"}'

Step 7: OpenAI-Compatible API Mode

Ollama implements the OpenAI Chat Completions API at /v1/chat/completions. This means any library or tool written for OpenAI — including the official Python and JS SDKs — works with Ollama with a two-line configuration change.

This is critical for integrating Ollama into existing applications without any code-level rewrites.

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",  # trailing slash is required
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a privacy-focused security auditor."},
        {"role": "user",   "content": "Review the following authentication code for vulnerabilities:\n\n[paste code]"},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

Node.js / TypeScript (openai SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1/",  // trailing slash required
  apiKey: "ollama",
});

async function reviewCode(code: string): Promise<string> {
  const response = await client.chat.completions.create({
    model: "qwen2.5-coder:7b",
    messages: [
      { role: "system",  content: "You are an expert TypeScript developer." },
      { role: "user",    content: `Refactor this for readability:\n\n${code}` },
    ],
    stream: false,
  });
  return response.choices[0].message.content ?? "";
}

LangChain Integration

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(
    model="llama3.2",
    base_url="http://localhost:11434",
    temperature=0.0,
)

messages = [
    SystemMessage(content="You are a data privacy expert."),
    HumanMessage(content="What are GDPR's key obligations for data processors?"),
]

response = llm.invoke(messages)
print(response.content)

Step 8: Multimodal Vision Models

Ollama also supports vision-capable models that accept images as part of the prompt — useful for document analysis, diagram understanding, and screenshot debugging.

Pull a vision model:

ollama pull gemma3:4b     # Google's Gemma 3 (4B, vision-enabled)
ollama pull llava:13b     # LLaVA — the original open vision model

Send an image via the API:

# Linux: use -w0 to suppress line wrapping
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Identify any security concerns in this system architecture diagram.",
  "images": ["'$(base64 -w0 architecture.png)'"]
}'

# macOS: base64 wraps at 76 chars by default — pipe through tr to strip newlines
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Identify any security concerns in this system architecture diagram.",
  "images": ["'$(base64 architecture.png | tr -d '\n')'"]
}'

Or via Python with the openai SDK (install first: pip install openai):

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

with open("architecture.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",      "text": "What security risks do you see in this diagram?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
        ],
    }],
)
print(response.choices[0].message.content)

Step 9: Performance Tuning

Environment Variables

Control Ollama's runtime behavior via environment variables before starting the server:

# Run on GPU 0 only (useful on multi-GPU machines)
CUDA_VISIBLE_DEVICES=0 ollama serve

# Increase context window globally (default is 4096)
OLLAMA_CONTEXT_LENGTH=32768 ollama serve

# Keep models loaded in VRAM indefinitely (avoids reload latency)
OLLAMA_KEEP_ALIVE=-1 ollama serve

# Allow access from other machines on the network
OLLAMA_HOST=0.0.0.0:11434 ollama serve

Verify GPU Offload

After starting a model, check how many layers are offloaded to the GPU:

ollama ps

NAME                    ID              SIZE     PROCESSOR    CONTEXT
llama3.2:latest         a80c4f17acd5    2.0 GB   100% GPU     8192

If you see CPU instead of GPU, your VRAM is too small for the full model. Either use a smaller quantization (Q4 instead of Q8) or a smaller model variant.

Concurrent Request Handling

Ollama supports parallel requests by default (up to 4 simultaneous generations per model instance). For high-throughput API use, increase the concurrency limit:

OLLAMA_NUM_PARALLEL=8 ollama serve

Step 10: Network Exposure and Security

By default, Ollama listens on 127.0.0.1:11434 — local-only, not accessible from other machines. This is the secure default for single-developer use.

To expose Ollama on your LAN (for shared team access or a home server):

OLLAMA_HOST=0.0.0.0:11434 ollama serve

[!WARNING] Do not expose Ollama directly to the public internet. The API has no built-in authentication. If you need external access, put Nginx or Caddy in front with HTTPS and HTTP Basic Auth, or use a VPN tunnel (WireGuard, Tailscale) to restrict access to trusted peers only.

For fine-grained access control in team environments, consider wrapping Ollama with Open WebUI — an open-source frontend that adds user accounts, model-level permissions, and an audit log.

What You've Built

At the end of this guide, you have a fully private, zero-cost AI backend that:

Runs open-weight models with GPU acceleration on your own hardware
Exposes a REST API that any application can call with no external dependencies
Is compatible with the OpenAI SDK — requiring only a two-line base_url change in existing code
Can serve multimodal (vision + text) requests using the same interface
Keeps all prompts, context, and outputs off third-party servers permanently

Whether you are processing confidential financial records, reviewing proprietary source code, or building a privacy-first product, Ollama gives you state-of-the-art language model capabilities with full auditability and zero egress risk.

Share this guide:

𝕏 in r/

Related Guides

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

guides

Shan • 2026-05-24

DeepSeekOllamaReasoning ModelsLocal LLMsRAG

How to Run DeepSeek R1 Locally with Ollama: Full Setup Guide

Install DeepSeek R1 locally using Ollama in under 5 minutes. Covers model variant selection from 1.5B to 671B, visible chain-of-thought reasoning, REST API usage, Python integration, and building a simple RAG application.

How to Run Local LLMs Securely Using Ollama

In this article

Prerequisites

Step 1: Install Ollama

Step 2: Choose a Model and Quantization Level

Quantization Quick Reference

Model Recommendations by Use Case

Match Your Hardware Before You Pull

Step 3: Pull a Model

Step 4: Run a Model

Useful REPL Commands

Step 5: Customize Behavior with a Modelfile

Step 6: Using the REST API

Generate (single-turn)

Chat (multi-turn with history)

Streaming Responses

Model Management Endpoints

Step 7: OpenAI-Compatible API Mode

Python (openai SDK)

Node.js / TypeScript (openai SDK)

LangChain Integration

Step 8: Multimodal Vision Models

Step 9: Performance Tuning

Environment Variables

Verify GPU Offload

Concurrent Request Handling

Step 10: Network Exposure and Security

What You've Built

Related Guides

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

How to Run DeepSeek R1 Locally with Ollama: Full Setup Guide