How to Run Local LLMs Securely Using Ollama
In this article
- Prerequisites
- Step 1: Install Ollama
- Step 2: Choose a Model and Quantization Level
- Quantization Quick Reference
- Model Recommendations by Use Case
- Match Your Hardware Before You Pull
- Step 3: Pull a Model
- Step 4: Run a Model
- Useful REPL Commands
- Step 5: Customize Behavior with a Modelfile
- Step 6: Using the REST API
- Generate (single-turn)
- Chat (multi-turn with history)
- Streaming Responses
- Model Management Endpoints
- Step 7: OpenAI-Compatible API Mode
- Python (openai SDK)
- Node.js / TypeScript (openai SDK)
- LangChain Integration
- Step 8: Multimodal Vision Models
- Step 9: Performance Tuning
- Environment Variables
- Verify GPU Offload
- Concurrent Request Handling
- Step 10: Network Exposure and Security
- What You've Built
While proprietary models like GPT-4o are incredibly powerful, they require sending every prompt over the internet to a third-party server. For enterprises handling PII, confidential trade secrets, or regulated healthcare data, that is often a hard dealbreaker.
The solution is to run open-weights models entirely on your own hardware. This guide covers the complete picture: installing Ollama, selecting the right model and quantization level for your hardware, customizing model behavior with a Modelfile, querying the built-in REST API, and integrating with Python and Node.js applications.
Prerequisites
Before you begin, confirm your environment meets the minimum requirements.
Hardware:
- 8 GB RAM — runs 3B–4B models comfortably on CPU
- 16 GB RAM — runs 7B–8B models on CPU (slow but functional)
- 8 GB VRAM — runs 7B–8B models at full GPU speed (recommended minimum for a good experience)
- 16–24 GB VRAM — runs 13B–32B models at GPU speed
- ~10–25 GB of free disk space per model (depending on size and quantization)
Supported GPU backends:
- NVIDIA (CUDA 11.3+)
- AMD (ROCm 5.7+)
- Apple Silicon (Metal via
llama.cpp)
CPU-only inference works but is noticeably slower — expect 3–8 tokens per second on a modern desktop versus 60–120+ tokens per second with a GPU.
Software:
- macOS 12+, Ubuntu 20.04+, or Windows 10/11
curlfor the Linux installer or the desktop installer for macOS/Windows
Step 1: Install Ollama
Ollama handles model download, GGUF quantization management, and hardware acceleration configuration automatically — there is nothing to configure manually.
Linux (one-liner):
curl -fsSL https://ollama.com/install.sh | sh
macOS / Windows: Download the installer from ollama.com. The macOS installer ships as a standard .pkg and registers Ollama as a background menu-bar application. The Windows installer handles PATH registration automatically.
After installation, confirm Ollama is running:
ollama --version
# ollama version 0.4.x
If you receive a "command not found" error, start the server manually in one terminal:
ollama serve
Then re-run the version check in a second terminal. Once it prints cleanly, the HTTP server is live on localhost:11434.
Step 2: Choose a Model and Quantization Level
Ollama models are distributed as GGUF files — a binary format that bundles weights and metadata into a single self-contained file. The quantization level you choose determines the trade-off between model quality and hardware requirements.
Quantization Quick Reference
| Quant | Bits per Weight | Quality vs. FP16 | Min VRAM (7B) | Best For |
|---|---|---|---|---|
| Q2_K | 2–3 bit | Noticeable degradation | ~3 GB | Extremely memory-constrained devices |
| Q4_K_M | 4 bit (medium) | ~1–2% perplexity loss | ~5 GB | Best balance of size and quality — recommended default |
| Q5_K_M | 5 bit (medium) | ~0.5% perplexity loss | ~6 GB | When you have the VRAM and want near-lossless quality |
| Q8_0 | 8 bit | Negligible loss (<0.1%) | ~9 GB | GPU servers with 24 GB+ VRAM |
| F16 | 16 bit (full) | Reference quality | ~16 GB | Benchmarking and fine-tune serving |
Model Recommendations by Use Case
| Model | Size | Strengths | Context Window |
|---|---|---|---|
| Llama 3.2 | 3B / 11B | General-purpose chat, fast on CPU | 128k |
| Qwen 2.5 Coder | 7B / 32B | Best open-source coding & reasoning | 128k |
| Mistral 7B | 7B | Instruction-following, summarization | 32k |
| Gemma 3 | 4B / 12B / 27B | Multilingual, multimodal (vision) | 128k |
| DeepSeek-R1 | 8B / 32B / 70B | Chain-of-thought reasoning, math | 64k |
| Phi-4 Mini | 3.8B | Fastest on CPU; surprisingly capable | 16k |
Match Your Hardware Before You Pull
Before running any ollama pull command, check how much RAM or VRAM your machine has — this determines the largest model you can run at full GPU speed.
| Your Hardware | Max Comfortable Fit | Recommended Pull | What to Expect |
|---|---|---|---|
| 8 GB RAM, no dedicated GPU | 3B–4B model (Q4) | ollama pull llama3.2:3b or phi4-mini |
CPU inference, ~4–8 tok/s. Fully functional for everyday tasks. |
| 16 GB RAM, no dedicated GPU | 7B–8B model (Q4) | ollama pull llama3.1:8b or mistral |
CPU inference, ~3–6 tok/s. Good quality, patient workflow. |
| 8 GB VRAM (e.g. RTX 3070/4060) | 7B–8B model (Q4_K_M) | ollama pull llama3.1:8b or qwen2.5-coder:7b |
Full GPU, 40–80 tok/s. Excellent for development. |
| 16–24 GB VRAM (e.g. RTX 3090/4090) | 13B–32B model (Q4_K_M) | ollama pull qwen2.5-coder:32b or deepseek-r1:14b |
Full GPU, 30–60 tok/s. Production-quality reasoning. |
| 40–80 GB VRAM (e.g. A100, H100, dual 4090) | 70B model (Q4_K_M or Q8) | ollama pull llama3.3:70b or deepseek-r1:70b |
Full GPU, 20–40 tok/s. Near-frontier open-weight quality. |
| Apple Silicon (M1/M2/M3, unified memory) | Up to ~60–70% of total RAM | M1 8 GB → 3B; M2 16 GB → 7B; M3 Max 64 GB → 34B | Metal backend, 30–80 tok/s. Very efficient per-watt. |
[!CAUTION] If you pull a model that exceeds your available VRAM, Ollama will not crash — but it will silently split the model layers between your GPU and system RAM. This is called CPU offloading and causes a dramatic speed drop (sometimes 5–10× slower) that can make the model feel unresponsive. If generation is very slow (< 3 tokens/second), the model is almost certainly too large for your hardware. Pull a smaller quantization or a smaller model variant instead.
To check how much VRAM you have before pulling:
# NVIDIA
nvidia-smi --query-gpu=name,memory.total --format=csv
# macOS (shows unified memory)
system_profiler SPHardwareDataType | grep Memory
# Linux (CPU RAM)
free -h
Step 3: Pull a Model
Pull Llama 3.2 (defaults to Q4_K_M quantization, ~2.0 GB):
ollama pull llama3.2
To pull a specific quantization variant explicitly:
# Pull the higher-quality Q8 variant of the 3B model
ollama pull llama3.2:3b-q8_0
# Pull a 70B model for a high-VRAM server
# Note: use llama3.3 here — the llama3.2 series only ships 1B and 3B
ollama pull llama3.3:70b
Ollama will show a progress bar during the download. Once complete, the weights are cached in ~/.ollama/models/ and reused on all subsequent runs.
List all locally available models at any time:
ollama list
Step 4: Run a Model
The fastest way to interact is the built-in REPL:
ollama run llama3.2
You will be dropped into an interactive terminal session. The model runs entirely on your local RAM/VRAM — no packet ever leaves your network interface.
>>> Summarize the key risks in our Q4 financial report.
[paste any confidential data here safely]
To exit the REPL: type /bye or press Ctrl + D.
Useful REPL Commands
| Command | Effect |
|---|---|
/show info | Display model architecture and parameter count |
/show modelfile | Print the active Modelfile configuration |
/set parameter temperature 0.2 | Adjust temperature on the fly |
/save my-session | Save the current context to a named session |
Step 5: Customize Behavior with a Modelfile
A Modelfile lets you create a persistent, named variant of any base model with a custom system prompt, temperature, and stop tokens — equivalent to a deployed "assistant persona."
Create a file called Modelfile:
FROM llama3.2
# Keep responses focused and technical
SYSTEM """
You are a senior backend engineer specializing in distributed systems.
Answer all questions with production-grade code examples.
Avoid hand-wavy explanations — be specific and precise.
"""
# Lower temperature = more deterministic output
PARAMETER temperature 0.1
# Stop generation cleanly at common sentence boundaries
PARAMETER stop "Human:"
PARAMETER stop "User:"
# Increase context window for long code reviews
PARAMETER num_ctx 32768
Register and run your custom persona:
ollama create senior-engineer -f Modelfile
ollama run senior-engineer
Your custom model now appears in ollama list alongside the base models and can be used identically via the REST API.
Step 6: Using the REST API
Ollama exposes a local HTTP server on port 11434 by default. This is the integration surface for all applications, scripts, and tools.
Generate (single-turn)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain the CAP theorem in two sentences.",
"stream": false
}'
Response:
{
"model": "llama3.2",
"response": "The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance. In practice, network partitions are unavoidable, so engineers must choose between consistency (CP systems like ZooKeeper) or availability (AP systems like Cassandra).",
"done": true,
"total_duration": 1420803000
}
Chat (multi-turn with history)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "system", "content": "You are a Rust expert." },
{ "role": "user", "content": "What is ownership?" },
{ "role": "assistant", "content": "Ownership is Rust memory management model..." },
{ "role": "user", "content": "How does it relate to borrowing?" }
],
"stream": false
}'
Streaming Responses
Set "stream": true (or omit it — streaming is the default) to receive Server-Sent Events. Each chunk contains a partial response token and a done flag that flips to true on the final chunk:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a Python async HTTP client.",
"stream": true
}'
Model Management Endpoints
# List all local models
curl http://localhost:11434/api/tags
# Check which models are currently loaded in memory
curl http://localhost:11434/api/ps
# Delete a model
curl -X DELETE http://localhost:11434/api/delete -d '{"model": "llama3.2"}'
Step 7: OpenAI-Compatible API Mode
Ollama implements the OpenAI Chat Completions API at /v1/chat/completions. This means any library or tool written for OpenAI — including the official Python and JS SDKs — works with Ollama with a two-line configuration change.
This is critical for integrating Ollama into existing applications without any code-level rewrites.
Python (openai SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/", # trailing slash is required
api_key="ollama", # required by the SDK but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a privacy-focused security auditor."},
{"role": "user", "content": "Review the following authentication code for vulnerabilities:\n\n[paste code]"},
],
temperature=0.2,
)
print(response.choices[0].message.content)
Node.js / TypeScript (openai SDK)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1/", // trailing slash required
apiKey: "ollama",
});
async function reviewCode(code: string): Promise<string> {
const response = await client.chat.completions.create({
model: "qwen2.5-coder:7b",
messages: [
{ role: "system", content: "You are an expert TypeScript developer." },
{ role: "user", content: `Refactor this for readability:\n\n${code}` },
],
stream: false,
});
return response.choices[0].message.content ?? "";
}
LangChain Integration
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOllama(
model="llama3.2",
base_url="http://localhost:11434",
temperature=0.0,
)
messages = [
SystemMessage(content="You are a data privacy expert."),
HumanMessage(content="What are GDPR's key obligations for data processors?"),
]
response = llm.invoke(messages)
print(response.content)
Step 8: Multimodal Vision Models
Ollama also supports vision-capable models that accept images as part of the prompt — useful for document analysis, diagram understanding, and screenshot debugging.
Pull a vision model:
ollama pull gemma3:4b # Google's Gemma 3 (4B, vision-enabled)
ollama pull llava:13b # LLaVA — the original open vision model
Send an image via the API:
# Linux: use -w0 to suppress line wrapping
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Identify any security concerns in this system architecture diagram.",
"images": ["'$(base64 -w0 architecture.png)'"]
}'
# macOS: base64 wraps at 76 chars by default — pipe through tr to strip newlines
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Identify any security concerns in this system architecture diagram.",
"images": ["'$(base64 architecture.png | tr -d '\n')'"]
}'
Or via Python with the openai SDK (install first: pip install openai):
import base64
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
)
with open("architecture.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gemma3:4b",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What security risks do you see in this diagram?"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
],
}],
)
print(response.choices[0].message.content)
Step 9: Performance Tuning
Environment Variables
Control Ollama's runtime behavior via environment variables before starting the server:
# Run on GPU 0 only (useful on multi-GPU machines)
CUDA_VISIBLE_DEVICES=0 ollama serve
# Increase context window globally (default is 4096)
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
# Keep models loaded in VRAM indefinitely (avoids reload latency)
OLLAMA_KEEP_ALIVE=-1 ollama serve
# Allow access from other machines on the network
OLLAMA_HOST=0.0.0.0:11434 ollama serve
Verify GPU Offload
After starting a model, check how many layers are offloaded to the GPU:
ollama ps
NAME ID SIZE PROCESSOR CONTEXT
llama3.2:latest a80c4f17acd5 2.0 GB 100% GPU 8192
If you see CPU instead of GPU, your VRAM is too small for the full model. Either use a smaller quantization (Q4 instead of Q8) or a smaller model variant.
Concurrent Request Handling
Ollama supports parallel requests by default (up to 4 simultaneous generations per model instance). For high-throughput API use, increase the concurrency limit:
OLLAMA_NUM_PARALLEL=8 ollama serve
Step 10: Network Exposure and Security
By default, Ollama listens on 127.0.0.1:11434 — local-only, not accessible from other machines. This is the secure default for single-developer use.
To expose Ollama on your LAN (for shared team access or a home server):
OLLAMA_HOST=0.0.0.0:11434 ollama serve
[!WARNING] Do not expose Ollama directly to the public internet. The API has no built-in authentication. If you need external access, put Nginx or Caddy in front with HTTPS and HTTP Basic Auth, or use a VPN tunnel (WireGuard, Tailscale) to restrict access to trusted peers only.
For fine-grained access control in team environments, consider wrapping Ollama with Open WebUI — an open-source frontend that adds user accounts, model-level permissions, and an audit log.
What You've Built
At the end of this guide, you have a fully private, zero-cost AI backend that:
- Runs open-weight models with GPU acceleration on your own hardware
- Exposes a REST API that any application can call with no external dependencies
- Is compatible with the OpenAI SDK — requiring only a two-line
base_urlchange in existing code - Can serve multimodal (vision + text) requests using the same interface
- Keeps all prompts, context, and outputs off third-party servers permanently
Whether you are processing confidential financial records, reviewing proprietary source code, or building a privacy-first product, Ollama gives you state-of-the-art language model capabilities with full auditability and zero egress risk.
Related Guides
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
How to Run DeepSeek R1 Locally with Ollama: Full Setup Guide
Install DeepSeek R1 locally using Ollama in under 5 minutes. Covers model variant selection from 1.5B to 671B, visible chain-of-thought reasoning, REST API usage, Python integration, and building a simple RAG application.
Run DeepSeek V4 Flash Locally with llama.cpp on a Single GPU
Step-by-step guide to running the full DeepSeek V4 Flash GGUF locally on a single RTX Pro 6000 GPU using a modified llama.cpp build, llama-server, and the Hugging Face hf_transfer downloader.