How to Run Qwen3 Locally with Ollama: Setup, API, and a Gradio App
In this article
- Why Qwen3?
- Prerequisites
- Step 1: Install Ollama
- Step 2: Choose Your Qwen3 Variant
- Step 3: Reasoning Modes — /think and /no_think
- Step 4: REST API Usage
- Step 5: Python SDK
- Step 6: Build a Reasoning + Multilingual Gradio App
- reasoning_tab.py
- multilingual_tab.py
- Launch both tabs:
- Using the OpenAI-Compatible Endpoint
- Performance Notes
- What to Read Next
Why Qwen3?
Qwen3 is Alibaba's most capable open-weight family, covering models from 0.6B to 235B parameters. It ships with hybrid reasoning: toggle deep chain-of-thought on or off per-request without loading a different model. Benchmark performance on math, code, and multilingual tasks rivals DeepSeek-R1 and o3-mini. The full model range ships under an Apache 2.0 license.
The practical appeal for local deployment is the MoE architecture at the high end. Qwen3-30B-A3B activates only 3 billion parameters per token during inference despite having 30B total — meaning it fits on mid-range hardware and runs at speeds comparable to a much smaller dense model.
Prerequisites
- macOS, Linux, or Windows (WSL2 recommended on Windows)
- RAM: 8GB minimum for 0.6B–1.7B; 16–32GB for 8B; 64GB+ for 32B+
- GPU (optional): Ollama auto-detects NVIDIA/AMD/Apple Silicon and offloads layers; CPU-only works but runs slowly on larger variants
Step 1: Install Ollama
Download from ollama.com/download and follow the platform installer. After installation, verify:
ollama --version
Expected output:
ollama version 0.5.x
On Linux, Ollama installs as a systemd service and starts automatically. On macOS, it runs as a menu bar app. On Windows, install inside WSL2 for GPU support.
Step 2: Choose Your Qwen3 Variant
Ollama hosts the full Qwen3 family. Select based on available VRAM or RAM:
| Model | Run Command | VRAM / RAM | Best For |
|---|---|---|---|
| Qwen3-0.6B | ollama run qwen3:0.6b | ~1.5 GB | Edge devices, quick experiments |
| Qwen3-1.7B | ollama run qwen3:1.7b | ~2.5 GB | Chatbots, low-latency assistants |
| Qwen3-4B | ollama run qwen3:4b | ~4 GB | General tasks on consumer hardware |
| Qwen3-8B | ollama run qwen3:8b | ~8 GB | Multilingual, moderate reasoning |
| Qwen3-14B | ollama run qwen3:14b | ~14 GB | Complex reasoning, content work |
| Qwen3-32B | ollama run qwen3:32b | ~32 GB | Strong reasoning, large context |
| Qwen3-30B-A3B (MoE) | ollama run qwen3:30b-a3b | ~20 GB active | Efficient coding, fast on GPU |
| Qwen3-235B-A22B (MoE) | ollama run qwen3:235b-a22b | ~140 GB active | Enterprise-scale, best quality |
Start with qwen3:8b if you're unsure — it's the default when you run ollama run qwen3 without a tag.
Pull and run:
ollama run qwen3:8b
The first run downloads the model weights. Subsequent runs start from cache. Once the >>> prompt appears, the model is ready.
Step 3: Reasoning Modes — /think and /no_think
Qwen3's biggest differentiation is controllable reasoning depth. Appending /think to any prompt triggers full chain-of-thought reasoning with visible <think> blocks in the output. /no_think suppresses it for faster, direct answers.
Deep reasoning (for math, code, analysis):
>>> Prove that √2 is irrational. /think
The model emits a <think> section working through the proof before the final answer.
Fast response (for factual lookup, drafting):
>>> What is the capital of Argentina? /no_think
No think block — the answer arrives immediately.
Omitting both tags uses the model's default, which for Qwen3-8B defaults to non-thinking mode.
Step 4: REST API Usage
Start the Ollama server in the background:
ollama serve
The server listens on localhost:11434. Send requests using curl:
curl http://localhost:11434/api/chat -d '{
"model": "qwen3:8b",
"messages": [
{ "role": "user", "content": "Explain entropy in thermodynamics. /think" }
],
"stream": false
}'
Response shape:
{
"model": "qwen3:8b",
"message": {
"role": "assistant",
"content": "<think>...</think>\n\nEntropy is a measure of..."
},
"done": true
}
Set "stream": true to get token-by-token streaming, useful for building real-time UI.
The Ollama API is OpenAI-compatible. You can also use the /v1/chat/completions endpoint if you're using an OpenAI SDK client.
Step 5: Python SDK
pip install ollama
Basic inference:
import ollama
response = ollama.chat(
model="qwen3:8b",
messages=[
{"role": "user", "content": "Summarize transformer self-attention in 3 sentences. /think"}
]
)
print(response["message"]["content"])
The output includes the <think> block followed by the answer. To strip the thinking section programmatically:
import re
content = response["message"]["content"]
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()
print(answer)
For streaming with Python:
stream = ollama.chat(
model="qwen3:8b",
messages=[{"role": "user", "content": "Write a Python quicksort. /no_think"}],
stream=True,
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)
Step 6: Build a Reasoning + Multilingual Gradio App
Install dependencies:
pip install gradio ollama
reasoning_tab.py
import gradio as gr
import ollama
def run_qwen3(prompt: str, mode: str) -> str:
tagged = f"{prompt} /{mode}"
response = ollama.chat(
model="qwen3:8b",
messages=[{"role": "user", "content": tagged}],
)
return response["message"]["content"]
reasoning_ui = gr.Interface(
fn=run_qwen3,
inputs=[
gr.Textbox(label="Prompt", lines=3),
gr.Radio(["think", "no_think"], label="Reasoning Mode", value="think"),
],
outputs=gr.Textbox(label="Response", lines=10),
title="Qwen3 Reasoning Demo",
description="Toggle chain-of-thought reasoning on or off per request.",
)
multilingual_tab.py
import gradio as gr
import ollama
LANGUAGES = ["English", "French", "Spanish", "German", "Hindi", "Chinese", "Arabic", "Japanese"]
def translate(prompt: str, lang: str) -> str:
if lang == "English":
message = prompt
else:
message = f"Translate the following to {lang}, then respond in {lang}: {prompt}"
response = ollama.chat(
model="qwen3:8b",
messages=[{"role": "user", "content": message}],
)
return response["message"]["content"]
multilingual_ui = gr.Interface(
fn=translate,
inputs=[
gr.Textbox(label="Input", lines=3),
gr.Dropdown(LANGUAGES, label="Target Language", value="English"),
],
outputs=gr.Textbox(label="Output", lines=8),
title="Qwen3 Multilingual",
description="Process or translate text using Qwen3 100-language support.",
)
Launch both tabs:
import gradio as gr
from reasoning_tab import reasoning_ui
from multilingual_tab import multilingual_ui
demo = gr.TabbedInterface(
[reasoning_ui, multilingual_ui],
tab_names=["Reasoning Mode", "Multilingual"],
)
if __name__ == "__main__":
demo.launch(debug=True)
Run with:
python app.py
The app opens at http://127.0.0.1:7860 in your browser.
Using the OpenAI-Compatible Endpoint
If you have existing code targeting the OpenAI API, you can point it at Ollama with zero SDK changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # arbitrary, Ollama ignores it
)
completion = client.chat.completions.create(
model="qwen3:8b",
messages=[
{"role": "user", "content": "What are the MoE routing trade-offs in Qwen3? /think"}
],
)
print(completion.choices[0].message.content)
This makes swapping between Ollama and OpenAI purely a config change.
Performance Notes
On an NVIDIA RTX 4090 (24GB VRAM), Qwen3-8B runs at roughly 80–100 tokens/second with no_think mode. The 30B-A3B MoE model achieves ~40 tokens/second at the same VRAM budget because active parameters per forward pass stay near 3B, not 30B.
On Apple M3 Pro (18GB unified memory), Qwen3-8B runs comfortably at 20–30 tokens/second.
CPU-only inference on Qwen3-4B is usable (~5–8 tokens/second on a modern 8-core machine) but too slow for interactive use with larger models.
What to Read Next
- Run Claude Code Locally with Ollama — use Qwen3 as the backing model for a local AI coding agent
- Local RAG Tutorial: LangChain, Ollama & ChromaDB — build a private retrieval-augmented pipeline on top of Ollama
- Building AI Agents with Local SLMs — multi-step tool-calling agents using Ollama-hosted models
Related Guides
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
Event-Driven Architecture for Agentic AI: The Architect's Guide
A comprehensive architectural guide to designing resilient, real-time agentic AI systems using event-driven architecture — covering loose coupling, fault isolation, reference architecture, and governance patterns.
Cursor AI: Complete Setup and Practical Coding Guide
Everything developers need to use Cursor AI effectively — installation, the full keyboard shortcut map, inline code generation, chat with codebase context, tab autocomplete, @ mentions, custom rules, and how it compares to GitHub Copilot.