How to Run Qwen3 Locally with Ollama: Setup, API, and a Gradio App

May 24, 2026guides

Why Qwen3?

Qwen3 is Alibaba's most capable open-weight family, covering models from 0.6B to 235B parameters. It ships with hybrid reasoning: toggle deep chain-of-thought on or off per-request without loading a different model. Benchmark performance on math, code, and multilingual tasks rivals DeepSeek-R1 and o3-mini. The full model range ships under an Apache 2.0 license.

The practical appeal for local deployment is the MoE architecture at the high end. Qwen3-30B-A3B activates only 3 billion parameters per token during inference despite having 30B total — meaning it fits on mid-range hardware and runs at speeds comparable to a much smaller dense model.


Prerequisites

  • macOS, Linux, or Windows (WSL2 recommended on Windows)
  • RAM: 8GB minimum for 0.6B–1.7B; 16–32GB for 8B; 64GB+ for 32B+
  • GPU (optional): Ollama auto-detects NVIDIA/AMD/Apple Silicon and offloads layers; CPU-only works but runs slowly on larger variants

Step 1: Install Ollama

Download from ollama.com/download and follow the platform installer. After installation, verify:

ollama --version

Expected output:

ollama version 0.5.x

On Linux, Ollama installs as a systemd service and starts automatically. On macOS, it runs as a menu bar app. On Windows, install inside WSL2 for GPU support.


Step 2: Choose Your Qwen3 Variant

Ollama hosts the full Qwen3 family. Select based on available VRAM or RAM:

ModelRun CommandVRAM / RAMBest For
Qwen3-0.6Bollama run qwen3:0.6b~1.5 GBEdge devices, quick experiments
Qwen3-1.7Bollama run qwen3:1.7b~2.5 GBChatbots, low-latency assistants
Qwen3-4Bollama run qwen3:4b~4 GBGeneral tasks on consumer hardware
Qwen3-8Bollama run qwen3:8b~8 GBMultilingual, moderate reasoning
Qwen3-14Bollama run qwen3:14b~14 GBComplex reasoning, content work
Qwen3-32Bollama run qwen3:32b~32 GBStrong reasoning, large context
Qwen3-30B-A3B (MoE)ollama run qwen3:30b-a3b~20 GB activeEfficient coding, fast on GPU
Qwen3-235B-A22B (MoE)ollama run qwen3:235b-a22b~140 GB activeEnterprise-scale, best quality

Start with qwen3:8b if you're unsure — it's the default when you run ollama run qwen3 without a tag.

Pull and run:

ollama run qwen3:8b

The first run downloads the model weights. Subsequent runs start from cache. Once the >>> prompt appears, the model is ready.


Step 3: Reasoning Modes — /think and /no_think

Qwen3's biggest differentiation is controllable reasoning depth. Appending /think to any prompt triggers full chain-of-thought reasoning with visible <think> blocks in the output. /no_think suppresses it for faster, direct answers.

Deep reasoning (for math, code, analysis):

>>> Prove that √2 is irrational. /think

The model emits a <think> section working through the proof before the final answer.

Fast response (for factual lookup, drafting):

>>> What is the capital of Argentina? /no_think

No think block — the answer arrives immediately.

Omitting both tags uses the model's default, which for Qwen3-8B defaults to non-thinking mode.


Step 4: REST API Usage

Start the Ollama server in the background:

ollama serve

The server listens on localhost:11434. Send requests using curl:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "messages": [
    { "role": "user", "content": "Explain entropy in thermodynamics. /think" }
  ],
  "stream": false
}'

Response shape:

{
  "model": "qwen3:8b",
  "message": {
    "role": "assistant",
    "content": "<think>...</think>\n\nEntropy is a measure of..."
  },
  "done": true
}

Set "stream": true to get token-by-token streaming, useful for building real-time UI.

The Ollama API is OpenAI-compatible. You can also use the /v1/chat/completions endpoint if you're using an OpenAI SDK client.


Step 5: Python SDK

pip install ollama

Basic inference:

import ollama

response = ollama.chat(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Summarize transformer self-attention in 3 sentences. /think"}
    ]
)
print(response["message"]["content"])

The output includes the <think> block followed by the answer. To strip the thinking section programmatically:

import re

content = response["message"]["content"]
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()
print(answer)

For streaming with Python:

stream = ollama.chat(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Write a Python quicksort. /no_think"}],
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Step 6: Build a Reasoning + Multilingual Gradio App

Install dependencies:

pip install gradio ollama

reasoning_tab.py

import gradio as gr
import ollama

def run_qwen3(prompt: str, mode: str) -> str:
    tagged = f"{prompt} /{mode}"
    response = ollama.chat(
        model="qwen3:8b",
        messages=[{"role": "user", "content": tagged}],
    )
    return response["message"]["content"]

reasoning_ui = gr.Interface(
    fn=run_qwen3,
    inputs=[
        gr.Textbox(label="Prompt", lines=3),
        gr.Radio(["think", "no_think"], label="Reasoning Mode", value="think"),
    ],
    outputs=gr.Textbox(label="Response", lines=10),
    title="Qwen3 Reasoning Demo",
    description="Toggle chain-of-thought reasoning on or off per request.",
)

multilingual_tab.py

import gradio as gr
import ollama

LANGUAGES = ["English", "French", "Spanish", "German", "Hindi", "Chinese", "Arabic", "Japanese"]

def translate(prompt: str, lang: str) -> str:
    if lang == "English":
        message = prompt
    else:
        message = f"Translate the following to {lang}, then respond in {lang}: {prompt}"

    response = ollama.chat(
        model="qwen3:8b",
        messages=[{"role": "user", "content": message}],
    )
    return response["message"]["content"]

multilingual_ui = gr.Interface(
    fn=translate,
    inputs=[
        gr.Textbox(label="Input", lines=3),
        gr.Dropdown(LANGUAGES, label="Target Language", value="English"),
    ],
    outputs=gr.Textbox(label="Output", lines=8),
    title="Qwen3 Multilingual",
    description="Process or translate text using Qwen3 100-language support.",
)

Launch both tabs:

import gradio as gr
from reasoning_tab import reasoning_ui
from multilingual_tab import multilingual_ui

demo = gr.TabbedInterface(
    [reasoning_ui, multilingual_ui],
    tab_names=["Reasoning Mode", "Multilingual"],
)

if __name__ == "__main__":
    demo.launch(debug=True)

Run with:

python app.py

The app opens at http://127.0.0.1:7860 in your browser.


Using the OpenAI-Compatible Endpoint

If you have existing code targeting the OpenAI API, you can point it at Ollama with zero SDK changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # arbitrary, Ollama ignores it
)

completion = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "What are the MoE routing trade-offs in Qwen3? /think"}
    ],
)
print(completion.choices[0].message.content)

This makes swapping between Ollama and OpenAI purely a config change.


Performance Notes

On an NVIDIA RTX 4090 (24GB VRAM), Qwen3-8B runs at roughly 80–100 tokens/second with no_think mode. The 30B-A3B MoE model achieves ~40 tokens/second at the same VRAM budget because active parameters per forward pass stay near 3B, not 30B.

On Apple M3 Pro (18GB unified memory), Qwen3-8B runs comfortably at 20–30 tokens/second.

CPU-only inference on Qwen3-4B is usable (~5–8 tokens/second on a modern 8-core machine) but too slow for interactive use with larger models.


Related Guides