How do I run Qwen3 locally?

Install Ollama from ollama.com, then run 'ollama run qwen3' to download and start the default 8B model. For specific sizes use 'ollama run qwen3:14b' or 'ollama run qwen3:30b-a3b' (the efficient MoE variant). Models are ready to query immediately after download with no additional configuration.

What is the difference between Qwen3 /think and /no_think modes?

Qwen3 supports two reasoning modes. Appending /think to your prompt enables chain-of-thought reasoning inside tags before the final answer — more accurate for math and coding but slower. Appending /no_think skips the reasoning step for a 2-3x faster direct response. The API equivalents are 'thinking: true' and 'thinking: false' in the request options.

Which Qwen3 model should I download?

For laptops (8-16GB RAM): qwen3:8b is the best starting point. For a workstation with 16GB+ VRAM: qwen3:14b or the qwen3:30b-a3b MoE variant (only 3B active parameters per token — faster than 14B at comparable quality). qwen3:32b is the highest quality consumer-GPU option and fits on a single RTX 4090.

Does Qwen3 work for non-English languages?

Yes. Qwen3 supports 119 languages and consistently outperforms comparably-sized Western models on multilingual benchmarks including Chinese, Japanese, Korean, Arabic, French, and Spanish. It is the preferred local model for any application requiring non-English language support.

How to Run Qwen3 Locally with Ollama: Setup, API, and a Gradio App

May 24, 2026 • guides

Why Qwen3?
Prerequisites
Step 1: Install Ollama
Step 2: Choose Your Qwen3 Variant
Step 3: Reasoning Modes — /think and /no_think
Step 4: REST API Usage
Step 5: Python SDK
Step 6: Build a Reasoning + Multilingual Gradio App
reasoning_tab.py
multilingual_tab.py
Launch both tabs:
Using the OpenAI-Compatible Endpoint
Performance Notes
What to Read Next

Why Qwen3?

Qwen3 is Alibaba's most capable open-weight family, covering models from 0.6B to 235B parameters. It ships with hybrid reasoning: toggle deep chain-of-thought on or off per-request without loading a different model. Benchmark performance on math, code, and multilingual tasks rivals DeepSeek-R1 and o3-mini. The full model range ships under an Apache 2.0 license.

The practical appeal for local deployment is the MoE architecture at the high end. Qwen3-30B-A3B activates only 3 billion parameters per token during inference despite having 30B total — meaning it fits on mid-range hardware and runs at speeds comparable to a much smaller dense model.

Prerequisites

macOS, Linux, or Windows (WSL2 recommended on Windows)
RAM: 8GB minimum for 0.6B–1.7B; 16–32GB for 8B; 64GB+ for 32B+
GPU (optional): Ollama auto-detects NVIDIA/AMD/Apple Silicon and offloads layers; CPU-only works but runs slowly on larger variants

Step 1: Install Ollama

Download from ollama.com/download and follow the platform installer. After installation, verify:

ollama --version

Expected output:

ollama version 0.5.x

On Linux, Ollama installs as a systemd service and starts automatically. On macOS, it runs as a menu bar app. On Windows, install inside WSL2 for GPU support.

Step 2: Choose Your Qwen3 Variant

Ollama hosts the full Qwen3 family. Select based on available VRAM or RAM:

Model	Run Command	VRAM / RAM	Best For
Qwen3-0.6B	`ollama run qwen3:0.6b`	~1.5 GB	Edge devices, quick experiments
Qwen3-1.7B	`ollama run qwen3:1.7b`	~2.5 GB	Chatbots, low-latency assistants
Qwen3-4B	`ollama run qwen3:4b`	~4 GB	General tasks on consumer hardware
Qwen3-8B	`ollama run qwen3:8b`	~8 GB	Multilingual, moderate reasoning
Qwen3-14B	`ollama run qwen3:14b`	~14 GB	Complex reasoning, content work
Qwen3-32B	`ollama run qwen3:32b`	~32 GB	Strong reasoning, large context
Qwen3-30B-A3B (MoE)	`ollama run qwen3:30b-a3b`	~20 GB active	Efficient coding, fast on GPU
Qwen3-235B-A22B (MoE)	`ollama run qwen3:235b-a22b`	~140 GB active	Enterprise-scale, best quality

Start with qwen3:8b if you're unsure — it's the default when you run ollama run qwen3 without a tag.

Pull and run:

ollama run qwen3:8b

The first run downloads the model weights. Subsequent runs start from cache. Once the >>> prompt appears, the model is ready.

Step 3: Reasoning Modes — `/think` and `/no_think`

Qwen3's biggest differentiation is controllable reasoning depth. Appending /think to any prompt triggers full chain-of-thought reasoning with visible <think> blocks in the output. /no_think suppresses it for faster, direct answers.

Deep reasoning (for math, code, analysis):

>>> Prove that √2 is irrational. /think

The model emits a <think> section working through the proof before the final answer.

Fast response (for factual lookup, drafting):

>>> What is the capital of Argentina? /no_think

No think block — the answer arrives immediately.

Omitting both tags uses the model's default, which for Qwen3-8B defaults to non-thinking mode.

Step 4: REST API Usage

Start the Ollama server in the background:

ollama serve

The server listens on localhost:11434. Send requests using curl:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "messages": [
    { "role": "user", "content": "Explain entropy in thermodynamics. /think" }
  ],
  "stream": false
}'

Response shape:

{
  "model": "qwen3:8b",
  "message": {
    "role": "assistant",
    "content": "<think>...</think>\n\nEntropy is a measure of..."
  },
  "done": true
}

Set "stream": true to get token-by-token streaming, useful for building real-time UI.

The Ollama API is OpenAI-compatible. You can also use the /v1/chat/completions endpoint if you're using an OpenAI SDK client.

Step 5: Python SDK

pip install ollama

Basic inference:

import ollama

response = ollama.chat(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Summarize transformer self-attention in 3 sentences. /think"}
    ]
)
print(response["message"]["content"])

The output includes the <think> block followed by the answer. To strip the thinking section programmatically:

import re

content = response["message"]["content"]
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL).strip()
print(answer)

For streaming with Python:

stream = ollama.chat(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Write a Python quicksort. /no_think"}],
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Step 6: Build a Reasoning + Multilingual Gradio App

Install dependencies:

pip install gradio ollama

reasoning_tab.py

import gradio as gr
import ollama

def run_qwen3(prompt: str, mode: str) -> str:
    tagged = f"{prompt} /{mode}"
    response = ollama.chat(
        model="qwen3:8b",
        messages=[{"role": "user", "content": tagged}],
    )
    return response["message"]["content"]

reasoning_ui = gr.Interface(
    fn=run_qwen3,
    inputs=[
        gr.Textbox(label="Prompt", lines=3),
        gr.Radio(["think", "no_think"], label="Reasoning Mode", value="think"),
    ],
    outputs=gr.Textbox(label="Response", lines=10),
    title="Qwen3 Reasoning Demo",
    description="Toggle chain-of-thought reasoning on or off per request.",
)

multilingual_tab.py

import gradio as gr
import ollama

LANGUAGES = ["English", "French", "Spanish", "German", "Hindi", "Chinese", "Arabic", "Japanese"]

def translate(prompt: str, lang: str) -> str:
    if lang == "English":
        message = prompt
    else:
        message = f"Translate the following to {lang}, then respond in {lang}: {prompt}"

    response = ollama.chat(
        model="qwen3:8b",
        messages=[{"role": "user", "content": message}],
    )
    return response["message"]["content"]

multilingual_ui = gr.Interface(
    fn=translate,
    inputs=[
        gr.Textbox(label="Input", lines=3),
        gr.Dropdown(LANGUAGES, label="Target Language", value="English"),
    ],
    outputs=gr.Textbox(label="Output", lines=8),
    title="Qwen3 Multilingual",
    description="Process or translate text using Qwen3 100-language support.",
)

Launch both tabs:

import gradio as gr
from reasoning_tab import reasoning_ui
from multilingual_tab import multilingual_ui

demo = gr.TabbedInterface(
    [reasoning_ui, multilingual_ui],
    tab_names=["Reasoning Mode", "Multilingual"],
)

if __name__ == "__main__":
    demo.launch(debug=True)

Run with:

python app.py

The app opens at http://127.0.0.1:7860 in your browser.

Using the OpenAI-Compatible Endpoint

If you have existing code targeting the OpenAI API, you can point it at Ollama with zero SDK changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # arbitrary, Ollama ignores it
)

completion = client.chat.completions.create(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "What are the MoE routing trade-offs in Qwen3? /think"}
    ],
)
print(completion.choices[0].message.content)

This makes swapping between Ollama and OpenAI purely a config change.

Performance Notes

On an NVIDIA RTX 4090 (24GB VRAM), Qwen3-8B runs at roughly 80–100 tokens/second with no_think mode. The 30B-A3B MoE model achieves ~40 tokens/second at the same VRAM budget because active parameters per forward pass stay near 3B, not 30B.

On Apple M3 Pro (18GB unified memory), Qwen3-8B runs comfortably at 20–30 tokens/second.

CPU-only inference on Qwen3-4B is usable (~5–8 tokens/second on a modern 8-core machine) but too slow for interactive use with larger models.

Related Guides

guides

Shan • 2026-07-03

llmself-hostedollamahardwareprivacy

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Complete 2026 guide to running LLMs locally for privacy and cost savings. Set up Ollama, llama.cpp, and vLLM on your hardware.

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

How to Run Qwen3 Locally with Ollama: Setup, API, and a Gradio App

In this article

Why Qwen3?

Prerequisites

Step 1: Install Ollama

Step 2: Choose Your Qwen3 Variant

Step 3: Reasoning Modes — /think and /no_think

Step 4: REST API Usage

Step 5: Python SDK

Step 6: Build a Reasoning + Multilingual Gradio App

reasoning_tab.py

multilingual_tab.py

Launch both tabs:

Using the OpenAI-Compatible Endpoint

Performance Notes

What to Read Next

Related Guides

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Step 3: Reasoning Modes — `/think` and `/no_think`