Local RAG Tutorial: LangChain, Ollama & ChromaDB with Ragas

May 13, 2026 • guides

What You'll Build

In this article, we’ll cover setup, chunking & embeddings, retrieval tricks, a minimal app (Python + Streamlit), a Node.js variant, and quality evaluation with Ragas — all running locally with Ollama and ChromaDB.

A local chat-with-your-docs app (PDFs, web pages, CSVs) that:
- Indexes content in ChromaDB with strong Ollama embeddings.
- Retrieves highly relevant chunks (with LangChain retrievers).
- Answers with a local LLM via Ollama.
- Validates quality using Ragas metrics.

Local RAG Stack: Private, Secure, and Offline

Core Components — Why These Tools?

LangChain orchestrates the pipeline (load → split → embed → store → retrieve → generate) and provides utilities like RecursiveCharacterTextSplitter, RetrievalQA, and MultiQueryRetriever for better recall.
Ollama runs open models locally (e.g., Llama 3.2, Qwen 2.5), keeps data private, and also serves embedding models such as mxbai-embed-large or nomic-embed-text.
ChromaDB is a lightweight vector store that persists to disk (or runs as a server) and integrates cleanly with LangChain in both Python and Node.
Ragas adds measurable evaluation — faithfulness, answer correctness, answer relevance, context precision/recall, and semantic similarity — so you know when your RAG is actually working.

End‑to‑End Architecture

Ingest: PDF/Web/CSV loaders → unified document set.
Split: Chunk text (size & overlap tuned to your corpus).
Embed: Local embedding model via Ollama.
Index: Store vectors + metadata in ChromaDB (persisted).
Retrieve: Similarity (or MMR/MultiQuery) retriever.
Generate: Local LLM answers grounded in retrieved context.
Evaluate: Run Ragas over Q/A + contexts to quantify quality.

Setup Quickstart (Python)

Install libraries:

pip install streamlit langchain langchain-ollama langchain-community chromadb pypdf python-dotenv ragas datasets

Install & Run Ollama: Download from ollama.com and pull your models:
```
ollama pull llama3.2
ollama pull mxbai-embed-large
```
Chroma Server (Optional): You can use local persistence or run the server in Docker:
```
docker run -p 8000:8000 chromadb/chroma
```

Reference Implementation: Python + Streamlit

This consolidates the patterns from the latest LangChain tutorials.

# app.py
import os
import tempfile
import streamlit as st
from dotenv import load_dotenv
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA

load_dotenv()

st.set_page_config(page_title="Local PDF RAG", layout="wide")
st.title("Chat with your PDF (Local RAG)")

# --- Sidebar: model + file ---
model_name = st.sidebar.selectbox("Model", ["llama3.2"], index=0)
uploaded = st.sidebar.file_uploader("Upload a PDF", type=["pdf"])

# --- Helpers (cached) ---
@st.cache_resource
def get_llm(name):
    return ChatOllama(model=name, streaming=True)

@st.cache_resource
def get_embed():
    return OllamaEmbeddings(model="mxbai-embed-large")

def load_pdf(file):
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        tmp.write(file.read())
        path = tmp.name
    docs = PyPDFLoader(path).load()
    os.unlink(path)
    return docs

def split_docs(docs, sz=1000, ov=200):
    return RecursiveCharacterTextSplitter(chunk_size=sz, chunk_overlap=ov).split_documents(docs)

@st.cache_resource
def index_texts(texts, emb):
    persist_dir = os.path.join(tempfile.gettempdir(), "chroma_db")
    vs = Chroma.from_documents(texts, emb, persist_directory=persist_dir)
    return vs

# --- Build the vector store ---
if uploaded:
    docs = load_pdf(uploaded)
    texts = split_docs(docs)
    embeddings = get_embed()
    vector_store = index_texts(texts, embeddings)
    retriever = vector_store.as_retriever()
    qa = RetrievalQA.from_chain_type(
        llm=get_llm(model_name), 
        chain_type="stuff",
        retriever=retriever
    )

    # --- Chat UI ---
    if "messages" not in st.session_state:
        st.session_state["messages"] = []
    
    for role, msg in st.session_state["messages"]:
        with st.chat_message(role):
            st.write(msg)
            
    if prompt := st.chat_input("Ask about the PDF"):
        st.session_state["messages"].append(("user", prompt))
        with st.chat_message("user"):
            st.write(prompt)
            
        with st.chat_message("assistant"):
            try:
                answer = qa.run(prompt)
                st.write(answer)
                st.session_state["messages"].append(("assistant", answer))
            except Exception as e:
                st.error(f"Error: {e}")
else:
    st.info("Upload a PDF to begin.")

Node.js Variant: Code-Aware Local RAG

For engineering workflows like converting JavaScript to TypeScript, you can use a similar stack in Node.js.

// index.js (Node.js snippet)
const { Ollama } = require("@langchain/community/llms/ollama");
const { Chroma } = require("@langchain/community/vectorstores/chroma");
const { OllamaEmbeddings } = require("@langchain/community/embeddings/ollama");

async function run() {
  const embeddings = new OllamaEmbeddings({ model: "mxbai-embed-large" });
  const vectorStore = await Chroma.fromDocuments(docs, embeddings, {
    collectionName: "code-examples",
    url: "http://localhost:8000",
  });
  
  const model = new Ollama({ model: "qwen2.5-coder:0.5b" });
  // ... proceed to query and generate
}

Comparison: Local RAG vs. Cloud RAG

Aspect	Local RAG (Our Build)	Cloud RAG (OpenAI/Anthropic)
Data Privacy	100% Private (On-Prem)	Subject to Provider Terms
Cost	Free (Infinite Tokens)	Pay-per-Token
Latency	Ultra-Low (Localhost)	Network Dependent
Customization	Full Control (Models, Chunks)	Limited to Provider APIs
Offline Use	Supported	Not Supported

Evaluating with Ragas

Why evaluate? Retrieval and prompts often "look right" but produce subtly wrong answers. Ragas quantifies this with five critical dimensions:

Faithfulness: Does the answer stay true to the retrieved context?
Answer Correctness: Is the answer factually accurate?
Answer Relevancy: Does the answer directly address the user's prompt?
Context Precision: Are the retrieved chunks truly relevant?
Context Recall: Did we retrieve all necessary information?

Pro Tip: If local evaluation times out, you can use a hosted LLM (like GPT-4o) as the evaluator while keeping your actual application and data 100% local.

Production Checklist

Chunking: Start with 750/200 for prose or 500/100 for granular data. Tune by spot-checking summaries.
Retrieval: Use MultiQueryRetriever when single-query recall is thin.
Persistence: Use a stable persist_directory or the Chroma Docker server to avoid re-indexing.
Models: Llama 3.2 is excellent for general chat; use Qwen 2.5 Coder for development tasks.