Why should I use metadata filtering before scoring?

Filtering candidate documents before computing their cosine similarity scores saves intensive matrix multiplication on data you would discard anyway. Pre-filtering guarantees your `min_score` thresholds won't simply yield poor matches.

Why normalize embeddings?

By setting the L2-norm of your resulting embeddings to exactly 1.0, you eliminate the division step in the standard Cosine Similarity equation. The similarity score mathematically reduces to the dot product.

Building Context-Aware Search in Python with LLM Embeddings and Metadata

May 23, 2026 • guides

AMA

AI Mastery ArchitectLead Systems Engineer

RAGCUDALLM OpsAgentic Systems

Introduction

Basic keyword searches are highly susceptible to failing whenever users phrase a query differently from the strict terminology found within a database. For example, an engineer investigating "login keeps failing" shouldn’t be prevented from seeing a pivotal incident titled "OAuth2 token refresh race condition." To resolve this misalignment, semantic search provides an upgraded paradigm.

Semantic search operates by processing text into dense numerical manifestations called vector embeddings. Because similar concepts share vectors that are closer together, lexical exactness is no longer a hard requirement. However, appending specific constraints, such as sorting by team, date, or severity, is required to achieve context-bound filtering.

This guide demonstrates how to engineer a context-aware search engine pipeline: creating local embeddings, tying in metadata-aware indexes, performing cosine similarity ranking, and optimizing data persistence. Check out our Local RAG Configurator to test out embedding permutations.

What You Will Build

You will assemble a search engine over a set of technical support tickets that accomplishes the following:

Computes 384-dimensional embed arrays entirely locally using an open-source model.
Includes an indexing structure capable of pre-filtering on attributes prior to similarity scoring.
Handles Cosine Similarity ranking over the pre-filtered results.
Exports to local files so computing isn’t required on every restart.

Prerequisites: Python 3.8+ and comfort manipulating dictionaries alongside NumPy.

First, make sure dependencies are available:

pip install sentence-transformers numpy

Understanding How Semantic Search Works

Behind the scenes, a sentence embedding model converts alphanumeric queries into a static-width floating-point vector mapping. Language models place contextually intertwined phrasing within the same approximate quadrant inside of multi-dimensional space.

To measure how effectively one vector aligns with another, researchers utilize Cosine Similarity, calculating the angular offset between vectors.

$$ \text{Cosine Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}||\mathbf{B}|} $$

When the dataset vectors are normalized to exactly 1.0 in length, the equation discards the fractional denominator entirely, collapsing into a simple dot product: $\mathbf{A} \cdot \mathbf{B}$. This generates a match score spanning -1 (perfectly antithetical) to 1 (perfect match). For typical retrieval use cases, an average baseline hovers at 0.1–0.25, and actionable overlaps score 0.6 or greater.

Metadata constraints enter the picture to bridge the gap between semantic resonance and factual attributes. Embedding layers assess meaning, but they fail to represent properties like the author identity or issue priority. Fusing both solves complex retrieval patterns.

Setting Up the Dataset

Let us construct a set of support issues originating from infrastructure, backend, and front-end teams mapping attributes like date windows, status, and severity.

from datetime import date
 
tickets = [
    {"id": "T-101", "team": "infrastructure", "status": "open",     "priority": "high",
     "created": date(2025, 11, 3),
     "text": "Kubernetes pod keeps crashing with OOMKilled — memory limits on the ML inference container are set too low for the model it loads at runtime."},
    {"id": "T-102", "team": "infrastructure", "status": "open",     "priority": "high",
     "created": date(2025, 11, 8),
     "text": "Nginx ingress returning 502 after rotating TLS certificate. Chain is valid per openssl verify but the backend handshake fails immediately."},
    {"id": "T-103", "team": "infrastructure", "status": "resolved", "priority": "medium",
     "created": date(2025, 10, 14),
     "text": "Terraform state file locked in S3 — a team member force-applied a plan without releasing the DynamoDB lock first."},
    # ...
    {"id": "T-401", "team": "infrastructure", "status": "open",     "priority": "medium",
     "created": date(2025, 11, 11),
     "text": "CI pipeline fails on ARM64 runners — base Docker image has no ARM variant, exec format error at build stage."},
    {"id": "T-402", "team": "infrastructure", "status": "resolved", "priority": "high",
     "created": date(2025, 10, 9),
     "text": "VPN gateway latency spikes at peak hours — BGP route flapping between two peers causing intermittent packet loss across the private subnet."},
]

Let's test this distribution manually to verify it parses correctly:

open_ct     = sum(1 for t in tickets if t["status"] == "open")
resolved_ct = sum(1 for t in tickets if t["status"] == "resolved")
print(f"{len(tickets)} tickets | {open_ct} open | {resolved_ct} resolved")

The output confirms our metrics:

20 tickets | 14 open | 6 resolved

Step 1: Generating Embeddings

We'll utilize all-MiniLM-L6-v2, an incredibly efficient neural mapping capable of processing sentences into 384 dimensions. Generating this locally requires zero external API dependencies. Once downloaded via HuggingFace, it operates entirely offline.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
texts      = [t["text"] for t in tickets]
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)

print(f"Shape: {embeddings.shape}  |  norm[0]: {np.linalg.norm(embeddings[0]):.4f}")

Setting normalize_embeddings=True transforms all individual embeddings to hold a clean L2 norm of 1.0. By clamping vector magnitude directly on the unit hypersphere, future query-similarity operations execute purely via rapid matrix multiplication bypassing division loops.

Step 2: Building the Index

Your index bridges semantic processing loops and discrete data lookup functions. Instead of filtering after querying, passing the attributes straight into a pre-scoring filter increases query speed.

class ContextAwareIndex:
    def __init__(self, embeddings: np.ndarray, documents: list):
        self.embeddings = embeddings   # (N, D), L2-normalized
        self.documents  = documents

    def search(
        self,
        query: str,
        top_k: int       = 5,
        team: str        = None,
        status: str      = None,
        # ... logic truncated for brevity
    ):
        pass

Isolating valid documents prior to measuring angle correlation protects against scoring tickets bound to be removed from the subset anyway.

Step 3: Running Queries

Filtering purely by context allows models to retrieve topics intuitively.

results = index.search("authentication token expiry and session management", top_k=4)

Combining text similarity searches alongside strict metadata windows unlocks exactly what internal analysts need during triaging. Example query bounded to open tickets before November 10th:

results = index.search(
    "authentication token expiry and session management",
    top_k=4,
    status="open",
    before=date(2025, 11, 10),
)

A common situation in SRE operations involves looking across boundaries. For instance, diagnosing hardware starvation requires viewing infrastructure issues near backend memory warnings simultaneously.

results = index.search(
    "resource exhaustion and memory pressure under load",
    top_k=2,
    status="open",
    priority="high",
)

Step 4: Persisting the Index

Rebuilding representations dynamically upon application boot constitutes a painful lifecycle bottleneck. Extracting vector states to the filesystem enables continuous reload potential.

import json

# Write the embedding matrix and ticket metadata to disk
np.save("ticket_embeddings.npy", embeddings)

with open("ticket_metadata.json", "w") as f:
    json.dump(
        [{**t, "created": t["created"].isoformat()} for t in tickets],
        f, indent=2,
    )

Binary .npy files effectively host dimensional mapping, while parsed timestamps load as .json. Instantiating offline models happens efficiently. The entire runtime relies solely on fetching two files, letting you skip the token processing pipeline permanently!

Summary

Fusing dense vectorized query algorithms with direct relational constraints results in advanced, scalable discovery architectures perfectly positioned for Enterprise data sets. Check out our LLM inference models page to find open-source LLMs that can further fine-tune your workflow retrieval outputs.

Share this guide:

𝕏 in r/

Related Guides

guides

Shan • 2026-07-03

llmself-hostedollamahardwareprivacy

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Complete 2026 guide to running LLMs locally for privacy and cost savings. Set up Ollama, llama.cpp, and vLLM on your hardware.

guides

Shan • 2026-06-07

Zero-Shot ClassificationLocal LLMOllamaNLPProduction AI

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.

guides

architect • 2026-05-25T09:00:00Z

Local LLMsOllamallama.cppRAGDockerGGUFLLM Engineering

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production

Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.

Building Context-Aware Search in Python with LLM Embeddings and Metadata

In this article

Introduction

What You Will Build

Understanding How Semantic Search Works

Setting Up the Dataset

Step 1: Generating Embeddings

Step 2: Building the Index

Step 3: Running Queries

Step 4: Persisting the Index

Summary

Related Guides

Self-Hosted LLM Guide 2026: Run AI Locally for Privacy & Savings

Build a Local LLM Zero-Shot Classifier You Can Actually Deploy

The Complete Developer Guide to Running LLMs Locally: From Ollama to Production