Building Context-Aware Search in Python with LLM Embeddings and Metadata
In this article
Introduction
Basic keyword searches are highly susceptible to failing whenever users phrase a query differently from the strict terminology found within a database. For example, an engineer investigating "login keeps failing" shouldn’t be prevented from seeing a pivotal incident titled "OAuth2 token refresh race condition." To resolve this misalignment, semantic search provides an upgraded paradigm.
Semantic search operates by processing text into dense numerical manifestations called vector embeddings. Because similar concepts share vectors that are closer together, lexical exactness is no longer a hard requirement. However, appending specific constraints, such as sorting by team, date, or severity, is required to achieve context-bound filtering.
This guide demonstrates how to engineer a context-aware search engine pipeline: creating local embeddings, tying in metadata-aware indexes, performing cosine similarity ranking, and optimizing data persistence. Check out our Local RAG Configurator to test out embedding permutations.
What You Will Build
You will assemble a search engine over a set of technical support tickets that accomplishes the following:
- Computes 384-dimensional embed arrays entirely locally using an open-source model.
- Includes an indexing structure capable of pre-filtering on attributes prior to similarity scoring.
- Handles Cosine Similarity ranking over the pre-filtered results.
- Exports to local files so computing isn’t required on every restart.
Prerequisites: Python 3.8+ and comfort manipulating dictionaries alongside NumPy.
First, make sure dependencies are available:
pip install sentence-transformers numpy
Understanding How Semantic Search Works
Behind the scenes, a sentence embedding model converts alphanumeric queries into a static-width floating-point vector mapping. Language models place contextually intertwined phrasing within the same approximate quadrant inside of multi-dimensional space.
To measure how effectively one vector aligns with another, researchers utilize Cosine Similarity, calculating the angular offset between vectors.
$$ \text{Cosine Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}||\mathbf{B}|} $$
When the dataset vectors are normalized to exactly 1.0 in length, the equation discards the fractional denominator entirely, collapsing into a simple dot product: $\mathbf{A} \cdot \mathbf{B}$. This generates a match score spanning -1 (perfectly antithetical) to 1 (perfect match). For typical retrieval use cases, an average baseline hovers at 0.1–0.25, and actionable overlaps score 0.6 or greater.
Metadata constraints enter the picture to bridge the gap between semantic resonance and factual attributes. Embedding layers assess meaning, but they fail to represent properties like the author identity or issue priority. Fusing both solves complex retrieval patterns.
Setting Up the Dataset
Let us construct a set of support issues originating from infrastructure, backend, and front-end teams mapping attributes like date windows, status, and severity.
from datetime import date
tickets = [
{"id": "T-101", "team": "infrastructure", "status": "open", "priority": "high",
"created": date(2025, 11, 3),
"text": "Kubernetes pod keeps crashing with OOMKilled — memory limits on the ML inference container are set too low for the model it loads at runtime."},
{"id": "T-102", "team": "infrastructure", "status": "open", "priority": "high",
"created": date(2025, 11, 8),
"text": "Nginx ingress returning 502 after rotating TLS certificate. Chain is valid per openssl verify but the backend handshake fails immediately."},
{"id": "T-103", "team": "infrastructure", "status": "resolved", "priority": "medium",
"created": date(2025, 10, 14),
"text": "Terraform state file locked in S3 — a team member force-applied a plan without releasing the DynamoDB lock first."},
# ...
{"id": "T-401", "team": "infrastructure", "status": "open", "priority": "medium",
"created": date(2025, 11, 11),
"text": "CI pipeline fails on ARM64 runners — base Docker image has no ARM variant, exec format error at build stage."},
{"id": "T-402", "team": "infrastructure", "status": "resolved", "priority": "high",
"created": date(2025, 10, 9),
"text": "VPN gateway latency spikes at peak hours — BGP route flapping between two peers causing intermittent packet loss across the private subnet."},
]
Let's test this distribution manually to verify it parses correctly:
open_ct = sum(1 for t in tickets if t["status"] == "open")
resolved_ct = sum(1 for t in tickets if t["status"] == "resolved")
print(f"{len(tickets)} tickets | {open_ct} open | {resolved_ct} resolved")
The output confirms our metrics:
20 tickets | 14 open | 6 resolved
Step 1: Generating Embeddings
We'll utilize all-MiniLM-L6-v2, an incredibly efficient neural mapping capable of processing sentences into 384 dimensions. Generating this locally requires zero external API dependencies. Once downloaded via HuggingFace, it operates entirely offline.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [t["text"] for t in tickets]
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
print(f"Shape: {embeddings.shape} | norm[0]: {np.linalg.norm(embeddings[0]):.4f}")
Setting normalize_embeddings=True transforms all individual embeddings to hold a clean L2 norm of 1.0. By clamping vector magnitude directly on the unit hypersphere, future query-similarity operations execute purely via rapid matrix multiplication bypassing division loops.
Step 2: Building the Index
Your index bridges semantic processing loops and discrete data lookup functions. Instead of filtering after querying, passing the attributes straight into a pre-scoring filter increases query speed.
class ContextAwareIndex:
def __init__(self, embeddings: np.ndarray, documents: list):
self.embeddings = embeddings # (N, D), L2-normalized
self.documents = documents
def search(
self,
query: str,
top_k: int = 5,
team: str = None,
status: str = None,
# ... logic truncated for brevity
):
pass
Isolating valid documents prior to measuring angle correlation protects against scoring tickets bound to be removed from the subset anyway.
Step 3: Running Queries
Filtering purely by context allows models to retrieve topics intuitively.
results = index.search("authentication token expiry and session management", top_k=4)
Combining text similarity searches alongside strict metadata windows unlocks exactly what internal analysts need during triaging. Example query bounded to open tickets before November 10th:
results = index.search(
"authentication token expiry and session management",
top_k=4,
status="open",
before=date(2025, 11, 10),
)
A common situation in SRE operations involves looking across boundaries. For instance, diagnosing hardware starvation requires viewing infrastructure issues near backend memory warnings simultaneously.
results = index.search(
"resource exhaustion and memory pressure under load",
top_k=2,
status="open",
priority="high",
)
Step 4: Persisting the Index
Rebuilding representations dynamically upon application boot constitutes a painful lifecycle bottleneck. Extracting vector states to the filesystem enables continuous reload potential.
import json
# Write the embedding matrix and ticket metadata to disk
np.save("ticket_embeddings.npy", embeddings)
with open("ticket_metadata.json", "w") as f:
json.dump(
[{**t, "created": t["created"].isoformat()} for t in tickets],
f, indent=2,
)
Binary .npy files effectively host dimensional mapping, while parsed timestamps load as .json. Instantiating offline models happens efficiently. The entire runtime relies solely on fetching two files, letting you skip the token processing pipeline permanently!
Summary
Fusing dense vectorized query algorithms with direct relational constraints results in advanced, scalable discovery architectures perfectly positioned for Enterprise data sets. Check out our LLM inference models page to find open-source LLMs that can further fine-tune your workflow retrieval outputs.
Related Guides
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
Event-Driven Architecture for Agentic AI: The Architect's Guide
A comprehensive architectural guide to designing resilient, real-time agentic AI systems using event-driven architecture — covering loose coupling, fault isolation, reference architecture, and governance patterns.
Cursor AI: Complete Setup and Practical Coding Guide
Everything developers need to use Cursor AI effectively — installation, the full keyboard shortcut map, inline code generation, chat with codebase context, tab autocomplete, @ mentions, custom rules, and how it compares to GitHub Copilot.