Building a Production RAG Pipeline with Bedrock and OpenSearch Serverless
In this article
The first RAG pipeline I built in anger was a Saturday afternoon affair: a LangChain notebook, a FAISS index sitting on local disk, and an embedding loop. But as soon as that demo hits the real world, the questions change. How do you handle 10,000 documents? How do you refresh the index without rebuilding from scratch? Who owns the IAM policies? And finally, what is the cost floor?
Amazon Bedrock Knowledge Bases is the enterprise answer to these questions. It takes the "small distributed system" of RAG—the chunking, the embedding pipeline, the vector store provisioning, and the sync logic—and folds them into a managed service.

The Vector Backend Decision Matrix
The default vector backend for Bedrock is OpenSearch Serverless (OSS). It is a fine default, but it is also the most expensive, and understanding the OCU floor matters before you sign your team up for the bill.
| Vector Backend | Cost Floor | Latency | Best For... |
|---|---|---|---|
| OpenSearch Serverless | ~$345/mo (2 OCU min) | Sub-100ms | High traffic, Hybrid Search, standard AWS RAG. |
| S3 Vectors | Pay-per-request | 100ms - 1s | Spiky traffic, indices up to 2 billion vectors. |
| Aurora PostgreSQL | Instance price | Variable | Small datasets, SQL-familiar access patterns. |
| Pinecone / MongoDB | SaaS pricing | Variable | Existing platform investment outside of AWS. |
[!CAUTION] Cost Floor Warning: OpenSearch Serverless (OSS) requires a minimum of 2 OCUs (1 for indexing, 1 for search). This means your cost floor starts at ~$345/month regardless of your usage. For smaller workloads, consider S3 Vectors or Aurora to avoid this fixed overhead.
Security & IAM: The Tripartite Trust Model
The mental model: the Bedrock Service Role is the one doing the work. The Data Access Policy on the OSS collection must explicitly grant that service role permission to touch the collection, because IAM alone is not sufficient.
[!IMPORTANT] Data Access Policy vs. IAM: You must configure an OpenSearch Serverless Data Access Policy in addition to IAM. Without this, Bedrock will return 403 Forbidden even if its IAM role has
aoss:APIAccessAll.
1. Trust Policy
Lets Bedrock assume the role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}
2. Permissions Policy
Read S3, invoke embedding model, and write to OSS.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": ["arn:aws:s3:::my-kb-docs", "arn:aws:s3:::my-kb-docs/*"]
},
{
"Effect": "Allow",
"Action": "bedrock:InvokeModel",
"Resource": "arn:aws:bedrock:*::foundation-model/amazon.titan-embed-text-v2:0"
},
{
"Effect": "Allow",
"Action": "aoss:APIAccessAll",
"Resource": "arn:aws:aoss:*:*:collection/*"
}
]
}
Advanced Chunking Strategies
The two knobs that meaningfully affect retrieval quality are the chunking strategy and the embedding model. Choice is more consequential than the documentation suggests.
- Fixed-Size: 300-token slices. Predictable, but splits tables and code blocks.
- Hierarchical: Retrieves on small child chunks but returns the 1500-token parent to the model. Best for technical docs.
- Semantic: Uses an embedding model to detect topic shifts. Highest quality for narrative content, but slowest to compute.
Implementation: The Boto3 SDK Path
Here is the exact code path to stand up a Knowledge Base with Hierarchical Chunking.
import boto3
import time
bedrock_agent = boto3.client('bedrock-agent', region_name='us-east-1')
# Step 1: Create the Knowledge Base
kb = bedrock_agent.create_knowledge_base(
name='company-docs-kb',
description='Internal policy and engineering docs',
roleArn='arn:aws:iam::123456789012:role/BedrockKBRole',
knowledgeBaseConfiguration={
'type': 'VECTOR',
'vectorKnowledgeBaseConfiguration': {
'embeddingModelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0'
}
},
storageConfiguration={
'type': 'OPENSEARCH_SERVERLESS',
'opensearchServerlessConfiguration': {
'collectionArn': 'arn:aws:aoss:us-east-1:123456789012:collection/abc123',
'vectorIndexName': 'company-docs-index',
'fieldMapping': {
'vectorField': 'embedding',
'textField': 'text',
'metadataField': 'metadata'
}
}
}
)
kb_id = kb['knowledgeBase']['knowledgeBaseId']
# Step 2: Attach S3 Data Source with Hierarchical Chunking
ds = bedrock_agent.create_data_source(
knowledgeBaseId=kb_id,
name='company-docs-s3',
dataSourceConfiguration={
'type': 'S3',
's3Configuration': {
'bucketArn': 'arn:aws:s3:::my-kb-docs'
}
},
vectorIngestionConfiguration={
'chunkingConfiguration': {
'chunkingStrategy': 'HIERARCHICAL',
'hierarchicalChunkingConfiguration': {
'levelConfigurations': [
{'maxTokens': 1500},
{'maxTokens': 300}
],
'overlapTokens': 60
}
}
}
)
# Step 3: Kick off the first ingestion job
job = bedrock_agent.start_ingestion_job(
knowledgeBaseId=kb_id,
dataSourceId=ds['dataSource']['dataSourceId']
)
job_id = job['ingestionJob']['ingestionJobId']
# Step 4: (Expert Path) Poll for completion and check statistics
while True:
status = bedrock_agent.get_ingestion_job(
knowledgeBaseId=kb_id,
dataSourceId=ds['dataSource']['dataSourceId'],
ingestionJobId=job_id
)['ingestionJob']
print(f"Status: {status['status']}")
if status['status'] in ['COMPLETE', 'FAILED', 'STOPPED']:
stats = status['statistics']
print(f"Ingested: {stats['numberOfDocumentsScanned']}")
print(f"Failed: {stats['numberOfDocumentsFailed']}")
break
time.sleep(30)
Querying the Knowledge Base
Using the retrieve_and_generate API to get grounded answers with citations.
import boto3
runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')
response = runtime.retrieve_and_generate(
input={'text': 'What is our policy on remote work for engineering?'},
retrieveAndGenerateConfiguration={
'type': 'KNOWLEDGE_BASE',
'knowledgeBaseConfiguration': {
'knowledgeBaseId': kb_id,
'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20240620-v1:0',
'retrievalConfiguration': {
'vectorSearchConfiguration': {
'numberOfResults': 5,
'overrideSearchType': 'HYBRID' # Essential for keyword+vector
}
}
}
}
)
print(response['output']['text'])
Production Checklist
[!IMPORTANT] Deployment Readiness
- Sync Failures: Always monitor the
statisticsblock inget_ingestion_job. Corrupted PDFs will fail silently, leaving gaps in your index.- Metadata Filtering: Use
.metadata.jsonsidecar files in S3. This is mandatory for multi-audience KBs to prevent "vibes-based" disambiguation.- Model Migrations: You cannot swap embedding models in an existing KB. You must create a new KB, re-ingest, and cut over at the application layer.
- Cost Monitoring: A single misconfigured retry loop can burn $1,000 in an afternoon. Use Budget Alarms.
The 2026 Roadmap
The focus is shifting to the edges. S3 Vectors changed the economics for large RAG deployments overnight. AgentCore is increasingly the choice for systems that need to take actions, while Bedrock Data Automation has become the best way to parse complex PDFs with tables and figures.
For multi-modal workloads, Amazon Nova Multimodal Embeddings V1 (3072 dimensions) is the new standard, enabling RAG over product catalogs and manuals where diagrams matter as much as text.
This architecture is unglamorous and well-documented—the only kind that survives the shift from demo to system.
The Production AI Engineer
Build resilient enterprise RAG architectures, scale vector indices across multi-tenant silos, isolate system environments, and secure your prompt endpoints in our comprehensive 122-page engineering playbook.
Related Guides
Build a Local LLM Zero-Shot Classifier You Can Actually Deploy
Learn how to run zero-shot text classification on a local model with Ollama, enforce strict JSON outputs, and add confidence-aware routing for production triage.
The Complete Developer Guide to Running LLMs Locally: From Ollama to Production
Everything you need to run LLMs on your own hardware in 2026: VRAM sizing, model formats, an 8-tool comparison table, a full local RAG pipeline, and Docker production deployment with GPU passthrough and Nginx auth.
Event-Driven Architecture for Agentic AI: The Architect's Guide
A comprehensive architectural guide to designing resilient, real-time agentic AI systems using event-driven architecture — covering loose coupling, fault isolation, reference architecture, and governance patterns.