How to Build a Custom RAG System with Pinecone and Next.js

April 12, 2026guides

Why You Need This Stack

Building a Retrieval-Augmented Generation (RAG) system allows your AI chatbot to perfectly cite your proprietary company documents without hallucinating. The absolute gold-standard tech stack for building this in 2026 relies on three pillars:

  1. Next.js (App Router): The scalable full-stack framework.
  2. OpenAI text-embedding-3-small: To convert your text into math.
  3. Pinecone: The lightning-fast Vector Database to store that math.

In this guide, we will walk through the core logic of building this exact architecture.

Step 1: Document Processing & Embeddings

You cannot feed a 100-page PDF directly into an LLM effectively. You must break it down.

First, you parse your document into small, overlapping chunks (e.g., 500 words per chunk). Once you have an array of chunks, you run them through an embedding model. An embedding model reads the text and mathematically converts its "meaning" into an array of floating-point numbers (a vector).

import OpenAI from "openai"

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

async function generateEmbedding(text: string) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  })
  
  return response.data[0].embedding; // This is your [0.123, -0.456, ...] vector
}

Step 2: Pushing to Pinecone

Once you have the mathematical representation vectors of your paragraphs, you need to store them in a database specifically designed to search math: Pinecone.

import { Pinecone } from '@pinecone-database/pinecone';

const pc = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pc.index('my-company-data');

// Example pushing a chunk to the database
await index.upsert([{
  id: "doc_1_chunk_1",
  values: generatedVector,
  metadata: {
    text: "The company vacation policy allows 15 days of PTO.",
    source: "employee_handbook.pdf"
  }
}]);

Step 3: Handling the User Request

When a user visits your Next.js application and asks a question, your Next.js backend API must perform the crucial RAG mechanic: Semantic Search.

Instead of sending the user's question directly to the Chatbot, you convert the user's question into an embedding vector, and ask Pinecone to find the closest matching vectors in the database.

// Inside your Next.js App Router API Route (e.g., api/chat/route.ts)

// 1. Embed the user's question
const questionVector = await generateEmbedding("How much PTO do I get?");

// 2. Search Pinecone for the 3 most relevant paragraphs
const searchResults = await index.query({
  vector: questionVector,
  topK: 3,
  includeMetadata: true
});

// 3. Extract the raw text from the Pinecone results
const contextText = searchResults.matches
  .map(match => match.metadata.text)
  .join('\n\n');

Step 4: Structuring the Final Prompt

Now that you have the most relevant paragraphs from your company database (contextText), you inject them into a massive system prompt and send it to the LLM for the final generation.

const systemPrompt = `
You are a helpful company assistant. 
Answer the user's question using ONLY the context provided below. 
If the context does not contain the answer, reply "I do not have that information."

CONTEXT:
${contextText}
`;

const chatCompletion = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
        { role: "system", content: systemPrompt },
        { role: "user", content: "How much PTO do I get?" }
    ]
});

return Response.json({ answer: chatCompletion.choices[0].message.content });

Conclusion

By following this architecture, the LLM never has an opportunity to guess. It technically performs a reading comprehension test on the exact paragraphs Pinecone retrieved. This Next.js -> Embeddings -> Pinecone -> LLM loop is the fundamental backbone of modern AI engineering.