Explore how RAG works and how to implement it in a SaaS project

Retrieval‑Augmented Generation (RAG) lets your product answer with your knowledge - docs, tickets, changelogs, spreadsheets - rather than relying on the model’s latent memory. This article explores how RAG works and how to implement it in a SaaS context with performance, cost, and safety in mind. We’ll build a typed flow you can run in Next.js or Node, with practical guidance for ingestion, retrieval, reranking, context compression, prompting, citations, and evaluation.

If you’re just starting with GPT in web apps, first see our integration walkthrough: Integrate OpenAI into Next.js. Want a full chatbot with React and Express? Jump to Build an AI chatbot with React + Node. For site‑wide metadata and CTR wins, consult our SEO checklist.

TL;DR

Ingest docs → chunk with metadata → embed → store vectors.
At query time: query rewrite → hybrid retrieval (semantic + keyword) → rerank → compress → generate with citations.
Enforce permissions at retrieval time; add evaluations and logs.
Control costs with token budgets, caching, and small fast models where possible.

Conceptual Architecture

User Query
   │
   ▼
Query Rewrite / Expansion (optional)
   │
   ├────► Keyword Retrieval (BM25)
   │
   ├────► Vector Retrieval (Embeddings)
   │
   ▼
Merge + Rerank (cross-encoder or heuristic)
   │
   ▼
Context Compression (extractive or abstractive)
   │
   ▼
Prompt Construction (instructions + compressed context + schema)
   │
   ▼
LLM Generation → Answer + Citations

For a high‑level intro and additional patterns, compare with our focused primer: RAG production guide.

Data Ingestion (Quality First)

High‑quality ingestion is the single most important lever for RAG quality.

Normalize sources: HTML, MD, PDF, tickets, changelogs.
Preserve structure: titles, headings, breadcrumbs, anchors.
Chunk semantically (200–800 tokens) with stable IDs and small overlaps.
Keep metadata: source URL, section, product area, version, permissions.
Deduplicate near‑duplicates; track timestamps for freshness boosts.

// lib/ingest.ts
import { createHash } from "crypto";

export type RawDoc = {
  id?: string;
  title: string;
  url?: string;
  content: string;
  updatedAt?: string;
  access?: string; // e.g., org/team/role tags
};

export type VectorItem = {
  id: string;
  vector: number[];
  metadata: Record<string, string>;
};

export type EmbeddingFn = (inputs: string[]) => Promise<number[][]>;
export type VectorStore = { upsert: (items: VectorItem[]) => Promise<void> };

const CHUNK_SIZE = 800; // tokens (approx)
const CHUNK_OVERLAP = 120;

export const chunkText = (text: string, size = CHUNK_SIZE, overlap = CHUNK_OVERLAP): string[] => {
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + size, words.length);
    chunks.push(words.slice(start, end).join(" "));
    if (end === words.length) break;
    start = Math.max(0, end - overlap);
  }
  return chunks;
};

export const ingestDocs = async ({ docs, embed, store }: { docs: RawDoc[]; embed: EmbeddingFn; store: VectorStore }) => {
  const items: VectorItem[] = [];
  for (const doc of docs) {
    const chunks = chunkText(doc.content);
    const vectors = await embed(chunks);
    for (let i = 0; i < chunks.length; i++) {
      const baseId =
        doc.id ?? createHash("sha1").update(`${doc.title}:${doc.url ?? ""}`).digest("hex");
      const id = `${baseId}#${i}`;
      items.push({
        id,
        vector: vectors[i],
        metadata: {
          title: doc.title,
          url: doc.url ?? "",
          updatedAt: doc.updatedAt ?? "",
          chunkIndex: String(i),
          access: doc.access ?? "public",
        },
      });
    }
  }
  await store.upsert(items);
};

Retrieval: Hybrid Beats Either Alone

Semantic vectors are great for paraphrases; keyword (BM25) excels at rare terms and exact IDs. Combining both typically wins.

// lib/retrieve.ts
export type RetrievedChunk = { id: string; text: string; source: string; score: number };
export type SearchFilter = { mustMatch?: Record<string, string>; recencyBoostDays?: number };

export type Retriever = {
  vectorSearch: (queryVector: number[], k: number, filter?: SearchFilter) => Promise<RetrievedChunk[]>;
  keywordSearch: (queryText: string, k: number, filter?: SearchFilter) => Promise<RetrievedChunk[]>;
};

export type EmbedOne = (input: string) => Promise<number[]>;

export const hybridRetrieve = async ({
  query,
  embed,
  retriever,
  k = 12,
  filter,
}: {
  query: string;
  embed: EmbedOne;
  retriever: Retriever;
  k?: number;
  filter?: SearchFilter;
}) => {
  const qVec = await embed(query);
  const [vecResults, kwResults] = await Promise.all([
    retriever.vectorSearch(qVec, k, filter),
    retriever.keywordSearch(query, Math.ceil(k / 2), filter),
  ]);

  const combined = [...vecResults, ...kwResults];
  const bestById = new Map<string, RetrievedChunk>();
  for (const c of combined) {
    const prev = bestById.get(c.id);
    if (!prev || c.score > prev.score) bestById.set(c.id, c);
  }
  return Array.from(bestById.values())
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
};

export const simpleRerank = async (query: string, chunks: RetrievedChunk[]): Promise<RetrievedChunk[]> => {
  const qTerms = new Set(query.toLowerCase().split(/[^a-z0-9]+/).filter(Boolean));
  const scored = chunks.map((c) => {
    const terms = c.text.toLowerCase().split(/[^a-z0-9]+/).filter(Boolean);
    const overlap = terms.reduce((acc, t) => acc + (qTerms.has(t) ? 1 : 0), 0);
    return { ...c, score: c.score + overlap * 0.01 };
  });
  return scored.sort((a, b) => b.score - a.score);
};

Context Compression (Fit More Signal)

You rarely want to pass full chunks as‑is. Compress to extract only the relevant parts while preserving citations.

// lib/compress.ts
import type { RetrievedChunk } from "./retrieve";

export const naiveCompress = async (q: string, chunks: RetrievedChunk[]): Promise<string> => {
  return chunks
    .slice(0, 6)
    .map((c) => `[CITATION:${c.id}] ${c.text}`)
    .join("\n\n");
};

For higher precision, use extractive summarization or LLM selectors that preserve spans with IDs.

Prompting and Structured Outputs

Prefer structured outputs validated with zod when your UI expects fields.

// lib/prompt.ts
import { z } from "zod";

export const AnswerSchema = z.object({
  answer: z.string(),
  citations: z.array(z.object({ id: z.string(), source: z.string().optional() })).min(1),
  confidence: z.number().min(0).max(1).optional(),
});

export const buildSchemaPrompt = (context: string, question: string) => {
  const schema = AnswerSchema.toString();
  return `You are a strict JSON generator. Use ONLY the provided context. If insufficient, set answer to "I don't know" and return empty citations.\n\nContext:\n${context}\n\nQuestion: ${question}\n\nReturn JSON matching this schema:\n${schema}`;
};

export const extractFirstJson = (raw: string) => {
  const m = raw.match(/\{[\s\S]*\}/);
  return m ? m[0] : "{}";
};

End‑to‑End Answer Function

// lib/answer.ts
import type { RetrievedChunk } from "./retrieve";
import { hybridRetrieve, simpleRerank } from "./retrieve";
import { naiveCompress } from "./compress";
import { AnswerSchema, buildSchemaPrompt, extractFirstJson } from "./prompt";

type Generate = (prompt: string) => Promise<string>;
type EmbedOne = (input: string) => Promise<number[]>;
type Retriever = Parameters<typeof hybridRetrieve>[0]["retriever"];

export const answerWithRag = async ({
  query,
  embed,
  retriever,
  generate,
}: {
  query: string;
  embed: EmbedOne;
  retriever: Retriever;
  generate: Generate;
}) => {
  const initial = await hybridRetrieve({ query, embed, retriever, k: 12 });
  if (initial.length === 0) {
    return { answer: "I don’t have enough information to answer that.", citations: [] };
  }

  const ranked = await simpleRerank(query, initial);
  const context = await naiveCompress(query, ranked);
  const prompt = buildSchemaPrompt(context, query);
  const raw = await generate(prompt);
  const parsed = AnswerSchema.parse(JSON.parse(extractFirstJson(raw)));
  return parsed;
};

Next.js API Route Wiring (App Router)

// app/api/rag/route.ts
import { NextRequest, NextResponse } from "next/server";
import { answerWithRag } from "@/lib/answer";

export const POST = async (req: NextRequest) => {
  const { query } = (await req.json()) as { query?: string };
  if (!query || !query.trim()) return NextResponse.json({ error: "Missing query" }, { status: 400 });

  // Inject your concrete implementations
  const embedOne = async (_: string) => new Array(768).fill(0);
  const retriever = {
    vectorSearch: async () => [],
    keywordSearch: async () => [],
  } as const;
  const generate = async (prompt: string) => `{"answer":"Echo","citations":[{"id":"doc#1"}],"confidence":0.5}`;

  const result = await answerWithRag({ query, embed: embedOne, retriever, generate });
  return NextResponse.json(result);
};

See our end‑to‑end Next.js integration guide for OpenAI auth, streaming, and client UI: Integrate OpenAI into Next.js. For a full React + Express scaffold, read AI chatbot with React + Node.

Permissioning and Security

Enforce permissions during retrieval (document‑level ACLs) rather than in prompts.
Never embed secrets; redact PII where required.
Keep an audit trail: query, retrieved docs, and returned citations.

Evaluations That Matter

Groundedness: sample claims and verify support from retrieved text.
Retrieval quality: Recall@K, MRR, nDCG; log queries with no answer.
Answer quality: task‑specific rubrics; maintain a golden set.
Latency/cost: P95 latency, tokens per step, cache hit rate.

Cost and Latency Controls

Cache query rewrites and retrieval results.
Use small rerankers; cap candidate counts at each stage.
Compress aggressively before generation; prefer compact prompts.
Batch ingestion and upserts; version content to avoid full re‑index.

SaaS‑Specific Patterns

Multi‑tenant indexes: partition by org; enforce org filters in queries.
Freshness: add recency boosts or decay; prioritize latest versions.
Safety: refusal rules when context is insufficient; ask clarifying questions.
Observability: per‑tenant dashboards; error budgets for latency/cost.

Putting It Together in a Minimal UI

Pair the API with a chat UI that requires citations and shows sources inline. For reference UI patterns and streaming plumbing, review AI chatbot with React + Node.

Common Pitfalls

Over‑ or under‑chunking; evaluate chunk strategies on real queries.
Only vector search → misses exact IDs and rare terms; add keyword.
Skipping evaluations; ship with groundedness checks and review queues.
No observability; can’t debug retrieval vs generation failures.

Implementation Checklist

Define use‑cases, failure modes, and refusal behavior.
Ingest: parsers, chunking, metadata, dedupe, versions.
Retrieval: hybrid search + filters; set budgets (N→K→M tokens).
Rerank + compress: keep citations; prefer extractive compression.
Prompting + schema validation; return citations and confidence.
Metrics and dashboards; a golden set; alerting.
Pilot in a narrow scope; add manual fallback routes.
Iterate on bad cases; refine chunking, embeddings, prompts.

Where to Go Next

Build your Next.js integration end‑to‑end: Integrate OpenAI into Next.js.
Deploy the full stack and harden runtime: Deploy Next.js on a VPS.
Improve site discoverability: Next.js SEO best practices.

Conclusion

RAG turns LLMs into grounded assistants that reason with your organization’s knowledge. Treat it as an information‑retrieval system first and a generation system second: invest in ingestion quality, retrieval relevance, and evaluations. With hybrid search, tight budgets, structured outputs, and clear refusal rules, you can deliver accurate, auditable answers that users trust - and you can do it within SaaS constraints like tenancy, permissions, and cost. For full code wiring and client UX, explore our Next.js integration and AI chatbot scaffold.