AverageDevs
AILLMRAG

Techniques for Reducing Hallucinations in LLM Based Applications

A practical, engineering focused guide to diagnosing and reducing hallucinations in LLM based apps using prompt design, retrieval, constraints, evaluation, and architecture patterns.

Techniques for Reducing Hallucinations in LLM Based Applications

Hallucinations are the LLM version of that confident friend who will happily explain anything, regardless of whether they actually know what they are talking about. In a demo this can look impressive. In production it can mean fabricated citations, invented API responses, and angry users. You cannot fully eliminate hallucinations in non trivial tasks, but you can constrain, detect, and reduce them to the point where your system behaves predictably and your support inbox stays quiet.

In this guide, we will look at concrete techniques for reducing hallucinations in LLM based applications. The target reader is a mid level engineer who has already shipped an LLM feature or a small RAG prototype, and now needs to make it production ready. We will talk about prompt patterns, retrieval design, architecture choices, and evaluation strategies, and connect them to real world systems such as Document QA with Next.js and LangChain and RAG for SaaS Products.

TLDR

  • You cannot fully remove hallucinations, but you can bound them - by narrowing tasks, grounding answers in data, and refusing to guess.
  • Good retrieval is more important than clever prompts - design your RAG pipeline as carefully as your UI.
  • Constrain outputs - make the model choose among options or emit strict JSON instead of free form prose where possible.
  • Add self checks and verification layers - ask models to critique or validate each other, and cross check against tools or rules.
  • Continuously evaluate with real data - hallucination control is an ongoing process, not a one time prompt tuning session.

The rest of this article will walk through these ideas with practical examples and tradeoffs.

Step zero - get specific about what hallucination means in your product

Hallucination is a fuzzy term. Before you can fix it, you need a crisp definition of what counts as a hallucination for your app. That varies widely:

  • In a document QA system, hallucination often means stating something that is not supported by the provided documents.
  • In a code assistant, it might mean suggesting APIs or functions that do not exist, or code that fails basic compilation.
  • In a customer support bot, hallucination can be any answer that contradicts your policy documents or invents policies.

If you are building something like the systems in Retrieval Augmented Generation (RAG) Guide or Document QA for SaaS, a good starting definition is:

A response is a hallucination if it asserts a specific fact that is not entailed by the retrieved context or upstream tools.

This definition encourages you to focus on two main levers:

  1. What context you give the model.
  2. How you instruct it to treat that context.

Once you are clear on your app specific definition, you can start instrumenting and optimizing against it.

Technique 1 - Narrow the task and clarify the contract

Many hallucinations are not mysterious model failures. They are the result of vague goals. If you ask:

Explain our billing model and pricing tiers to this user.

The model will happily make something up if it is not sure. If you instead say:

Based only on the billing policy document below, answer the user's question. If the answer is not in the document, say that you do not know and suggest contacting support.

You have changed the contract. The model is no longer trying to be a creative expert. It is an extractor and summarizer with permission to say "I do not know".

Here is a prompt skeleton you can adapt:

You are helping a user with questions about ACME Corp's billing.

You are given:
1. The user's question.
2. Relevant policy snippets from ACME's internal documentation.

Your job:
- Use only the provided snippets to answer.
- If the answer is not clearly present, say you do not know and recommend contacting support.

Rules:
- Do not invent policies, fees, or dates.
- Do not guess numbers or legal language.
- Quote or paraphrase snippets, and point to which snippet supports your answer.

This pattern does not eliminate hallucinations, but it significantly reduces them. The more you can make tasks look like "classify", "select", or "summarize based on explicit evidence", the less room the model has to improvise.

Related posts such as How Text Becomes Model Input and Context Windows and LLM Prompt Optimization are worth reading if you want a deeper understanding of how instructions interact with tokens and attention.

Technique 2 - Design retrieval as a first class system

If your app uses RAG, retrieval quality is the foundation. A beautifully crafted prompt will not save you if the model sees the wrong documents. Common retrieval issues include:

  • Using a single vector search over everything, which can return semantically similar but irrelevant chunks.
  • Storing long, unstructured documents without chunking or metadata, so the model sees a wall of text.
  • Skipping basic filters like tenant id or document type, which leads to cross tenant leakage or mixing internal and public docs.

Instead, treat your RAG pipeline with the same attention you would give a search feature:

  1. Chunk documents with structure in mind
    Use headings, sections, and semantic boundaries instead of naive fixed size chunking.
  2. Store rich metadata
    Tenant, document type, creation date, language, and access level.
  3. Use hybrid search
    Combine dense vector search with keyword or BM25 to reduce "vibes only" matches.
  4. Apply hard filters before ranking
    Tenant, environment, and permission filters should run before similarity scoring.

A simple retrieval function might look like this in TypeScript pseudo code:

type RetrievalParams = {
  tenantId: string;
  query: string;
  topK: number;
};

export const retrieveContext = async ({
  tenantId,
  query,
  topK,
}: RetrievalParams) => {
  const results = await vectorStore.search({
    query,
    topK: topK * 3,
    filter: {
      tenantId,
      published: true,
    },
  });

  const reranked = rerankWithKeywordSignal(query, results);
  return reranked.slice(0, topK);
};

For a deeper dive into retrieval patterns and vector stores, pair this article with Vector Databases and Semantic Search and RAG for SaaS.

Technique 3 - Make the model show its work

One of the most effective anti hallucination patterns is to force the model to tie claims to specific evidence. You are not just asking "What is the answer", you are asking:

Which snippet supports this answer, and how.

This can take several forms:

  • Ask for citations with snippet ids and ranges.
  • Ask the model to quote the lines it used.
  • Use a two step pattern: first select relevant snippets, then answer based on those.

A simple output schema:

{
  "answer": "string",
  "citations": [
    {
      "snippetId": "string",
      "explanation": "string"
    }
  ],
  "unknown": boolean
}

You then validate server side:

  • If unknown is true but citations exist, treat that as suspicious.
  • If claims reference snippets that clearly do not mention the asserted fact, lower your trust score.

Combined with a strict JSON output format - a technique also useful in AI Summarized Dashboards - this pattern both reduces hallucinations and makes them easier to detect when they still happen.

Technique 4 - Constrain outputs with schemas and options

Free form natural language is where models have the most freedom to wander. Whenever you can, replace "write anything you like" with structured or constrained outputs:

  • Classification - ask the model to choose one label from a small set.
  • JSON schemas - define the shape of the output and parse it strictly.
  • Option selection - show a list of possible responses and have the model pick.

Example: instead of asking a support bot to decide arbitrarily how confident it feels, give it choices:

You must choose one of the following confidence levels for your answer:
- "high"   -  answer is clearly supported by the context.
- "medium" -  context is somewhat related but not exact.
- "low"    -  context does not contain a clear answer.

Return a JSON object:
{
  "answer": string,
  "confidence": "high" | "medium" | "low"
}

On the backend, you then:

  • Refuse to surface answers with confidence = "low" without human review.
  • Log all "medium" answers for later evaluation.

This style of constraint shows up across many production LLM systems, from internal tooling to public facing assistants. It is the same mindset as validating API inputs and outputs carefully in a microservice architecture, something you may already practice if you have followed the guidance in API Versioning and Backward Compatibility.

Technique 5 - Add self checks and adversarial prompts

LLMs can help you detect their own mistakes. That sounds paradoxical, but in practice it works surprisingly well if you design the interaction carefully.

Patterns you can use:

  • Self critique - ask the same model to review its own answer against the context and flag unsupported claims.
  • Cross examination - have model A propose an answer and model B try to poke holes in it.
  • Consistency checks - query the model with logically equivalent prompts and see if the answers agree.

Here is a simple self critique pass:

You are given:
- A user question.
- A proposed answer written by another assistant.
- The context snippets that answer is supposed to be based on.

Your job is to check whether the answer is fully supported by the context.

Rules:
- If the answer includes any claim that is not clearly supported by the context, you must mark it as "unsupported".
- If the answer contradicts the context, you must mark it as "contradicted".

Respond in JSON:
{
  "status": "supported" | "unsupported" | "contradicted",
  "problems": string[]
}

You can run this as a background validation step even if you choose to show the original answer immediately. Over time, this feedback becomes part of your evaluation dataset and helps you tune prompts, retrieval, and model selection.

This is analogous to using automated tests and linters to catch regressions in code, a theme explored from a different angle in AI Maintain Code Quality and Reduce Bugs.

Technique 6 - Incorporate tools and external verifiers

Some hallucinations are best handled by taking the decision away from the model entirely and delegating it to deterministic tools:

  • For numeric questions, call a calculator or database instead of asking the model to compute.
  • For knowledge lookups, hit a search API or internal knowledge base and give the model those results as context.
  • For structured tasks, use schema validation libraries to reject nonsense.

For example, in a Next.js API route you might:

type Question = {
  type: "numerical" | "factual" | "freeform";
  text: string;
};

export const answerQuestion = async (question: Question) => {
  if (question.type === "numerical") {
    return await answerWithCalculator(question.text);
  }

  if (question.type === "factual") {
    const searchResults = await webSearch(question.text);
    return await answerWithLLMAndSearch(question.text, searchResults);
  }

  return await answerWithLLM(question.text);
};

The model still plays a role, but now it is downstream of a tool that provides grounding. This kind of tool augmented architecture is central to agentic systems, a topic you can explore further in Agentic Workflows for Developer Automation and AI Reshaping the Software Development Lifecycle.

Technique 7 - Evaluate with real traffic and curated datasets

You cannot meaningfully reduce hallucinations without measurement. That means:

  • Building evaluation datasets that reflect real user questions, not just synthetic edge cases.
  • Labeling answers as correct, partially correct, unsupported, or harmful.
  • Tracking metrics over time as you change prompts, models, or retrieval.

Useful signals include:

  • User feedback buttons - thumbs up or down on answers, plus optional free text.
  • Support tickets - cluster tickets that mention the assistant and look for patterns.
  • Shadow testing - run new prompts or models in parallel on historical queries and compare outputs.

You can wire this into your deployment pipeline. For example:

  1. Collect a weekly batch of real queries and their ground truth or human judgments.
  2. Run your current prompt plus a candidate variant.
  3. Use an evaluation model to judge which variant did better.
  4. Only roll out changes that improve or maintain quality.

This practice sits nicely alongside broader platform level evaluation, such as the kinds of experiments discussed in AI Coding Assistants - Benefits, Risks, and Adoption and Ethics of AI Generated Code in Production.

Technique 8 - Choose models and settings with hallucinations in mind

Not all models hallucinate in the same way. Some are more verbose, some are more cautious, some are tuned for open ended creativity. When selecting models and config values, consider:

  • Temperature - higher values increase randomness and creativity, which usually increases hallucinations.
  • Top p and top k - similar effect, with different tradeoffs.
  • Model family and training data - models fine tuned for instruction following and grounded tasks often behave better than general chat tuned ones.

If your task is sensitive to correctness, your default stance should be:

  • Keep temperature low.
  • Prefer models marketed for "tools", "assistants", or "enterprise" use.
  • Be skeptical of models tuned heavily for role play or creative writing.

This does not mean you cannot use creative settings anywhere. You might allow higher temperature in brainstorming features while keeping core workflows tightly constrained. The key is to be explicit about where hallucinations are acceptable and where they are not, and to align your model choices with that boundary.

Architecture patterns that reduce blast radius

Finally, step back and look at the architecture of your LLM integration. Even with all the techniques above, you will occasionally get incorrect outputs. Good system design limits the damage:

  • Keep LLMs out of critical control loops - do not let a model directly trigger irreversible actions such as charges or data deletion without human or rule based checks.
  • Treat LLM outputs as suggestions, not facts - especially in code generation, configuration, or policy.
  • Use layered decision making - model suggests, rules and tools approve or deny.

For example, in an AI assisted deployment pipeline, the model might propose a rollout plan, but your existing safeguards around canary deployments, health checks, and automatic rollbacks still make the final call. This mindset lines up with the broader themes in AI Automation - Pros and Cons and Manager's Guide to Safe AI Adoption.

Actionable next steps

To make this concrete, here is a short roadmap you can follow over the next couple of weeks:

  1. Define hallucination for your app
    Write down what counts as a hallucination in your context and collect 20 to 50 real examples.
  2. Harden your prompts and outputs
    Add explicit instructions about using only provided context, refusing to guess, and returning structured JSON with citations or confidence.
  3. Invest in retrieval quality
    Review your chunking, metadata, and search filters if you use RAG, and run small experiments with hybrid search or reranking.
  4. Add a self check or verification step
    Implement at least one simple self critique or tool based validation layer in a non critical path.
  5. Set up lightweight evaluation
    Start logging user feedback, sampling conversations, and periodically running offline evaluations against a fixed dataset.

You will not eliminate hallucinations entirely, and that is fine. The goal is to move from "the model sometimes says strange things and we hope users do not notice" to "we understand when and why hallucinations occur, we bound their impact, and we improve them systematically over time". With that mindset, LLMs go from temperamental demo machines to reliable components in your production architecture.