AverageDevs
Architecture

Error Handling Patterns in Distributed Systems - Practical Examples

A pragmatic guide to error handling in distributed systems with real-world patterns, code snippets, architecture diagrams, and advice drawn from production outages.

Error Handling Patterns in Distributed Systems - Practical Examples

If you are used to building single service web apps, error handling probably feels simple: try, catch, log, maybe show a friendly toast. Once you move into distributed systems, that mental model collapses. Calls jump between services, networks misbehave, messages arrive twice or not at all, and logs scatter across multiple machines. Suddenly "just wrap it in a try catch" is not enough.

In this guide we will walk through practical error handling patterns for distributed systems. The goal is not academic purity. The goal is to give you tools you can actually use on a system with multiple services, message queues, and real traffic. We will use concrete examples, code snippets, and simple text based architecture diagrams to tie everything together, so you can apply these ideas the next time an incident wakes you up at three in the morning.

Along the way, we will connect related topics like Practical Guide to Implementing Clean Architecture in Full-Stack Projects, Best Practices for API Versioning and Backward Compatibility, and How AI Helps Maintain Code Quality and Reduce Bugs so you can explore the broader reliability story beyond error handling alone.

Why error handling is harder in distributed systems

In a monolith, most failures are local. A database query fails, an in process call throws, or a user passes invalid data. You see the stack trace in one place, and that is often enough to debug and fix. You might even get away with swallowing some exceptions because everything happens in one runtime.

In a distributed system you no longer have that luxury:

  • Failures are partial. A request might succeed in one service and fail in another.
  • Failures are intermittent. A call that fails now might work on the next retry.
  • Failures are invisible. Timeouts can look like nothing happened at all unless you instrument correctly.
  • Failures cascade. One slow dependency can stall threads, exhaust connection pools, and take unrelated services down with it.

Because of this, error handling stops being a local implementation detail and becomes an architectural concern. Many of the choices you make here will also connect to topics like Database Indexing Strategies Every Backend Developer Should Know and React Performance and Bundle Size Optimization in 2025, since a slow or overloaded system is usually one unlucky outage away from bigger reliability problems.

There are three core constraints you should keep in the back of your mind as we walk through the patterns.

  1. Networks are unreliable. Every remote call can fail, be delayed, or be duplicated.
  2. State is shared. Multiple services can touch the same logical entity at different times.
  3. Observability is essential. You cannot fix what you cannot see, especially once you cross process and machine boundaries.

The patterns below are essentially different ways of making these constraints explicit and manageable.

Core principles for error handling in distributed systems

Before we touch specific patterns, it is worth aligning on a few principles. These will show up repeatedly and they are useful to keep in mind when you are doing system design or reviewing pull requests.

Fail fast, fail loud, but never fail silently

In a distributed environment, a quiet failure is worse than a noisy one. If a service swallows an exception and returns a generic success response, you lose any chance of recovery or diagnosis. When the user reports a bug, you will have no trail to follow.

  • Fail fast by applying aggressive, explicit timeouts to network calls.
  • Fail loud by surfacing meaningful error signals to callers, not just generic 500s.
  • Avoid swallowing errors. If you catch, you should transform, log, or compensate.

You will see this principle again when we talk about timeouts, retries, and circuit breakers.

Idempotency as a first class concept

Many error handling strategies rely on retries. Retries are only safe if the operation is idempotent or if you can detect duplicates and handle them explicitly.

For example, "create an order and charge a card" is not intrinsically idempotent. If a network timeout happens after the card is charged but before the client sees the response, a retry could charge the card twice. Idempotency keys solve this by turning that sequence into a logical single operation keyed by a client provided token.

If this sounds familiar, it is because payment providers have been hammering on this concept for years. You can see a production ready take on it in Stripe API Versioning Explained and similar APIs that use idempotency keys and stable error codes as part of their public contracts.

Timeouts and bounded retries

Without timeouts, failing dependencies can tie up resources and propagate slowness through your system. Without limits on retries, you can create your own denial of service during a partial outage as all services keep retrying each other.

The pattern is simple:

  • Every outbound call has a timeout.
  • Retries are bounded and use exponential backoff with jitter.
  • You treat retries as part of normal operation, not as a special case.

We will look at concrete TypeScript code for this in a later section.

Standard contracts for errors

If each service returns errors in a different format, debugging across boundaries becomes painful. You want a consistent error envelope that carries at least:

  • A machine readable error code.
  • A stable human readable message.
  • A correlation or trace ID.
  • Optional details for logs and dashboards.

This also makes it easier to expose safe error messages to clients or to your analytics tooling. If you later build something similar to AI-Summarized Dashboards: From Walls of Charts to Actionable Narratives, having structured error data will save you a lot of time.

Pattern 1: Correlation IDs and request scoped context

The first pattern you should introduce in any distributed system is correlation IDs. These are unique identifiers attached to a request as it enters your boundary, then propagated to every downstream service and log line.

Why correlation IDs matter

Without correlation IDs, logs from different services are just a pile of events. With correlation IDs, you can reconstruct the journey of a single user request through the system.

When a user says, "My checkout failed at 10:31 UTC," you can search by the correlation ID associated with their request and see everything that happened, including retries, fallbacks, and partial failures. It turns an incident from a murder mystery into a fairly boring detective report, which is exactly what you want.

Simple architecture diagram

Here is a minimal flow diagram using correlation IDs:

[Client]
   |
   v
[API Gateway] -- assigns correlationId --> [Order Service]
                                               |
                                               v
                                      [Payment Service]
                                               |
                                               v
                                      [Inventory Service]

The correlation ID is created at the API gateway, placed into a header like X-Correlation-Id, then passed to every service. Each service logs it automatically and returns it in responses so clients can include it in bug reports.

Example correlation middleware in TypeScript

This example uses an Express style API, but the idea works in any framework:

// correlation.ts
import { randomUUID } from "crypto";
import type { Request, Response, NextFunction } from "express";

const CORRELATION_HEADER = "x-correlation-id";

export const correlationMiddleware = (
  req: Request,
  res: Response,
  next: NextFunction
): void => {
  const existingId = req.header(CORRELATION_HEADER);
  const correlationId = existingId || randomUUID();

  (req as any).correlationId = correlationId;
  res.setHeader(CORRELATION_HEADER, correlationId);

  next();
};

export const withCorrelation = (message: string, req: Request): string => {
  const correlationId = (req as any).correlationId;
  return `[correlationId=${correlationId}] ${message}`;
};

A handler that uses it consistently:

// orderRoutes.ts
import type { Request, Response } from "express";
import { withCorrelation } from "./correlation";

export const createOrderHandler = async (req: Request, res: Response): Promise<void> => {
  try {
    // Business logic that can still throw
    res.status(201).json({ ok: true });
  } catch (error) {
    console.error(withCorrelation("Failed to create order", req), { error });
    res.status(500).json({
      errorCode: "ORDER_CREATION_FAILED",
      message: "We could not create your order. Please try again.",
      correlationId: (req as any).correlationId,
    });
  }
};

Once this pattern is in place, you can make it part of your clean architecture boundary, just like you would structure interfaces and adapters in Practical Guide to Implementing Clean Architecture in Full-Stack Projects.

Pattern 2: Standardized error envelopes between services

In distributed systems, you want services to speak a common error language. This does not need to be over engineered. A simple JSON shape with a stable errorCode is usually enough to unlock consistent handling and logging.

Designing an error envelope

Here is a practical example of an error envelope:

{
  "errorCode": "PAYMENT_DECLINED",
  "message": "The payment provider declined the transaction.",
  "correlationId": "df0a1d6b-6b34-4f0b-a8e5-5cce366ed6a3",
  "details": {
    "provider": "stripe",
    "reason": "insufficient_funds"
  }
}

For internal service to service communication, you may include more detail. For external clients, you can map to a safer, less verbose version that hides sensitive internal data while keeping the error code stable.

TypeScript representation and helpers

export type ErrorCode =
  | "VALIDATION_ERROR"
  | "RESOURCE_NOT_FOUND"
  | "PAYMENT_DECLINED"
  | "INTERNAL_ERROR";

export type ServiceError = {
  errorCode: ErrorCode;
  message: string;
  correlationId: string;
  details?: Record<string, unknown>;
};

export class ServiceErrorException extends Error {
  public readonly errorCode: ErrorCode;
  public readonly details?: Record<string, unknown>;

  constructor(params: {
    errorCode: ErrorCode;
    message: string;
    details?: Record<string, unknown>;
  }) {
    super(params.message);
    this.errorCode = params.errorCode;
    this.details = params.details;
  }
}

Then plug this into a single error handling middleware:

import type { Request, Response, NextFunction } from "express";
import { ServiceErrorException, type ServiceError, type ErrorCode } from "./errors";

const mapErrorCodeToStatus = (errorCode: ErrorCode): number => {
  switch (errorCode) {
    case "VALIDATION_ERROR":
      return 400;
    case "RESOURCE_NOT_FOUND":
      return 404;
    case "PAYMENT_DECLINED":
      return 402;
    default:
      return 500;
  }
};

export const errorMiddleware = (
  err: unknown,
  req: Request,
  res: Response,
  _next: NextFunction
): void => {
  const correlationId = (req as any).correlationId ?? "unknown";

  if (err instanceof ServiceErrorException) {
    const payload: ServiceError = {
      errorCode: err.errorCode,
      message: err.message,
      correlationId,
      details: err.details,
    };

    console.error("[service-error]", { correlationId, payload });
    res.status(mapErrorCodeToStatus(err.errorCode)).json(payload);
    return;
  }

  const payload: ServiceError = {
    errorCode: "INTERNAL_ERROR",
    message: "Unexpected server error",
    correlationId,
  };

  console.error("[unhandled-error]", { correlationId, err });
  res.status(500).json(payload);
};

With this in place, both other services and your front end have a predictable way to respond to errors. For example, your React app can map PAYMENT_DECLINED to a specific UI flow, and your API gateway can transform internal errors into stable public responses the way you might do when following patterns from Best Practices for API Versioning and Backward Compatibility.

Pattern 3: Timeouts, retries, and circuit breakers

Timeouts, retries, and circuit breakers form the classic triad of resilience patterns. They interact closely, so it is useful to think of them as one combined mechanism instead of three unrelated features.

Architecture diagram for a single dependency

[Order Service] --(HTTP call)--> [Payment Service]
      |                                  |
      |                         own database and logs
      |
      +-- timeout settings
      |
      +-- retry policy
      |
      +-- circuit breaker state

If the payment service becomes slow or unhealthy, the order service should not blindly keep sending traffic. Instead, it should:

  1. Detect failures or timeouts.
  2. Retry short lived issues with backoff.
  3. Open a circuit if the failure rate passes a threshold.
  4. Optionally fall back to a degraded behavior.

Simple retry with backoff in TypeScript

You do not need a giant framework to get started. A small utility goes a long way:

type RetryOptions = {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
};

const defaultRetryOptions: RetryOptions = {
  maxAttempts: 3,
  baseDelayMs: 100,
  maxDelayMs: 1_000,
};

const sleep = (ms: number): Promise<void> =>
  new Promise((resolve) => setTimeout(resolve, ms));

const calculateBackoff = (attempt: number, options: RetryOptions): number => {
  const exponential = options.baseDelayMs * Math.pow(2, attempt);
  const jitter = Math.random() * options.baseDelayMs;
  return Math.min(exponential + jitter, options.maxDelayMs);
};

export const withRetries = async <T>(
  operation: () => Promise<T>,
  options: Partial<RetryOptions> = {}
): Promise<T> => {
  const config = { ...defaultRetryOptions, ...options };

  let lastError: unknown;

  for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error;
      if (attempt === config.maxAttempts - 1) {
        break;
      }
      const delay = calculateBackoff(attempt, config);
      await sleep(delay);
    }
  }

  throw lastError;
};

Using it for an HTTP call with a timeout and standardized error reporting:

export const callPaymentService = async (
  payload: unknown,
  correlationId: string
): Promise<unknown> => {
  return withRetries(async () => {
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), 1_500);

    try {
      const response = await fetch("https://payment.internal/charge", {
        method: "POST",
        body: JSON.stringify(payload),
        headers: {
          "Content-Type": "application/json",
          "X-Correlation-Id": correlationId,
        },
        signal: controller.signal,
      });

      if (!response.ok) {
        throw new Error(`Payment service error status=${response.status}`);
      }

      return response.json();
    } finally {
      clearTimeout(timeout);
    }
  });
};

For production you will likely adopt a library that also tracks sliding windows of failures and implements a full circuit breaker, but it is helpful to understand the basic ingredients before you start flipping configuration flags.

Pattern 4: Sagas and compensating transactions

Distributed systems cannot rely on traditional ACID transactions across multiple services. Two phase commit exists, but it is complex, fragile, and rare in internet facing systems. Instead, we model long running workflows explicitly as sagas.

High level saga example

Imagine a checkout flow that performs three operations:

  1. Create an order record.
  2. Charge the customer.
  3. Reserve inventory.

If step three fails after steps one and two have succeeded, you need to:

  • Refund the payment.
  • Cancel the order.

This is the saga pattern in action: a sequence of steps and compensations. Each step has a corresponding compensating action that logically undoes its effects.

Saga orchestrator diagram

[Order Orchestrator]
       |
       +--> [Order Service]        (create order)
       |
       +--> [Payment Service]      (charge)
       |
       +--> [Inventory Service]    (reserve)

The orchestrator is responsible for deciding which compensating actions to trigger in response to errors. In some architectures this orchestrator is a dedicated service or workflow engine; in others it is a specific use case inside one of your core services. Either way, the important part is that the flow and its error handling are explicit.

Minimal saga implementation in TypeScript

type SagaStepContext = {
  orderId?: string;
  paymentId?: string;
  reservationId?: string;
};

type SagaStep = {
  name: string;
  execute: (ctx: SagaStepContext) => Promise<SagaStepContext>;
  compensate: (ctx: SagaStepContext) => Promise<void>;
};

export const runSaga = async (steps: SagaStep[]): Promise<SagaStepContext> => {
  const executedSteps: SagaStep[] = [];
  let context: SagaStepContext = {};

  try {
    for (const step of steps) {
      context = await step.execute(context);
      executedSteps.push(step);
    }
    return context;
  } catch (error) {
    for (const step of executedSteps.reverse()) {
      try {
        await step.compensate(context);
      } catch (compError) {
        console.error("Failed to compensate step", {
          step: step.name,
          compError,
        });
      }
    }
    throw error;
  }
};

Concrete steps for an order saga might look like this:

const createOrderStep: SagaStep = {
  name: "create-order",
  execute: async (ctx) => {
    const orderId = await createOrderRecord();
    return { ...ctx, orderId };
  },
  compensate: async (ctx) => {
    if (ctx.orderId) {
      await cancelOrderRecord(ctx.orderId);
    }
  },
};

const chargeCustomerStep: SagaStep = {
  name: "charge-customer",
  execute: async (ctx) => {
    if (!ctx.orderId) {
      throw new Error("Missing orderId");
    }
    const paymentId = await chargePaymentProvider(ctx.orderId);
    return { ...ctx, paymentId };
  },
  compensate: async (ctx) => {
    if (ctx.paymentId) {
      await refundPayment(ctx.paymentId);
    }
  },
};

This pattern fits naturally with the kind of layered design described in Practical Guide to Implementing Clean Architecture in Full-Stack Projects, where the saga sits in an application layer and calls out to ports implemented by infrastructure adapters.

Pattern 5: Dead letter queues and poison message handling

In message driven architectures, some messages will never be processable by a given consumer. Maybe the payload is malformed, maybe a downstream entity was deleted, or maybe your code has a bug in a specific edge case. Without a strategy, these poison messages will:

  • Be retried indefinitely.
  • Block other messages depending on queue semantics.
  • Hide real issues until systems saturate.

Dead letter queues are the usual solution.

Dead letter queue architecture

[Producer] -> [Main Queue] -> [Consumer Service]
                               |
                        after N failed attempts
                               v
                        [Dead Letter Queue]

Each time the consumer fails to process a message, the message is retried up to a configured limit. Once that limit is reached, the broker automatically moves it to the dead letter queue. From there, you can analyze and remediate without burning compute on endless retries.

Treating DLQs as triage, not trash

The dead letter queue is not a bin. It is a triage backlog. A practical approach looks like this:

  1. Monitor DLQ volume and rate with metrics.
  2. Expose DLQ metrics on dashboards with clear alerts.
  3. Periodically inspect DLQ messages and group them by failure cause.
  4. Decide whether to:
    • Fix a bug and reprocess messages.
    • Manually correct data and replay selected items.
    • Permanently discard with a clear explanation and, if needed, user communication.

If you are building observability or product analytics experiences similar to AI-Summarized Dashboards: From Walls of Charts to Actionable Narratives, DLQ metrics deserve prominent placement. They provide early warning for subtle data quality issues that might not surface as obvious outages.

Pattern 6: Observability as part of error handling

Many teams treat logging, metrics, and tracing as nice to have. In distributed systems they are part of error handling itself. When something fails, there are three questions you always end up asking:

  1. What exactly failed
  2. Where in the architecture did it fail
  3. Why did it fail

Correlation IDs, error codes, and standardized envelopes help answer what and where. To answer why, you need good observability.

Structured logs instead of pretty logs

Plain text logs are fine for local development. For real systems, prefer structured logs that your tools can parse:

console.error(
  JSON.stringify({
    level: "error",
    event: "order_creation_failed",
    correlationId,
    errorCode: "ORDER_CREATION_FAILED",
    details: {
      userId,
      cartSize,
    },
  })
);

This makes it much easier to aggregate and filter logs and to build higher level automation. It also paves the way for using techniques like those in How AI Helps Maintain Code Quality and Reduce Bugs where consistent signals make AI far more reliable.

Metrics as first class signals

You should track at least:

  • Error rate per endpoint and dependency.
  • Latency distributions (p50, p95, p99).
  • Saturation metrics such as queue depth, thread pool usage, or active connections.

When an error pattern changes, metrics are often the earliest warning sign. A small but sharp increase in 500 rate or queue latency on a single service can tell you where to look long before users file tickets.

Tracing for full request stories

Distributed tracing ties correlation IDs to timing graphs across services. When an incident happens, you can see that:

  • The user clicked "checkout".
  • The API gateway called the order service.
  • The order service called payment and inventory.
  • The payment call was retried twice and the third attempt finally succeeded.

This is also where you can connect application level behavior to infrastructure level details, such as whether a 500 was caused by a query plan regression like the ones discussed in Database Indexing Strategies Every Backend Developer Should Know.

Pattern 7: Graceful degradation and fallbacks

The most mature distributed systems do not simply fail or succeed. They degrade gracefully when dependencies are slow or unavailable. From a user perspective this feels like "it mostly works" even during partial outages instead of "the entire product is on fire."

Examples:

  • If the recommendation service is down, the home page still loads, but without personalized recommendations.
  • If the analytics pipeline is backlogged, user actions are buffered locally and shipped later, similar to patterns in offline capable apps.
  • If an AI based feature is unavailable, the UI falls back to a simpler manual workflow, as you might design in an assistant described in How AI Helps Maintain Code Quality and Reduce Bugs.

Simple fallback example in TypeScript

Imagine a call to a personalization service:

type Recommendation = { productId: string; score: number };

export const getRecommendations = async (
  userId: string
): Promise<Recommendation[]> => {
  try {
    const response = await fetch(
      `https://recommendation.internal/users/${userId}/recommendations`,
      { method: "GET" }
    );
    if (!response.ok) {
      throw new Error(`Status: ${response.status}`);
    }
    return response.json();
  } catch (error) {
    console.warn("Recommendation service unavailable, using fallback", {
      userId,
      error,
    });
    return getTrendingFallback();
  }
};

Here the error is not only handled but turned into a degraded experience that still serves the user. The important part is that fallbacks are explicit, observable with metrics, and do not silently mask long term issues. You still want alerts and investigations when a dependency is down, but the user impact is softened.

Putting the patterns together: a practical rollout plan

If you are looking at an existing codebase and wondering where to start, use an incremental approach instead of trying to refactor everything at once. Distributed systems reward small, careful improvements more than heroic rewrites.

  1. Introduce correlation IDs and structured logs
    • Add correlation ID creation at the edge of your system.
    • Propagate IDs through internal calls.
    • Update logging to include IDs and structured payloads.
  2. Define a shared error envelope
    • Agree on error codes across core services.
    • Implement consistent mapping to HTTP status codes.
    • Update clients and API gateways to use error codes for UX behavior.
  3. Wrap outbound calls with timeouts and retries
    • Set reasonable defaults for timeouts based on your SLOs.
    • Add exponential backoff and jitter using a small utility.
    • Measure the impact on latency and error rates before and after.
  4. Create dead letter queues for all long lived queues
    • Configure maximum retry counts and DLQ destinations.
    • Set up monitoring on DLQ depth and growth rates.
    • Document operational procedures for DLQ triage and replay.
  5. Introduce sagas for multi service workflows
    • Start with your most critical flows, such as payments or provisioning.
    • Model steps and compensations explicitly.
    • Decide whether to use orchestration (central driver) or choreography (event driven).
  6. Invest in tracing and metrics
    • Add tracing libraries to each service.
    • Standardize tags such as service, endpoint, and correlationId.
    • Build dashboards for top level flows that tie back to error rates.

As you do this, lean on architectural guidance from related pieces like Practical Guide to Implementing Clean Architecture in Full-Stack Projects and operational topics in Best Practices for API Versioning and Backward Compatibility. Clean boundaries and evolutionary APIs make it much easier to retrofit error handling patterns without breaking clients.

Conclusion: Building systems that fail well

Error handling in distributed systems is not about avoiding failure. Failure is baked into the environment. Networks will flake, dependencies will go down, partial deploys will happen on Fridays even if everyone swears they will not. What you can control is how predictable and recoverable those failures are.

The goal is to create systems that fail well:

  • You know when something broke.
  • You understand where and why it broke.
  • You can recover or degrade gracefully.
  • You can improve the system based on each incident instead of just patching symptoms.

Correlation IDs, standardized error envelopes, timeouts with retries, sagas, dead letter queues, fallbacks, and strong observability are not competing ideas. They are pieces of the same reliability puzzle. Combined with sound architectural choices from Practical Guide to Implementing Clean Architecture in Full-Stack Projects, performance work such as React Performance and Bundle Size Optimization in 2025, and AI assisted quality practices from How AI Helps Maintain Code Quality and Reduce Bugs, they help you move from "this system usually works" to "this system behaves predictably, even under stress."

Actionable takeaways

  • Add correlation ID middleware this week so that every request gets a traceable identifier across services.
  • Standardize your error shape using a small ServiceError model and shared errorCode values that all teams agree on.
  • Wrap outbound calls with timeouts and a small retry helper that uses exponential backoff and jitter.
  • Configure dead letter queues for all message brokers, monitor their depth, and set up a regular triage process.
  • Model your first saga for a critical business flow, defining both forward steps and compensating actions explicitly.
  • Invest in observability by moving to structured logs, tracking error and latency metrics, and enabling distributed tracing for your top endpoints.