Building Reliable Webhook Delivery: Idempotency, Signatures, and Retries That Survive Incidents

Webhooks are the connective tissue of modern SaaS. Billing providers tell you when invoices fail, auth systems notify you about user lifecycle changes, and AI vendors stream completion events back to your product. When notifications are delayed or dropped, customers see inconsistent state and finance teams lose trust. Reliable delivery is about treating incoming notifications as a first class integration surface with durability, correctness, and auditability baked in.

We lean on Background Jobs: Queue Patterns for Web Apps for durability, apply the safeguards from Error Handling Patterns in Distributed Systems so retries do not make things worse, reuse the observability discipline in Designing a High Quality Logging Pipeline with Attention to Cost and Structure, and evolve schemas with the restraint in API Versioning and Backward Compatibility.

TL;DR

Durable queue in front of your webhook processor, persistent retries with backoff, and idempotent handlers that commit once.
Strong request signing with rotated secrets plus replay protection on timestamps and nonces.
Delivery state machine with explicit transitions and a poison queue for messages that exceed safe retry limits.
Observability baked in from the first line: structured logs, metrics, and traceable delivery ids to support incident response.
Versioned payloads, contract tests, and gradual rollout to avoid breaking consumers while you evolve fields.

What reliable webhooks look like

A reliable webhook platform guarantees four things: messages are durably stored before processing, each notification is delivered at least once without duplication side effects, payloads are authentic and unchanged, and failures are observable with a clear recovery path. Without those, you end up with zombie deliveries, double charges, or silent data drift.

Treat the webhook handler as a specialized integration boundary, not just another route. The mental model mirrors the staging and escalation design in Error Handling Patterns in Distributed Systems, except here your retry budget must tolerate someone else's outages as well as your own.

Reference architecture

Here is a lightweight architecture that scales from a small Next.js backend to a multi service environment:

Inbound HTTPS Endpoint
  |
  |  Validate signature, schema, and headers
  v
Durable Queue (SQS, RabbitMQ, Redis stream)
  |
  |  Delivery tasks with deliveryId and attempts
  v
Webhook Processor (worker pool)
  |
  |  Idempotent handler, writes to DB, emits events
  v
Success log + metrics
  |
  |  or
  v
Retry scheduler with backoff
  |
  |  Max attempts -> Dead letter queue
  v
Alerting and dashboards

Two design choices matter most. First, accept and enqueue quickly instead of doing work in the HTTP thread, just like the email and invoicing flows in Background Jobs: Queue Patterns for Web Apps. Second, make processing idempotent so retries cannot corrupt state.

Idempotency first

Assume the sender will retry aggressively whenever it does not see a 2xx response. The only safe way to handle that is to ensure your handler can run multiple times without side effects. The simplest technique is to store an idempotency key derived from the sender's unique event id plus your endpoint path.

import { db } from "./db";

type WebhookEvent = {
  id: string; // provider event id
  type: string;
  payload: unknown;
};

export const handleInvoiceEvent = async (event: WebhookEvent) => {
  const key = `invoice:${event.id}`;

  const existing = await db
    .selectFrom("webhook_receipts")
    .select(["idempotency_key", "status"])
    .where("idempotency_key", "=", key)
    .executeTakeFirst();

  if (existing?.status === "completed") {
    return { deduped: true };
  }

  await db.transaction().execute(async (trx) => {
    await trx
      .insertInto("webhook_receipts")
      .values({ idempotency_key: key, status: "processing" })
      .onConflict((oc) => oc.doNothing())
      .execute();

    await trx
      .updateTable("invoices")
      .set({ status: "failed" })
      .where("external_event_id", "=", event.id)
      .execute();

    await trx
      .updateTable("webhook_receipts")
      .set({ status: "completed" })
      .where("idempotency_key", "=", key)
      .execute();
  });

  return { deduped: false };
};

Commit the receipt and domain update in the same transaction so retries cannot partially apply changes, just as Error Handling Patterns in Distributed Systems recommends. Store the idempotency key and status so later retries short circuit without touching domain tables.

Request signing and replay protection

Authenticating webhook payloads protects against spoofing and accidental cross environment sends. A common pattern is HMAC signing with a shared secret per endpoint. Verify the signature and a recent timestamp, and include a nonce to block replays.

import crypto from "crypto";

const SECRET = process.env.WEBHOOK_SECRET ?? "";
const ALLOWED_DRIFT_MS = 5 * 60 * 1000;

export const verifySignature = (
  rawBody: string,
  providedSignature: string,
  providedTimestamp: string,
  nonce: string
) => {
  const timestampMs = Number(providedTimestamp);
  if (!Number.isFinite(timestampMs)) {
    throw new Error("invalid timestamp");
  }

  if (Math.abs(Date.now() - timestampMs) > ALLOWED_DRIFT_MS) {
    throw new Error("timestamp too old");
  }

  const baseString = `${timestampMs}.${nonce}.${rawBody}`;
  const computed = crypto
    .createHmac("sha256", SECRET)
    .update(baseString)
    .digest("hex");

  if (!crypto.timingSafeEqual(Buffer.from(computed), Buffer.from(providedSignature))) {
    throw new Error("signature mismatch");
  }
};

Require a unique nonce per request and store it for a short TTL to prevent replay. Rotate secrets regularly and roll endpoints when a secret is exposed. If you already manage secrets using the patterns in Designing a High Quality Logging Pipeline with Attention to Cost and Structure, apply the same rigor here to keep storage costs predictable while still having forensic detail.

Delivery state machine

Model delivery attempts explicitly so you can reason about behavior under stress. A simple state machine is pending -> processing -> succeeded | failed -> dead letter. Store attempt counts, last error, and last attempt timestamp.

type DeliveryState = "pending" | "processing" | "succeeded" | "failed" | "dead_letter";

type DeliveryRecord = {
  deliveryId: string;
  state: DeliveryState;
  attempts: number;
  nextAttemptAt: Date;
  lastError?: string;
};

Workers should atomically claim a pending delivery, mark it processing, execute the handler, then mark it succeeded or failed. When attempts exceed your budget, move the record to a dead letter queue for manual review. That gives you the observability hooks you need later when you follow the incident review habits from Designing a High Quality Logging Pipeline with Attention to Cost and Structure.

Backoff and retry strategy

Use exponential backoff with jitter so that bursts of upstream failures do not hammer the provider. A common schedule is 1s, 5s, 30s, 5m, 30m, 2h, 12h. Persist the schedule with each delivery so that restarts do not reset attempts.

const BACKOFF_MS = [1000, 5000, 30000, 300000, 1800000, 7200000, 43200000];

const nextBackoff = (attempts: number) => {
  const index = Math.min(attempts, BACKOFF_MS.length - 1);
  const base = BACKOFF_MS[index];
  const jitter = Math.floor(Math.random() * 0.25 * base);
  return base + jitter;
};

Persisting retry schedule aligns with the resilience mindset in Background Jobs: Queue Patterns for Web Apps where jobs survive process restarts. Pair it with circuit breakers from Error Handling Patterns in Distributed Systems so you can pause processing when a provider is returning consistent 5xx responses.

Observability and cost

Every delivery should carry a deliveryId, provider eventId, and traceId. Emit structured logs on enqueue, claim, attempt, success, and failure, and capture latency plus response codes. Reuse the JSON schema and field naming approach from Designing a High Quality Logging Pipeline with Attention to Cost and Structure to keep queries fast and storage predictable.

Track success rate, P99 end to end latency, retry distribution, and dead letter volume. Add traces if you already instrumented your services, and propagate the same trace id across the enqueue and worker paths. That will save hours during incidents and pairs nicely with the structured error enrichment recommended in Error Handling Patterns in Distributed Systems.

Versioning payloads

Providers evolve fields, and your consumers will need to handle both old and new shapes. Add a webhook_version header and include a schemaVersion field in the body. Validate against a schema per version, and keep at least two versions live during migrations.

Roll out changes gradually using the same mindset as API Versioning and Backward Compatibility. Start by adding new optional fields, then mark old fields deprecated, and only remove them once your consumers confirm readiness. Document versions in your developer portal and set up contract tests.

Contract testing example with Zod:

import { z } from "zod";

const schemas = {
  "1": z.object({ schemaVersion: z.literal("1"), eventId: z.string(), type: z.string() }),
  "2": z.object({
    schemaVersion: z.literal("2"),
    eventId: z.string(),
    type: z.string(),
    payload: z.object({ customerId: z.string(), status: z.string(), planTier: z.string().optional() }),
  }),
};

export const parseWebhook = (body: unknown) => {
  const version = (body as { schemaVersion?: string }).schemaVersion ?? "1";
  return schemas[version]?.parse(body) ?? schemas["1"].parse(body);
};

Keep schemas validated in CI and notify downstream teams before switching defaults. The discipline you apply here mirrors the rollout patterns in API Versioning and Backward Compatibility.

Testing and chaos drills

Test the entire pipeline, not just the handler. Simulate retries, signature failures, and poison message paths. Add fixtures that mimic provider payloads and verify idempotency keys behave as expected. Run load tests to make sure your queue and worker pool keep up.

Schedule a small chaos drill that blocks outbound network for a few minutes. Confirm retries back off, dashboards light up, and dead letters are populated. This proactive exercise follows the resilience checks in Background Jobs: Queue Patterns for Web Apps and the recovery stages outlined in Error Handling Patterns in Distributed Systems.

Security hardening

Restrict source IP ranges when providers support it, and place webhook endpoints on a separate subdomain with stricter rate limits. Encrypt payloads that contain PII. Alert on signature failures and keep a runbook for rapid secret rotation.

Operations playbook

Incidents will happen. Prepare a playbook so on-call engineers have crisp steps:

Check dashboards for spikes in failures or latency.
Inspect the dead letter queue for repeating patterns and pause processing if retries are amplifying an upstream outage.
Verify upstream status pages and your own dependency monitors.
Backfill deliveries by replaying from the queue, then file a short incident summary with links to logs and metrics.

Postmortems should capture delivery ids, timeline, and remediation items. Use structured reports similar to the log centric reviews in Designing a High Quality Logging Pipeline with Attention to Cost and Structure so you can trend incident types over time.

Actionable next steps

If you want to harden your webhook handling over the next sprint, follow this checklist:

Add signature verification with timestamp and nonce checks to every endpoint.
Introduce a durable queue in front of your handler and move business logic into a worker.
Implement idempotency receipts stored transactionally with domain updates.
Persist a delivery state machine with backoff and a dead letter queue.
Instrument enqueue, claim, attempt, success, and failure with structured logs and metrics.
Version your payload schema and add contract tests for at least two live versions.
Schedule a chaos drill to simulate upstream outages and verify your dashboards and alerts.

Do these and your webhooks will survive provider outages, restarts, and customer growth without corrupting data or flooding your support queue. The result is a quiet, predictable integration surface that behaves like the rest of your core infrastructure rather than an afterthought.