Distributed Tracing with OpenTelemetry in TypeScript: From Request IDs to Production Incident Timelines

Most teams start with request IDs, then stop just before tracing starts paying off. You log a correlation ID, pass it in headers, and feel covered until a real incident spans API routes, background workers, and third party calls that fail in different ways over thirty minutes. At that point, logs alone are necessary but not sufficient.

If you already apply the structure from Designing a High Quality Logging Pipeline with Attention to Cost and Structure and the resilience mindset from Error Handling Patterns in Distributed Systems, distributed tracing is the next practical step. Tracing connects each hop into one timeline so on-call engineers can answer what happened, where time was spent, and which retry or fallback changed user outcomes.

This guide focuses on production TypeScript systems with Next.js and worker services. We will extend asynchronous workflows from Patterns for Background Jobs and Queues in Web Apps, harden inbound and outbound event paths similar to Building Reliable Webhook Delivery: Idempotency, Signatures, and Retries That Survive Incidents, and keep instrumentation boundaries aligned with Practical Guide to Implementing Clean Architecture in Full-Stack Projects.

TL;DR

Treat tracing as an operational feature, not a debugging add-on.
Initialize OpenTelemetry once at process startup, before app code imports HTTP or database clients.
Propagate context explicitly over HTTP, queues, and scheduled jobs so async work stays in the same trace.
Use stable span names and low-cardinality attributes that are queryable during incidents.
Combine head-based sampling for cost control with targeted always-on sampling for high-risk flows.
Roll tracing out by service boundary and critical workflow, then enforce it in code review.

Why request IDs alone stop short

Request IDs answer one question: "Which logs belong together?" Traces answer three harder questions: "Which hop was slow?", "Which dependency retried?", and "Which fallback actually returned the response?" In modern systems those questions decide whether incidents last ten minutes or two hours.

A single checkout can pass through an App Router endpoint, a billing adapter, an outbox write, a queue worker, and a webhook callback later. Request IDs can correlate logs, but they do not preserve parent-child timing relationships automatically. That is why teams under pressure still end up stitching raw log lines manually.

Tracing closes this gap by creating spans for each unit of work and preserving relationships between them. You can still keep your correlation IDs from Error Handling Patterns in Distributed Systems, but now each service contributes timing and status data to one trace graph instead of one flat list.

The tracing model that matters in production

You do not need every OpenTelemetry concept to ship value. You need a small model applied consistently:

trace: the full story for one logical operation.
span: one operation inside that story.
attributes: indexed metadata used for filtering.
events: timestamped details attached to a span.
links: references to related spans when parent-child is not enough.

Use this model to represent real system boundaries:

Trace: checkout.submit
  Span: http POST /api/checkout
    Span: db insert orders
    Span: outbox publish payment.requested
  Span: worker process payment.requested
    Span: http POST billing-provider/charge
  Span: webhook invoice.failed handler

This timeline mirrors how your platform actually behaves, including async hops. It also aligns with the asynchronous decomposition patterns in Patterns for Background Jobs and Queues in Web Apps, where work is intentionally split across queues and workers.

Reference architecture for Next.js plus workers

Keep the architecture explicit before writing instrumentation code:

Client
  |
  v
Next.js Route Handler
  |
  +--> PostgreSQL (domain write + outbox)
  |
  +--> Response to client
         |
         v
Outbox Relay -> Queue -> Worker Service -> External Provider
                                     |
                                     v
                             Webhook Endpoint

OpenTelemetry should cover each boundary:

Inbound HTTP request span in Next.js.
Child spans for database and outbound HTTP calls.
Context propagation into outbox payloads and queue messages.
Worker span that continues the same trace.
Webhook ingestion span linked to prior outbound delivery attempts.

That last point is critical for event ecosystems. If your platform already depends on the delivery discipline in Building Reliable Webhook Delivery: Idempotency, Signatures, and Retries That Survive Incidents, tracing gives you the missing end-to-end visibility for retries and delayed callbacks.

Step 1: Initialize OpenTelemetry early and once

Instrumentation has to load before your framework and libraries create clients. If you initialize after imports, you get partial traces and false confidence.

// src/observability/tracing.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_DEPLOYMENT_ENVIRONMENT } from "@opentelemetry/semantic-conventions";
import { ParentBasedSampler, TraceIdRatioBasedSampler } from "@opentelemetry/sdk-trace-base";

const createSdk = (): NodeSDK => {
  return new NodeSDK({
    resource: resourceFromAttributes({
      [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME ?? "web-api",
      [ATTR_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? "development",
      "service.version": process.env.APP_VERSION ?? "unknown",
    }),
    traceExporter: new OTLPTraceExporter({
      url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT,
    }),
    sampler: new ParentBasedSampler({
      root: new TraceIdRatioBasedSampler(Number(process.env.OTEL_SAMPLING_RATE ?? "0.1")),
    }),
    instrumentations: [getNodeAutoInstrumentations()],
  });
};

const sdk = createSdk();
void sdk.start();

For Next.js, load this in your startup hook or instrumentation entrypoint. For worker services, load it before queue consumers start polling.

Step 2: Propagate context across HTTP calls

Auto instrumentation covers many libraries, but teams still break traces with custom fetch wrappers. Make propagation explicit at the adapter boundary and keep it testable.

// src/infrastructure/http/httpClient.ts
import { context, propagation, trace } from "@opentelemetry/api";

type JsonValue = Record<string, unknown> | unknown[];

export const postJson = async (params: {
  url: string;
  body: JsonValue;
  headers?: Record<string, string>;
}): Promise<Response> => {
  const span = trace.getTracer("http-client").startSpan("http.post_json");

  try {
    return await context.with(trace.setSpan(context.active(), span), async () => {
      const outboundHeaders: Record<string, string> = {
        "content-type": "application/json",
        ...(params.headers ?? {}),
      };

      propagation.inject(context.active(), outboundHeaders);

      const response = await fetch(params.url, {
        method: "POST",
        headers: outboundHeaders,
        body: JSON.stringify(params.body),
      });

      span.setAttribute("http.status_code", response.status);
      return response;
    });
  } catch (error) {
    span.recordException(error as Error);
    span.setAttribute("error", true);
    throw error;
  } finally {
    span.end();
  }
};

This keeps failure details consistent with the error contracts you already standardize in Error Handling Patterns in Distributed Systems, while preserving distributed trace context for downstream services.

Step 3: Propagate context through queues and workers

HTTP propagation is only half the story. Most trace gaps happen in async pipelines. For queue-backed systems, store propagation headers with each message so workers can extract and continue the trace.

// src/queues/enqueue.ts
import { context, propagation } from "@opentelemetry/api";

type QueueMessage<TPayload> = {
  type: string;
  payload: TPayload;
  traceContext: Record<string, string>;
};

export const buildMessage = <TPayload>(params: {
  type: string;
  payload: TPayload;
}): QueueMessage<TPayload> => {
  const traceContext: Record<string, string> = {};
  propagation.inject(context.active(), traceContext);
  return { type: params.type, payload: params.payload, traceContext };
};

// src/workers/processMessage.ts
import { context, propagation, trace } from "@opentelemetry/api";

type QueueMessage<TPayload> = {
  type: string;
  payload: TPayload;
  traceContext: Record<string, string>;
};

export const processMessage = async <TPayload>(message: QueueMessage<TPayload>): Promise<void> => {
  const extracted = propagation.extract(context.active(), message.traceContext);
  const tracer = trace.getTracer("worker");

  await context.with(extracted, async () => {
    const span = tracer.startSpan(`worker.${message.type}`);
    try {
      // domain logic
    } catch (error) {
      span.recordException(error as Error);
      span.setAttribute("error", true);
      throw error;
    } finally {
      span.end();
    }
  });
};

This is the same async boundary where reliability patterns from Patterns for Background Jobs and Queues in Web Apps and replay-safe processing from Building Reliable Webhook Delivery: Idempotency, Signatures, and Retries That Survive Incidents already matter. Tracing simply makes those behaviors visible and measurable.

Step 4: Design span names and attributes for queryability

Low quality traces usually come from poor naming and high-cardinality attributes. If your team cannot filter traces quickly during incidents, instrumentation exists but observability does not.

Use these rules:

Span names are operation-oriented, not user-specific.
Attributes use stable keys and bounded value sets.
IDs that explode cardinality stay in logs, not indexed attributes.
Every critical span has status and error information.

Recommended attribute schema:

service.name
deployment.environment
http.method
http.route
http.status_code
db.system
db.operation
messaging.system
messaging.destination
tenant.id (if multi-tenant and bounded)

This schema complements structured log fields from Designing a High Quality Logging Pipeline with Attention to Cost and Structure. Logs keep rich payload detail, while traces keep timeline and bounded dimensions for fast filtering.

Step 5: Add business spans and links for retries and fan-out

Auto spans for HTTP and DB are useful but not enough for business debugging. Add manual spans around domain operations where outcomes matter to users and revenue.

import { context, trace, SpanStatusCode } from "@opentelemetry/api";

type ChargeInput = { orderId: string; customerId: string; amountCents: number };

export const chargeOrder = async (input: ChargeInput): Promise<void> => {
  const tracer = trace.getTracer("billing-usecase");
  const span = tracer.startSpan("billing.charge_order", {
    attributes: {
      "order.id": input.orderId,
      "customer.id": input.customerId,
      "payment.amount_cents": input.amountCents,
    },
  });

  try {
    await context.with(trace.setSpan(context.active(), span), async () => {
      // call provider and persist result
    });
    span.setStatus({ code: SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error as Error);
    span.setStatus({ code: SpanStatusCode.ERROR, message: "charge failed" });
    throw error;
  } finally {
    span.end();
  }
};

For retries, emit span events like retry_scheduled and retry_exhausted. For fan-out operations, use span links so separate traces can still be related during root-cause analysis.

Sampling and cost control without blind spots

Tracing cost can climb fast at high throughput. The answer is not to disable tracing. The answer is deliberate sampling policy plus retention tiers, the same cost discipline outlined in Designing a High Quality Logging Pipeline with Attention to Cost and Structure.

A practical baseline:

100% sampling for checkout, billing, auth, and provisioning flows.
10% sampling for standard CRUD endpoints.
1% sampling for health checks and noisy background scans.
Temporary 100% overrides during active incidents.

Collector policy example:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      - name: baseline-random
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Tail sampling lets you keep all error traces and unusually slow traces while still controlling volume for normal traffic.

Privacy, tenancy, and compliance guardrails

Tracing can accidentally become a data leak if you treat spans like debug dumps. Build explicit guardrails before broad rollout:

Never store raw request bodies, access tokens, or payment metadata in span attributes.
Hash or tokenize sensitive IDs if traces are visible outside core platform teams.
Restrict query access by environment and role.
Apply short retention for high-sensitivity services.

For multi-tenant platforms, attach tenant context carefully and keep values bounded. This gives operators enough context to triage noisy neighbors without introducing unbounded cardinality or exposing sensitive tenant details in every trace view. If your tenancy controls follow strict boundaries, keep them consistent with the isolation and observability discipline in Practical Guide to Implementing Clean Architecture in Full-Stack Projects so instrumentation ownership does not drift between layers.

A practical pattern is split observability:

Traces keep operational metadata and timing.
Logs keep rich payload details under stricter access controls.
Audit events track who queried or exported telemetry data.

This structure aligns with the cost and schema discipline in Designing a High Quality Logging Pipeline with Attention to Cost and Structure, where not every field belongs in every telemetry channel.

Incident workflow: traces that lower MTTR

During incidents, traces should support a repeatable process:

Start from the user-facing failure and identify one trace ID.
Inspect the span tree to find the first failing or slow boundary.
Pivot to logs using span IDs for rich payload details.
Confirm whether retries, circuit breakers, or fallbacks changed outcomes.
Capture the trace in postmortem artifacts.

This operational loop works best when it mirrors your existing runbooks from Error Handling Patterns in Distributed Systems and webhook replay playbooks from Building Reliable Webhook Delivery: Idempotency, Signatures, and Retries That Survive Incidents.

A concrete example: if invoice webhooks are delayed, traces can show whether delay comes from queue depth, worker concurrency, external provider latency, or repeated signature validation failures. You can then apply the right fix instead of tuning the wrong component.

Keep instrumentation aligned with architecture boundaries

Tracing gets messy when every layer creates spans arbitrarily. Keep ownership clear:

Interface layer: inbound request spans and response status.
Application layer: business operation spans.
Infrastructure adapters: DB, HTTP, cache, and queue spans.
Domain layer: no tracing API calls.

This layering keeps observability concerns consistent with Practical Guide to Implementing Clean Architecture in Full-Stack Projects, where boundaries stay explicit and testable. You can unit test domain rules without telemetry mocks, and integration-test adapters where instrumentation actually lives.

Testing and rollout strategy

Rollout fails when teams instrument everything at once and verify nothing. Use phased delivery:

Pick one critical workflow end to end.
Add startup initialization and basic HTTP spans.
Add queue propagation for one worker path.
Validate trace continuity in staging with synthetic traffic.
Add alerting for missing trace context rates.
Expand service by service.

Verification checks that should pass before production:

95%+ of incoming requests create root spans.
Critical workflows preserve trace context through worker hops.
Error traces include exception events and status codes.
Trace IDs can be correlated with structured logs.

When this becomes standard practice, include instrumentation requirements in design docs for new adapters and use cases, the same way you already enforce layering from Practical Guide to Implementing Clean Architecture in Full-Stack Projects.

A practical 30-day rollout plan

If your team is busy and observability work often gets postponed, commit to milestones with clear outcomes:

Week 1

Ship SDK bootstrap in one API service.
Verify traces appear in staging with service metadata and route names.
Add a short runbook for finding one failed request by trace ID.

Week 2

Propagate context across one queue and one worker path.
Add span events for retries and dead-letter transitions.
Validate that retry-heavy paths can be diagnosed without manual log stitching.

Week 3

Apply sampling policies for critical versus non-critical routes.
Add dashboards for trace error rate, missing-context rate, and slow-span percentiles.
Run one synthetic failure drill and record MTTR improvements.

Week 4

Expand instrumentation to one more domain workflow.
Add trace standards to code review checklists and architecture docs.
Lock retention and data access policy with security and platform teams.

By the end of month one, you should be able to trace at least one mission-critical flow from HTTP entry through asynchronous processing and webhook callbacks, with enough consistency to support postmortems and release gates.

Common mistakes that quietly break trace quality

Initializing SDK after app imports, which causes partial instrumentation.
Recording user emails or raw payloads as indexed attributes, which creates privacy and cardinality issues.
Naming spans with dynamic IDs, which destroys aggregation.
Forgetting propagation in queue payloads, which splits traces at async boundaries.
Sampling too aggressively on error-prone services, which hides incidents.
Treating tracing UI as the source of truth and ignoring logs for payload-level context.

Most of these failures show up weeks later during the first major outage. Catch them early with automated checks and explicit conventions.

Actionable next steps

If you want production-ready tracing in the next sprint, execute this checklist:

Bootstrap OpenTelemetry at process start for your web API and one worker service.
Instrument one high-value workflow end to end, including queue propagation.
Define and document span naming plus attribute conventions.
Configure sampling so all errors and critical flows are always retained.
Add runbook steps that pivot from traces to logs and back.
Require trace propagation in every new async integration and webhook consumer.

Distributed tracing is not another dashboard for your stack. It is the connective tissue between architecture, reliability, and incident response. When traces, logs, and error contracts reinforce each other, your team spends less time guessing and more time fixing the right boundary first.