AverageDevs
ArchitectureDevOps

Designing a High Quality Logging Pipeline with Attention to Cost and Structure

A practical engineering guide to building production-grade logging pipelines that balance observability needs with cost efficiency, covering architecture patterns, structured logging best practices, and real-world trade-offs for mid-level developers.

Designing a High Quality Logging Pipeline with Attention to Cost and Structure

Every production system needs logging. Yet somehow, logging is simultaneously the most obvious requirement and the most misunderstood aspect of building scalable applications. Junior developers treat logs like printf debugging that accidentally made it to production. Senior developers know that logs are the primary diagnostic tool when things go wrong at 3 AM. The difference between a logging system that saves you in an outage and one that drowns you in noise comes down to thoughtful design decisions about structure, volume, cost, and query patterns.

A high quality logging pipeline is not about capturing everything. It is about capturing the right things, in the right format, at the right time, and making that data accessible when you need it without bankrupting your infrastructure budget. This becomes especially tricky at scale. What works when you are processing 100 requests per second falls apart at 10,000. What makes sense when you have three engineers looking at logs breaks down when you have thirty. And what was affordable when you were venture funded becomes unsustainable when you need to show a path to profitability.

If you are a mid-level developer who has shipped features to production and watched logs accumulate, but never really designed a logging strategy from scratch, this guide is for you. We will walk through the architecture of a production logging pipeline, explore structured logging patterns, discuss cost trade-offs, and look at practical code examples in TypeScript and Node.js that you can adapt to your stack. For broader context on maintaining system health, see our guide on Error Handling Patterns in Distributed Systems. If you are wrestling with performance issues, start with Optimize React Apps Performance.

TL;DR

  • Structured logging is non-negotiable at any scale beyond toy projects. JSON is your friend.
  • Log levels matter, but most teams use them wrong. INFO is not a dumping ground.
  • Cost scales with volume, so instrument smartly and sample aggressively in high-throughput paths.
  • Searchability is everything. Design your log structure around the queries you will actually run during incidents.
  • Correlation IDs are the thread that ties distributed traces together. Generate them early and pass them everywhere.

What makes a logging pipeline high quality

Think of your logging pipeline as a diagnostic nervous system for your application. When something goes wrong, logs are how you reconstruct what happened. But unlike a literal nervous system, you get to design exactly what signals to capture and how to organize them. A high quality pipeline balances four competing concerns: completeness, cost, performance, and usability.

Completeness means capturing enough context to diagnose problems without having to deploy new instrumentation and wait for the issue to recur. This sounds simple until you realize that the information you need during an incident is never the information you thought to log beforehand. The trick is to log liberally at decision points, state transitions, and boundaries between systems, but not so liberally that you overwhelm your storage or your engineers.

Cost is the constraint that most tutorials ignore. Logging vendors charge per gigabyte ingested, per gigabyte stored, and per query. At a few requests per second, this is negligible. At scale, logging can be one of your top five infrastructure costs. The difference between a well-designed pipeline and a poorly designed one can be thousands of dollars per month, or tens of thousands at serious scale.

Performance means your logging layer cannot be the bottleneck. If writing logs blocks your request handling, you have failed. Logs should be async, batched, and fast enough that you can instrument hot paths without worrying about latency. This is easier said than done when you are dealing with structured data, serialization overhead, and network calls to a remote logging service.

Usability is about making logs searchable when you are debugging an incident at 2 AM with half your team asleep and customers complaining. This means consistent structure, meaningful field names, and query patterns that do not require remembering magic incantations. If your team cannot answer "show me all failed payment processing requests for customer X in the last hour" in under 30 seconds, your logging pipeline is not usable enough.

These four concerns pull in different directions. Completeness and usability push you toward logging more. Cost and performance push you toward logging less. Great logging design is finding the equilibrium where you capture what matters, discard what does not, and structure everything so that queries are fast and cheap.

The anatomy of a production logging pipeline

Before we dive into how to structure individual log entries, let us zoom out and look at the components of a full logging pipeline. At a high level, you have four layers: generation, collection, storage, and querying.

The generation layer is your application code. This is where log entries are created, typically through a logging library or framework. In Node.js, this might be Winston, Pino, or Bunyan. In Python, it is the standard logging module. In Go, it is zap or logrus. The key responsibility here is to emit structured log entries with enough context and to do so without blocking the request handling thread.

The collection layer aggregates logs from multiple sources and forwards them to storage. In a distributed system, this might be a sidecar agent like Fluent Bit or Vector running alongside your application containers, or it might be a centralized collector like Logstash or Fluentd. The collection layer handles batching, buffering, and retries so that your application does not have to. It also handles pre-processing like adding metadata, filtering out noise, or sampling high-volume streams.

The storage layer is where logs live long-term. This could be a managed service like Datadog, Grafana Loki, or AWS CloudWatch, or it could be self-hosted infrastructure like Elasticsearch, ClickHouse, or S3 with Athena on top. The storage layer needs to handle high write throughput, compress data efficiently, and support fast queries on recent data while keeping older data accessible but cheap.

The querying layer is how you interact with logs. This might be a web UI like Kibana or Grafana, a command line tool like kubectl logs, or an API that lets you programmatically retrieve logs for analysis. The querying layer needs to be fast enough for interactive debugging, expressive enough to filter and aggregate on arbitrary fields, and resilient enough that it does not fall over during incidents when everyone is querying logs simultaneously.

Here is a simplified architecture for a Next.js application deployed on Kubernetes:

Next.js App (Pod)

   ├──► Application logs to stdout (JSON)


Fluent Bit (Sidecar)

   ├──► Parses JSON, adds k8s metadata
   ├──► Samples high-volume streams


Grafana Loki / Datadog / CloudWatch

   ├──► Indexes on select fields
   ├──► Compresses and stores


Grafana UI / Datadog UI

   └──► Query logs by traceId, userId, etc.

This separation of concerns means your application code stays simple. You just emit structured logs to stdout. The collection layer handles everything else. This also makes it easy to swap out storage backends without changing application code, which is valuable when you are comparing costs between vendors or migrating to a new observability stack.

For insights on managing infrastructure complexity, see Deploy Next.js on VPS and Edge Functions and Serverless in 2025.

Structured logging: why JSON is your friend

The single most important decision you will make about your logging pipeline is to use structured logging from day one. Structured logging means emitting logs as JSON objects with typed fields, not as freeform strings that you have to parse later with regex. This is not a matter of preference or style. It is a technical requirement for building a queryable logging system.

Consider the difference between these two log entries:

Unstructured: "User john@example.com failed login from IP 192.168.1.1"

Structured:
{
  "timestamp": "2025-12-28T14:32:10.543Z",
  "level": "warn",
  "message": "Login failed",
  "userId": "user_abc123",
  "email": "john@example.com",
  "ip": "192.168.1.1",
  "reason": "invalid_password"
}

With the unstructured log, if you want to find all failed logins from a specific IP, you have to write a regex that parses the string. This is slow, error-prone, and breaks the moment someone changes the log message format. With the structured log, you just query for ip="192.168.1.1" AND level="warn" AND message="Login failed". Modern log storage systems can index on these fields, making queries orders of magnitude faster.

Structured logging also makes it easy to add context. Want to know which API endpoint triggered this log? Add a path field. Want to track this request across services? Add a traceId field. Want to group logs by deployment version? Add a version field. With unstructured logs, adding fields means changing string templates and updating parsers. With structured logs, you just add key-value pairs.

Here is a practical logger setup in TypeScript using Pino, one of the fastest JSON loggers for Node.js:

import pino from "pino";

/**
 * Creates a production-ready logger with sensible defaults.
 * Logs are emitted as JSON to stdout for collection by log aggregators.
 */
export const createLogger = (serviceName: string) => {
  return pino({
    name: serviceName,
    level: process.env.LOG_LEVEL || "info",
    formatters: {
      level: (label) => {
        return { level: label };
      },
    },
    timestamp: pino.stdTimeFunctions.isoTime,
    base: {
      service: serviceName,
      version: process.env.APP_VERSION || "unknown",
      environment: process.env.NODE_ENV || "development",
    },
  });
};

const logger = createLogger("api-service");

// Usage in application code
export const handleLogin = async (req: Request) => {
  const { email, password } = await req.json();
  const traceId = req.headers.get("x-trace-id") || crypto.randomUUID();

  logger.info({
    traceId,
    email,
    action: "login_attempt",
  });

  try {
    const user = await authenticateUser(email, password);
    
    logger.info({
      traceId,
      userId: user.id,
      email,
      action: "login_success",
    });

    return Response.json({ success: true });
  } catch (error) {
    logger.warn({
      traceId,
      email,
      action: "login_failed",
      reason: error instanceof Error ? error.message : "unknown",
    });

    return Response.json({ success: false }, { status: 401 });
  }
};

Notice how every log entry includes a traceId. This is critical for distributed systems. When a request flows through multiple services, the trace ID is the thread that lets you stitch together logs from different services to understand the full lifecycle of the request. Generate the trace ID at the edge of your system (in your API gateway or the first service that receives the request) and pass it through headers to downstream services.

For patterns on handling distributed request flows, see Error Handling Patterns in Distributed Systems.

Log levels: using them correctly

Most logging libraries support multiple log levels: DEBUG, INFO, WARN, ERROR, and sometimes FATAL or TRACE. In theory, these levels let you control verbosity. In practice, most teams use them inconsistently, leading to logs that are either too noisy or not informative enough.

Here is a practical framework for thinking about log levels. DEBUG is for information that is only useful when you are actively debugging a specific issue. This includes intermediate values in complex calculations, loop iterations, or cache hit rates. In production, you typically filter out DEBUG logs to save on volume and cost.

INFO is for significant events in the normal flow of the application. A user logged in. A payment was processed. A cron job completed. These are not errors, but they are important enough that you want a record of them happening. INFO logs should be rare enough that you can skim them during an incident and get a sense of what the system was doing.

WARN is for unexpected situations that are not errors but might indicate a problem. A retry succeeded after an initial failure. A feature flag defaulted to a fallback value. A deprecated API was called. WARN logs are candidates for alerting, but not critical enough to page someone immediately.

ERROR is for actual failures. A request returned a 500. A database query timed out. A file could not be opened. ERROR logs should always include enough context to understand what went wrong and how to reproduce it. These are the logs you will be searching through during incidents.

FATAL is for unrecoverable errors that cause the application to shut down. Out of memory. Database connection pool exhausted. Configuration file missing. FATAL logs should always trigger alerts, because they mean your service is down.

The mistake most teams make is logging everything at INFO or DEBUG. This leads to massive log volume and high costs. A better approach is to log sparingly at INFO, generously at WARN and ERROR, and almost never at DEBUG in production. Use feature flags or dynamic log levels to enable DEBUG logging for specific users or sessions when you need to debug a production issue.

Here is how to implement dynamic log levels in a Next.js API:

import pino from "pino";

const baseLogger = createLogger("api-service");

/**
 * Creates a child logger with dynamic log level based on request headers.
 * Allows enabling debug logs for specific requests without changing global config.
 */
export const getRequestLogger = (req: Request) => {
  const debugHeader = req.headers.get("x-enable-debug");
  const level = debugHeader === "true" ? "debug" : undefined;

  return baseLogger.child({
    traceId: req.headers.get("x-trace-id") || crypto.randomUUID(),
    ...(level && { level }),
  });
};

// Usage in route handlers
export const POST = async (req: Request) => {
  const logger = getRequestLogger(req);

  logger.debug("Parsing request body");
  const body = await req.json();

  logger.info({ action: "request_received", path: req.url });

  // ... rest of handler
};

Now you can enable debug logging for a specific request by passing x-enable-debug: true in the headers, without flooding your logs with debug output from all requests.

Controlling costs: sampling and retention strategies

Logging costs money. At small scale, this is negligible. At large scale, it can be one of your biggest infrastructure expenses. The math is brutal: if you process 1,000 requests per second and each request generates 5 log entries at 500 bytes each, you are producing 2.5 MB per second, or 216 GB per day. At typical vendor pricing of $0.10 per GB ingested and $0.02 per GB stored per month, that is $21.60 per day in ingestion costs alone, or about $650 per month. And that is just for one service at modest scale.

The solution is not to stop logging. It is to be selective about what you log and for how long you keep it. There are four levers you can pull: sampling, filtering, compression, and retention.

Sampling means only logging a fraction of events. For high-volume, low-value logs like successful health check requests or cache hits, you might sample at 1% or 0.1%. For critical events like payment processing or authentication, you log everything. Modern logging libraries support probabilistic sampling based on log level or custom logic.

Here is a sampling middleware for Next.js:

import { NextRequest, NextResponse } from "next/server";

const HEALTH_CHECK_PATHS = ["/health", "/ping", "/readiness"];
const SAMPLE_RATE = 0.01; // 1% sampling for health checks

export const middleware = (req: NextRequest) => {
  const isHealthCheck = HEALTH_CHECK_PATHS.some((path) =>
    req.nextUrl.pathname.startsWith(path)
  );

  if (isHealthCheck && Math.random() > SAMPLE_RATE) {
    // Skip logging for 99% of health checks
    req.headers.set("x-skip-logs", "true");
  }

  return NextResponse.next();
};

// In your logger utility
export const shouldSkipLogs = (req: Request): boolean => {
  return req.headers.get("x-skip-logs") === "true";
};

Filtering means dropping logs that you know you will never query. If you have a microservice that logs every cache access and you have never once needed to debug cache behavior, stop logging it. Be ruthless about cutting logs that do not provide value. Review your most expensive log streams quarterly and ask whether they are worth the cost.

Compression is mostly handled by your storage layer, but you can help by keeping log entries small. Avoid logging large payloads or deeply nested objects. If you need to log a request body, truncate it. If you need to log an array, log the length instead of the contents. Every byte you do not log is a byte you do not pay to store.

Retention is about how long you keep logs. Most incidents are debugged within a few days. Keeping full-fidelity logs for more than 30 days is usually overkill. A common pattern is hot storage for 7 days, warm storage for 30 days, and cold storage or deletion after 90 days. Hot storage is fast and expensive, warm storage is slower and cheaper, cold storage is archival and very cheap.

Here is a retention policy for CloudWatch Logs using the AWS CDK:

import * as logs from "aws-cdk-lib/aws-logs";
import * as cdk from "aws-cdk-lib";

export const createLogGroup = (scope: cdk.Stack, serviceName: string) => {
  return new logs.LogGroup(scope, `${serviceName}-logs`, {
    logGroupName: `/aws/ecs/${serviceName}`,
    retention: logs.RetentionDays.ONE_WEEK, // Hot storage: 7 days
    removalPolicy: cdk.RemovalPolicy.DESTROY,
  });
};

// For long-term storage, export to S3 with lifecycle policies
export const createLogArchive = (scope: cdk.Stack) => {
  const bucket = new s3.Bucket(scope, "log-archive", {
    lifecycleRules: [
      {
        transitions: [
          {
            storageClass: s3.StorageClass.GLACIER,
            transitionAfter: cdk.Duration.days(90), // Move to cold storage after 90 days
          },
        ],
        expiration: cdk.Duration.days(365), // Delete after 1 year
      },
    ],
  });

  return bucket;
};

This gives you fast access to recent logs while keeping long-term costs down. For patterns on cost-effective infrastructure, see Deploy Next.js on VPS.

Query patterns: designing logs for searchability

Logs are only useful if you can find what you need during an incident. This means designing your log structure around the queries you will actually run. The most common query patterns are: find logs for a specific user, find logs for a specific request, find errors in a specific service, and find logs matching a specific event.

To support these queries, you need consistent field names across all services. Establish a logging schema that defines required and optional fields. Required fields might include timestamp, level, service, traceId, and message. Optional fields depend on the context, but common ones are userId, sessionId, requestId, path, method, statusCode, latencyMs, and error.

Here is a TypeScript type definition for a standard log entry:

type LogLevel = "debug" | "info" | "warn" | "error" | "fatal";

type BaseLogEntry = {
  timestamp: string;
  level: LogLevel;
  service: string;
  environment: string;
  version: string;
  message: string;
  traceId: string;
};

type HttpLogEntry = BaseLogEntry & {
  type: "http";
  method: string;
  path: string;
  statusCode: number;
  latencyMs: number;
  userId?: string;
  ip?: string;
  userAgent?: string;
};

type DatabaseLogEntry = BaseLogEntry & {
  type: "database";
  query: string;
  durationMs: number;
  rowsAffected?: number;
  error?: string;
};

type LogEntry = HttpLogEntry | DatabaseLogEntry;

/**
 * Helper function to create HTTP log entries with consistent structure.
 */
export const createHttpLog = (params: {
  level: LogLevel;
  message: string;
  traceId: string;
  method: string;
  path: string;
  statusCode: number;
  latencyMs: number;
  userId?: string;
  ip?: string;
}): HttpLogEntry => {
  return {
    timestamp: new Date().toISOString(),
    service: process.env.SERVICE_NAME || "unknown",
    environment: process.env.NODE_ENV || "development",
    version: process.env.APP_VERSION || "unknown",
    type: "http",
    ...params,
  };
};

With this structure, you can write queries like "find all HTTP logs where statusCode >= 500 and service = api-service and timestamp > now() - 1h". This is fast because your log storage can index on these fields.

Another critical pattern is tagging logs with business context. If you are building an e-commerce platform, tag logs with orderId and customerId. If you are building a SaaS product, tag with tenantId and subscriptionTier. This lets you answer questions like "did customer X experience errors today" without having to correlate logs across multiple systems.

For context on building systems that scale, see Database Indexing Strategies for Backend Developers and Clean Architecture for Fullstack Applications.

Correlation IDs and distributed tracing

In a distributed system, a single user request might touch five or ten services. Without a way to correlate logs across those services, debugging becomes nearly impossible. This is where correlation IDs and distributed tracing come in.

A correlation ID, also called a trace ID or request ID, is a unique identifier generated at the edge of your system and passed through every service that handles the request. Each service logs the trace ID with every log entry, making it trivial to find all logs related to a specific request.

The pattern is simple: generate a UUID when a request enters your system, store it in a request header, and extract it in every service. Here is how to implement this in a Next.js middleware:

import { NextRequest, NextResponse } from "next/server";

const TRACE_ID_HEADER = "x-trace-id";

export const middleware = (req: NextRequest) => {
  let traceId = req.headers.get(TRACE_ID_HEADER);

  if (!traceId) {
    traceId = crypto.randomUUID();
  }

  const requestHeaders = new Headers(req.headers);
  requestHeaders.set(TRACE_ID_HEADER, traceId);

  const response = NextResponse.next({
    request: {
      headers: requestHeaders,
    },
  });

  response.headers.set(TRACE_ID_HEADER, traceId);

  return response;
};

Then in your route handlers and utilities, extract the trace ID and include it in all logs:

export const getTraceId = (req: Request): string => {
  return req.headers.get(TRACE_ID_HEADER) || "unknown";
};

export const POST = async (req: Request) => {
  const traceId = getTraceId(req);
  const logger = baseLogger.child({ traceId });

  logger.info({ action: "payment_started" });

  const result = await processPayment(req, traceId);

  logger.info({ action: "payment_completed", result });

  return Response.json(result);
};

const processPayment = async (req: Request, traceId: string) => {
  // When calling downstream services, pass the trace ID
  const response = await fetch("https://payment-service/charge", {
    method: "POST",
    headers: {
      "content-type": "application/json",
      [TRACE_ID_HEADER]: traceId,
    },
    body: JSON.stringify({}),
  });

  return response.json();
};

Now when you investigate a failed payment, you can search for traceId="abc123" and see every log entry from every service involved in processing that request. This is invaluable for debugging timeouts, race conditions, and other distributed system failures.

For advanced use cases, consider integrating with a distributed tracing system like OpenTelemetry, Jaeger, or Zipkin. These tools go beyond correlation IDs and capture timing information for each span of work, giving you a visual timeline of where time was spent during a request.

Actionable takeaways

If you take nothing else from this article, remember these three things:

  1. Switch to structured logging today. Even if you do nothing else, emitting logs as JSON instead of strings will make your future self grateful. Use a library like Pino or Winston, define a schema for your log entries, and enforce it in code review.

  2. Implement trace IDs this week. Add middleware to generate or extract a trace ID from request headers, pass it to all logging calls, and propagate it to downstream services. This single change will save you hours of debugging distributed failures.

  3. Review your logging costs monthly. Set up billing alerts, identify your most expensive log streams, and ask whether they are worth the cost. Implement sampling for high-volume low-value logs, and set aggressive retention policies for logs you rarely query.

Building a high quality logging pipeline is not glamorous work, but it is foundational to running reliable systems at scale. You do not need exotic tools or a massive platform team. You just need structured logs, consistent field names, correlation IDs, and a thoughtful approach to cost and retention. Start small, measure what you are spending, and iterate based on what queries you actually run during incidents. Over time, your logging pipeline will evolve from a cost center into a diagnostic superpower that lets you understand and fix production issues faster than your competitors.