Skip to contentPedro Farbo

Observability with OpenTelemetry: Monitoring Microservices in Production

Implement complete observability in your microservices with OpenTelemetry, Prometheus, and Grafana. Learn to configure distributed traces, custom metrics, and log correlation for production debugging.

This content is free! Help keep the project running.

PIX:0737160d-e98f-4a65-8392-5dba70e7ff3e

This is the fourth article in our microservices series. If you haven't read the previous articles, check out the microservices guide, API Gateway with Kong, and messaging with RabbitMQ.

Why Observability?

In distributed systems, debugging is exponentially harder. A request passes through multiple services, each with its own logs, metrics, and states. Without proper observability, finding the root cause of a problem is like looking for a needle in a haystack.

The Three Pillars of Observability

┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                            │
├───────────────────┬───────────────────┬───────────────────┤
│      TRACES       │     METRICS       │       LOGS        │
│                   │                   │                   │
│  ┌─────────────┐  │  ┌─────────────┐  │  ┌─────────────┐  │
│  │ Distributed │  │  │  Counters   │  │  │  Structured │  │
│  │  Request    │  │  │  Histograms │  │  │    JSON     │  │
│  │   Latency   │  │  │   Gauges    │  │  │   Context   │  │
│  │   Errors    │  │  │  Percentiles│  │  │   TraceID   │  │
│  └─────────────┘  │  └─────────────┘  │  └─────────────┘  │
│                   │                   │                   │
│  "What happened   │  "How is the      │  "Why did it     │
│   in this request?│   system behaving?│   happen?"       │
│                   │                   │                   │
└───────────────────┴───────────────────┴───────────────────┘

OpenTelemetry: The Industry Standard

OpenTelemetry (OTel) is a CNCF project that provides APIs, SDKs, and tools to collect telemetry (traces, metrics, and logs) in a standardized and vendor-neutral way.

OpenTelemetry Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         APPLICATION                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐     │
│  │   Auto-instr.   │  │  Manual-instr.  │  │    Baggage      │     │
│  │  (HTTP, gRPC)   │  │   (Custom)      │  │   (Context)     │     │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘     │
│           │                    │                    │               │
│           └────────────────────┼────────────────────┘               │
│                                ▼                                    │
│                    ┌─────────────────────┐                         │
│                    │   OTel SDK          │                         │
│                    │  ┌───────────────┐  │                         │
│                    │  │   Processor   │  │                         │
│                    │  │   Sampler     │  │                         │
│                    │  │   Exporter    │  │                         │
│                    │  └───────────────┘  │                         │
│                    └──────────┬──────────┘                         │
└───────────────────────────────┼─────────────────────────────────────┘
                                │
                                ▼
                    ┌─────────────────────┐
                    │   OTel Collector    │
                    │  ┌───────────────┐  │
                    │  │   Receivers   │──┼──► OTLP, Jaeger, Zipkin
                    │  │   Processors  │──┼──► Batch, Filter, Transform
                    │  │   Exporters   │──┼──► Jaeger, Prometheus, Loki
                    │  └───────────────┘  │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
      ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
      │   Jaeger    │  │ Prometheus  │  │    Loki     │
      │   (Traces)  │  │  (Metrics)  │  │   (Logs)    │
      └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
             │                │                │
             └────────────────┼────────────────┘
                              ▼
                      ┌─────────────┐
                      │   Grafana   │
                      │ (Dashboard) │
                      └─────────────┘

Project Structure

observability-service/
├── src/
│   ├── instrumentation/
│   │   ├── index.ts              # Main OTel setup
│   │   ├── tracing.ts            # Trace configuration
│   │   ├── metrics.ts            # Metrics configuration
│   │   └── logging.ts            # Log configuration
│   ├── middleware/
│   │   ├── request-context.ts    # Request context
│   │   ├── metrics.middleware.ts # HTTP metrics
│   │   └── logging.middleware.ts # Structured logs
│   ├── utils/
│   │   ├── trace-context.ts      # Trace utilities
│   │   ├── custom-metrics.ts     # Custom metrics
│   │   └── log-formatter.ts      # Log formatting
│   ├── exporters/
│   │   ├── jaeger.ts             # Jaeger exporter
│   │   ├── prometheus.ts         # Prometheus exporter
│   │   └── loki.ts               # Loki exporter
│   └── app.ts
├── docker/
│   ├── otel-collector-config.yaml
│   ├── prometheus.yml
│   ├── loki-config.yaml
│   └── grafana/
│       └── dashboards/
│           └── microservices.json
├── docker-compose.observability.yml
└── package.json

OpenTelemetry SDK Configuration

Installation

bash
# Core OpenTelemetrynpm install @opentelemetry/api @opentelemetry/sdk-node # Auto-instrumentationnpm install @opentelemetry/auto-instrumentations-node # Exportersnpm install @opentelemetry/exporter-trace-otlp-httpnpm install @opentelemetry/exporter-metrics-otlp-httpnpm install @opentelemetry/exporter-logs-otlp-http # Resources and semanticsnpm install @opentelemetry/resourcesnpm install @opentelemetry/semantic-conventions

Main Setup

typescript
// src/instrumentation/index.tsimport { NodeSDK } from '@opentelemetry/sdk-node';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http';import { Resource } from '@opentelemetry/resources';import {  SEMRESATTRS_SERVICE_NAME,  SEMRESATTRS_SERVICE_VERSION,  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,} from '@opentelemetry/semantic-conventions';import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api'; // Configure diagnostics for debuggingif (process.env.OTEL_DEBUG === 'true') {  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);} // Resource configuration (identifies the service)const resource = new Resource({  [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',  [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',  [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',  'service.instance.id': process.env.HOSTNAME || 'local',  'service.namespace': 'microservices',}); // Exporter configurationconst traceExporter = new OTLPTraceExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',  headers: {    'x-api-key': process.env.OTEL_API_KEY || '',  },}); const metricExporter = new OTLPMetricExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',}); const logExporter = new OTLPLogExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/logs',}); // SDK configurationconst sdk = new NodeSDK({  resource,  traceExporter,  metricReader: new PeriodicExportingMetricReader({    exporter: metricExporter,    exportIntervalMillis: 15000, // Export metrics every 15s  }),  logRecordProcessor: new BatchLogRecordProcessor(logExporter),  instrumentations: [    getNodeAutoInstrumentations({      // Specific configuration per instrumentation      '@opentelemetry/instrumentation-http': {        requestHook: (span, request) => {          span.setAttribute('http.request.id', request.headers['x-request-id'] || '');        },        responseHook: (span, response) => {          span.setAttribute('http.response.content_length',            response.headers['content-length'] || 0);        },        ignoreIncomingRequestHook: (request) => {          // Ignore health checks          return request.url === '/health' || request.url === '/ready';        },      },      '@opentelemetry/instrumentation-express': {        enabled: true,      },      '@opentelemetry/instrumentation-pg': {        enhancedDatabaseReporting: true,      },      '@opentelemetry/instrumentation-redis': {        enabled: true,      },      '@opentelemetry/instrumentation-amqplib': {        enabled: true, // RabbitMQ      },    }),  ],}); // Initializationexport async function initTelemetry(): Promise<void> {  try {    await sdk.start();    console.log('OpenTelemetry initialized successfully');     // Graceful shutdown    process.on('SIGTERM', async () => {      try {        await sdk.shutdown();        console.log('OpenTelemetry shut down successfully');      } catch (error) {        console.error('Error shutting down OpenTelemetry', error);      }    });  } catch (error) {    console.error('Error initializing OpenTelemetry', error);    throw error;  }} export { sdk };

Application Entry Point

typescript
// src/index.tsimport { initTelemetry } from './instrumentation'; // IMPORTANT: Initialize telemetry first!async function bootstrap() {  await initTelemetry();   // Now import the rest of the application  const { createApp } = await import('./app');  const app = await createApp();   const port = process.env.PORT || 3000;  app.listen(port, () => {    console.log(`Server running on port ${port}`);  });} bootstrap().catch(console.error);

Distributed Tracing

Distributed tracing allows you to follow a request through multiple services.

Fundamental Concepts

┌─────────────────────────────────────────────────────────────────┐
│                          TRACE                                  │
│  TraceID: abc123                                                 │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ SPAN: API Gateway (Root Span)                              │ │
│  │ SpanID: span-1, ParentID: null                             │ │
│  │ Duration: 250ms                                             │ │
│  │ ┌────────────────────────────────────────────────────────┐ │ │
│  │ │ SPAN: User Service                                     │ │ │
│  │ │ SpanID: span-2, ParentID: span-1                       │ │ │
│  │ │ Duration: 50ms                                          │ │ │
│  │ └────────────────────────────────────────────────────────┘ │ │
│  │ ┌────────────────────────────────────────────────────────┐ │ │
│  │ │ SPAN: Order Service                                    │ │ │
│  │ │ SpanID: span-3, ParentID: span-1                       │ │ │
│  │ │ Duration: 150ms                                         │ │ │
│  │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│  │ │ │ SPAN: Database Query                               │ │ │ │
│  │ │ │ SpanID: span-4, ParentID: span-3                   │ │ │ │
│  │ │ │ Duration: 45ms                                      │ │ │ │
│  │ │ └────────────────────────────────────────────────────┘ │ │ │
│  │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│  │ │ │ SPAN: RabbitMQ Publish                             │ │ │ │
│  │ │ │ SpanID: span-5, ParentID: span-3                   │ │ │ │
│  │ │ │ Duration: 10ms                                      │ │ │ │
│  │ │ └────────────────────────────────────────────────────┘ │ │ │
│  │ └────────────────────────────────────────────────────────┘ │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Manual Span Instrumentation

typescript
// src/utils/trace-context.tsimport { trace, SpanStatusCode, SpanKind, context, propagation } from '@opentelemetry/api';import type { Span, SpanOptions, Context } from '@opentelemetry/api'; const tracer = trace.getTracer('microservice-tracer', '1.0.0'); // Decorator for automatic tracingexport function Traced(  spanName?: string,  options?: SpanOptions): MethodDecorator {  return function (    target: any,    propertyKey: string | symbol,    descriptor: PropertyDescriptor  ) {    const originalMethod = descriptor.value;    const name = spanName || `${target.constructor.name}.${String(propertyKey)}`;     descriptor.value = async function (...args: any[]) {      return tracer.startActiveSpan(name, options || {}, async (span: Span) => {        try {          // Add parameters as attributes (careful with sensitive data!)          span.setAttribute('method.arguments.count', args.length);           const result = await originalMethod.apply(this, args);           span.setStatus({ code: SpanStatusCode.OK });          return result;        } catch (error) {          span.setStatus({            code: SpanStatusCode.ERROR,            message: error instanceof Error ? error.message : 'Unknown error',          });          span.recordException(error as Error);          throw error;        } finally {          span.end();        }      });    };     return descriptor;  };} // Create span manuallyexport function createSpan(  name: string,  fn: (span: Span) => Promise<any>,  options?: SpanOptions): Promise<any> {  return tracer.startActiveSpan(name, options || {}, async (span) => {    try {      const result = await fn(span);      span.setStatus({ code: SpanStatusCode.OK });      return result;    } catch (error) {      span.setStatus({        code: SpanStatusCode.ERROR,        message: error instanceof Error ? error.message : 'Unknown error',      });      span.recordException(error as Error);      throw error;    } finally {      span.end();    }  });} // Extract/inject context for propagationexport function extractContext(headers: Record<string, string>): Context {  return propagation.extract(context.active(), headers);} export function injectContext(headers: Record<string, string>): void {  propagation.inject(context.active(), headers);} // Add events to a spanexport function addSpanEvent(  eventName: string,  attributes?: Record<string, string | number | boolean>): void {  const span = trace.getActiveSpan();  if (span) {    span.addEvent(eventName, attributes);  }} // Get current trace IDexport function getCurrentTraceId(): string | undefined {  const span = trace.getActiveSpan();  return span?.spanContext().traceId;} // Get current span IDexport function getCurrentSpanId(): string | undefined {  const span = trace.getActiveSpan();  return span?.spanContext().spanId;}

Usage in Services

typescript
// src/services/order.service.tsimport { Traced, createSpan, addSpanEvent } from '../utils/trace-context';import { trace, SpanKind } from '@opentelemetry/api'; export class OrderService {  private readonly tracer = trace.getTracer('order-service');   @Traced('OrderService.createOrder', { kind: SpanKind.INTERNAL })  async createOrder(orderData: CreateOrderDTO): Promise<Order> {    addSpanEvent('order.validation.started');     // Validation    await this.validateOrder(orderData);    addSpanEvent('order.validation.completed');     // Create child span for specific operation    const order = await createSpan('order.save', async (span) => {      span.setAttribute('order.items.count', orderData.items.length);      span.setAttribute('order.total', orderData.total);       const savedOrder = await this.orderRepository.save(orderData);       span.setAttribute('order.id', savedOrder.id);      return savedOrder;    });     // Publish event    await this.publishOrderCreated(order);     return order;  }   @Traced('OrderService.validateOrder')  private async validateOrder(orderData: CreateOrderDTO): Promise<void> {    // Validation with automatic spans    await this.validateStock(orderData.items);    await this.validatePayment(orderData.paymentMethod);  }   private async publishOrderCreated(order: Order): Promise<void> {    // Span for messaging    await createSpan(      'rabbitmq.publish.order_created',      async (span) => {        span.setAttribute('messaging.system', 'rabbitmq');        span.setAttribute('messaging.destination', 'orders.created');        span.setAttribute('messaging.message_id', order.id);         await this.messagePublisher.publish('orders.created', {          orderId: order.id,          timestamp: new Date().toISOString(),        });      },      { kind: SpanKind.PRODUCER }    );  }}

Context Propagation Between Services

typescript
// src/middleware/request-context.tsimport { Request, Response, NextFunction } from 'express';import { context, propagation, trace } from '@opentelemetry/api';import { v4 as uuidv4 } from 'uuid'; export interface RequestContext {  traceId: string;  spanId: string;  requestId: string;  userId?: string;  correlationId: string;} declare global {  namespace Express {    interface Request {      context: RequestContext;    }  }} export function requestContextMiddleware(  req: Request,  res: Response,  next: NextFunction): void {  // Extract propagation context (if exists)  const extractedContext = propagation.extract(context.active(), req.headers);   context.with(extractedContext, () => {    const span = trace.getActiveSpan();    const spanContext = span?.spanContext();     // Create request context    req.context = {      traceId: spanContext?.traceId || uuidv4().replace(/-/g, ''),      spanId: spanContext?.spanId || uuidv4().replace(/-/g, '').substring(0, 16),      requestId: req.headers['x-request-id'] as string || uuidv4(),      userId: req.headers['x-user-id'] as string,      correlationId: req.headers['x-correlation-id'] as string || uuidv4(),    };     // Add response headers for debugging    res.setHeader('x-trace-id', req.context.traceId);    res.setHeader('x-request-id', req.context.requestId);     // Add attributes to current span    if (span) {      span.setAttribute('request.id', req.context.requestId);      span.setAttribute('correlation.id', req.context.correlationId);      if (req.context.userId) {        span.setAttribute('user.id', req.context.userId);      }    }     next();  });} // Helper to propagate context in HTTP callsexport function getTracingHeaders(): Record<string, string> {  const headers: Record<string, string> = {};  propagation.inject(context.active(), headers);  return headers;}

HTTP Client with Automatic Propagation

typescript
// src/utils/http-client.tsimport axios, { AxiosInstance, AxiosRequestConfig } from 'axios';import { getTracingHeaders, getCurrentTraceId } from './trace-context'; export function createTracedHttpClient(baseURL: string): AxiosInstance {  const client = axios.create({ baseURL });   // Interceptor to add tracing headers  client.interceptors.request.use((config) => {    const tracingHeaders = getTracingHeaders();     config.headers = {      ...config.headers,      ...tracingHeaders,      'x-trace-id': getCurrentTraceId(),    };     return config;  });   // Interceptor for error logging  client.interceptors.response.use(    (response) => response,    (error) => {      const traceId = getCurrentTraceId();      console.error(`HTTP Error [trace: ${traceId}]:`, {        url: error.config?.url,        method: error.config?.method,        status: error.response?.status,        message: error.message,      });      throw error;    }  );   return client;}

Custom Metrics

Metric Types

typescript
// src/instrumentation/metrics.tsimport { metrics, ValueType } from '@opentelemetry/api'; const meter = metrics.getMeter('microservice-metrics', '1.0.0'); // Counter - values that only increaseexport const httpRequestsTotal = meter.createCounter('http_requests_total', {  description: 'Total number of HTTP requests',  unit: '1',}); // UpDownCounter - values that can increase or decreaseexport const activeConnections = meter.createUpDownCounter('active_connections', {  description: 'Number of active connections',  unit: '1',}); // Histogram - value distributionexport const httpRequestDuration = meter.createHistogram('http_request_duration_seconds', {  description: 'Duration of HTTP requests in seconds',  unit: 's',  advice: {    explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],  },}); // Observable Gauge - current value that is observedexport const memoryUsage = meter.createObservableGauge('process_memory_bytes', {  description: 'Process memory usage in bytes',  unit: 'By',}); memoryUsage.addCallback((result) => {  const usage = process.memoryUsage();  result.observe(usage.heapUsed, { type: 'heap_used' });  result.observe(usage.heapTotal, { type: 'heap_total' });  result.observe(usage.rss, { type: 'rss' });  result.observe(usage.external, { type: 'external' });}); // Observable Counter - observable counterexport const cpuUsage = meter.createObservableCounter('process_cpu_seconds_total', {  description: 'Total CPU time spent in seconds',  unit: 's',}); let previousCpuUsage = process.cpuUsage();cpuUsage.addCallback((result) => {  const currentCpuUsage = process.cpuUsage(previousCpuUsage);  result.observe((currentCpuUsage.user + currentCpuUsage.system) / 1e6, {});  previousCpuUsage = process.cpuUsage();});

Business Metrics

typescript
// src/utils/business-metrics.tsimport { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('business-metrics', '1.0.0'); // Order metricsexport const ordersCreated = meter.createCounter('orders_created_total', {  description: 'Total orders created',}); export const orderValue = meter.createHistogram('order_value_dollars', {  description: 'Order value distribution',  unit: 'USD',  advice: {    explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],  },}); export const orderProcessingTime = meter.createHistogram('order_processing_duration_seconds', {  description: 'Time to process an order',  unit: 's',}); // User metricsexport const activeUsers = meter.createUpDownCounter('active_users', {  description: 'Number of currently active users',}); export const userRegistrations = meter.createCounter('user_registrations_total', {  description: 'Total user registrations',}); // Stock metricsexport const stockLevel = meter.createObservableGauge('stock_level', {  description: 'Current stock level by product',}); // Payment metricsexport const paymentAttempts = meter.createCounter('payment_attempts_total', {  description: 'Total payment attempts',}); export const paymentAmount = meter.createHistogram('payment_amount_dollars', {  description: 'Payment amount distribution',  unit: 'USD',}); // Helper to record order metricsexport function recordOrderMetrics(order: {  id: string;  total: number;  items: number;  processingTimeMs: number;  paymentMethod: string;  region: string;}) {  const labels = {    payment_method: order.paymentMethod,    region: order.region,  };   ordersCreated.add(1, labels);  orderValue.record(order.total, labels);  orderProcessingTime.record(order.processingTimeMs / 1000, labels);}

HTTP Metrics Middleware

typescript
// src/middleware/metrics.middleware.tsimport { Request, Response, NextFunction } from 'express';import { httpRequestsTotal, httpRequestDuration, activeConnections } from '../instrumentation/metrics'; export function metricsMiddleware(  req: Request,  res: Response,  next: NextFunction): void {  const startTime = process.hrtime.bigint();   // Increment active connections  activeConnections.add(1);   // Common labels  const labels = {    method: req.method,    route: req.route?.path || req.path,    host: req.hostname,  };   // When response finishes  res.on('finish', () => {    const endTime = process.hrtime.bigint();    const durationSeconds = Number(endTime - startTime) / 1e9;     const finalLabels = {      ...labels,      status_code: res.statusCode.toString(),      status_class: `${Math.floor(res.statusCode / 100)}xx`,    };     // Record metrics    httpRequestsTotal.add(1, finalLabels);    httpRequestDuration.record(durationSeconds, finalLabels);    activeConnections.add(-1);  });   // Error/timeout case  res.on('close', () => {    if (!res.writableEnded) {      activeConnections.add(-1);    }  });   next();}

Structured Logs

Logger Configuration

typescript
// src/instrumentation/logging.tsimport { logs, SeverityNumber } from '@opentelemetry/api-logs';import { trace, context } from '@opentelemetry/api';import pino from 'pino'; const logger = logs.getLogger('microservice-logger', '1.0.0'); // OpenTelemetry severity levelsconst severityMap: Record<string, SeverityNumber> = {  trace: SeverityNumber.TRACE,  debug: SeverityNumber.DEBUG,  info: SeverityNumber.INFO,  warn: SeverityNumber.WARN,  error: SeverityNumber.ERROR,  fatal: SeverityNumber.FATAL,}; export interface LogContext {  [key: string]: unknown;} export function createLogger(serviceName: string) {  // Pino for local/console logs  const pinoLogger = pino({    level: process.env.LOG_LEVEL || 'info',    formatters: {      level: (label) => ({ level: label }),      bindings: () => ({}),    },    timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,    base: {      service: serviceName,      environment: process.env.NODE_ENV,    },  });   return {    trace: (message: string, ctx?: LogContext) => log('trace', message, ctx),    debug: (message: string, ctx?: LogContext) => log('debug', message, ctx),    info: (message: string, ctx?: LogContext) => log('info', message, ctx),    warn: (message: string, ctx?: LogContext) => log('warn', message, ctx),    error: (message: string, ctx?: LogContext) => log('error', message, ctx),    fatal: (message: string, ctx?: LogContext) => log('fatal', message, ctx),    child: (bindings: Record<string, unknown>) => {      return createChildLogger(serviceName, bindings);    },  };   function log(level: string, message: string, ctx?: LogContext) {    // Log to console via Pino    pinoLogger[level as keyof typeof pinoLogger]({ ...ctx }, message);     // Log to OpenTelemetry    const span = trace.getActiveSpan();    const spanContext = span?.spanContext();     logger.emit({      severityNumber: severityMap[level],      severityText: level.toUpperCase(),      body: message,      attributes: {        'service.name': serviceName,        'log.level': level,        ...(spanContext && {          'trace_id': spanContext.traceId,          'span_id': spanContext.spanId,        }),        ...flattenObject(ctx || {}),      },    });  }} function createChildLogger(serviceName: string, bindings: Record<string, unknown>) {  const parentLogger = createLogger(serviceName);   return {    trace: (message: string, ctx?: LogContext) =>      parentLogger.trace(message, { ...bindings, ...ctx }),    debug: (message: string, ctx?: LogContext) =>      parentLogger.debug(message, { ...bindings, ...ctx }),    info: (message: string, ctx?: LogContext) =>      parentLogger.info(message, { ...bindings, ...ctx }),    warn: (message: string, ctx?: LogContext) =>      parentLogger.warn(message, { ...bindings, ...ctx }),    error: (message: string, ctx?: LogContext) =>      parentLogger.error(message, { ...bindings, ...ctx }),    fatal: (message: string, ctx?: LogContext) =>      parentLogger.fatal(message, { ...bindings, ...ctx }),    child: (newBindings: Record<string, unknown>) =>      createChildLogger(serviceName, { ...bindings, ...newBindings }),  };} // Flatten nested objects for attributesfunction flattenObject(  obj: Record<string, unknown>,  prefix = ''): Record<string, string | number | boolean> {  const result: Record<string, string | number | boolean> = {};   for (const [key, value] of Object.entries(obj)) {    const newKey = prefix ? `${prefix}.${key}` : key;     if (value && typeof value === 'object' && !Array.isArray(value)) {      Object.assign(result, flattenObject(value as Record<string, unknown>, newKey));    } else if (typeof value === 'string' || typeof value === 'number' || typeof value === 'boolean') {      result[newKey] = value;    } else if (value !== undefined && value !== null) {      result[newKey] = String(value);    }  }   return result;} export const log = createLogger(process.env.SERVICE_NAME || 'unknown-service');

Complete Observability Stack

Docker Compose

yaml
# docker-compose.observability.ymlversion: '3.8' services:  # OpenTelemetry Collector  otel-collector:    image: otel/opentelemetry-collector-contrib:0.91.0    container_name: otel-collector    command: ["--config=/etc/otel-collector-config.yaml"]    volumes:      - ./docker/otel-collector-config.yaml:/etc/otel-collector-config.yaml    ports:      - "4317:4317"   # OTLP gRPC      - "4318:4318"   # OTLP HTTP      - "8888:8888"   # Prometheus metrics exposed by the collector      - "8889:8889"   # Prometheus exporter metrics      - "13133:13133" # Health check      - "55679:55679" # zPages    depends_on:      - jaeger      - prometheus      - loki    networks:      - observability   # Jaeger - Distributed Tracing  jaeger:    image: jaegertracing/all-in-one:1.52    container_name: jaeger    ports:      - "16686:16686" # UI      - "14268:14268" # HTTP collector      - "14250:14250" # gRPC collector    environment:      - COLLECTOR_OTLP_ENABLED=true    networks:      - observability   # Prometheus - Metrics  prometheus:    image: prom/prometheus:v2.48.0    container_name: prometheus    volumes:      - ./docker/prometheus.yml:/etc/prometheus/prometheus.yml    ports:      - "9090:9090"    networks:      - observability   # Loki - Log Aggregation  loki:    image: grafana/loki:2.9.2    container_name: loki    ports:      - "3100:3100"    networks:      - observability   # Grafana - Visualization  grafana:    image: grafana/grafana:10.2.2    container_name: grafana    environment:      - GF_SECURITY_ADMIN_USER=admin      - GF_SECURITY_ADMIN_PASSWORD=admin123    ports:      - "3001:3000"    depends_on:      - prometheus      - loki      - jaeger    networks:      - observability networks:  observability:    driver: bridge

Production Checklist

Instrumentation

  • OpenTelemetry SDK configured before other imports
  • Auto-instrumentation enabled for HTTP, database, messaging
  • Custom spans for critical business operations
  • Relevant attributes added to spans
  • Errors captured and recorded correctly

Metrics

  • RED metrics for all endpoints
  • Business metrics defined
  • Histograms with appropriate buckets
  • Consistent labels across services
  • Label cardinality controlled

Logs

  • Structured format (JSON)
  • Correlation with trace ID
  • Appropriate log levels
  • Sensitive data masked
  • Rotation and retention configured

Alerts

  • SLOs defined and monitored
  • Alerts for critical metrics
  • Runbooks for each alert
  • Escalation configured
  • Alert tests performed

Infrastructure

  • Collector with high availability
  • Adequate data retention
  • Configuration backup
  • Sampling configured for volume
  • Adequate resources for stack

Conclusion

Observability is the foundation for operating microservices in production with confidence. The key points are:

  1. Three Pillars: Traces, metrics, and logs work together to provide complete visibility
  2. OpenTelemetry: Vendor-neutral standard that simplifies instrumentation
  3. Correlation: Trace ID connects logs, metrics, and traces from the same request
  4. SLOs: Define clear objectives and monitor error budgets
  5. Smart Alerts: Alert on symptoms, not causes

With this complete series, you have all the tools to build robust microservices:

PF
About the author

Pedro Farbo

Platform Engineering Lead & Solutions Architect with 10+ years of experience. CEO at Farbo TSC. Expert in Microservices, Kong, Backstage, and Cloud.

Enjoyed the content? Your contribution helps keep everything online and free!

PIX:0737160d-e98f-4a65-8392-5dba70e7ff3e