Pular para o conteúdoPedro Farbo

Observabilidade com OpenTelemetry: Monitorando Microserviços em Produção

Implemente observabilidade completa em seus microserviços com OpenTelemetry, Prometheus e Grafana. Aprenda a configurar traces distribuídos, métricas customizadas e correlação de logs para debugging em produção.

Este conteúdo é gratuito! Ajude a manter o projeto no ar.

PIX:0737160d-e98f-4a65-8392-5dba70e7ff3e

Este é o quarto artigo da nossa série sobre microserviços. Se você ainda não leu os artigos anteriores, confira o guia de microserviços, API Gateway com Kong e mensageria com RabbitMQ.

Por que Observabilidade?

Em sistemas distribuídos, debugging é exponencialmente mais difícil. Uma requisição passa por múltiplos serviços, cada um com seus próprios logs, métricas e estados. Sem observabilidade adequada, encontrar a causa raiz de um problema é como procurar uma agulha em um palheiro.

Os Três Pilares da Observabilidade

┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILIDADE                          │
├───────────────────┬───────────────────┬───────────────────┤
│      TRACES       │     MÉTRICAS      │       LOGS        │
│                   │                   │                   │
│  ┌─────────────┐  │  ┌─────────────┐  │  ┌─────────────┐  │
│  │ Requisição  │  │  │  Contadores │  │  │  Estrutura  │  │
│  │ Distribuída │  │  │  Histogramas│  │  │    JSON     │  │
│  │   Latência  │  │  │   Gauges    │  │  │  Contexto   │  │
│  │  Erros      │  │  │  Percentis  │  │  │  TraceID    │  │
│  └─────────────┘  │  └─────────────┘  │  └─────────────┘  │
│                   │                   │                   │
│  "O que aconteceu │  "Como o sistema  │  "Por que        │
│   nesta request?" │   está se         │   aconteceu?"    │
│                   │   comportando?"   │                   │
└───────────────────┴───────────────────┴───────────────────┘

OpenTelemetry: O Padrão da Indústria

OpenTelemetry (OTel) é um projeto CNCF que fornece APIs, SDKs e ferramentas para coletar telemetria (traces, métricas e logs) de forma padronizada e vendor-neutral.

Arquitetura OpenTelemetry

┌─────────────────────────────────────────────────────────────────────┐
│                         APLICAÇÃO                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐     │
│  │   Auto-instr.   │  │  Manual-instr.  │  │    Baggage      │     │
│  │  (HTTP, gRPC)   │  │   (Custom)      │  │   (Context)     │     │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘     │
│           │                    │                    │               │
│           └────────────────────┼────────────────────┘               │
│                                ▼                                    │
│                    ┌─────────────────────┐                         │
│                    │   OTel SDK          │                         │
│                    │  ┌───────────────┐  │                         │
│                    │  │   Processor   │  │                         │
│                    │  │   Sampler     │  │                         │
│                    │  │   Exporter    │  │                         │
│                    │  └───────────────┘  │                         │
│                    └──────────┬──────────┘                         │
└───────────────────────────────┼─────────────────────────────────────┘
                                │
                                ▼
                    ┌─────────────────────┐
                    │   OTel Collector    │
                    │  ┌───────────────┐  │
                    │  │   Receivers   │──┼──► OTLP, Jaeger, Zipkin
                    │  │   Processors  │──┼──► Batch, Filter, Transform
                    │  │   Exporters   │──┼──► Jaeger, Prometheus, Loki
                    │  └───────────────┘  │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
      ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
      │   Jaeger    │  │ Prometheus  │  │    Loki     │
      │   (Traces)  │  │  (Metrics)  │  │   (Logs)    │
      └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
             │                │                │
             └────────────────┼────────────────┘
                              ▼
                      ┌─────────────┐
                      │   Grafana   │
                      │ (Dashboard) │
                      └─────────────┘

Estrutura do Projeto

observability-service/
├── src/
│   ├── instrumentation/
│   │   ├── index.ts              # Setup principal OTel
│   │   ├── tracing.ts            # Configuração de traces
│   │   ├── metrics.ts            # Configuração de métricas
│   │   └── logging.ts            # Configuração de logs
│   ├── middleware/
│   │   ├── request-context.ts    # Contexto de requisição
│   │   ├── metrics.middleware.ts # Métricas HTTP
│   │   └── logging.middleware.ts # Logs estruturados
│   ├── utils/
│   │   ├── trace-context.ts      # Utilitários de trace
│   │   ├── custom-metrics.ts     # Métricas customizadas
│   │   └── log-formatter.ts      # Formatação de logs
│   ├── exporters/
│   │   ├── jaeger.ts             # Exporter Jaeger
│   │   ├── prometheus.ts         # Exporter Prometheus
│   │   └── loki.ts               # Exporter Loki
│   └── app.ts
├── docker/
│   ├── otel-collector-config.yaml
│   ├── prometheus.yml
│   ├── loki-config.yaml
│   └── grafana/
│       └── dashboards/
│           └── microservices.json
├── docker-compose.observability.yml
└── package.json

Configuração do OpenTelemetry SDK

Instalação

bash
# Core OpenTelemetrynpm install @opentelemetry/api @opentelemetry/sdk-node # Instrumentação automáticanpm install @opentelemetry/auto-instrumentations-node # Exportersnpm install @opentelemetry/exporter-trace-otlp-httpnpm install @opentelemetry/exporter-metrics-otlp-httpnpm install @opentelemetry/exporter-logs-otlp-http # Recursos e semânticanpm install @opentelemetry/resourcesnpm install @opentelemetry/semantic-conventions

Setup Principal

typescript
// src/instrumentation/index.tsimport { NodeSDK } from '@opentelemetry/sdk-node';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http';import { Resource } from '@opentelemetry/resources';import {  SEMRESATTRS_SERVICE_NAME,  SEMRESATTRS_SERVICE_VERSION,  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,} from '@opentelemetry/semantic-conventions';import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api'; // Configurar diagnóstico para debuggingif (process.env.OTEL_DEBUG === 'true') {  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);} // Configuração do recurso (identifica o serviço)const resource = new Resource({  [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',  [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',  [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',  'service.instance.id': process.env.HOSTNAME || 'local',  'service.namespace': 'microservices',}); // Configuração dos exportersconst traceExporter = new OTLPTraceExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces',  headers: {    'x-api-key': process.env.OTEL_API_KEY || '',  },}); const metricExporter = new OTLPMetricExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',}); const logExporter = new OTLPLogExporter({  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/logs',}); // Configuração do SDKconst sdk = new NodeSDK({  resource,  traceExporter,  metricReader: new PeriodicExportingMetricReader({    exporter: metricExporter,    exportIntervalMillis: 15000, // Exporta métricas a cada 15s  }),  logRecordProcessor: new BatchLogRecordProcessor(logExporter),  instrumentations: [    getNodeAutoInstrumentations({      // Configuração específica por instrumentação      '@opentelemetry/instrumentation-http': {        requestHook: (span, request) => {          span.setAttribute('http.request.id', request.headers['x-request-id'] || '');        },        responseHook: (span, response) => {          span.setAttribute('http.response.content_length',            response.headers['content-length'] || 0);        },        ignoreIncomingRequestHook: (request) => {          // Ignora health checks          return request.url === '/health' || request.url === '/ready';        },      },      '@opentelemetry/instrumentation-express': {        enabled: true,      },      '@opentelemetry/instrumentation-pg': {        enhancedDatabaseReporting: true,      },      '@opentelemetry/instrumentation-redis': {        enabled: true,      },      '@opentelemetry/instrumentation-amqplib': {        enabled: true, // RabbitMQ      },    }),  ],}); // Inicializaçãoexport async function initTelemetry(): Promise<void> {  try {    await sdk.start();    console.log('OpenTelemetry initialized successfully');     // Graceful shutdown    process.on('SIGTERM', async () => {      try {        await sdk.shutdown();        console.log('OpenTelemetry shut down successfully');      } catch (error) {        console.error('Error shutting down OpenTelemetry', error);      }    });  } catch (error) {    console.error('Error initializing OpenTelemetry', error);    throw error;  }} export { sdk };

Entry Point da Aplicação

typescript
// src/index.tsimport { initTelemetry } from './instrumentation'; // IMPORTANTE: Inicializar telemetria primeiro!async function bootstrap() {  await initTelemetry();   // Agora importa o resto da aplicação  const { createApp } = await import('./app');  const app = await createApp();   const port = process.env.PORT || 3000;  app.listen(port, () => {    console.log(`Server running on port ${port}`);  });} bootstrap().catch(console.error);

Distributed Tracing

Tracing distribuído permite seguir uma requisição através de múltiplos serviços.

Conceitos Fundamentais

┌─────────────────────────────────────────────────────────────────┐
│                          TRACE                                  │
│  TraceID: abc123                                                 │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │ SPAN: API Gateway (Root Span)                              │ │
│  │ SpanID: span-1, ParentID: null                             │ │
│  │ Duration: 250ms                                             │ │
│  │ ┌────────────────────────────────────────────────────────┐ │ │
│  │ │ SPAN: User Service                                     │ │ │
│  │ │ SpanID: span-2, ParentID: span-1                       │ │ │
│  │ │ Duration: 50ms                                          │ │ │
│  │ └────────────────────────────────────────────────────────┘ │ │
│  │ ┌────────────────────────────────────────────────────────┐ │ │
│  │ │ SPAN: Order Service                                    │ │ │
│  │ │ SpanID: span-3, ParentID: span-1                       │ │ │
│  │ │ Duration: 150ms                                         │ │ │
│  │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│  │ │ │ SPAN: Database Query                               │ │ │ │
│  │ │ │ SpanID: span-4, ParentID: span-3                   │ │ │ │
│  │ │ │ Duration: 45ms                                      │ │ │ │
│  │ │ └────────────────────────────────────────────────────┘ │ │ │
│  │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│  │ │ │ SPAN: RabbitMQ Publish                             │ │ │ │
│  │ │ │ SpanID: span-5, ParentID: span-3                   │ │ │ │
│  │ │ │ Duration: 10ms                                      │ │ │ │
│  │ │ └────────────────────────────────────────────────────┘ │ │ │
│  │ └────────────────────────────────────────────────────────┘ │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Instrumentação Manual de Spans

typescript
// src/utils/trace-context.tsimport { trace, SpanStatusCode, SpanKind, context, propagation } from '@opentelemetry/api';import type { Span, SpanOptions, Context } from '@opentelemetry/api'; const tracer = trace.getTracer('microservice-tracer', '1.0.0'); // Decorator para tracing automáticoexport function Traced(  spanName?: string,  options?: SpanOptions): MethodDecorator {  return function (    target: any,    propertyKey: string | symbol,    descriptor: PropertyDescriptor  ) {    const originalMethod = descriptor.value;    const name = spanName || `${target.constructor.name}.${String(propertyKey)}`;     descriptor.value = async function (...args: any[]) {      return tracer.startActiveSpan(name, options || {}, async (span: Span) => {        try {          // Adiciona parâmetros como atributos (cuidado com dados sensíveis!)          span.setAttribute('method.arguments.count', args.length);           const result = await originalMethod.apply(this, args);           span.setStatus({ code: SpanStatusCode.OK });          return result;        } catch (error) {          span.setStatus({            code: SpanStatusCode.ERROR,            message: error instanceof Error ? error.message : 'Unknown error',          });          span.recordException(error as Error);          throw error;        } finally {          span.end();        }      });    };     return descriptor;  };} // Criar span manualmenteexport function createSpan(  name: string,  fn: (span: Span) => Promise<any>,  options?: SpanOptions): Promise<any> {  return tracer.startActiveSpan(name, options || {}, async (span) => {    try {      const result = await fn(span);      span.setStatus({ code: SpanStatusCode.OK });      return result;    } catch (error) {      span.setStatus({        code: SpanStatusCode.ERROR,        message: error instanceof Error ? error.message : 'Unknown error',      });      span.recordException(error as Error);      throw error;    } finally {      span.end();    }  });} // Extrair/injetar contexto para propagaçãoexport function extractContext(headers: Record<string, string>): Context {  return propagation.extract(context.active(), headers);} export function injectContext(headers: Record<string, string>): void {  propagation.inject(context.active(), headers);} // Adicionar eventos a um spanexport function addSpanEvent(  eventName: string,  attributes?: Record<string, string | number | boolean>): void {  const span = trace.getActiveSpan();  if (span) {    span.addEvent(eventName, attributes);  }} // Obter trace ID atualexport function getCurrentTraceId(): string | undefined {  const span = trace.getActiveSpan();  return span?.spanContext().traceId;} // Obter span ID atualexport function getCurrentSpanId(): string | undefined {  const span = trace.getActiveSpan();  return span?.spanContext().spanId;}

Uso em Services

typescript
// src/services/order.service.tsimport { Traced, createSpan, addSpanEvent } from '../utils/trace-context';import { trace, SpanKind } from '@opentelemetry/api'; export class OrderService {  private readonly tracer = trace.getTracer('order-service');   @Traced('OrderService.createOrder', { kind: SpanKind.INTERNAL })  async createOrder(orderData: CreateOrderDTO): Promise<Order> {    addSpanEvent('order.validation.started');     // Validação    await this.validateOrder(orderData);    addSpanEvent('order.validation.completed');     // Criar span filha para operação específica    const order = await createSpan('order.save', async (span) => {      span.setAttribute('order.items.count', orderData.items.length);      span.setAttribute('order.total', orderData.total);       const savedOrder = await this.orderRepository.save(orderData);       span.setAttribute('order.id', savedOrder.id);      return savedOrder;    });     // Publicar evento    await this.publishOrderCreated(order);     return order;  }   @Traced('OrderService.validateOrder')  private async validateOrder(orderData: CreateOrderDTO): Promise<void> {    // Validação com spans automáticas    await this.validateStock(orderData.items);    await this.validatePayment(orderData.paymentMethod);  }   private async publishOrderCreated(order: Order): Promise<void> {    // Span para mensageria    await createSpan(      'rabbitmq.publish.order_created',      async (span) => {        span.setAttribute('messaging.system', 'rabbitmq');        span.setAttribute('messaging.destination', 'orders.created');        span.setAttribute('messaging.message_id', order.id);         await this.messagePublisher.publish('orders.created', {          orderId: order.id,          timestamp: new Date().toISOString(),        });      },      { kind: SpanKind.PRODUCER }    );  }}

Propagação de Contexto entre Serviços

typescript
// src/middleware/request-context.tsimport { Request, Response, NextFunction } from 'express';import { context, propagation, trace } from '@opentelemetry/api';import { v4 as uuidv4 } from 'uuid'; export interface RequestContext {  traceId: string;  spanId: string;  requestId: string;  userId?: string;  correlationId: string;} declare global {  namespace Express {    interface Request {      context: RequestContext;    }  }} export function requestContextMiddleware(  req: Request,  res: Response,  next: NextFunction): void {  // Extrair contexto de propagação (se existir)  const extractedContext = propagation.extract(context.active(), req.headers);   context.with(extractedContext, () => {    const span = trace.getActiveSpan();    const spanContext = span?.spanContext();     // Criar contexto da requisição    req.context = {      traceId: spanContext?.traceId || uuidv4().replace(/-/g, ''),      spanId: spanContext?.spanId || uuidv4().replace(/-/g, '').substring(0, 16),      requestId: req.headers['x-request-id'] as string || uuidv4(),      userId: req.headers['x-user-id'] as string,      correlationId: req.headers['x-correlation-id'] as string || uuidv4(),    };     // Adicionar headers de resposta para debugging    res.setHeader('x-trace-id', req.context.traceId);    res.setHeader('x-request-id', req.context.requestId);     // Adicionar atributos ao span atual    if (span) {      span.setAttribute('request.id', req.context.requestId);      span.setAttribute('correlation.id', req.context.correlationId);      if (req.context.userId) {        span.setAttribute('user.id', req.context.userId);      }    }     next();  });} // Helper para propagar contexto em chamadas HTTPexport function getTracingHeaders(): Record<string, string> {  const headers: Record<string, string> = {};  propagation.inject(context.active(), headers);  return headers;}

Cliente HTTP com Propagação Automática

typescript
// src/utils/http-client.tsimport axios, { AxiosInstance, AxiosRequestConfig } from 'axios';import { getTracingHeaders, getCurrentTraceId } from './trace-context'; export function createTracedHttpClient(baseURL: string): AxiosInstance {  const client = axios.create({ baseURL });   // Interceptor para adicionar headers de tracing  client.interceptors.request.use((config) => {    const tracingHeaders = getTracingHeaders();     config.headers = {      ...config.headers,      ...tracingHeaders,      'x-trace-id': getCurrentTraceId(),    };     return config;  });   // Interceptor para logging de erros  client.interceptors.response.use(    (response) => response,    (error) => {      const traceId = getCurrentTraceId();      console.error(`HTTP Error [trace: ${traceId}]:`, {        url: error.config?.url,        method: error.config?.method,        status: error.response?.status,        message: error.message,      });      throw error;    }  );   return client;}

Métricas Customizadas

Tipos de Métricas

typescript
// src/instrumentation/metrics.tsimport { metrics, ValueType } from '@opentelemetry/api'; const meter = metrics.getMeter('microservice-metrics', '1.0.0'); // Counter - valores que só aumentamexport const httpRequestsTotal = meter.createCounter('http_requests_total', {  description: 'Total number of HTTP requests',  unit: '1',}); // UpDownCounter - valores que podem aumentar ou diminuirexport const activeConnections = meter.createUpDownCounter('active_connections', {  description: 'Number of active connections',  unit: '1',}); // Histogram - distribuição de valoresexport const httpRequestDuration = meter.createHistogram('http_request_duration_seconds', {  description: 'Duration of HTTP requests in seconds',  unit: 's',  advice: {    explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],  },}); // Observable Gauge - valor atual que é observadoexport const memoryUsage = meter.createObservableGauge('process_memory_bytes', {  description: 'Process memory usage in bytes',  unit: 'By',}); memoryUsage.addCallback((result) => {  const usage = process.memoryUsage();  result.observe(usage.heapUsed, { type: 'heap_used' });  result.observe(usage.heapTotal, { type: 'heap_total' });  result.observe(usage.rss, { type: 'rss' });  result.observe(usage.external, { type: 'external' });}); // Observable Counter - contador observávelexport const cpuUsage = meter.createObservableCounter('process_cpu_seconds_total', {  description: 'Total CPU time spent in seconds',  unit: 's',}); let previousCpuUsage = process.cpuUsage();cpuUsage.addCallback((result) => {  const currentCpuUsage = process.cpuUsage(previousCpuUsage);  result.observe((currentCpuUsage.user + currentCpuUsage.system) / 1e6, {});  previousCpuUsage = process.cpuUsage();});

Métricas de Negócio

typescript
// src/utils/business-metrics.tsimport { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('business-metrics', '1.0.0'); // Métricas de pedidosexport const ordersCreated = meter.createCounter('orders_created_total', {  description: 'Total orders created',}); export const orderValue = meter.createHistogram('order_value_dollars', {  description: 'Order value distribution',  unit: 'USD',  advice: {    explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],  },}); export const orderProcessingTime = meter.createHistogram('order_processing_duration_seconds', {  description: 'Time to process an order',  unit: 's',}); // Métricas de usuáriosexport const activeUsers = meter.createUpDownCounter('active_users', {  description: 'Number of currently active users',}); export const userRegistrations = meter.createCounter('user_registrations_total', {  description: 'Total user registrations',}); // Métricas de estoqueexport const stockLevel = meter.createObservableGauge('stock_level', {  description: 'Current stock level by product',}); // Métricas de pagamentoexport const paymentAttempts = meter.createCounter('payment_attempts_total', {  description: 'Total payment attempts',}); export const paymentAmount = meter.createHistogram('payment_amount_dollars', {  description: 'Payment amount distribution',  unit: 'USD',}); // Helper para registrar métricas de pedidoexport function recordOrderMetrics(order: {  id: string;  total: number;  items: number;  processingTimeMs: number;  paymentMethod: string;  region: string;}) {  const labels = {    payment_method: order.paymentMethod,    region: order.region,  };   ordersCreated.add(1, labels);  orderValue.record(order.total, labels);  orderProcessingTime.record(order.processingTimeMs / 1000, labels);}

Middleware de Métricas HTTP

typescript
// src/middleware/metrics.middleware.tsimport { Request, Response, NextFunction } from 'express';import { httpRequestsTotal, httpRequestDuration, activeConnections } from '../instrumentation/metrics'; export function metricsMiddleware(  req: Request,  res: Response,  next: NextFunction): void {  const startTime = process.hrtime.bigint();   // Incrementa conexões ativas  activeConnections.add(1);   // Labels comuns  const labels = {    method: req.method,    route: req.route?.path || req.path,    host: req.hostname,  };   // Quando a resposta terminar  res.on('finish', () => {    const endTime = process.hrtime.bigint();    const durationSeconds = Number(endTime - startTime) / 1e9;     const finalLabels = {      ...labels,      status_code: res.statusCode.toString(),      status_class: `${Math.floor(res.statusCode / 100)}xx`,    };     // Registra métricas    httpRequestsTotal.add(1, finalLabels);    httpRequestDuration.record(durationSeconds, finalLabels);    activeConnections.add(-1);  });   // Caso de erro/timeout  res.on('close', () => {    if (!res.writableEnded) {      activeConnections.add(-1);    }  });   next();}

Logs Estruturados

Configuração do Logger

typescript
// src/instrumentation/logging.tsimport { logs, SeverityNumber } from '@opentelemetry/api-logs';import { trace, context } from '@opentelemetry/api';import pino from 'pino'; const logger = logs.getLogger('microservice-logger', '1.0.0'); // Níveis de severidade OpenTelemetryconst severityMap: Record<string, SeverityNumber> = {  trace: SeverityNumber.TRACE,  debug: SeverityNumber.DEBUG,  info: SeverityNumber.INFO,  warn: SeverityNumber.WARN,  error: SeverityNumber.ERROR,  fatal: SeverityNumber.FATAL,}; export interface LogContext {  [key: string]: unknown;} export function createLogger(serviceName: string) {  // Pino para logs locais/console  const pinoLogger = pino({    level: process.env.LOG_LEVEL || 'info',    formatters: {      level: (label) => ({ level: label }),      bindings: () => ({}),    },    timestamp: () => `,"timestamp":"${new Date().toISOString()}"`,    base: {      service: serviceName,      environment: process.env.NODE_ENV,    },  });   return {    trace: (message: string, ctx?: LogContext) => log('trace', message, ctx),    debug: (message: string, ctx?: LogContext) => log('debug', message, ctx),    info: (message: string, ctx?: LogContext) => log('info', message, ctx),    warn: (message: string, ctx?: LogContext) => log('warn', message, ctx),    error: (message: string, ctx?: LogContext) => log('error', message, ctx),    fatal: (message: string, ctx?: LogContext) => log('fatal', message, ctx),    child: (bindings: Record<string, unknown>) => {      return createChildLogger(serviceName, bindings);    },  };   function log(level: string, message: string, ctx?: LogContext) {    // Log para console via Pino    pinoLogger[level as keyof typeof pinoLogger]({ ...ctx }, message);     // Log para OpenTelemetry    const span = trace.getActiveSpan();    const spanContext = span?.spanContext();     logger.emit({      severityNumber: severityMap[level],      severityText: level.toUpperCase(),      body: message,      attributes: {        'service.name': serviceName,        'log.level': level,        ...(spanContext && {          'trace_id': spanContext.traceId,          'span_id': spanContext.spanId,        }),        ...flattenObject(ctx || {}),      },    });  }} function createChildLogger(serviceName: string, bindings: Record<string, unknown>) {  const parentLogger = createLogger(serviceName);   return {    trace: (message: string, ctx?: LogContext) =>      parentLogger.trace(message, { ...bindings, ...ctx }),    debug: (message: string, ctx?: LogContext) =>      parentLogger.debug(message, { ...bindings, ...ctx }),    info: (message: string, ctx?: LogContext) =>      parentLogger.info(message, { ...bindings, ...ctx }),    warn: (message: string, ctx?: LogContext) =>      parentLogger.warn(message, { ...bindings, ...ctx }),    error: (message: string, ctx?: LogContext) =>      parentLogger.error(message, { ...bindings, ...ctx }),    fatal: (message: string, ctx?: LogContext) =>      parentLogger.fatal(message, { ...bindings, ...ctx }),    child: (newBindings: Record<string, unknown>) =>      createChildLogger(serviceName, { ...bindings, ...newBindings }),  };} // Flatten nested objects para atributosfunction flattenObject(  obj: Record<string, unknown>,  prefix = ''): Record<string, string | number | boolean> {  const result: Record<string, string | number | boolean> = {};   for (const [key, value] of Object.entries(obj)) {    const newKey = prefix ? `${prefix}.${key}` : key;     if (value && typeof value === 'object' && !Array.isArray(value)) {      Object.assign(result, flattenObject(value as Record<string, unknown>, newKey));    } else if (typeof value === 'string' || typeof value === 'number' || typeof value === 'boolean') {      result[newKey] = value;    } else if (value !== undefined && value !== null) {      result[newKey] = String(value);    }  }   return result;} export const log = createLogger(process.env.SERVICE_NAME || 'unknown-service');

Middleware de Logging

typescript
// src/middleware/logging.middleware.tsimport { Request, Response, NextFunction } from 'express';import { log } from '../instrumentation/logging';import { getCurrentTraceId, getCurrentSpanId } from '../utils/trace-context'; export function loggingMiddleware(  req: Request,  res: Response,  next: NextFunction): void {  const startTime = Date.now();   // Logger contextual para esta requisição  const requestLogger = log.child({    requestId: req.context?.requestId,    traceId: getCurrentTraceId(),    spanId: getCurrentSpanId(),    method: req.method,    path: req.path,    userAgent: req.headers['user-agent'],    ip: req.ip,  });   // Log de início da requisição  requestLogger.info('Request started', {    query: req.query,    params: req.params,    // Não loga body para evitar dados sensíveis  });   // Captura o body da resposta  const originalSend = res.send;  res.send = function (body: any) {    res.locals.body = body;    return originalSend.call(this, body);  };   // Log ao finalizar  res.on('finish', () => {    const duration = Date.now() - startTime;    const logContext = {      statusCode: res.statusCode,      duration,      contentLength: res.get('content-length'),    };     if (res.statusCode >= 500) {      requestLogger.error('Request failed', logContext);    } else if (res.statusCode >= 400) {      requestLogger.warn('Request client error', logContext);    } else {      requestLogger.info('Request completed', logContext);    }  });   // Disponibiliza logger na requisição  (req as any).log = requestLogger;   next();}

Correlação Logs ↔ Traces

typescript
// src/utils/log-formatter.tsimport { getCurrentTraceId, getCurrentSpanId } from './trace-context'; export interface CorrelatedLogEntry {  timestamp: string;  level: string;  message: string;  traceId?: string;  spanId?: string;  service: string;  [key: string]: unknown;} export function formatLogEntry(  level: string,  message: string,  context: Record<string, unknown> = {}): CorrelatedLogEntry {  return {    timestamp: new Date().toISOString(),    level,    message,    traceId: getCurrentTraceId(),    spanId: getCurrentSpanId(),    service: process.env.SERVICE_NAME || 'unknown',    ...context,  };} // Exemplo de uso em error handlerexport function logError(error: Error, context?: Record<string, unknown>): void {  const entry = formatLogEntry('error', error.message, {    errorName: error.name,    stack: error.stack,    ...context,  });   console.error(JSON.stringify(entry));}

OpenTelemetry Collector

O Collector é o hub central que recebe, processa e exporta telemetria.

Configuração do Collector

yaml
# docker/otel-collector-config.yamlreceivers:  otlp:    protocols:      grpc:        endpoint: 0.0.0.0:4317      http:        endpoint: 0.0.0.0:4318   # Receiver para métricas do host  hostmetrics:    collection_interval: 30s    scrapers:      cpu:      memory:      disk:      network:   # Receiver Prometheus (pull-based)  prometheus:    config:      scrape_configs:        - job_name: 'otel-collector'          scrape_interval: 15s          static_configs:            - targets: ['localhost:8888'] processors:  # Agrupa em batches para eficiência  batch:    timeout: 5s    send_batch_size: 1000    send_batch_max_size: 2000   # Adiciona metadados  resource:    attributes:      - key: deployment.environment        value: production        action: upsert      - key: collector.version        value: "0.91.0"        action: upsert   # Filtra dados sensíveis  attributes:    actions:      - key: http.request.header.authorization        action: delete      - key: db.statement        action: hash      - key: user.email        pattern: ^.*@        action: hash   # Sampling para reduzir volume  probabilistic_sampler:    sampling_percentage: 10   # Tail-based sampling (mantém traces com erros)  tail_sampling:    decision_wait: 10s    num_traces: 100000    policies:      # Sempre mantém traces com erros      - name: errors        type: status_code        status_code:          status_codes: [ERROR]      # Sempre mantém traces lentos (>2s)      - name: slow-traces        type: latency        latency:          threshold_ms: 2000      # Amostra 10% dos traces normais      - name: normal-sampling        type: probabilistic        probabilistic:          sampling_percentage: 10   # Transformações  transform:    trace_statements:      - context: span        statements:          - set(attributes["processed_by"], "otel-collector")          - truncate_all(attributes, 256) exporters:  # Debug (apenas desenvolvimento)  debug:    verbosity: detailed    sampling_initial: 5    sampling_thereafter: 100   # Jaeger para traces  otlp/jaeger:    endpoint: jaeger:4317    tls:      insecure: true   # Prometheus para métricas  prometheus:    endpoint: "0.0.0.0:8889"    namespace: microservices    const_labels:      environment: production    resource_to_telemetry_conversion:      enabled: true   # Loki para logs  loki:    endpoint: http://loki:3100/loki/api/v1/push    tenant_id: microservices    labels:      resource:        service.name: "service"        service.namespace: "namespace"      attributes:        log.level: "level"   # OTLP genérico (pode ser Grafana Cloud, Honeycomb, etc.)  otlp/cloud:    endpoint: ${OTLP_ENDPOINT}    headers:      Authorization: Bearer ${OTLP_TOKEN} extensions:  health_check:    endpoint: 0.0.0.0:13133  pprof:    endpoint: 0.0.0.0:1777  zpages:    endpoint: 0.0.0.0:55679 service:  extensions: [health_check, pprof, zpages]   pipelines:    traces:      receivers: [otlp]      processors: [batch, resource, attributes, tail_sampling]      exporters: [otlp/jaeger, otlp/cloud]     metrics:      receivers: [otlp, hostmetrics, prometheus]      processors: [batch, resource]      exporters: [prometheus, otlp/cloud]     logs:      receivers: [otlp]      processors: [batch, resource, attributes]      exporters: [loki, otlp/cloud]   telemetry:    logs:      level: info    metrics:      address: 0.0.0.0:8888

Stack de Observabilidade Completa

Docker Compose

yaml
# docker-compose.observability.ymlversion: '3.8' services:  # OpenTelemetry Collector  otel-collector:    image: otel/opentelemetry-collector-contrib:0.91.0    container_name: otel-collector    command: ["--config=/etc/otel-collector-config.yaml"]    volumes:      - ./docker/otel-collector-config.yaml:/etc/otel-collector-config.yaml    ports:      - "4317:4317"   # OTLP gRPC      - "4318:4318"   # OTLP HTTP      - "8888:8888"   # Prometheus metrics exposed by the collector      - "8889:8889"   # Prometheus exporter metrics      - "13133:13133" # Health check      - "55679:55679" # zPages    environment:      - OTLP_ENDPOINT=${OTLP_ENDPOINT:-}      - OTLP_TOKEN=${OTLP_TOKEN:-}    depends_on:      - jaeger      - prometheus      - loki    networks:      - observability   # Jaeger - Distributed Tracing  jaeger:    image: jaegertracing/all-in-one:1.52    container_name: jaeger    ports:      - "16686:16686" # UI      - "14268:14268" # HTTP collector      - "14250:14250" # gRPC collector    environment:      - COLLECTOR_OTLP_ENABLED=true      - SPAN_STORAGE_TYPE=badger      - BADGER_EPHEMERAL=false      - BADGER_DIRECTORY_VALUE=/badger/data      - BADGER_DIRECTORY_KEY=/badger/key    volumes:      - jaeger-data:/badger    networks:      - observability   # Prometheus - Metrics  prometheus:    image: prom/prometheus:v2.48.0    container_name: prometheus    command:      - '--config.file=/etc/prometheus/prometheus.yml'      - '--storage.tsdb.path=/prometheus'      - '--storage.tsdb.retention.time=15d'      - '--web.enable-lifecycle'      - '--web.enable-remote-write-receiver'    volumes:      - ./docker/prometheus.yml:/etc/prometheus/prometheus.yml      - prometheus-data:/prometheus    ports:      - "9090:9090"    networks:      - observability   # Loki - Log Aggregation  loki:    image: grafana/loki:2.9.2    container_name: loki    command: -config.file=/etc/loki/loki-config.yaml    volumes:      - ./docker/loki-config.yaml:/etc/loki/loki-config.yaml      - loki-data:/loki    ports:      - "3100:3100"    networks:      - observability   # Grafana - Visualization  grafana:    image: grafana/grafana:10.2.2    container_name: grafana    environment:      - GF_SECURITY_ADMIN_USER=admin      - GF_SECURITY_ADMIN_PASSWORD=admin123      - GF_USERS_ALLOW_SIGN_UP=false      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor    volumes:      - grafana-data:/var/lib/grafana      - ./docker/grafana/provisioning:/etc/grafana/provisioning      - ./docker/grafana/dashboards:/var/lib/grafana/dashboards    ports:      - "3001:3000"    depends_on:      - prometheus      - loki      - jaeger    networks:      - observability   # Alertmanager - Alerts  alertmanager:    image: prom/alertmanager:v0.26.0    container_name: alertmanager    volumes:      - ./docker/alertmanager.yml:/etc/alertmanager/alertmanager.yml      - alertmanager-data:/alertmanager    command:      - '--config.file=/etc/alertmanager/alertmanager.yml'      - '--storage.path=/alertmanager'    ports:      - "9093:9093"    networks:      - observability volumes:  jaeger-data:  prometheus-data:  loki-data:  grafana-data:  alertmanager-data: networks:  observability:    driver: bridge

Configuração Prometheus

yaml
# docker/prometheus.ymlglobal:  scrape_interval: 15s  evaluation_interval: 15s  external_labels:    cluster: 'microservices-cluster'    env: 'production' alerting:  alertmanagers:    - static_configs:        - targets: ['alertmanager:9093'] rule_files:  - '/etc/prometheus/rules/*.yml' scrape_configs:  # Prometheus self-monitoring  - job_name: 'prometheus'    static_configs:      - targets: ['localhost:9090']   # OpenTelemetry Collector  - job_name: 'otel-collector'    static_configs:      - targets: ['otel-collector:8889']   # Microservices (via service discovery ou static)  - job_name: 'microservices'    kubernetes_sd_configs:      - role: pod        namespaces:          names:            - microservices    relabel_configs:      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]        action: keep        regex: true      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]        action: replace        target_label: __metrics_path__        regex: (.+)      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]        action: replace        regex: ([^:]+)(?::\d+)?;(\d+)        replacement: $1:$2        target_label: __address__      - source_labels: [__meta_kubernetes_namespace]        action: replace        target_label: namespace      - source_labels: [__meta_kubernetes_pod_name]        action: replace        target_label: pod

Configuração Loki

yaml
# docker/loki-config.yamlauth_enabled: false server:  http_listen_port: 3100  grpc_listen_port: 9096 common:  instance_addr: 127.0.0.1  path_prefix: /loki  storage:    filesystem:      chunks_directory: /loki/chunks      rules_directory: /loki/rules  replication_factor: 1  ring:    kvstore:      store: inmemory query_range:  results_cache:    cache:      embedded_cache:        enabled: true        max_size_mb: 100 schema_config:  configs:    - from: 2020-10-24      store: boltdb-shipper      object_store: filesystem      schema: v11      index:        prefix: index_        period: 24h ruler:  alertmanager_url: http://alertmanager:9093 limits_config:  reject_old_samples: true  reject_old_samples_max_age: 168h  ingestion_rate_mb: 10  ingestion_burst_size_mb: 20  max_streams_per_user: 10000  max_line_size: 256kb table_manager:  retention_deletes_enabled: true  retention_period: 336h

Dashboards Grafana

Dashboard de Microserviços

json
{  "dashboard": {    "title": "Microservices Overview",    "tags": ["microservices", "observability"],    "timezone": "browser",    "panels": [      {        "title": "Request Rate",        "type": "timeseries",        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },        "targets": [          {            "expr": "sum(rate(http_requests_total[5m])) by (service)",            "legendFormat": "{{service}}"          }        ]      },      {        "title": "Error Rate",        "type": "timeseries",        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },        "targets": [          {            "expr": "sum(rate(http_requests_total{status_class='5xx'}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",            "legendFormat": "{{service}}"          }        ],        "fieldConfig": {          "defaults": {            "unit": "percent",            "thresholds": {              "steps": [                { "color": "green", "value": null },                { "color": "yellow", "value": 1 },                { "color": "red", "value": 5 }              ]            }          }        }      },      {        "title": "P99 Latency",        "type": "timeseries",        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },        "targets": [          {            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",            "legendFormat": "{{service}}"          }        ],        "fieldConfig": {          "defaults": {            "unit": "s"          }        }      },      {        "title": "Active Connections",        "type": "stat",        "gridPos": { "h": 4, "w": 6, "x": 12, "y": 8 },        "targets": [          {            "expr": "sum(active_connections)"          }        ]      },      {        "title": "Service Map",        "type": "nodeGraph",        "gridPos": { "h": 12, "w": 24, "x": 0, "y": 16 },        "datasource": "Jaeger",        "targets": [          {            "queryType": "serviceMap"          }        ]      }    ]  }}

Alertas e SLOs

Regras de Alerta Prometheus

yaml
# docker/prometheus/rules/microservices-alerts.ymlgroups:  - name: microservices.rules    interval: 30s    rules:      # SLI: Availability      - record: sli:availability:rate5m        expr: |          sum(rate(http_requests_total{status_class!="5xx"}[5m])) by (service)          /          sum(rate(http_requests_total[5m])) by (service)       # SLI: Latency (P99 < 500ms)      - record: sli:latency_p99:rate5m        expr: |          histogram_quantile(0.99,            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)          )   - name: microservices.alerts    rules:      # High Error Rate      - alert: HighErrorRate        expr: |          (            sum(rate(http_requests_total{status_class="5xx"}[5m])) by (service)            /            sum(rate(http_requests_total[5m])) by (service)          ) > 0.05        for: 5m        labels:          severity: critical        annotations:          summary: "High error rate on {{ $labels.service }}"          description: "Error rate is {{ printf \"%.2f\" $value }}% (threshold: 5%)"       # High Latency      - alert: HighLatency        expr: |          histogram_quantile(0.99,            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)          ) > 0.5        for: 5m        labels:          severity: warning        annotations:          summary: "High P99 latency on {{ $labels.service }}"          description: "P99 latency is {{ printf \"%.3f\" $value }}s (threshold: 500ms)"       # Service Down      - alert: ServiceDown        expr: up == 0        for: 1m        labels:          severity: critical        annotations:          summary: "Service {{ $labels.job }} is down"          description: "{{ $labels.instance }} has been down for more than 1 minute"       # Memory Usage High      - alert: HighMemoryUsage        expr: |          (process_memory_bytes{type="heap_used"} / process_memory_bytes{type="heap_total"}) > 0.85        for: 10m        labels:          severity: warning        annotations:          summary: "High memory usage on {{ $labels.service }}"          description: "Heap usage is {{ printf \"%.1f\" $value }}%"       # Dead Letter Queue Growing      - alert: DLQGrowing        expr: |          increase(rabbitmq_queue_messages{queue=~".*\\.dlq"}[1h]) > 100        for: 5m        labels:          severity: warning        annotations:          summary: "Dead Letter Queue growing"          description: "DLQ {{ $labels.queue }} has {{ $value }} new messages in the last hour"       # SLO Breach Risk      - alert: SLOBreachRisk        expr: |          (            1 - (              sum(rate(http_requests_total{status_class!="5xx"}[30m])) by (service)              /              sum(rate(http_requests_total[30m])) by (service)            )          ) > (1 - 0.999) * 2        for: 10m        labels:          severity: critical        annotations:          summary: "SLO breach risk for {{ $labels.service }}"          description: "Error budget burn rate is 2x normal, risking monthly SLO breach"

SLO Dashboard

typescript
// src/slo/slo-calculator.tsimport { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('slo-metrics', '1.0.0'); interface SLOConfig {  name: string;  target: number; // Ex: 0.999 para 99.9%  windowDays: number;} interface SLOStatus {  name: string;  target: number;  current: number;  errorBudget: number;  errorBudgetRemaining: number;  isBreached: boolean;} export class SLOCalculator {  private sloGauge = meter.createObservableGauge('slo_current', {    description: 'Current SLO value',  });   private errorBudgetGauge = meter.createObservableGauge('slo_error_budget_remaining', {    description: 'Remaining error budget percentage',  });   constructor(private slos: SLOConfig[]) {    this.setupMetrics();  }   private setupMetrics(): void {    this.sloGauge.addCallback(async (result) => {      for (const slo of this.slos) {        const status = await this.calculateSLO(slo);        result.observe(status.current, { slo_name: slo.name });      }    });     this.errorBudgetGauge.addCallback(async (result) => {      for (const slo of this.slos) {        const status = await this.calculateSLO(slo);        result.observe(status.errorBudgetRemaining, { slo_name: slo.name });      }    });  }   async calculateSLO(config: SLOConfig): Promise<SLOStatus> {    // Em produção, isso consultaria Prometheus ou outra fonte de métricas    const totalRequests = await this.getTotalRequests(config.windowDays);    const successfulRequests = await this.getSuccessfulRequests(config.windowDays);     const current = totalRequests > 0 ? successfulRequests / totalRequests : 1;    const errorBudget = 1 - config.target;    const errorsAllowed = totalRequests * errorBudget;    const actualErrors = totalRequests - successfulRequests;    const errorBudgetRemaining = Math.max(0, (errorsAllowed - actualErrors) / errorsAllowed);     return {      name: config.name,      target: config.target,      current,      errorBudget,      errorBudgetRemaining,      isBreached: current < config.target,    };  }   private async getTotalRequests(windowDays: number): Promise<number> {    // Implementação real consultaria Prometheus    return 1000000; // Placeholder  }   private async getSuccessfulRequests(windowDays: number): Promise<number> {    // Implementação real consultaria Prometheus    return 999500; // Placeholder  }} // Usoconst sloCalculator = new SLOCalculator([  { name: 'api_availability', target: 0.999, windowDays: 30 },  { name: 'api_latency_p99', target: 0.99, windowDays: 30 },  { name: 'order_processing', target: 0.995, windowDays: 30 },]);

Deploy em Kubernetes

Operator do OpenTelemetry

yaml
# k8s/otel-operator.yamlapiVersion: opentelemetry.io/v1alpha1kind: OpenTelemetryCollectormetadata:  name: otel-collector  namespace: observabilityspec:  mode: deployment  replicas: 2   config: |    receivers:      otlp:        protocols:          grpc:            endpoint: 0.0.0.0:4317          http:            endpoint: 0.0.0.0:4318     processors:      batch:        timeout: 5s        send_batch_size: 1000       memory_limiter:        check_interval: 1s        limit_mib: 1000        spike_limit_mib: 200     exporters:      otlp/jaeger:        endpoint: jaeger-collector.observability.svc:4317        tls:          insecure: true       prometheus:        endpoint: "0.0.0.0:8889"     service:      pipelines:        traces:          receivers: [otlp]          processors: [memory_limiter, batch]          exporters: [otlp/jaeger]        metrics:          receivers: [otlp]          processors: [memory_limiter, batch]          exporters: [prometheus]   resources:    limits:      cpu: 500m      memory: 1Gi    requests:      cpu: 100m      memory: 256Mi ---apiVersion: opentelemetry.io/v1alpha1kind: Instrumentationmetadata:  name: auto-instrumentation  namespace: microservicesspec:  exporter:    endpoint: http://otel-collector.observability.svc:4318   propagators:    - tracecontext    - baggage    - b3   sampler:    type: parentbased_traceidratio    argument: "0.1"   nodejs:    env:      - name: OTEL_TRACES_EXPORTER        value: otlp      - name: OTEL_METRICS_EXPORTER        value: otlp      - name: OTEL_LOGS_EXPORTER        value: otlp

Deployment com Auto-Instrumentação

yaml
# k8s/microservice-deployment.yamlapiVersion: apps/v1kind: Deploymentmetadata:  name: order-service  namespace: microservicesspec:  replicas: 3  selector:    matchLabels:      app: order-service  template:    metadata:      labels:        app: order-service      annotations:        # Ativa auto-instrumentação do OTel Operator        instrumentation.opentelemetry.io/inject-nodejs: "true"        # Prometheus scraping        prometheus.io/scrape: "true"        prometheus.io/port: "9464"        prometheus.io/path: "/metrics"    spec:      containers:        - name: order-service          image: order-service:latest          ports:            - containerPort: 3000              name: http            - containerPort: 9464              name: metrics          env:            - name: SERVICE_NAME              value: order-service            - name: SERVICE_VERSION              valueFrom:                fieldRef:                  fieldPath: metadata.labels['version']            - name: NODE_ENV              value: production            - name: OTEL_EXPORTER_OTLP_ENDPOINT              value: http://otel-collector.observability.svc:4318            - name: POD_NAME              valueFrom:                fieldRef:                  fieldPath: metadata.name            - name: POD_NAMESPACE              valueFrom:                fieldRef:                  fieldPath: metadata.namespace          resources:            limits:              cpu: 500m              memory: 512Mi            requests:              cpu: 100m              memory: 128Mi          livenessProbe:            httpGet:              path: /health              port: http            initialDelaySeconds: 30            periodSeconds: 10          readinessProbe:            httpGet:              path: /ready              port: http            initialDelaySeconds: 5            periodSeconds: 5

Troubleshooting com Observabilidade

Script de Investigação

typescript
// src/utils/troubleshooting.tsimport { trace, context, SpanStatusCode } from '@opentelemetry/api';import { log } from '../instrumentation/logging'; interface Investigation {  traceId: string;  spans: SpanInfo[];  errors: ErrorInfo[];  logs: LogEntry[];  metrics: MetricSnapshot[];} interface SpanInfo {  name: string;  duration: number;  status: string;  attributes: Record<string, unknown>;} interface ErrorInfo {  spanName: string;  message: string;  stack?: string;  timestamp: string;} interface LogEntry {  level: string;  message: string;  timestamp: string;  attributes: Record<string, unknown>;} interface MetricSnapshot {  name: string;  value: number;  labels: Record<string, string>;} export class TroubleshootingHelper {  // Wrapper para operações com logging detalhado  async executeWithDiagnostics<T>(    name: string,    operation: () => Promise<T>,    context?: Record<string, unknown>  ): Promise<T> {    const tracer = trace.getTracer('troubleshooting');     return tracer.startActiveSpan(`diagnostic:${name}`, async (span) => {      const startTime = Date.now();       log.info(`Starting operation: ${name}`, {        ...context,        operation: name,        phase: 'start',      });       try {        const result = await operation();         const duration = Date.now() - startTime;        span.setAttribute('operation.duration_ms', duration);        span.setStatus({ code: SpanStatusCode.OK });         log.info(`Completed operation: ${name}`, {          ...context,          operation: name,          phase: 'complete',          duration_ms: duration,        });         return result;      } catch (error) {        const duration = Date.now() - startTime;         span.setStatus({          code: SpanStatusCode.ERROR,          message: error instanceof Error ? error.message : 'Unknown error',        });        span.recordException(error as Error);         log.error(`Failed operation: ${name}`, {          ...context,          operation: name,          phase: 'error',          duration_ms: duration,          error: error instanceof Error ? error.message : String(error),          stack: error instanceof Error ? error.stack : undefined,        });         throw error;      } finally {        span.end();      }    });  }   // Adiciona breadcrumbs para debugging  addBreadcrumb(    category: string,    message: string,    data?: Record<string, unknown>  ): void {    const span = trace.getActiveSpan();     if (span) {      span.addEvent('breadcrumb', {        'breadcrumb.category': category,        'breadcrumb.message': message,        ...Object.entries(data || {}).reduce((acc, [key, value]) => {          acc[`breadcrumb.data.${key}`] = String(value);          return acc;        }, {} as Record<string, string>),      });    }     log.debug(`[${category}] ${message}`, data);  }   // Health check com diagnóstico  async runHealthCheck(): Promise<{    status: 'healthy' | 'degraded' | 'unhealthy';    checks: Record<string, { status: string; latency: number; error?: string }>;  }> {    const checks: Record<string, { status: string; latency: number; error?: string }> = {};     // Database check    const dbStart = Date.now();    try {      // await database.query('SELECT 1');      checks.database = { status: 'healthy', latency: Date.now() - dbStart };    } catch (error) {      checks.database = {        status: 'unhealthy',        latency: Date.now() - dbStart,        error: error instanceof Error ? error.message : 'Unknown error',      };    }     // Redis check    const redisStart = Date.now();    try {      // await redis.ping();      checks.redis = { status: 'healthy', latency: Date.now() - redisStart };    } catch (error) {      checks.redis = {        status: 'unhealthy',        latency: Date.now() - redisStart,        error: error instanceof Error ? error.message : 'Unknown error',      };    }     // RabbitMQ check    const mqStart = Date.now();    try {      // await rabbitmq.checkConnection();      checks.rabbitmq = { status: 'healthy', latency: Date.now() - mqStart };    } catch (error) {      checks.rabbitmq = {        status: 'unhealthy',        latency: Date.now() - mqStart,        error: error instanceof Error ? error.message : 'Unknown error',      };    }     // Determine overall status    const unhealthyCount = Object.values(checks).filter(c => c.status === 'unhealthy').length;    const status = unhealthyCount === 0 ? 'healthy' : unhealthyCount >= 2 ? 'unhealthy' : 'degraded';     return { status, checks };  }} export const troubleshoot = new TroubleshootingHelper();

Checklist de Produção

Instrumentação

  • OpenTelemetry SDK configurado antes de outros imports
  • Auto-instrumentação habilitada para HTTP, banco de dados, mensageria
  • Spans customizadas para operações de negócio críticas
  • Atributos relevantes adicionados aos spans
  • Erros capturados e registrados corretamente

Métricas

  • Métricas RED para todos os endpoints
  • Métricas de negócio definidas
  • Histogramas com buckets apropriados
  • Labels consistentes entre serviços
  • Cardinalidade de labels controlada

Logs

  • Formato estruturado (JSON)
  • Correlação com trace ID
  • Níveis de log apropriados
  • Dados sensíveis mascarados
  • Rotação e retenção configuradas

Alertas

  • SLOs definidos e monitorados
  • Alertas para métricas críticas
  • Runbooks para cada alerta
  • Escalação configurada
  • Testes de alertas realizados

Infraestrutura

  • Collector com alta disponibilidade
  • Retenção de dados adequada
  • Backup de configurações
  • Sampling configurado para volume
  • Recursos adequados para stack

Conclusão

Observabilidade é a base para operar microserviços em produção com confiança. Os pontos-chave são:

  1. Três Pilares: Traces, métricas e logs trabalham juntos para dar visibilidade completa
  2. OpenTelemetry: Padrão vendor-neutral que simplifica instrumentação
  3. Correlação: Trace ID conecta logs, métricas e traces de uma mesma requisição
  4. SLOs: Defina objetivos claros e monitore error budgets
  5. Alertas Inteligentes: Alerte sobre sintomas, não causas

Com esta série completa, você tem todas as ferramentas para construir microserviços robustos:

PF
Sobre o autor

Pedro Farbo

Platform Engineering Lead & Solutions Architect com +10 anos de experiência. CEO da Farbo TSC. Especialista em Microserviços, Kong, Backstage e Cloud.

Gostou do conteúdo? Sua contribuição ajuda a manter tudo online e gratuito!

PIX:0737160d-e98f-4a65-8392-5dba70e7ff3e