Observabilidade com OpenTelemetry: Monitorando Microserviços em Produção
Implemente observabilidade completa em seus microserviços com OpenTelemetry, Prometheus e Grafana. Aprenda a configurar traces distribuídos, métricas customizadas e correlação de logs para debugging em produção.
Este conteúdo é gratuito! Ajude a manter o projeto no ar.
0737160d-e98f-4a65-8392-5dba70e7ff3eEste é o quarto artigo da nossa série sobre microserviços. Se você ainda não leu os artigos anteriores, confira o guia de microserviços, API Gateway com Kong e mensageria com RabbitMQ.
Por que Observabilidade?
Em sistemas distribuídos, debugging é exponencialmente mais difícil. Uma requisição passa por múltiplos serviços, cada um com seus próprios logs, métricas e estados. Sem observabilidade adequada, encontrar a causa raiz de um problema é como procurar uma agulha em um palheiro.
Os Três Pilares da Observabilidade
┌─────────────────────────────────────────────────────────────┐
│ OBSERVABILIDADE │
├───────────────────┬───────────────────┬───────────────────┤
│ TRACES │ MÉTRICAS │ LOGS │
│ │ │ │
│ ┌─────────────┐ │ ┌─────────────┐ │ ┌─────────────┐ │
│ │ Requisição │ │ │ Contadores │ │ │ Estrutura │ │
│ │ Distribuída │ │ │ Histogramas│ │ │ JSON │ │
│ │ Latência │ │ │ Gauges │ │ │ Contexto │ │
│ │ Erros │ │ │ Percentis │ │ │ TraceID │ │
│ └─────────────┘ │ └─────────────┘ │ └─────────────┘ │
│ │ │ │
│ "O que aconteceu │ "Como o sistema │ "Por que │
│ nesta request?" │ está se │ aconteceu?" │
│ │ comportando?" │ │
└───────────────────┴───────────────────┴───────────────────┘
OpenTelemetry: O Padrão da Indústria
OpenTelemetry (OTel) é um projeto CNCF que fornece APIs, SDKs e ferramentas para coletar telemetria (traces, métricas e logs) de forma padronizada e vendor-neutral.
Arquitetura OpenTelemetry
┌─────────────────────────────────────────────────────────────────────┐
│ APLICAÇÃO │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Auto-instr. │ │ Manual-instr. │ │ Baggage │ │
│ │ (HTTP, gRPC) │ │ (Custom) │ │ (Context) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ OTel SDK │ │
│ │ ┌───────────────┐ │ │
│ │ │ Processor │ │ │
│ │ │ Sampler │ │ │
│ │ │ Exporter │ │ │
│ │ └───────────────┘ │ │
│ └──────────┬──────────┘ │
└───────────────────────────────┼─────────────────────────────────────┘
│
▼
┌─────────────────────┐
│ OTel Collector │
│ ┌───────────────┐ │
│ │ Receivers │──┼──► OTLP, Jaeger, Zipkin
│ │ Processors │──┼──► Batch, Filter, Transform
│ │ Exporters │──┼──► Jaeger, Prometheus, Loki
│ └───────────────┘ │
└──────────┬──────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Jaeger │ │ Prometheus │ │ Loki │
│ (Traces) │ │ (Metrics) │ │ (Logs) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────────────┼────────────────┘
▼
┌─────────────┐
│ Grafana │
│ (Dashboard) │
└─────────────┘
Estrutura do Projeto
observability-service/
├── src/
│ ├── instrumentation/
│ │ ├── index.ts # Setup principal OTel
│ │ ├── tracing.ts # Configuração de traces
│ │ ├── metrics.ts # Configuração de métricas
│ │ └── logging.ts # Configuração de logs
│ ├── middleware/
│ │ ├── request-context.ts # Contexto de requisição
│ │ ├── metrics.middleware.ts # Métricas HTTP
│ │ └── logging.middleware.ts # Logs estruturados
│ ├── utils/
│ │ ├── trace-context.ts # Utilitários de trace
│ │ ├── custom-metrics.ts # Métricas customizadas
│ │ └── log-formatter.ts # Formatação de logs
│ ├── exporters/
│ │ ├── jaeger.ts # Exporter Jaeger
│ │ ├── prometheus.ts # Exporter Prometheus
│ │ └── loki.ts # Exporter Loki
│ └── app.ts
├── docker/
│ ├── otel-collector-config.yaml
│ ├── prometheus.yml
│ ├── loki-config.yaml
│ └── grafana/
│ └── dashboards/
│ └── microservices.json
├── docker-compose.observability.yml
└── package.json
Configuração do OpenTelemetry SDK
Instalação
# Core OpenTelemetrynpm install @opentelemetry/api @opentelemetry/sdk-node # Instrumentação automáticanpm install @opentelemetry/auto-instrumentations-node # Exportersnpm install @opentelemetry/exporter-trace-otlp-httpnpm install @opentelemetry/exporter-metrics-otlp-httpnpm install @opentelemetry/exporter-logs-otlp-http # Recursos e semânticanpm install @opentelemetry/resourcesnpm install @opentelemetry/semantic-conventionsSetup Principal
// src/instrumentation/index.tsimport { NodeSDK } from '@opentelemetry/sdk-node';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-http';import { Resource } from '@opentelemetry/resources';import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION, SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,} from '@opentelemetry/semantic-conventions';import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api'; // Configurar diagnóstico para debuggingif (process.env.OTEL_DEBUG === 'true') { diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.DEBUG);} // Configuração do recurso (identifica o serviço)const resource = new Resource({ [SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service', [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0', [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development', 'service.instance.id': process.env.HOSTNAME || 'local', 'service.namespace': 'microservices',}); // Configuração dos exportersconst traceExporter = new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces', headers: { 'x-api-key': process.env.OTEL_API_KEY || '', },}); const metricExporter = new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics',}); const logExporter = new OTLPLogExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/logs',}); // Configuração do SDKconst sdk = new NodeSDK({ resource, traceExporter, metricReader: new PeriodicExportingMetricReader({ exporter: metricExporter, exportIntervalMillis: 15000, // Exporta métricas a cada 15s }), logRecordProcessor: new BatchLogRecordProcessor(logExporter), instrumentations: [ getNodeAutoInstrumentations({ // Configuração específica por instrumentação '@opentelemetry/instrumentation-http': { requestHook: (span, request) => { span.setAttribute('http.request.id', request.headers['x-request-id'] || ''); }, responseHook: (span, response) => { span.setAttribute('http.response.content_length', response.headers['content-length'] || 0); }, ignoreIncomingRequestHook: (request) => { // Ignora health checks return request.url === '/health' || request.url === '/ready'; }, }, '@opentelemetry/instrumentation-express': { enabled: true, }, '@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true, }, '@opentelemetry/instrumentation-redis': { enabled: true, }, '@opentelemetry/instrumentation-amqplib': { enabled: true, // RabbitMQ }, }), ],}); // Inicializaçãoexport async function initTelemetry(): Promise<void> { try { await sdk.start(); console.log('OpenTelemetry initialized successfully'); // Graceful shutdown process.on('SIGTERM', async () => { try { await sdk.shutdown(); console.log('OpenTelemetry shut down successfully'); } catch (error) { console.error('Error shutting down OpenTelemetry', error); } }); } catch (error) { console.error('Error initializing OpenTelemetry', error); throw error; }} export { sdk };Entry Point da Aplicação
// src/index.tsimport { initTelemetry } from './instrumentation'; // IMPORTANTE: Inicializar telemetria primeiro!async function bootstrap() { await initTelemetry(); // Agora importa o resto da aplicação const { createApp } = await import('./app'); const app = await createApp(); const port = process.env.PORT || 3000; app.listen(port, () => { console.log(`Server running on port ${port}`); });} bootstrap().catch(console.error);Distributed Tracing
Tracing distribuído permite seguir uma requisição através de múltiplos serviços.
Conceitos Fundamentais
┌─────────────────────────────────────────────────────────────────┐
│ TRACE │
│ TraceID: abc123 │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ SPAN: API Gateway (Root Span) │ │
│ │ SpanID: span-1, ParentID: null │ │
│ │ Duration: 250ms │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ SPAN: User Service │ │ │
│ │ │ SpanID: span-2, ParentID: span-1 │ │ │
│ │ │ Duration: 50ms │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ SPAN: Order Service │ │ │
│ │ │ SpanID: span-3, ParentID: span-1 │ │ │
│ │ │ Duration: 150ms │ │ │
│ │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ SPAN: Database Query │ │ │ │
│ │ │ │ SpanID: span-4, ParentID: span-3 │ │ │ │
│ │ │ │ Duration: 45ms │ │ │ │
│ │ │ └────────────────────────────────────────────────────┘ │ │ │
│ │ │ ┌────────────────────────────────────────────────────┐ │ │ │
│ │ │ │ SPAN: RabbitMQ Publish │ │ │ │
│ │ │ │ SpanID: span-5, ParentID: span-3 │ │ │ │
│ │ │ │ Duration: 10ms │ │ │ │
│ │ │ └────────────────────────────────────────────────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Instrumentação Manual de Spans
// src/utils/trace-context.tsimport { trace, SpanStatusCode, SpanKind, context, propagation } from '@opentelemetry/api';import type { Span, SpanOptions, Context } from '@opentelemetry/api'; const tracer = trace.getTracer('microservice-tracer', '1.0.0'); // Decorator para tracing automáticoexport function Traced( spanName?: string, options?: SpanOptions): MethodDecorator { return function ( target: any, propertyKey: string | symbol, descriptor: PropertyDescriptor ) { const originalMethod = descriptor.value; const name = spanName || `${target.constructor.name}.${String(propertyKey)}`; descriptor.value = async function (...args: any[]) { return tracer.startActiveSpan(name, options || {}, async (span: Span) => { try { // Adiciona parâmetros como atributos (cuidado com dados sensíveis!) span.setAttribute('method.arguments.count', args.length); const result = await originalMethod.apply(this, args); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : 'Unknown error', }); span.recordException(error as Error); throw error; } finally { span.end(); } }); }; return descriptor; };} // Criar span manualmenteexport function createSpan( name: string, fn: (span: Span) => Promise<any>, options?: SpanOptions): Promise<any> { return tracer.startActiveSpan(name, options || {}, async (span) => { try { const result = await fn(span); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : 'Unknown error', }); span.recordException(error as Error); throw error; } finally { span.end(); } });} // Extrair/injetar contexto para propagaçãoexport function extractContext(headers: Record<string, string>): Context { return propagation.extract(context.active(), headers);} export function injectContext(headers: Record<string, string>): void { propagation.inject(context.active(), headers);} // Adicionar eventos a um spanexport function addSpanEvent( eventName: string, attributes?: Record<string, string | number | boolean>): void { const span = trace.getActiveSpan(); if (span) { span.addEvent(eventName, attributes); }} // Obter trace ID atualexport function getCurrentTraceId(): string | undefined { const span = trace.getActiveSpan(); return span?.spanContext().traceId;} // Obter span ID atualexport function getCurrentSpanId(): string | undefined { const span = trace.getActiveSpan(); return span?.spanContext().spanId;}Uso em Services
// src/services/order.service.tsimport { Traced, createSpan, addSpanEvent } from '../utils/trace-context';import { trace, SpanKind } from '@opentelemetry/api'; export class OrderService { private readonly tracer = trace.getTracer('order-service'); @Traced('OrderService.createOrder', { kind: SpanKind.INTERNAL }) async createOrder(orderData: CreateOrderDTO): Promise<Order> { addSpanEvent('order.validation.started'); // Validação await this.validateOrder(orderData); addSpanEvent('order.validation.completed'); // Criar span filha para operação específica const order = await createSpan('order.save', async (span) => { span.setAttribute('order.items.count', orderData.items.length); span.setAttribute('order.total', orderData.total); const savedOrder = await this.orderRepository.save(orderData); span.setAttribute('order.id', savedOrder.id); return savedOrder; }); // Publicar evento await this.publishOrderCreated(order); return order; } @Traced('OrderService.validateOrder') private async validateOrder(orderData: CreateOrderDTO): Promise<void> { // Validação com spans automáticas await this.validateStock(orderData.items); await this.validatePayment(orderData.paymentMethod); } private async publishOrderCreated(order: Order): Promise<void> { // Span para mensageria await createSpan( 'rabbitmq.publish.order_created', async (span) => { span.setAttribute('messaging.system', 'rabbitmq'); span.setAttribute('messaging.destination', 'orders.created'); span.setAttribute('messaging.message_id', order.id); await this.messagePublisher.publish('orders.created', { orderId: order.id, timestamp: new Date().toISOString(), }); }, { kind: SpanKind.PRODUCER } ); }}Propagação de Contexto entre Serviços
// src/middleware/request-context.tsimport { Request, Response, NextFunction } from 'express';import { context, propagation, trace } from '@opentelemetry/api';import { v4 as uuidv4 } from 'uuid'; export interface RequestContext { traceId: string; spanId: string; requestId: string; userId?: string; correlationId: string;} declare global { namespace Express { interface Request { context: RequestContext; } }} export function requestContextMiddleware( req: Request, res: Response, next: NextFunction): void { // Extrair contexto de propagação (se existir) const extractedContext = propagation.extract(context.active(), req.headers); context.with(extractedContext, () => { const span = trace.getActiveSpan(); const spanContext = span?.spanContext(); // Criar contexto da requisição req.context = { traceId: spanContext?.traceId || uuidv4().replace(/-/g, ''), spanId: spanContext?.spanId || uuidv4().replace(/-/g, '').substring(0, 16), requestId: req.headers['x-request-id'] as string || uuidv4(), userId: req.headers['x-user-id'] as string, correlationId: req.headers['x-correlation-id'] as string || uuidv4(), }; // Adicionar headers de resposta para debugging res.setHeader('x-trace-id', req.context.traceId); res.setHeader('x-request-id', req.context.requestId); // Adicionar atributos ao span atual if (span) { span.setAttribute('request.id', req.context.requestId); span.setAttribute('correlation.id', req.context.correlationId); if (req.context.userId) { span.setAttribute('user.id', req.context.userId); } } next(); });} // Helper para propagar contexto em chamadas HTTPexport function getTracingHeaders(): Record<string, string> { const headers: Record<string, string> = {}; propagation.inject(context.active(), headers); return headers;}Cliente HTTP com Propagação Automática
// src/utils/http-client.tsimport axios, { AxiosInstance, AxiosRequestConfig } from 'axios';import { getTracingHeaders, getCurrentTraceId } from './trace-context'; export function createTracedHttpClient(baseURL: string): AxiosInstance { const client = axios.create({ baseURL }); // Interceptor para adicionar headers de tracing client.interceptors.request.use((config) => { const tracingHeaders = getTracingHeaders(); config.headers = { ...config.headers, ...tracingHeaders, 'x-trace-id': getCurrentTraceId(), }; return config; }); // Interceptor para logging de erros client.interceptors.response.use( (response) => response, (error) => { const traceId = getCurrentTraceId(); console.error(`HTTP Error [trace: ${traceId}]:`, { url: error.config?.url, method: error.config?.method, status: error.response?.status, message: error.message, }); throw error; } ); return client;}Métricas Customizadas
Tipos de Métricas
// src/instrumentation/metrics.tsimport { metrics, ValueType } from '@opentelemetry/api'; const meter = metrics.getMeter('microservice-metrics', '1.0.0'); // Counter - valores que só aumentamexport const httpRequestsTotal = meter.createCounter('http_requests_total', { description: 'Total number of HTTP requests', unit: '1',}); // UpDownCounter - valores que podem aumentar ou diminuirexport const activeConnections = meter.createUpDownCounter('active_connections', { description: 'Number of active connections', unit: '1',}); // Histogram - distribuição de valoresexport const httpRequestDuration = meter.createHistogram('http_request_duration_seconds', { description: 'Duration of HTTP requests in seconds', unit: 's', advice: { explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], },}); // Observable Gauge - valor atual que é observadoexport const memoryUsage = meter.createObservableGauge('process_memory_bytes', { description: 'Process memory usage in bytes', unit: 'By',}); memoryUsage.addCallback((result) => { const usage = process.memoryUsage(); result.observe(usage.heapUsed, { type: 'heap_used' }); result.observe(usage.heapTotal, { type: 'heap_total' }); result.observe(usage.rss, { type: 'rss' }); result.observe(usage.external, { type: 'external' });}); // Observable Counter - contador observávelexport const cpuUsage = meter.createObservableCounter('process_cpu_seconds_total', { description: 'Total CPU time spent in seconds', unit: 's',}); let previousCpuUsage = process.cpuUsage();cpuUsage.addCallback((result) => { const currentCpuUsage = process.cpuUsage(previousCpuUsage); result.observe((currentCpuUsage.user + currentCpuUsage.system) / 1e6, {}); previousCpuUsage = process.cpuUsage();});Métricas de Negócio
// src/utils/business-metrics.tsimport { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('business-metrics', '1.0.0'); // Métricas de pedidosexport const ordersCreated = meter.createCounter('orders_created_total', { description: 'Total orders created',}); export const orderValue = meter.createHistogram('order_value_dollars', { description: 'Order value distribution', unit: 'USD', advice: { explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000], },}); export const orderProcessingTime = meter.createHistogram('order_processing_duration_seconds', { description: 'Time to process an order', unit: 's',}); // Métricas de usuáriosexport const activeUsers = meter.createUpDownCounter('active_users', { description: 'Number of currently active users',}); export const userRegistrations = meter.createCounter('user_registrations_total', { description: 'Total user registrations',}); // Métricas de estoqueexport const stockLevel = meter.createObservableGauge('stock_level', { description: 'Current stock level by product',}); // Métricas de pagamentoexport const paymentAttempts = meter.createCounter('payment_attempts_total', { description: 'Total payment attempts',}); export const paymentAmount = meter.createHistogram('payment_amount_dollars', { description: 'Payment amount distribution', unit: 'USD',}); // Helper para registrar métricas de pedidoexport function recordOrderMetrics(order: { id: string; total: number; items: number; processingTimeMs: number; paymentMethod: string; region: string;}) { const labels = { payment_method: order.paymentMethod, region: order.region, }; ordersCreated.add(1, labels); orderValue.record(order.total, labels); orderProcessingTime.record(order.processingTimeMs / 1000, labels);}Middleware de Métricas HTTP
// src/middleware/metrics.middleware.tsimport { Request, Response, NextFunction } from 'express';import { httpRequestsTotal, httpRequestDuration, activeConnections } from '../instrumentation/metrics'; export function metricsMiddleware( req: Request, res: Response, next: NextFunction): void { const startTime = process.hrtime.bigint(); // Incrementa conexões ativas activeConnections.add(1); // Labels comuns const labels = { method: req.method, route: req.route?.path || req.path, host: req.hostname, }; // Quando a resposta terminar res.on('finish', () => { const endTime = process.hrtime.bigint(); const durationSeconds = Number(endTime - startTime) / 1e9; const finalLabels = { ...labels, status_code: res.statusCode.toString(), status_class: `${Math.floor(res.statusCode / 100)}xx`, }; // Registra métricas httpRequestsTotal.add(1, finalLabels); httpRequestDuration.record(durationSeconds, finalLabels); activeConnections.add(-1); }); // Caso de erro/timeout res.on('close', () => { if (!res.writableEnded) { activeConnections.add(-1); } }); next();}Logs Estruturados
Configuração do Logger
// src/instrumentation/logging.tsimport { logs, SeverityNumber } from '@opentelemetry/api-logs';import { trace, context } from '@opentelemetry/api';import pino from 'pino'; const logger = logs.getLogger('microservice-logger', '1.0.0'); // Níveis de severidade OpenTelemetryconst severityMap: Record<string, SeverityNumber> = { trace: SeverityNumber.TRACE, debug: SeverityNumber.DEBUG, info: SeverityNumber.INFO, warn: SeverityNumber.WARN, error: SeverityNumber.ERROR, fatal: SeverityNumber.FATAL,}; export interface LogContext { [key: string]: unknown;} export function createLogger(serviceName: string) { // Pino para logs locais/console const pinoLogger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), bindings: () => ({}), }, timestamp: () => `,"timestamp":"${new Date().toISOString()}"`, base: { service: serviceName, environment: process.env.NODE_ENV, }, }); return { trace: (message: string, ctx?: LogContext) => log('trace', message, ctx), debug: (message: string, ctx?: LogContext) => log('debug', message, ctx), info: (message: string, ctx?: LogContext) => log('info', message, ctx), warn: (message: string, ctx?: LogContext) => log('warn', message, ctx), error: (message: string, ctx?: LogContext) => log('error', message, ctx), fatal: (message: string, ctx?: LogContext) => log('fatal', message, ctx), child: (bindings: Record<string, unknown>) => { return createChildLogger(serviceName, bindings); }, }; function log(level: string, message: string, ctx?: LogContext) { // Log para console via Pino pinoLogger[level as keyof typeof pinoLogger]({ ...ctx }, message); // Log para OpenTelemetry const span = trace.getActiveSpan(); const spanContext = span?.spanContext(); logger.emit({ severityNumber: severityMap[level], severityText: level.toUpperCase(), body: message, attributes: { 'service.name': serviceName, 'log.level': level, ...(spanContext && { 'trace_id': spanContext.traceId, 'span_id': spanContext.spanId, }), ...flattenObject(ctx || {}), }, }); }} function createChildLogger(serviceName: string, bindings: Record<string, unknown>) { const parentLogger = createLogger(serviceName); return { trace: (message: string, ctx?: LogContext) => parentLogger.trace(message, { ...bindings, ...ctx }), debug: (message: string, ctx?: LogContext) => parentLogger.debug(message, { ...bindings, ...ctx }), info: (message: string, ctx?: LogContext) => parentLogger.info(message, { ...bindings, ...ctx }), warn: (message: string, ctx?: LogContext) => parentLogger.warn(message, { ...bindings, ...ctx }), error: (message: string, ctx?: LogContext) => parentLogger.error(message, { ...bindings, ...ctx }), fatal: (message: string, ctx?: LogContext) => parentLogger.fatal(message, { ...bindings, ...ctx }), child: (newBindings: Record<string, unknown>) => createChildLogger(serviceName, { ...bindings, ...newBindings }), };} // Flatten nested objects para atributosfunction flattenObject( obj: Record<string, unknown>, prefix = ''): Record<string, string | number | boolean> { const result: Record<string, string | number | boolean> = {}; for (const [key, value] of Object.entries(obj)) { const newKey = prefix ? `${prefix}.${key}` : key; if (value && typeof value === 'object' && !Array.isArray(value)) { Object.assign(result, flattenObject(value as Record<string, unknown>, newKey)); } else if (typeof value === 'string' || typeof value === 'number' || typeof value === 'boolean') { result[newKey] = value; } else if (value !== undefined && value !== null) { result[newKey] = String(value); } } return result;} export const log = createLogger(process.env.SERVICE_NAME || 'unknown-service');Middleware de Logging
// src/middleware/logging.middleware.tsimport { Request, Response, NextFunction } from 'express';import { log } from '../instrumentation/logging';import { getCurrentTraceId, getCurrentSpanId } from '../utils/trace-context'; export function loggingMiddleware( req: Request, res: Response, next: NextFunction): void { const startTime = Date.now(); // Logger contextual para esta requisição const requestLogger = log.child({ requestId: req.context?.requestId, traceId: getCurrentTraceId(), spanId: getCurrentSpanId(), method: req.method, path: req.path, userAgent: req.headers['user-agent'], ip: req.ip, }); // Log de início da requisição requestLogger.info('Request started', { query: req.query, params: req.params, // Não loga body para evitar dados sensíveis }); // Captura o body da resposta const originalSend = res.send; res.send = function (body: any) { res.locals.body = body; return originalSend.call(this, body); }; // Log ao finalizar res.on('finish', () => { const duration = Date.now() - startTime; const logContext = { statusCode: res.statusCode, duration, contentLength: res.get('content-length'), }; if (res.statusCode >= 500) { requestLogger.error('Request failed', logContext); } else if (res.statusCode >= 400) { requestLogger.warn('Request client error', logContext); } else { requestLogger.info('Request completed', logContext); } }); // Disponibiliza logger na requisição (req as any).log = requestLogger; next();}Correlação Logs ↔ Traces
// src/utils/log-formatter.tsimport { getCurrentTraceId, getCurrentSpanId } from './trace-context'; export interface CorrelatedLogEntry { timestamp: string; level: string; message: string; traceId?: string; spanId?: string; service: string; [key: string]: unknown;} export function formatLogEntry( level: string, message: string, context: Record<string, unknown> = {}): CorrelatedLogEntry { return { timestamp: new Date().toISOString(), level, message, traceId: getCurrentTraceId(), spanId: getCurrentSpanId(), service: process.env.SERVICE_NAME || 'unknown', ...context, };} // Exemplo de uso em error handlerexport function logError(error: Error, context?: Record<string, unknown>): void { const entry = formatLogEntry('error', error.message, { errorName: error.name, stack: error.stack, ...context, }); console.error(JSON.stringify(entry));}OpenTelemetry Collector
O Collector é o hub central que recebe, processa e exporta telemetria.
Configuração do Collector
# docker/otel-collector-config.yamlreceivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 # Receiver para métricas do host hostmetrics: collection_interval: 30s scrapers: cpu: memory: disk: network: # Receiver Prometheus (pull-based) prometheus: config: scrape_configs: - job_name: 'otel-collector' scrape_interval: 15s static_configs: - targets: ['localhost:8888'] processors: # Agrupa em batches para eficiência batch: timeout: 5s send_batch_size: 1000 send_batch_max_size: 2000 # Adiciona metadados resource: attributes: - key: deployment.environment value: production action: upsert - key: collector.version value: "0.91.0" action: upsert # Filtra dados sensíveis attributes: actions: - key: http.request.header.authorization action: delete - key: db.statement action: hash - key: user.email pattern: ^.*@ action: hash # Sampling para reduzir volume probabilistic_sampler: sampling_percentage: 10 # Tail-based sampling (mantém traces com erros) tail_sampling: decision_wait: 10s num_traces: 100000 policies: # Sempre mantém traces com erros - name: errors type: status_code status_code: status_codes: [ERROR] # Sempre mantém traces lentos (>2s) - name: slow-traces type: latency latency: threshold_ms: 2000 # Amostra 10% dos traces normais - name: normal-sampling type: probabilistic probabilistic: sampling_percentage: 10 # Transformações transform: trace_statements: - context: span statements: - set(attributes["processed_by"], "otel-collector") - truncate_all(attributes, 256) exporters: # Debug (apenas desenvolvimento) debug: verbosity: detailed sampling_initial: 5 sampling_thereafter: 100 # Jaeger para traces otlp/jaeger: endpoint: jaeger:4317 tls: insecure: true # Prometheus para métricas prometheus: endpoint: "0.0.0.0:8889" namespace: microservices const_labels: environment: production resource_to_telemetry_conversion: enabled: true # Loki para logs loki: endpoint: http://loki:3100/loki/api/v1/push tenant_id: microservices labels: resource: service.name: "service" service.namespace: "namespace" attributes: log.level: "level" # OTLP genérico (pode ser Grafana Cloud, Honeycomb, etc.) otlp/cloud: endpoint: ${OTLP_ENDPOINT} headers: Authorization: Bearer ${OTLP_TOKEN} extensions: health_check: endpoint: 0.0.0.0:13133 pprof: endpoint: 0.0.0.0:1777 zpages: endpoint: 0.0.0.0:55679 service: extensions: [health_check, pprof, zpages] pipelines: traces: receivers: [otlp] processors: [batch, resource, attributes, tail_sampling] exporters: [otlp/jaeger, otlp/cloud] metrics: receivers: [otlp, hostmetrics, prometheus] processors: [batch, resource] exporters: [prometheus, otlp/cloud] logs: receivers: [otlp] processors: [batch, resource, attributes] exporters: [loki, otlp/cloud] telemetry: logs: level: info metrics: address: 0.0.0.0:8888Stack de Observabilidade Completa
Docker Compose
# docker-compose.observability.ymlversion: '3.8' services: # OpenTelemetry Collector otel-collector: image: otel/opentelemetry-collector-contrib:0.91.0 container_name: otel-collector command: ["--config=/etc/otel-collector-config.yaml"] volumes: - ./docker/otel-collector-config.yaml:/etc/otel-collector-config.yaml ports: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP - "8888:8888" # Prometheus metrics exposed by the collector - "8889:8889" # Prometheus exporter metrics - "13133:13133" # Health check - "55679:55679" # zPages environment: - OTLP_ENDPOINT=${OTLP_ENDPOINT:-} - OTLP_TOKEN=${OTLP_TOKEN:-} depends_on: - jaeger - prometheus - loki networks: - observability # Jaeger - Distributed Tracing jaeger: image: jaegertracing/all-in-one:1.52 container_name: jaeger ports: - "16686:16686" # UI - "14268:14268" # HTTP collector - "14250:14250" # gRPC collector environment: - COLLECTOR_OTLP_ENABLED=true - SPAN_STORAGE_TYPE=badger - BADGER_EPHEMERAL=false - BADGER_DIRECTORY_VALUE=/badger/data - BADGER_DIRECTORY_KEY=/badger/key volumes: - jaeger-data:/badger networks: - observability # Prometheus - Metrics prometheus: image: prom/prometheus:v2.48.0 container_name: prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d' - '--web.enable-lifecycle' - '--web.enable-remote-write-receiver' volumes: - ./docker/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus ports: - "9090:9090" networks: - observability # Loki - Log Aggregation loki: image: grafana/loki:2.9.2 container_name: loki command: -config.file=/etc/loki/loki-config.yaml volumes: - ./docker/loki-config.yaml:/etc/loki/loki-config.yaml - loki-data:/loki ports: - "3100:3100" networks: - observability # Grafana - Visualization grafana: image: grafana/grafana:10.2.2 container_name: grafana environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin123 - GF_USERS_ALLOW_SIGN_UP=false - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor volumes: - grafana-data:/var/lib/grafana - ./docker/grafana/provisioning:/etc/grafana/provisioning - ./docker/grafana/dashboards:/var/lib/grafana/dashboards ports: - "3001:3000" depends_on: - prometheus - loki - jaeger networks: - observability # Alertmanager - Alerts alertmanager: image: prom/alertmanager:v0.26.0 container_name: alertmanager volumes: - ./docker/alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager-data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' ports: - "9093:9093" networks: - observability volumes: jaeger-data: prometheus-data: loki-data: grafana-data: alertmanager-data: networks: observability: driver: bridgeConfiguração Prometheus
# docker/prometheus.ymlglobal: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'microservices-cluster' env: 'production' alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] rule_files: - '/etc/prometheus/rules/*.yml' scrape_configs: # Prometheus self-monitoring - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # OpenTelemetry Collector - job_name: 'otel-collector' static_configs: - targets: ['otel-collector:8889'] # Microservices (via service discovery ou static) - job_name: 'microservices' kubernetes_sd_configs: - role: pod namespaces: names: - microservices relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: podConfiguração Loki
# docker/loki-config.yamlauth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: instance_addr: 127.0.0.1 path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory query_range: results_cache: cache: embedded_cache: enabled: true max_size_mb: 100 schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://alertmanager:9093 limits_config: reject_old_samples: true reject_old_samples_max_age: 168h ingestion_rate_mb: 10 ingestion_burst_size_mb: 20 max_streams_per_user: 10000 max_line_size: 256kb table_manager: retention_deletes_enabled: true retention_period: 336hDashboards Grafana
Dashboard de Microserviços
{ "dashboard": { "title": "Microservices Overview", "tags": ["microservices", "observability"], "timezone": "browser", "panels": [ { "title": "Request Rate", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (service)", "legendFormat": "{{service}}" } ] }, { "title": "Error Rate", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }, "targets": [ { "expr": "sum(rate(http_requests_total{status_class='5xx'}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100", "legendFormat": "{{service}}" } ], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 1 }, { "color": "red", "value": 5 } ] } } } }, { "title": "P99 Latency", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))", "legendFormat": "{{service}}" } ], "fieldConfig": { "defaults": { "unit": "s" } } }, { "title": "Active Connections", "type": "stat", "gridPos": { "h": 4, "w": 6, "x": 12, "y": 8 }, "targets": [ { "expr": "sum(active_connections)" } ] }, { "title": "Service Map", "type": "nodeGraph", "gridPos": { "h": 12, "w": 24, "x": 0, "y": 16 }, "datasource": "Jaeger", "targets": [ { "queryType": "serviceMap" } ] } ] }}Alertas e SLOs
Regras de Alerta Prometheus
# docker/prometheus/rules/microservices-alerts.ymlgroups: - name: microservices.rules interval: 30s rules: # SLI: Availability - record: sli:availability:rate5m expr: | sum(rate(http_requests_total{status_class!="5xx"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) # SLI: Latency (P99 < 500ms) - record: sli:latency_p99:rate5m expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) - name: microservices.alerts rules: # High Error Rate - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status_class="5xx"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) ) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ printf \"%.2f\" $value }}% (threshold: 5%)" # High Latency - alert: HighLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 0.5 for: 5m labels: severity: warning annotations: summary: "High P99 latency on {{ $labels.service }}" description: "P99 latency is {{ printf \"%.3f\" $value }}s (threshold: 500ms)" # Service Down - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "{{ $labels.instance }} has been down for more than 1 minute" # Memory Usage High - alert: HighMemoryUsage expr: | (process_memory_bytes{type="heap_used"} / process_memory_bytes{type="heap_total"}) > 0.85 for: 10m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.service }}" description: "Heap usage is {{ printf \"%.1f\" $value }}%" # Dead Letter Queue Growing - alert: DLQGrowing expr: | increase(rabbitmq_queue_messages{queue=~".*\\.dlq"}[1h]) > 100 for: 5m labels: severity: warning annotations: summary: "Dead Letter Queue growing" description: "DLQ {{ $labels.queue }} has {{ $value }} new messages in the last hour" # SLO Breach Risk - alert: SLOBreachRisk expr: | ( 1 - ( sum(rate(http_requests_total{status_class!="5xx"}[30m])) by (service) / sum(rate(http_requests_total[30m])) by (service) ) ) > (1 - 0.999) * 2 for: 10m labels: severity: critical annotations: summary: "SLO breach risk for {{ $labels.service }}" description: "Error budget burn rate is 2x normal, risking monthly SLO breach"SLO Dashboard
// src/slo/slo-calculator.tsimport { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('slo-metrics', '1.0.0'); interface SLOConfig { name: string; target: number; // Ex: 0.999 para 99.9% windowDays: number;} interface SLOStatus { name: string; target: number; current: number; errorBudget: number; errorBudgetRemaining: number; isBreached: boolean;} export class SLOCalculator { private sloGauge = meter.createObservableGauge('slo_current', { description: 'Current SLO value', }); private errorBudgetGauge = meter.createObservableGauge('slo_error_budget_remaining', { description: 'Remaining error budget percentage', }); constructor(private slos: SLOConfig[]) { this.setupMetrics(); } private setupMetrics(): void { this.sloGauge.addCallback(async (result) => { for (const slo of this.slos) { const status = await this.calculateSLO(slo); result.observe(status.current, { slo_name: slo.name }); } }); this.errorBudgetGauge.addCallback(async (result) => { for (const slo of this.slos) { const status = await this.calculateSLO(slo); result.observe(status.errorBudgetRemaining, { slo_name: slo.name }); } }); } async calculateSLO(config: SLOConfig): Promise<SLOStatus> { // Em produção, isso consultaria Prometheus ou outra fonte de métricas const totalRequests = await this.getTotalRequests(config.windowDays); const successfulRequests = await this.getSuccessfulRequests(config.windowDays); const current = totalRequests > 0 ? successfulRequests / totalRequests : 1; const errorBudget = 1 - config.target; const errorsAllowed = totalRequests * errorBudget; const actualErrors = totalRequests - successfulRequests; const errorBudgetRemaining = Math.max(0, (errorsAllowed - actualErrors) / errorsAllowed); return { name: config.name, target: config.target, current, errorBudget, errorBudgetRemaining, isBreached: current < config.target, }; } private async getTotalRequests(windowDays: number): Promise<number> { // Implementação real consultaria Prometheus return 1000000; // Placeholder } private async getSuccessfulRequests(windowDays: number): Promise<number> { // Implementação real consultaria Prometheus return 999500; // Placeholder }} // Usoconst sloCalculator = new SLOCalculator([ { name: 'api_availability', target: 0.999, windowDays: 30 }, { name: 'api_latency_p99', target: 0.99, windowDays: 30 }, { name: 'order_processing', target: 0.995, windowDays: 30 },]);Deploy em Kubernetes
Operator do OpenTelemetry
# k8s/otel-operator.yamlapiVersion: opentelemetry.io/v1alpha1kind: OpenTelemetryCollectormetadata: name: otel-collector namespace: observabilityspec: mode: deployment replicas: 2 config: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 5s send_batch_size: 1000 memory_limiter: check_interval: 1s limit_mib: 1000 spike_limit_mib: 200 exporters: otlp/jaeger: endpoint: jaeger-collector.observability.svc:4317 tls: insecure: true prometheus: endpoint: "0.0.0.0:8889" service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp/jaeger] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] resources: limits: cpu: 500m memory: 1Gi requests: cpu: 100m memory: 256Mi ---apiVersion: opentelemetry.io/v1alpha1kind: Instrumentationmetadata: name: auto-instrumentation namespace: microservicesspec: exporter: endpoint: http://otel-collector.observability.svc:4318 propagators: - tracecontext - baggage - b3 sampler: type: parentbased_traceidratio argument: "0.1" nodejs: env: - name: OTEL_TRACES_EXPORTER value: otlp - name: OTEL_METRICS_EXPORTER value: otlp - name: OTEL_LOGS_EXPORTER value: otlpDeployment com Auto-Instrumentação
# k8s/microservice-deployment.yamlapiVersion: apps/v1kind: Deploymentmetadata: name: order-service namespace: microservicesspec: replicas: 3 selector: matchLabels: app: order-service template: metadata: labels: app: order-service annotations: # Ativa auto-instrumentação do OTel Operator instrumentation.opentelemetry.io/inject-nodejs: "true" # Prometheus scraping prometheus.io/scrape: "true" prometheus.io/port: "9464" prometheus.io/path: "/metrics" spec: containers: - name: order-service image: order-service:latest ports: - containerPort: 3000 name: http - containerPort: 9464 name: metrics env: - name: SERVICE_NAME value: order-service - name: SERVICE_VERSION valueFrom: fieldRef: fieldPath: metadata.labels['version'] - name: NODE_ENV value: production - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://otel-collector.observability.svc:4318 - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace resources: limits: cpu: 500m memory: 512Mi requests: cpu: 100m memory: 128Mi livenessProbe: httpGet: path: /health port: http initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: http initialDelaySeconds: 5 periodSeconds: 5Troubleshooting com Observabilidade
Script de Investigação
// src/utils/troubleshooting.tsimport { trace, context, SpanStatusCode } from '@opentelemetry/api';import { log } from '../instrumentation/logging'; interface Investigation { traceId: string; spans: SpanInfo[]; errors: ErrorInfo[]; logs: LogEntry[]; metrics: MetricSnapshot[];} interface SpanInfo { name: string; duration: number; status: string; attributes: Record<string, unknown>;} interface ErrorInfo { spanName: string; message: string; stack?: string; timestamp: string;} interface LogEntry { level: string; message: string; timestamp: string; attributes: Record<string, unknown>;} interface MetricSnapshot { name: string; value: number; labels: Record<string, string>;} export class TroubleshootingHelper { // Wrapper para operações com logging detalhado async executeWithDiagnostics<T>( name: string, operation: () => Promise<T>, context?: Record<string, unknown> ): Promise<T> { const tracer = trace.getTracer('troubleshooting'); return tracer.startActiveSpan(`diagnostic:${name}`, async (span) => { const startTime = Date.now(); log.info(`Starting operation: ${name}`, { ...context, operation: name, phase: 'start', }); try { const result = await operation(); const duration = Date.now() - startTime; span.setAttribute('operation.duration_ms', duration); span.setStatus({ code: SpanStatusCode.OK }); log.info(`Completed operation: ${name}`, { ...context, operation: name, phase: 'complete', duration_ms: duration, }); return result; } catch (error) { const duration = Date.now() - startTime; span.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : 'Unknown error', }); span.recordException(error as Error); log.error(`Failed operation: ${name}`, { ...context, operation: name, phase: 'error', duration_ms: duration, error: error instanceof Error ? error.message : String(error), stack: error instanceof Error ? error.stack : undefined, }); throw error; } finally { span.end(); } }); } // Adiciona breadcrumbs para debugging addBreadcrumb( category: string, message: string, data?: Record<string, unknown> ): void { const span = trace.getActiveSpan(); if (span) { span.addEvent('breadcrumb', { 'breadcrumb.category': category, 'breadcrumb.message': message, ...Object.entries(data || {}).reduce((acc, [key, value]) => { acc[`breadcrumb.data.${key}`] = String(value); return acc; }, {} as Record<string, string>), }); } log.debug(`[${category}] ${message}`, data); } // Health check com diagnóstico async runHealthCheck(): Promise<{ status: 'healthy' | 'degraded' | 'unhealthy'; checks: Record<string, { status: string; latency: number; error?: string }>; }> { const checks: Record<string, { status: string; latency: number; error?: string }> = {}; // Database check const dbStart = Date.now(); try { // await database.query('SELECT 1'); checks.database = { status: 'healthy', latency: Date.now() - dbStart }; } catch (error) { checks.database = { status: 'unhealthy', latency: Date.now() - dbStart, error: error instanceof Error ? error.message : 'Unknown error', }; } // Redis check const redisStart = Date.now(); try { // await redis.ping(); checks.redis = { status: 'healthy', latency: Date.now() - redisStart }; } catch (error) { checks.redis = { status: 'unhealthy', latency: Date.now() - redisStart, error: error instanceof Error ? error.message : 'Unknown error', }; } // RabbitMQ check const mqStart = Date.now(); try { // await rabbitmq.checkConnection(); checks.rabbitmq = { status: 'healthy', latency: Date.now() - mqStart }; } catch (error) { checks.rabbitmq = { status: 'unhealthy', latency: Date.now() - mqStart, error: error instanceof Error ? error.message : 'Unknown error', }; } // Determine overall status const unhealthyCount = Object.values(checks).filter(c => c.status === 'unhealthy').length; const status = unhealthyCount === 0 ? 'healthy' : unhealthyCount >= 2 ? 'unhealthy' : 'degraded'; return { status, checks }; }} export const troubleshoot = new TroubleshootingHelper();Checklist de Produção
Instrumentação
- OpenTelemetry SDK configurado antes de outros imports
- Auto-instrumentação habilitada para HTTP, banco de dados, mensageria
- Spans customizadas para operações de negócio críticas
- Atributos relevantes adicionados aos spans
- Erros capturados e registrados corretamente
Métricas
- Métricas RED para todos os endpoints
- Métricas de negócio definidas
- Histogramas com buckets apropriados
- Labels consistentes entre serviços
- Cardinalidade de labels controlada
Logs
- Formato estruturado (JSON)
- Correlação com trace ID
- Níveis de log apropriados
- Dados sensíveis mascarados
- Rotação e retenção configuradas
Alertas
- SLOs definidos e monitorados
- Alertas para métricas críticas
- Runbooks para cada alerta
- Escalação configurada
- Testes de alertas realizados
Infraestrutura
- Collector com alta disponibilidade
- Retenção de dados adequada
- Backup de configurações
- Sampling configurado para volume
- Recursos adequados para stack
Conclusão
Observabilidade é a base para operar microserviços em produção com confiança. Os pontos-chave são:
- Três Pilares: Traces, métricas e logs trabalham juntos para dar visibilidade completa
- OpenTelemetry: Padrão vendor-neutral que simplifica instrumentação
- Correlação: Trace ID conecta logs, métricas e traces de uma mesma requisição
- SLOs: Defina objetivos claros e monitore error budgets
- Alertas Inteligentes: Alerte sobre sintomas, não causas
Com esta série completa, você tem todas as ferramentas para construir microserviços robustos:
- Arquitetura de Microserviços - Fundamentos e padrões
- API Gateway com Kong - Gerenciamento de tráfego
- Mensageria com RabbitMQ - Comunicação assíncrona
- Observabilidade com OpenTelemetry (este artigo) - Monitoramento e debugging
Gostou do conteúdo? Sua contribuição ajuda a manter tudo online e gratuito!
0737160d-e98f-4a65-8392-5dba70e7ff3e