Skip to content

Observability

Status: 🌱

Motivation

Shorten incident resolution time by correlating logs, metrics, and traces with consistent context.

Connections

Practical Stack

  • Instrumentation standard: OpenTelemetry (OTel) for traces, metrics, and logs metadata.
  • Collectors/agents: Fluent Bit or Grafana Alloy (OTel Collector distro).
  • Logs backend: Loki.
  • Visualization and alerting: Grafana.

Data Flow (Simple Baseline)

  1. Services emit logs and traces with consistent context (service.name, env, version, trace_id, span_id).
  2. Runtime/system metrics are scraped or exported (Prometheus-style where possible).
  3. Fluent Bit/Alloy collects and enriches telemetry.
  4. Logs are sent to Loki; metrics/traces are sent to their observability backends.
  5. Grafana dashboards and alerts correlate logs, metrics, and traces for triage.

Log Structure Guidelines

  • Use structured JSON logs by default (avoid free-text only logs).
  • Include required fields: timestamp, level, message, service, environment, version, request/trace correlation IDs.
  • Normalize severity (debug, info, warn, error) and avoid custom variants.
  • Add domain context (tenant, job_id, order_id) only when it improves troubleshooting.
  • Redact secrets and PII at source and again in the collection pipeline.

OpenTelemetry Alignment Checklist

  • Define semantic conventions per service type (HTTP, gRPC, messaging, DB).
  • Propagate context across async boundaries and queues.
  • Capture RED/USE-style metrics for key services and dependencies.
  • Sample traces intentionally (head/tail) based on traffic and cost.
  • Treat telemetry schema changes as versioned contracts.

Fluent Bit vs Alloy (Quick Choice)

  • Fluent Bit: lightweight and strong for log collection/forwarding.
  • Alloy: broader telemetry pipeline with native OTel-first workflows and Grafana ecosystem integration.
  • Both are valid; choose based on whether you need logs-only simplicity or unified telemetry pipelines.