Observability
Status: 🌱
Motivation
Shorten incident resolution time by correlating logs, metrics, and traces with consistent context.
Connections
Practical Stack
- Instrumentation standard: OpenTelemetry (OTel) for traces, metrics, and logs metadata.
- Collectors/agents: Fluent Bit or Grafana Alloy (OTel Collector distro).
- Logs backend: Loki.
- Visualization and alerting: Grafana.
Data Flow (Simple Baseline)
- Services emit logs and traces with consistent context (
service.name,env,version,trace_id,span_id). - Runtime/system metrics are scraped or exported (Prometheus-style where possible).
- Fluent Bit/Alloy collects and enriches telemetry.
- Logs are sent to Loki; metrics/traces are sent to their observability backends.
- Grafana dashboards and alerts correlate logs, metrics, and traces for triage.
Log Structure Guidelines
- Use structured JSON logs by default (avoid free-text only logs).
- Include required fields: timestamp, level, message, service, environment, version, request/trace correlation IDs.
- Normalize severity (
debug,info,warn,error) and avoid custom variants. - Add domain context (tenant, job_id, order_id) only when it improves troubleshooting.
- Redact secrets and PII at source and again in the collection pipeline.
OpenTelemetry Alignment Checklist
- Define semantic conventions per service type (HTTP, gRPC, messaging, DB).
- Propagate context across async boundaries and queues.
- Capture RED/USE-style metrics for key services and dependencies.
- Sample traces intentionally (head/tail) based on traffic and cost.
- Treat telemetry schema changes as versioned contracts.
Fluent Bit vs Alloy (Quick Choice)
- Fluent Bit: lightweight and strong for log collection/forwarding.
- Alloy: broader telemetry pipeline with native OTel-first workflows and Grafana ecosystem integration.
- Both are valid; choose based on whether you need logs-only simplicity or unified telemetry pipelines.