Skip to content

Notes

Observability

Observability

Status: 🌱

Motivation

Shorten incident resolution time by correlating logs, metrics, and traces with consistent context.

Connections

Practical Stack

Instrumentation standard: OpenTelemetry (OTel) for traces, metrics, and logs metadata.
Collectors/agents: Fluent Bit or Grafana Alloy (OTel Collector distro).
Logs backend: Loki.
Visualization and alerting: Grafana.

Data Flow (Simple Baseline)

Services emit logs and traces with consistent context (service.name, env, version, trace_id, span_id).
Runtime/system metrics are scraped or exported (Prometheus-style where possible).
Fluent Bit/Alloy collects and enriches telemetry.
Logs are sent to Loki; metrics/traces are sent to their observability backends.
Grafana dashboards and alerts correlate logs, metrics, and traces for triage.

Log Structure Guidelines

Use structured JSON logs by default (avoid free-text only logs).
Include required fields: timestamp, level, message, service, environment, version, request/trace correlation IDs.
Normalize severity (debug, info, warn, error) and avoid custom variants.
Add domain context (tenant, job_id, order_id) only when it improves troubleshooting.
Redact secrets and PII at source and again in the collection pipeline.

OpenTelemetry Alignment Checklist

Define semantic conventions per service type (HTTP, gRPC, messaging, DB).
Propagate context across async boundaries and queues.
Capture RED/USE-style metrics for key services and dependencies.
Sample traces intentionally (head/tail) based on traffic and cost.
Treat telemetry schema changes as versioned contracts.

Fluent Bit vs Alloy (Quick Choice)

Fluent Bit: lightweight and strong for log collection/forwarding.
Alloy: broader telemetry pipeline with native OTel-first workflows and Grafana ecosystem integration.
Both are valid; choose based on whether you need logs-only simplicity or unified telemetry pipelines.