Monitoring & Logging
Observability — metrics, logs, traces — makes distributed HLD operable.
Introduction
Observability — metrics, logs, traces — makes distributed HLD operable. Without correlation IDs spanning gateway → services → DB, debugging p99 latency spikes is guesswork. The three pillars: Prometheus metrics, structured logs (ELK/Loki), distributed tracing (Jaeger/Tempo/OpenTelemetry).
Interviews expect SLI/SLO definition, alerting on symptoms (error rate, latency) not causes alone, and dashboards per golden signals: latency, traffic, errors, saturation.
This lesson maps observability into architecture from day one, not as an afterthought.
Understanding the topic
Key concepts
- Metrics: time-series counters/histograms — request rate, error %, p99 latency.
- Logs: structured JSON with traceId, userId, level — searchable in ELK.
- Traces: span tree showing cross-service call waterfall.
- SLI: measured behavior (availability = successful/total). SLO: target 99.9%.
- Alerting: burn rate on error budget — page humans on user impact.
- Health vs readiness probes for orchestrstrator routing.
flowchart TBApp --> Metrics[Prometheus]App --> Logs[ELK]App --> Traces[Jaeger]Metrics --> Grafana
Internal architecture
Architecture overview
flowchart TBApp --> Metrics[Prometheus]App --> Logs[ELK]App --> Traces[Jaeger]Metrics --> Grafana
Step-by-step explanation
- OpenTelemetry SDK in every service exports traces + metrics.
- Prometheus scrapes /actuator/prometheus; Grafana dashboards per service SLO.
- Logs shipped via Fluent Bit → Elasticsearch; Kibana queries by traceId.
- API Gateway generates X-Request-Id; propagates W3C traceparent header.
- Synthetic probes hit critical user journeys every 1m from multiple regions.
- PagerDuty alert when checkout error rate >1% for 5m.
Informative example
Spring Boot 3 Actuator + Micrometer Prometheus and structured logging pattern:
management:endpoints:web:exposure:include: health,info,prometheusmetrics:tags:application: order-servicetracing:sampling:probability: 0.1logging:pattern:console: '{"time":"%d","level":"%p","trace":"%X{traceId}","span":"%X{spanId}","msg":"%m"}%n'# Alert rule (Prometheus)# alert: HighErrorRate# expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.01
Sample traces in prod (10%) for cost. Always log correlation IDs. Define SLO before launch.
Real-world use
Real-world use cases
- E-commerce Black Friday — dashboard on checkout p99 and queue depth.
- Banking audit — immutable log retention 7 years.
- OTT playback errors correlated CDN + origin traces.
- Microservice deploy canary compared via error rate metrics.
Best practices
- Structured logs only — no unparsed printf strings.
- One traceId per user request end-to-end.
- RED method metrics on every HTTP service.
- Log levels appropriate — ERROR actionable, INFO sparse.
- Runbooks linked from alert annotations.
- Test alerts fire in staging chaos drills.
Common mistakes
- Logging PII/secrets — compliance violation.
- 100% trace sampling cost explosion.
- Alerts on CPU alone without user-facing SLO context.
- Different log formats per service — query hell.
- No centralized view — SSH grep during outage.
Advanced interview questions
Q1BeginnerThree pillars of observability?
Q2BeginnerWhat is an SLO?
Q3IntermediateHow trace across microservices?
Q4IntermediateMetrics vs logs?
Q5AdvancedDesign observability for payment platform.
Summary
Observability combines metrics, logs, and traces with correlation IDs. Define SLI/SLO before scaling — alerts on user impact. OpenTelemetry and Prometheus are industry standard stack. Structured logs and sampled traces balance insight and cost. Golden signals: latency, traffic, errors, saturation. Disaster recovery plans rely on monitoring to detect failover need.