High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 33

    Monitoring & Logging

    Observability — metrics, logs, traces — makes distributed HLD operable.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Observability — metrics, logs, traces — makes distributed HLD operable. Without correlation IDs spanning gateway → services → DB, debugging p99 latency spikes is guesswork. The three pillars: Prometheus metrics, structured logs (ELK/Loki), distributed tracing (Jaeger/Tempo/OpenTelemetry).

    Interviews expect SLI/SLO definition, alerting on symptoms (error rate, latency) not causes alone, and dashboards per golden signals: latency, traffic, errors, saturation.

    This lesson maps observability into architecture from day one, not as an afterthought.

    Understanding the topic

    Key concepts

    • Metrics: time-series counters/histograms — request rate, error %, p99 latency.
    • Logs: structured JSON with traceId, userId, level — searchable in ELK.
    • Traces: span tree showing cross-service call waterfall.
    • SLI: measured behavior (availability = successful/total). SLO: target 99.9%.
    • Alerting: burn rate on error budget — page humans on user impact.
    • Health vs readiness probes for orchestrstrator routing.
    text
    flowchart TB
    App --> Metrics[Prometheus]
    App --> Logs[ELK]
    App --> Traces[Jaeger]
    Metrics --> Grafana

    Internal architecture

    Architecture overview

    text
    flowchart TB
    App --> Metrics[Prometheus]
    App --> Logs[ELK]
    App --> Traces[Jaeger]
    Metrics --> Grafana

    Step-by-step explanation

    1. OpenTelemetry SDK in every service exports traces + metrics.
    2. Prometheus scrapes /actuator/prometheus; Grafana dashboards per service SLO.
    3. Logs shipped via Fluent Bit → Elasticsearch; Kibana queries by traceId.
    4. API Gateway generates X-Request-Id; propagates W3C traceparent header.
    5. Synthetic probes hit critical user journeys every 1m from multiple regions.
    6. PagerDuty alert when checkout error rate >1% for 5m.

    Informative example

    Spring Boot 3 Actuator + Micrometer Prometheus and structured logging pattern:

    yaml
    management:
    endpoints:
    web:
    exposure:
    include: health,info,prometheus
    metrics:
    tags:
    application: order-service
    tracing:
    sampling:
    probability: 0.1
    logging:
    pattern:
    console: '{"time":"%d","level":"%p","trace":"%X{traceId}","span":"%X{spanId}","msg":"%m"}%n'
    # Alert rule (Prometheus)
    # alert: HighErrorRate
    # expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.01

    Sample traces in prod (10%) for cost. Always log correlation IDs. Define SLO before launch.

    Real-world use

    Real-world use cases

    • E-commerce Black Friday — dashboard on checkout p99 and queue depth.
    • Banking audit — immutable log retention 7 years.
    • OTT playback errors correlated CDN + origin traces.
    • Microservice deploy canary compared via error rate metrics.

    Best practices

    • Structured logs only — no unparsed printf strings.
    • One traceId per user request end-to-end.
    • RED method metrics on every HTTP service.
    • Log levels appropriate — ERROR actionable, INFO sparse.
    • Runbooks linked from alert annotations.
    • Test alerts fire in staging chaos drills.

    Common mistakes

    • Logging PII/secrets — compliance violation.
    • 100% trace sampling cost explosion.
    • Alerts on CPU alone without user-facing SLO context.
    • Different log formats per service — query hell.
    • No centralized view — SSH grep during outage.

    Advanced interview questions

    Q1BeginnerThree pillars of observability?
    Metrics, logs, and distributed traces.
    Q2BeginnerWhat is an SLO?
    Target level for SLI like 99.9% availability over 30 days — drives error budget.
    Q3IntermediateHow trace across microservices?
    Propagate trace context headers (W3C traceparent); OpenTelemetry links spans.
    Q4IntermediateMetrics vs logs?
    Metrics aggregate numeric time-series for dashboards/alerts; logs detailed event records for debugging.
    Q5AdvancedDesign observability for payment platform.
    OpenTelemetry all services, PCI log redaction, SLO 99.99% auth, synthetic pay every 60s, burn rate alerts, Jaeger retention 7d, immutable audit Kafka topic.

    Summary

    Observability combines metrics, logs, and traces with correlation IDs. Define SLI/SLO before scaling — alerts on user impact. OpenTelemetry and Prometheus are industry standard stack. Structured logs and sampled traces balance insight and cost. Golden signals: latency, traffic, errors, saturation. Disaster recovery plans rely on monitoring to detect failover need.

    Ready to mark this lesson complete?Track your journey across the entire course.