High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 33“Scalable systems, HLD interviews & case studies”

Monitoring & Logging

Observability — metrics, logs, traces — makes distributed HLD operable.

Course progress0%

Focus

10 guided sections

Practice signal

Examples included

Career prep

Interview Q&A included

Introduction

Observability — metrics, logs, traces — makes distributed HLD operable. Without correlation IDs spanning gateway → services → DB, debugging p99 latency spikes is guesswork. The three pillars: Prometheus metrics, structured logs (ELK/Loki), distributed tracing (Jaeger/Tempo/OpenTelemetry).

Interviews expect SLI/SLO definition, alerting on symptoms (error rate, latency) not causes alone, and dashboards per golden signals: latency, traffic, errors, saturation.

This lesson maps observability into architecture from day one, not as an afterthought.

Understanding the topic

Key concepts

Metrics: time-series counters/histograms — request rate, error %, p99 latency.
Logs: structured JSON with traceId, userId, level — searchable in ELK.
Traces: span tree showing cross-service call waterfall.
SLI: measured behavior (availability = successful/total). SLO: target 99.9%.
Alerting: burn rate on error budget — page humans on user impact.
Health vs readiness probes for orchestrstrator routing.

text

flowchart TB
  App --> Metrics[Prometheus]
  App --> Logs[ELK]
  App --> Traces[Jaeger]
  Metrics --> Grafana

Internal architecture

Architecture overview

text

flowchart TB
  App --> Metrics[Prometheus]
  App --> Logs[ELK]
  App --> Traces[Jaeger]
  Metrics --> Grafana

Step-by-step explanation

OpenTelemetry SDK in every service exports traces + metrics.
Prometheus scrapes /actuator/prometheus; Grafana dashboards per service SLO.
Logs shipped via Fluent Bit → Elasticsearch; Kibana queries by traceId.
API Gateway generates X-Request-Id; propagates W3C traceparent header.
Synthetic probes hit critical user journeys every 1m from multiple regions.
PagerDuty alert when checkout error rate >1% for 5m.

Informative example

Spring Boot 3 Actuator + Micrometer Prometheus and structured logging pattern:

yaml

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    tags:
      application: order-service
  tracing:
    sampling:
      probability: 0.1

logging:
  pattern:
    console: '{"time":"%d","level":"%p","trace":"%X{traceId}","span":"%X{spanId}","msg":"%m"}%n'

# Alert rule (Prometheus)
# alert: HighErrorRate
# expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) / rate(http_server_requests_seconds_count[5m]) > 0.01

Sample traces in prod (10%) for cost. Always log correlation IDs. Define SLO before launch.

Real-world use

Real-world use cases

E-commerce Black Friday — dashboard on checkout p99 and queue depth.
Banking audit — immutable log retention 7 years.
OTT playback errors correlated CDN + origin traces.
Microservice deploy canary compared via error rate metrics.

Best practices

Structured logs only — no unparsed printf strings.
One traceId per user request end-to-end.
RED method metrics on every HTTP service.
Log levels appropriate — ERROR actionable, INFO sparse.
Runbooks linked from alert annotations.
Test alerts fire in staging chaos drills.

Common mistakes

Logging PII/secrets — compliance violation.
100% trace sampling cost explosion.
Alerts on CPU alone without user-facing SLO context.
Different log formats per service — query hell.
No centralized view — SSH grep during outage.

Advanced interview questions

Q1BeginnerThree pillars of observability?

Metrics, logs, and distributed traces.

Q2BeginnerWhat is an SLO?

Target level for SLI like 99.9% availability over 30 days — drives error budget.

Q3IntermediateHow trace across microservices?

Propagate trace context headers (W3C traceparent); OpenTelemetry links spans.

Q4IntermediateMetrics vs logs?

Metrics aggregate numeric time-series for dashboards/alerts; logs detailed event records for debugging.

Q5AdvancedDesign observability for payment platform.

OpenTelemetry all services, PCI log redaction, SLO 99.99% auth, synthetic pay every 60s, burn rate alerts, Jaeger retention 7d, immutable audit Kafka topic.

Summary

Observability combines metrics, logs, and traces with correlation IDs. Define SLI/SLO before scaling — alerts on user impact. OpenTelemetry and Prometheus are industry standard stack. Structured logs and sampled traces balance insight and cost. Golden signals: latency, traffic, errors, saturation. Disaster recovery plans rely on monitoring to detect failover need.

Ready to mark this lesson complete?Track your journey across the entire course.