High-Level Design Tutorial 0/42 lessons ~6 min read Lesson 28

    Retry Pattern

    The retry pattern re-executes failed operations when errors are likely transient — network blips, 503 from overloaded servers, deadlock timeouts.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    The retry pattern re-executes failed operations when errors are likely transient — network blips, 503 from overloaded servers, deadlock timeouts. Unbounded retries amplify outages (retry storms); production systems use capped attempts, exponential backoff with jitter, and idempotency keys.

    HLD places retries at client libraries, API gateways, message consumers, and saga steps. Retries interact with circuit breakers, timeouts, and duplicate side effects — design all three together.

    This lesson covers backoff strategies, retry-safe operations, and when not to retry.

    Understanding the topic

    Key concepts

    • Transient vs permanent errors — retry 5xx and timeouts, not 400 validation.
    • Exponential backoff: 1s, 2s, 4s… capped max delay.
    • Jitter: randomize delay to desynchronize retry waves.
    • Retry budget: max 3 attempts typical for sync user path.
    • Idempotency required — retries duplicate POST without key may double-charge.
    • Retry storm: many clients retry together worsens recovery — use jitter + breaker.
    text
    sequenceDiagram
    Client->>Service: request
    Service-->>Client: timeout
    Client->>Service: retry with backoff
    Service-->>Client: success

    Internal architecture

    Architecture overview

    text
    sequenceDiagram
    Client->>Service: request
    Service-->>Client: timeout
    Client->>Service: retry with backoff
    Service-->>Client: success

    Step-by-step explanation

    1. HTTP client: 3 retries, exponential backoff 100ms–2s, retry only idempotent GET/PUT with key.
    2. Kafka consumer: retry topic with delay queue or exponential backoff interceptor.
    3. Gateway retries safe GET upstream once; never auto-retry POST without idempotency.
    4. Saga step failure: retry with dedup store before compensating.
    5. Dead letter after max retries for manual investigation.
    6. Monitor retry rate spike as early outage signal.

    Informative example

    Spring Retry with exponential backoff on idempotent inventory check:

    java
    @Service
    public class InventoryClient {
    private final RestClient rest;
    public InventoryClient(RestClient.Builder builder) {
    this.rest = builder.baseUrl("http://inventory-service").build();
    }
    @Retryable(
    retryFor = { RestClientException.class },
    maxAttempts = 3,
    backoff = @Backoff(delay = 200, multiplier = 2.0, maxDelay = 2000, random = true)
    )
    public Availability check(String sku, int qty) {
    return rest.get()
    .uri("/availability?sku={sku}&qty={qty}", sku, qty)
    .retrieve()
    .body(Availability.class);
    }
    @Recover
    public Availability recover(RestClientException ex, String sku, int qty) {
    return Availability.unknown(sku);
    }
    }

    Enable @EnableRetry. For POST charge, pass Idempotency-Key header — retry same key, not blind retry.

    Real-world use

    Real-world use cases

    • Payment gateway intermittent 503 — retry with idempotency key.
    • S3 upload transient timeout — multipart retry parts.
    • Healthcare HL7 ACK retry to external lab system.
    • Microservice read during rolling deploy — retry another replica.

    Best practices

    • Always pair retries with overall deadline (timeout × attempts).
    • Use jitter on every backoff.
    • Log retry attempt count with correlation ID.
    • Classify errors in code — don't retry business rule violations.
    • Consumer retries belong in queue config, not infinite loops.
    • Load test retry behavior during simulated partial outage.

    Common mistakes

    • Retrying non-idempotent POST without key — duplicates.
    • No max attempts — infinite loop hogs threads.
    • Retrying through open circuit breaker — wasted calls.
    • Same fixed delay — synchronized retry storm.
    • Retrying 401/403 auth errors — pointless hammering.

    Advanced interview questions

    Q1BeginnerWhen should you retry a failed request?
    On transient errors like timeouts and 503 when operation is idempotent or uses idempotency key.
    Q2BeginnerWhy add jitter to backoff?
    Spreads retry times so clients don't synchronized spike recovering service.
    Q3IntermediateRetry vs circuit breaker?
    Retry handles brief glitches; breaker stops during sustained failure — complementary.
    Q4IntermediateHow retry safely in Kafka consumer?
    Limited attempts, backoff, then DLQ; processing must be idempotent by offset or business key.
    Q5AdvancedDesign retry for external credit check API.
    2s timeout, 3 retries exp backoff max 8s, idempotency by applicationId, breaker at 50% failures, cache soft decline 24h, audit each attempt.

    Summary

    Retry transient failures with capped attempts and backoff. Jitter prevents synchronized retry storms. Idempotency keys mandatory for retried mutations. Combine retries with timeouts and circuit breakers. DLQ after exhaustion for async paths. Idempotency pattern ensures safe retries at business layer.

    Ready to mark this lesson complete?Track your journey across the entire course.