Retry Pattern
The retry pattern re-executes failed operations when errors are likely transient — network blips, 503 from overloaded servers, deadlock timeouts.
Introduction
The retry pattern re-executes failed operations when errors are likely transient — network blips, 503 from overloaded servers, deadlock timeouts. Unbounded retries amplify outages (retry storms); production systems use capped attempts, exponential backoff with jitter, and idempotency keys.
HLD places retries at client libraries, API gateways, message consumers, and saga steps. Retries interact with circuit breakers, timeouts, and duplicate side effects — design all three together.
This lesson covers backoff strategies, retry-safe operations, and when not to retry.
Understanding the topic
Key concepts
- Transient vs permanent errors — retry 5xx and timeouts, not 400 validation.
- Exponential backoff: 1s, 2s, 4s… capped max delay.
- Jitter: randomize delay to desynchronize retry waves.
- Retry budget: max 3 attempts typical for sync user path.
- Idempotency required — retries duplicate POST without key may double-charge.
- Retry storm: many clients retry together worsens recovery — use jitter + breaker.
sequenceDiagramClient->>Service: requestService-->>Client: timeoutClient->>Service: retry with backoffService-->>Client: success
Internal architecture
Architecture overview
sequenceDiagramClient->>Service: requestService-->>Client: timeoutClient->>Service: retry with backoffService-->>Client: success
Step-by-step explanation
- HTTP client: 3 retries, exponential backoff 100ms–2s, retry only idempotent GET/PUT with key.
- Kafka consumer: retry topic with delay queue or exponential backoff interceptor.
- Gateway retries safe GET upstream once; never auto-retry POST without idempotency.
- Saga step failure: retry with dedup store before compensating.
- Dead letter after max retries for manual investigation.
- Monitor retry rate spike as early outage signal.
Informative example
Spring Retry with exponential backoff on idempotent inventory check:
@Servicepublic class InventoryClient {private final RestClient rest;public InventoryClient(RestClient.Builder builder) {this.rest = builder.baseUrl("http://inventory-service").build();}@Retryable(retryFor = { RestClientException.class },maxAttempts = 3,backoff = @Backoff(delay = 200, multiplier = 2.0, maxDelay = 2000, random = true))public Availability check(String sku, int qty) {return rest.get().uri("/availability?sku={sku}&qty={qty}", sku, qty).retrieve().body(Availability.class);}@Recoverpublic Availability recover(RestClientException ex, String sku, int qty) {return Availability.unknown(sku);}}
Enable @EnableRetry. For POST charge, pass Idempotency-Key header — retry same key, not blind retry.
Real-world use
Real-world use cases
- Payment gateway intermittent 503 — retry with idempotency key.
- S3 upload transient timeout — multipart retry parts.
- Healthcare HL7 ACK retry to external lab system.
- Microservice read during rolling deploy — retry another replica.
Best practices
- Always pair retries with overall deadline (timeout × attempts).
- Use jitter on every backoff.
- Log retry attempt count with correlation ID.
- Classify errors in code — don't retry business rule violations.
- Consumer retries belong in queue config, not infinite loops.
- Load test retry behavior during simulated partial outage.
Common mistakes
- Retrying non-idempotent POST without key — duplicates.
- No max attempts — infinite loop hogs threads.
- Retrying through open circuit breaker — wasted calls.
- Same fixed delay — synchronized retry storm.
- Retrying 401/403 auth errors — pointless hammering.
Advanced interview questions
Q1BeginnerWhen should you retry a failed request?
Q2BeginnerWhy add jitter to backoff?
Q3IntermediateRetry vs circuit breaker?
Q4IntermediateHow retry safely in Kafka consumer?
Q5AdvancedDesign retry for external credit check API.
Summary
Retry transient failures with capped attempts and backoff. Jitter prevents synchronized retry storms. Idempotency keys mandatory for retried mutations. Combine retries with timeouts and circuit breakers. DLQ after exhaustion for async paths. Idempotency pattern ensures safe retries at business layer.