Troubleshooting Redis
Ninety percent of Redis incidents fall into eight buckets: OOM/eviction, hot keys, slow commands, replication lag, stampede, penetration, avalanche, and fragmentation.
Introduction
Ninety percent of Redis incidents fall into eight buckets: OOM/eviction, hot keys, slow commands, replication lag, stampede, penetration, avalanche, and fragmentation. Systematic diagnosis beats random CONFIG changes.
Start SLOWLOG and INFO memory on every latency ticket. Correlate with deploy time, traffic spike, and backup window (fork latency).
Document fixes in postmortem — same hot key will return next sale season.
Understanding the topic
Key concepts
- Cache stampede — simultaneous expiry thundering herd.
- Cache penetration — queries for non-existent keys.
- Cache avalanche — mass expiry same time.
- Hot key — single key overloads one shard.
- Big key — DEL/GET blocks event loop.
- Fragmentation — RSS creep without more data.
Step-by-step explanation
- Symptom: latency, errors, evictions.
- Check SLOWLOG, INFO, recent changes.
- Classify: memory, hot key, network, fork.
- Mitigate: TTL jitter, local cache, UNLINK.
- Long-term: shard, pipeline, code fix.
Syntax reference
Common commands
- --hotkeys samples commands — not free.
- Match incident timestamp to BGSAVE schedule.
- Check client output buffer disconnects.
SLOWLOG GET 20redis-cli --hotkeys -i 0.1INFO memoryINFO statsLATENCY DOCTORMEMORY DOCTOR
Informative example
Hot key incident playbook — identify and mitigate:
# 1. Confirm hot keyredis-cli --hotkeys -i 0.1# 2. Check key sizeredis-cli MEMORY USAGE viral:product:sku# 3. Mitigate: app deploy local L1 cache for this key# 4. Long-term: shard read replicas or duplicate read key
Celebrity product drop = hot key. Pre-warm and local cache before event. Communicate with product on key design.
Real-world use
Real-world use cases
- P99 spike after marketing email.
- OOM kill pod restarts.
- FLUSHDB accident recovery.
- Cluster MOVED storm misconfigured client.
- Session stampede after deploy TTL change.
Best practices
- Runbook per incident class.
- Never KEYS * in prod debugging.
- Jitter TTL prevent avalanche.
- Bloom filter negative cache penetration.
- UNLINK big keys during cleanup.
- Game day hot key simulation.
Common mistakes
- Restart Redis without diagnosis — loses evidence.
- Increase maxmemory without fixing TTL leak.
- MONITOR in prod adding load.
- Blaming network without checking slowlog.
Advanced interview questions
Q1BeginnerCache stampede?
Q2BeginnerCache penetration?
Q3IntermediateFix hot key?
Q4IntermediateSudden latency during backup?
Q5AdvancedSystematic Redis incident debug?
Summary
SLOWLOG + INFO first on every incident.