Redis Tutorial 0/42 lessons ~6 min read Lesson 40

    Troubleshooting Redis

    Ninety percent of Redis incidents fall into eight buckets: OOM/eviction, hot keys, slow commands, replication lag, stampede, penetration, avalanche, and fragmentation.

    Course progress0%
    Focus
    10 guided sections
    Practice signal
    Examples included
    Career prep
    Interview Q&A included

    Introduction

    Ninety percent of Redis incidents fall into eight buckets: OOM/eviction, hot keys, slow commands, replication lag, stampede, penetration, avalanche, and fragmentation. Systematic diagnosis beats random CONFIG changes.

    Start SLOWLOG and INFO memory on every latency ticket. Correlate with deploy time, traffic spike, and backup window (fork latency).

    Document fixes in postmortem — same hot key will return next sale season.

    Understanding the topic

    Key concepts

    • Cache stampede — simultaneous expiry thundering herd.
    • Cache penetration — queries for non-existent keys.
    • Cache avalanche — mass expiry same time.
    • Hot key — single key overloads one shard.
    • Big key — DEL/GET blocks event loop.
    • Fragmentation — RSS creep without more data.

    Step-by-step explanation

    1. Symptom: latency, errors, evictions.
    2. Check SLOWLOG, INFO, recent changes.
    3. Classify: memory, hot key, network, fork.
    4. Mitigate: TTL jitter, local cache, UNLINK.
    5. Long-term: shard, pipeline, code fix.

    Syntax reference

    Common commands

    • --hotkeys samples commands — not free.
    • Match incident timestamp to BGSAVE schedule.
    • Check client output buffer disconnects.
    bash
    SLOWLOG GET 20
    redis-cli --hotkeys -i 0.1
    INFO memory
    INFO stats
    LATENCY DOCTOR
    MEMORY DOCTOR

    Informative example

    Hot key incident playbook — identify and mitigate:

    bash
    # 1. Confirm hot key
    redis-cli --hotkeys -i 0.1
    # 2. Check key size
    redis-cli MEMORY USAGE viral:product:sku
    # 3. Mitigate: app deploy local L1 cache for this key
    # 4. Long-term: shard read replicas or duplicate read key

    Celebrity product drop = hot key. Pre-warm and local cache before event. Communicate with product on key design.

    Real-world use

    Real-world use cases

    • P99 spike after marketing email.
    • OOM kill pod restarts.
    • FLUSHDB accident recovery.
    • Cluster MOVED storm misconfigured client.
    • Session stampede after deploy TTL change.

    Best practices

    • Runbook per incident class.
    • Never KEYS * in prod debugging.
    • Jitter TTL prevent avalanche.
    • Bloom filter negative cache penetration.
    • UNLINK big keys during cleanup.
    • Game day hot key simulation.

    Common mistakes

    • Restart Redis without diagnosis — loses evidence.
    • Increase maxmemory without fixing TTL leak.
    • MONITOR in prod adding load.
    • Blaming network without checking slowlog.

    Advanced interview questions

    Q1BeginnerCache stampede?
    Many requests miss expired hot key simultaneously hammer DB.
    Q2BeginnerCache penetration?
    Attack or bug requests non-existent keys bypassing cache.
    Q3IntermediateFix hot key?
    Local L1, read replica, duplicate key, split value, or pre-warm.
    Q4IntermediateSudden latency during backup?
    BGSAVE fork COW — move backup to replica schedule off-peak.
    Q5AdvancedSystematic Redis incident debug?
    Timeline, SLOWLOG, INFO memory/stats, hotkeys, deploy diff, repl lag, client buffers, then targeted fix.

    Summary

    SLOWLOG + INFO first on every incident.

    Ready to mark this lesson complete?Track your journey across the entire course.