Managing Redis Cache Issues and Monitoring Key Metrics
Overview of Four Common Redis Cache Problems
| Problem | Symptom | Mitigation Approach | Remarks |
|---|---|---|---|
| Cache Warm-Up | Service crashes shortly after launch | Load hot entries first; accelerate loading; sync master-slave data | Requires routine hot-entry analysis |
| Cache Avalanche | Mass expiration of keys → DB overload | Multi-level cache; static page rendering; optimize queries; alerting + throttling + circuit breaker + isolation; vary TTLs; permanent keys; locking; delayed refresh; adjust eviction policy | Combine prevention & reaction strategies |
| Cache Breakdown | Sudden DB spike despite stable keys | Locking; pre-set TTL for likely hot keys; delayed refresh; secondary cache | Focus on specific high-risk keys |
| Cache Penetration | Gradual hit-rate drop + high CPU + DB pressure | Cache nulls; whitelist via bitmap/Bloom filter; encrypt keys; monitoring + blacklist | Use temporarily; remove when resolved |
Cache Warm-Up
Typical Scenario
A newly deployed application using Redis crashes quickly under load due to:
- High request volume
- Heavy master-slave sync traffic
- Frequent RDBMS reads
Warm-Up Process
Preparation: Identify hot entries continuously.
- Heuristic method: Log access frequency, extract frequently read items.
- Algorithmic method: Maintain retention queue using LRU (e.g., Storm + Kafka pipeline).
Steps:
- Classify entries by priority; preload high-priority items into Redis.
- Parallelize loading across distributed nodes to shorten duration.
- Preload both master and replica instances.
Execution:
- Trigger warm-up via scheduled scripts.
- Optionally integrate CDN for better delivery.
Summary: Preloading critical entries avoids initial DB queries, letting users hit ready-to-serve cache immediately.
Cache Avalanche
Scenario
During steady operation, DB connections surge causing:
- Client errors: 408 (timeout), 500 (server error)
- Server collapse: DB, app, Redis, and cluster failures even after restart
Root Cause: Many keys expire simultaneous, forcing mass DB fetches which overwhelm the DB and cascade failure.
Two origins:
- Cache layer failure → all requests hit DB.
- Bulk expiration of popular keys → direct DB hits.
Preventive Measures
- Static rendering of high-traffic pages.
- Multi-tier caching: User → HTTP cache → CDN → proxy cache → local process cache → distributed cache → DB.
- Optimize slow DB operations (long queries, heavy transactions).
- Alerting system: track CPU usage, memory, avg response time, thread count; apply throttling or degradation to shed excess load temporarily.
Tiered Cache Characteristics:
- HTTP + CDN: serve static assets efficiently.
- Proxy cache: stable dynamic resources.
- Local process cache (Ehcache, Guava, Caffeine): fast but limited; sync via MQ or timer.
- Distributed cache (Redis cluster): large scale, robust.
Resilience Patterns:
- Circuit breaker: reroute traffic from faulty cache node.
- Throttling: limit incoming requests at edge/proxy.
- Isolation: queue requests when cache rebuilding/preheating.
Reactive Strategies
- Mix LRU/LFU eviction policies.
- Stagger TTLs: e.g., group A = 90min, B = 80min, C = 70min; add random offset to spread expirations.
- Permanent keys for super-hot entries.
- Scheduled maintenance: analyze near-expiry access patterns, extend TTL where needed.
- Locking (use cautiously): single-threaded refresh with snapshot rebuild; primary-replica failover.
Summary: Avalanche stems from concentrated expirations flooding DB. Spread TTLs and combine with layered defenses plus real-time metrics tuning.
Cache Breakdown
Scenario
System runs normally, no mass key expiry, yet DB load spikes and crashes—common with viral products.
Diagnosis:
- Specific hot key expires.
- Multiple requests miss Redis and hammer DB for same record.
Cause: Single high-traffic key expiry event.
Mitigation
- Predictive TTL: identify likely hot keys (e.g., flash-sale items) and set suitable expiry.
- Live adjustment: monitor access frequency, extend TTL or make permanent during surges.
- Background renewal: refresh TTL before peak periods.
- Secondary cache: use different expiry to avoid simultaneous invalidation.
- Distributed lock: prevent concurrent DB loads on miss (mind performance impact).
Summary: Breakdown is a single-key expiry under high concurrency. Prevent via data analysis, live monitoring, and layered cache design.
Cache Penetration
Scenario
Hit-rate declines over time, CPU usage rises, DB overloaded despite stable Redis memory—often from malicious or bogus requests.
Diagnosis:
- Widespread cache misses.
- Requests for nonexistent keys or attack URLs.
Cause: Queries for absent data return null; nulls aren’t cached, so DB is repeatedly queried.
Solutions
- Cache nulls briefly (30–300s) as interim fix.
- Whitelist known valid IDs using bitmaps or Bloom filters (more efficient than plain bitmaps).
- Monitoring + blacklist: flag abnormal hit-rate drops or null-ratio surges; apply blacklists during attacks.
- Encrypt keys: validate at app edge to block malformed requests.
Summary: Penetration accesses non-existent data, bypassing cache entirely. Use temporary shields like black/white lists and remove them once threat ends.
Redis Performance Monitoring Metrics
Performance
| Metric | Meaning | Note |
|---|---|---|
| Latency | Response time per request | |
| instantaneous_ops_per_sec | Average QPS | |
| Hit rate | Cache efficiency; low rate signals stress or poor expiry strategy |
Memory
| Metric | Meaning |
|---|---|
| used_memory | Memory consumed |
| mem_fragmentation_ratio | Fragmentation level |
| evicted_keys | Keys removed due to maxmemory limit |
| blocked_clients | Clients stalled on blocking list commands (BRPOP, etc.) |
Activity
| Metric | Meaning |
|---|---|
| connected_clients | Active client connections |
| connected_slaves | Replica count |
| master_last_io_seconds_ago | Seconds since last master-slave interaction |
| keyspace | Total keys in DB; sudden drops may precede avalanche |
Persistence
| Metric | Meaning |
|---|---|
| rdb_last_save_time | Timestamp of last RDB save |
| rdb_changes_since_last_save | Count of writes since last save |
Errors
| Metric | Meaning |
|---|---|
| rejected_connections | Connections denied by maxclient limit |
| keyspace_misses | Cache misses |
| master_link_down_since_seconds | Duration of master-slave disconneect |
Monitoring Tools & Commands
Tools: Cloud Insight Redis, Prometheus, Redis-stat, Redis-faina, RedisLive, Zabbix.
Commands: redis-benchmark, redis-cli, monitor, slowlog.
Configure slow log:
slowlog-log-slower-than 1000 # microseconds
slowlog-max-len 100 # max entries
Retrieve slow log info:
slowlog get # fetch entries
slowlog len # entry count
slowlog reset # clear log
Bloom Filter
Use case: Fast duplicate username check at registration.
Definition: Space-efficient probabilistic structure combining a bit array and multiple hash functions to test set membership.
Traits:
- Fixed size can hold unlimited elements; higher load increases false positives; if all bits are 1, everything appears present.
- Can yield false positives but not false negatives.
- Does not support deletion.
How It Works
Add element:
- Compute K hashes of the value.
- Map each hash to an index in the bit array; set those bits to 1.
Check existence:
- Recompute the K hashes.
- If all corresponding bits are 1 → possibly present; any bit 0 → definitely absent.