Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Managing Redis Cache Issues and Monitoring Key Metrics

Notes May 15 1

Overview of Four Common Redis Cache Problems

Problem Symptom Mitigation Approach Remarks
Cache Warm-Up Service crashes shortly after launch Load hot entries first; accelerate loading; sync master-slave data Requires routine hot-entry analysis
Cache Avalanche Mass expiration of keys → DB overload Multi-level cache; static page rendering; optimize queries; alerting + throttling + circuit breaker + isolation; vary TTLs; permanent keys; locking; delayed refresh; adjust eviction policy Combine prevention & reaction strategies
Cache Breakdown Sudden DB spike despite stable keys Locking; pre-set TTL for likely hot keys; delayed refresh; secondary cache Focus on specific high-risk keys
Cache Penetration Gradual hit-rate drop + high CPU + DB pressure Cache nulls; whitelist via bitmap/Bloom filter; encrypt keys; monitoring + blacklist Use temporarily; remove when resolved

Cache Warm-Up

Typical Scenario

A newly deployed application using Redis crashes quickly under load due to:

  • High request volume
  • Heavy master-slave sync traffic
  • Frequent RDBMS reads

Warm-Up Process

Preparation: Identify hot entries continuously.

  • Heuristic method: Log access frequency, extract frequently read items.
  • Algorithmic method: Maintain retention queue using LRU (e.g., Storm + Kafka pipeline).

Steps:

  1. Classify entries by priority; preload high-priority items into Redis.
  2. Parallelize loading across distributed nodes to shorten duration.
  3. Preload both master and replica instances.

Execution:

  • Trigger warm-up via scheduled scripts.
  • Optionally integrate CDN for better delivery.

Summary: Preloading critical entries avoids initial DB queries, letting users hit ready-to-serve cache immediately.

Cache Avalanche

Scenario

During steady operation, DB connections surge causing:

  • Client errors: 408 (timeout), 500 (server error)
  • Server collapse: DB, app, Redis, and cluster failures even after restart

Root Cause: Many keys expire simultaneous, forcing mass DB fetches which overwhelm the DB and cascade failure.

Two origins:

  1. Cache layer failure → all requests hit DB.
  2. Bulk expiration of popular keys → direct DB hits.

Preventive Measures

  • Static rendering of high-traffic pages.
  • Multi-tier caching: User → HTTP cache → CDN → proxy cache → local process cache → distributed cache → DB.
  • Optimize slow DB operations (long queries, heavy transactions).
  • Alerting system: track CPU usage, memory, avg response time, thread count; apply throttling or degradation to shed excess load temporarily.

Tiered Cache Characteristics:

  • HTTP + CDN: serve static assets efficiently.
  • Proxy cache: stable dynamic resources.
  • Local process cache (Ehcache, Guava, Caffeine): fast but limited; sync via MQ or timer.
  • Distributed cache (Redis cluster): large scale, robust.

Resilience Patterns:

  • Circuit breaker: reroute traffic from faulty cache node.
  • Throttling: limit incoming requests at edge/proxy.
  • Isolation: queue requests when cache rebuilding/preheating.

Reactive Strategies

  • Mix LRU/LFU eviction policies.
  • Stagger TTLs: e.g., group A = 90min, B = 80min, C = 70min; add random offset to spread expirations.
  • Permanent keys for super-hot entries.
  • Scheduled maintenance: analyze near-expiry access patterns, extend TTL where needed.
  • Locking (use cautiously): single-threaded refresh with snapshot rebuild; primary-replica failover.

Summary: Avalanche stems from concentrated expirations flooding DB. Spread TTLs and combine with layered defenses plus real-time metrics tuning.

Cache Breakdown

Scenario

System runs normally, no mass key expiry, yet DB load spikes and crashes—common with viral products.

Diagnosis:

  • Specific hot key expires.
  • Multiple requests miss Redis and hammer DB for same record.

Cause: Single high-traffic key expiry event.

Mitigation

  • Predictive TTL: identify likely hot keys (e.g., flash-sale items) and set suitable expiry.
  • Live adjustment: monitor access frequency, extend TTL or make permanent during surges.
  • Background renewal: refresh TTL before peak periods.
  • Secondary cache: use different expiry to avoid simultaneous invalidation.
  • Distributed lock: prevent concurrent DB loads on miss (mind performance impact).

Summary: Breakdown is a single-key expiry under high concurrency. Prevent via data analysis, live monitoring, and layered cache design.

Cache Penetration

Scenario

Hit-rate declines over time, CPU usage rises, DB overloaded despite stable Redis memory—often from malicious or bogus requests.

Diagnosis:

  • Widespread cache misses.
  • Requests for nonexistent keys or attack URLs.

Cause: Queries for absent data return null; nulls aren’t cached, so DB is repeatedly queried.

Solutions

  • Cache nulls briefly (30–300s) as interim fix.
  • Whitelist known valid IDs using bitmaps or Bloom filters (more efficient than plain bitmaps).
  • Monitoring + blacklist: flag abnormal hit-rate drops or null-ratio surges; apply blacklists during attacks.
  • Encrypt keys: validate at app edge to block malformed requests.

Summary: Penetration accesses non-existent data, bypassing cache entirely. Use temporary shields like black/white lists and remove them once threat ends.

Redis Performance Monitoring Metrics

Performance

Metric Meaning Note
Latency Response time per request
instantaneous_ops_per_sec Average QPS
Hit rate Cache efficiency; low rate signals stress or poor expiry strategy

Memory

Metric Meaning
used_memory Memory consumed
mem_fragmentation_ratio Fragmentation level
evicted_keys Keys removed due to maxmemory limit
blocked_clients Clients stalled on blocking list commands (BRPOP, etc.)

Activity

Metric Meaning
connected_clients Active client connections
connected_slaves Replica count
master_last_io_seconds_ago Seconds since last master-slave interaction
keyspace Total keys in DB; sudden drops may precede avalanche

Persistence

Metric Meaning
rdb_last_save_time Timestamp of last RDB save
rdb_changes_since_last_save Count of writes since last save

Errors

Metric Meaning
rejected_connections Connections denied by maxclient limit
keyspace_misses Cache misses
master_link_down_since_seconds Duration of master-slave disconneect

Monitoring Tools & Commands

Tools: Cloud Insight Redis, Prometheus, Redis-stat, Redis-faina, RedisLive, Zabbix. Commands: redis-benchmark, redis-cli, monitor, slowlog.

Configure slow log:

slowlog-log-slower-than 1000   # microseconds
slowlog-max-len 100           # max entries

Retrieve slow log info:

slowlog get   # fetch entries
slowlog len   # entry count
slowlog reset # clear log

Bloom Filter

Use case: Fast duplicate username check at registration.

Definition: Space-efficient probabilistic structure combining a bit array and multiple hash functions to test set membership.

Traits:

  • Fixed size can hold unlimited elements; higher load increases false positives; if all bits are 1, everything appears present.
  • Can yield false positives but not false negatives.
  • Does not support deletion.

How It Works

Add element:

  1. Compute K hashes of the value.
  2. Map each hash to an index in the bit array; set those bits to 1.

Check existence:

  1. Recompute the K hashes.
  2. If all corresponding bits are 1 → possibly present; any bit 0 → definitely absent.

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.