Caching & Redis

Caching appears in virtually every system design interview. Redis is the default distributed cache. Master both the concept and the tradeoffs.

Why Cache?

DB reads are ~1-10ms. Cache reads are ~0.1ms.
Reduce load on DB — same DB can handle 10x traffic with cache in front
Absorb read spikes without scaling storage layer

Cache hierarchy (fastest → slowest): CPU registers → L1/L2/L3 cache → RAM → SSD → Network (Redis) → HDD → DB

Cache Strategies

Cache-Aside (Lazy Loading) — Most Common

Read:  check cache → miss → read DB → write to cache → return
Write: write DB → invalidate (or update) cache

def get_user(user_id):
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)          # cache hit

    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))  # TTL 1hr
    return user

def update_user(user_id, data):
    db.update("UPDATE users SET ...", user_id, data)
    redis.delete(f"user:{user_id}")        # invalidate

Pros: Cache only what's needed. DB is source of truth. Cache failure = degraded, not broken. Cons: Cache miss penalty. First request is slow (cold start).

Write-Through

Write: write DB AND cache simultaneously
Read:  always in cache (if key not expired)

Pros: Cache always consistent with DB. Cons: Write latency includes cache write. Write amplification for data that's rarely read.

Write-Behind (Write-Back)

Write: write cache → async flush to DB later (batch)

Pros: Very low write latency. Cons: Data loss risk if cache crashes before DB flush. Complexity.

Read-Through

App reads from cache only. Cache is responsible for loading from DB on miss. Common in managed caches (ElastiCache DAX for DynamoDB).

Eviction Policies

Policy	How	Use When
LRU (Least Recently Used)	Evict least recently accessed	General purpose — most common
LFU (Least Frequently Used)	Evict least frequently accessed	When access frequency matters (hot items stay forever)
TTL (Time To Live)	Evict after fixed time	Freshness required (session tokens, rate limiting)
FIFO	Evict oldest inserted	Simple queues
Random	Evict random	Rarely used

Redis default: allkeys-lru (evict any key, LRU) or volatile-lru (only TTL-set keys).

Cache Invalidation Problems

Cache invalidation is one of the hardest problems in CS. Two strategies:

TTL-based (expiry): Set a TTL, accept stale data up to TTL duration. Simple, slightly stale.
Event-driven invalidation: On DB write, delete/update cache key. Consistent but complex.

Cache Stampede (Thundering Herd)

When a popular cached item expires, many requests simultaneously miss → flood DB.

Fix: probabilistic early expiration or mutex/lock on cache miss:

import time
import random

def get_with_lock(key, ttl):
    val = redis.get(key)
    if val:
        return val

    # Only one process recalculates
    lock = redis.set(f"lock:{key}", 1, nx=True, ex=30)  # nx = set if not exists
    if lock:
        val = db.query(...)
        redis.setex(key, ttl, val)
        redis.delete(f"lock:{key}")
    else:
        time.sleep(0.1)  # wait for lock holder
        return redis.get(key)

Redis Data Structures

Type	Commands	Use Cases
String	`GET`, `SET`, `INCR`, `SETEX`	Cache values, counters, session tokens
Hash	`HGET`, `HSET`, `HGETALL`	Object fields (user profile)
List	`LPUSH`, `RPOP`, `LRANGE`	Queue, activity feed, recent items
Set	`SADD`, `SMEMBERS`, `SISMEMBER`	Unique visitors, tags, friend list
Sorted Set (ZSet)	`ZADD`, `ZRANGE`, `ZRANGEBYSCORE`	Leaderboard, rate limiting, timeline
Bitmap	`SETBIT`, `BITCOUNT`	Daily active users, feature flags
HyperLogLog	`PFADD`, `PFCOUNT`	Approximate cardinality (unique count)

Redis for Rate Limiting

def is_rate_limited(user_id: str, limit: int, window_seconds: int) -> bool:
    key = f"rate:{user_id}:{int(time.time() // window_seconds)}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, window_seconds)
    return count > limit

Sliding window: use sorted set, score = timestamp.

See: [[System Design/Problem Designs/Rate Limiter]]

Redis Persistence

Mode	How	Tradeoff
RDB (snapshot)	Periodic dump to .rdb file	Fast restart, some data loss on crash
AOF (append-only file)	Log every write command	Near-zero data loss, slower restart
Both	RDB + AOF	Best durability, most disk I/O
None	Pure in-memory	Fastest, loses all data on restart

For pure cache (DB is source of truth): no persistence or RDB only. For session store / rate limiter (data loss = bad): AOF.

Redis Replication & Clustering

Master-Replica: One master (writes), N replicas (reads). Automatic failover with Redis Sentinel.

Redis Cluster: Hash slots 0-16383 split across nodes. Each node owns ~16383/N slots. Supports horizontal scaling.

Node A: slots 0–5460
Node B: slots 5461–10922
Node C: slots 10923–16383

Key user:123 → hash → slot → correct node. Client needs cluster-aware library.

CDN as Cache

For static assets (images, JS, CSS), CDN = global cache layer:

Push CDN: Upload assets to CDN explicitly
Pull CDN: CDN fetches from origin on first miss, caches by TTL

See: [[System Design/System Design Basics]] — scalability evolution

Interview Tradeoffs

Decision	Option A	Option B
Consistency	TTL invalidation (slightly stale)	Event-driven invalidation (consistent)
Write strategy	Cache-aside (simple)	Write-through (consistent)
Eviction	LRU (recency)	LFU (frequency)
Scale	Single Redis	Redis Cluster / sharded
Durability	In-memory only	AOF persistence

Common interview answer structure: "I'd use cache-aside with LRU eviction and a 5-minute TTL. For writes, I'd invalidate the cache key immediately. For read-heavy data like user profiles, this gives ~95% cache hit rate at steady state."

Redis Architecture Deep Dive

Source: [[raw/Redis Architecture Deep Dive What I Learned Building High-Performance Systems]]

Why Redis Is Fast: Single-Threaded + I/O Multiplexing

Single-threaded — Redis processes one command at a time on one thread.

Sounds wrong. Actually eliminates entire problem classes:

No race conditions — commands execute atomically, no locks needed
No context switching overhead
Predictable, consistent performance (multi-threaded DBs have lock contention variance)

Critical implication: Slow commands block EVERYTHING.

# This freezes your entire application — KEYS * on 10M keys = 30 second freeze
KEYS *

# Use SCAN instead — iterates without blocking
SCAN 0 MATCH "user:*" COUNT 100

Never use in production: KEYS *, FLUSHALL, SMEMBERS on huge sets, LINDEX on distant positions.

I/O Multiplexing (epoll/kqueue) — how single-threaded Redis handles thousands of simultaneous connections:

epoll monitors thousands of sockets
→ OS notifies Redis when data arrives on any socket
→ Redis reads command, processes it (sub-ms), writes response
→ Moves to next ready socket
All on one thread. No per-connection thread overhead.

Scales to ~100,000 concurrent connections on modern hardware.

Data Structure Internals

Strings — 3 internal encodings:

Value	Encoding	Why
Integer (0-9999)	int	Stored as actual 64-bit int — fast INCR, no parse
Short string (<44 bytes)	embstr	Embedded in value object — single allocation
Long string (>44 bytes)	raw	Separate allocation with pointer

Lists — linked list gotcha:

Redis lists are doubly-linked (or ziplist for small lists). LPUSH/RPOP = O(1). But LINDEX list 50000 = O(N) — scans 50,000 nodes. Never access list middle elements at scale.

Sorted Sets — skip list + hash table dual storage:

O(log N) for ZADD, ZRANK, ZRANGEBYSCORE
O(1) for ZSCORE

Memory warning: Each member stored twice (skip list + hash table). 10M member leaderboard ≈ 500MB+ RAM.

Hashes — ziplist memory savings (3x):

Small hashes (<512 fields, values <64 bytes) stored as ziplist — contiguous memory, no pointer overhead.

# Bad: 1000 separate keys ≈ 100KB+ overhead
SET user:1:name "John"
SET user:1:email "john@example.com"

# Good: 1 hash with ziplist encoding ≈ 33KB
HSET user:1 name "John" email "john@example.com"

When building millions of objects in Redis, hashes reduce memory by ~60-70%.

Persistence Deep Dive

RDB (snapshot) — the fork problem:

1. Redis forks a child process (copy-on-write — efficient for reads)
2. Child writes snapshot to temp file
3. Child renames to dump.rdb
4. Parent continues serving requests

Problem: Fork on 10GB dataset with active writes = 200-500ms where Redis doesn't respond. Every scheduled save causes a latency spike.

Fix: Use a replica for RDB saves. Ensure 2x free memory for copy-on-write pages.

AOF (append-only file) — growth problem:

AOF logs every write command. After a week: 50GB AOF file for a 2GB dataset.

Fix: BGREWRITEAOF — creates compact AOF with current state. Also uses fork (same latency spike risk).

appendfsync everysec   # fsync every second — good balance (default)
appendfsync always     # fsync every command — slowest, safest
appendfsync no         # let OS decide — fastest, data loss risk

Production guide:

Use Case	Persistence Mode
Pure cache (DB is source of truth)	None or RDB infrequent
Session store	AOF `everysec`
Rate limiter counters	AOF `everysec`
Critical data (payment state)	AOF `always` + replicas
Development	None (performance)

Lua Scripts — True Atomicity

When you need multi-step operations without race conditions, use Lua — not MULTI/EXEC.

Rate limiter example (GET → check → INCR → EXPIRE has a race condition):

-- rate_limit.lua: atomic check-and-increment
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local ttl   = tonumber(ARGV[2])

local current = redis.call('GET', key)
if current == false then current = 0
else current = tonumber(current) end

if current < limit then
    redis.call('INCR', key)
    redis.call('EXPIRE', key, ttl)
    return 1   -- allowed
else
    return 0   -- rate limited
end

Entire script executes atomically — no interleaving possible.

Use Lua when: Atomic multi-step (rate limiting, distributed locks, dedup). Don't use for: Simple single commands, long-running computation (blocks Redis).

Cluster note: All keys in a Lua script must be in the same hash slot. Use KEYS[] (not strings) and hash tags {...} to guarantee co-location.

Pipelining — 100x Bulk Throughput

# Without pipelining: 1000 network round trips ≈ 1 second
for i in range(1000):
    redis.set(f"key:{i}", f"value:{i}")

# With pipelining: 1 round trip ≈ 10ms
pipe = redis.pipeline()
for i in range(1000):
    pipe.set(f"key:{i}", f"value:{i}")
pipe.execute()

Important: Pipelined commands are NOT transactional. They execute independently. For atomicity, use Lua.

Transactions Gotcha — No Rollback

Redis MULTI/EXEC ≠ SQL transactions.

MULTI
SET key1 "value1"
INCR key1              # This will error (key1 is a string, not int)
SET key2 "value2"
EXEC

Result: key1 = "value1" ✅, INCR fails ❌, key2 = "value2" ✅.

Redis transactions DO NOT rollback on error. Other commands in the transaction still execute. If you need true atomicity, use Lua scripts.

Transactions guarantee: commands execute in order, atomically (no interleaving), but NOT rollback on failure.

Memory Management

Eviction policies (when maxmemory is hit):

Policy	Evicts	Use when
`noeviction`	Nothing — returns error	Can't afford data loss
`allkeys-lru`	Any key, least recently used	General-purpose cache
`volatile-lru`	Keys with TTL, LRU	Keep permanent keys safe
`allkeys-random`	Any key, random	Uniform access pattern
`volatile-ttl`	Key closest to expiry	Prioritize short-lived data

Note: Redis LRU is approximate — samples N random keys, evicts least recently used among them. Faster than true LRU but not exact.

Memory fragmentation:

INFO memory
# Check: mem_fragmentation_ratio
# > 1.5 = significant fragmentation
# used_memory: what Redis uses
# used_memory_rss: what OS allocated to Redis

# Example: used=2.5GB, rss=4.8GB, ratio=1.92 → wasting 2.3GB

Fix: CONFIG SET activedefrag yes (Redis 4.0+ background defrag). Or restart Redis periodically.

Monitoring

SLOWLOG GET 10          # last 10 slow commands (default threshold: 10ms)

INFO memory             # used_memory, mem_fragmentation_ratio
INFO stats              # keyspace_hits, keyspace_misses → cache hit rate
INFO clients            # connected_clients
INFO replication        # master/replica lag

# Alert thresholds:
# memory > 80% of maxmemory
# cache hit rate < 95%
# evicted_keys > 0 (unexpected eviction)
# replication lag > 1s

Cluster Hash Tags

In Redis Cluster, multi-key operations must be on the same hash slot.

# These might be on different nodes → CROSSSLOT error
MGET "user:1" "user:2"

# Hash tags force same slot: {user} hashes to same slot
MGET "{user}:1" "{user}:2"   ✅

# Lua scripts: all KEYS must be same slot
# Use hash tags in key names for any multi-key Lua scripts

[[Distributed Systems Concepts]] — CAP, consistency
[[System Design/Problem Designs/Rate Limiter]] — Redis sliding window
[[Message Queues & Kafka]] — Redis Pub/Sub vs Kafka
[[synthesis/Interview Prep Hub]] — SD talking points

Caching & Redis

Caching & Redis

Why Cache?

Cache Strategies

Cache-Aside (Lazy Loading) — Most Common

Write-Through

Write-Behind (Write-Back)

Read-Through

Eviction Policies

Cache Invalidation Problems

Cache Stampede (Thundering Herd)

Redis Data Structures

Redis for Rate Limiting

Redis Persistence

Redis Replication & Clustering

CDN as Cache

Interview Tradeoffs

Redis Architecture Deep Dive

Why Redis Is Fast: Single-Threaded + I/O Multiplexing

Data Structure Internals

Persistence Deep Dive

Lua Scripts — True Atomicity

Pipelining — 100x Bulk Throughput

Transactions Gotcha — No Rollback

Memory Management

Monitoring

Cluster Hash Tags

Related