Senior SD Mindset — L7 Lessons
Senior SD Mindset — L7 Lessons
Source: [[raw/The Day a Google L7 Engineer Tore My System Design to Shreds]]
The gap between mid-level and staff-level system design isn't more patterns — it's understanding the physics behind the patterns.
The Pattern Trap
Most engineers study SD by memorizing what big companies did. They can draw CDN + Redis + Kafka from memory and explain Netflix's tech stack.
An L7 interviewer doesn't care if you can copy Netflix. They ask: "Why did Netflix have to move?"
Line cook vs Master Chef: A line cook follows the recipe — add salt when it says add salt. A Master Chef understands the chemistry — if tomatoes are more acidic today, they adjust the sugar. System design is the same. Your "ingredients" are your constraints (latency budget, throughput, consistency requirements).
Symptom you're in the Pattern Trap: You add a component (cache, queue, CDN) because you've seen it in other designs — not because your current numbers demand it.
Fix: For every component you add, state: the dollar cost, the latency cost, the maintenance cost, and the failure mode.
Hot Key Problem
Classic example: viral URL in a URL shortener (Super Bowl ad).
What happens:
Viral shortCode → every app server hammers same cache key / same DB row
Adding more app servers → MORE concurrent requests to same single contention point
Result: horizontal scaling makes it worse
Solutions:
| Fix | How | When to use |
|---|---|---|
| Request collapsing | Multiple identical requests wait for one DB read; result shared | High read concurrency on same key |
| Local in-process cache | Each app server caches top-N keys in memory | Predictable hot keys, can tolerate slight staleness |
| Adaptive caching | Detect hot keys dynamically, promote to edge/local tier | Unpredictable viral spikes |
| Read replicas per shard | Route hot-key reads to dedicated replica | When DB is the bottleneck, not cache |
Interview signal: Mention hot key problem proactively when discussing caching for any read-heavy system. Saying "we'll just add more cache nodes" is a junior answer.
Cache Hit Rate Math
The formula:
Effective Latency = (HitRate × CacheLatency) + ((1 - HitRate) × DBLatency)
Why it matters — worked example:
Assume: Cache = 1ms, DB = 10ms
| Hit Rate | Effective Latency | Cache overhead net benefit? |
|---|---|---|
| 90% | 0.9ms + 1ms = 1.9ms | Yes — massive win |
| 50% | 0.5ms + 5ms = 5.5ms | Marginal |
| 10% | 0.1ms + 9ms = 9.1ms + network/serialization overhead | Net negative |
Long-tail data problem: If 90% of shortened URLs are clicked only once, cache hit rate ≈ 10%. Adding Redis adds 5ms overhead to 90% of traffic for no benefit — and introduces stale data risk.
Decision rule: If expected hit rate < ~50%, don't add a cache yet. Start naked, instrument, add cache only when numbers prove it helps.
The brave answer in interviews: "In this traffic pattern, I'd skip the distributed cache and benchmark first."
Failure-First Thinking
L7 doesn't ask "How does this work?" — they ask "How does this break?"
Design for each failure mode explicitly:
| Component | Failure | Your answer |
|---|---|---|
| Cache goes down | All traffic hits DB cold | Circuit breaker, fallback reads, pre-warm on restart |
| DB primary fails | Writes blocked during failover | Failover time SLA, WAL sync lag, what data is lost |
| Key generator fails | Can't create new short URLs | Fallback: hash-based generation |
| Short code duplicate | Two URLs collide | Check-before-insert, counter suffix |
| Disk fills | Writes rejected | Lazy delete of expired URLs + TTL cleanup job |
State inconsistency during failover: When primary DB in US-East fails and traffic routes to US-West, the network link is throttled, replication may be behind. Users will see stale or missing data for the lag window. This is unavoidable — the question is how you handle it gracefully (serve stale, queue writes, surface error).
Netflix Chaos Monkey principle: In distributed systems, failure isn't a possibility — it's a certainty. Design for it explicitly, not as an afterthought.
Trade-offs Are the Answer
System design has no right answer. Every choice is a trade-off. State trade-offs explicitly and defend your choice for your constraints:
| Axis | Option A | Option B | How to choose |
|---|---|---|---|
| Consistency vs Availability | Exact correct data | Page loads even if data is 2s stale | Depends on domain — bank account vs social feed |
| Latency vs Cost | Sub-10ms (more infra) | 50ms (cheaper) | Check the SLA, not your ego |
| Complexity vs Maintainability | 50 microservices, optimal | 3 services, 3 engineers can manage | Default to fewer services until you prove you need more |
| Horizontal vs Vertical | More nodes, coordination overhead | Bigger machine, simpler | Coordination overhead is real — don't add nodes reflexively |
Interview signal: Interviewers promote candidates who say "I chose X over Y because of constraint Z, and I'd revisit this decision if Z changes."
Prep Methodology (L7 Standard)
-
Naked system first — Design with 1 server + 1 DB. Then add complexity only when back-of-envelope proves you've hit a limit.
-
Cost every component — For each box you draw: latency cost, dollar cost, operational cost, failure mode.
-
Learn the math cold — QPS estimation, storage for 5 years, bandwidth. These are table stakes at senior level.
Writes/sec = (100M writes/day) / (86,400 sec) ≈ 1,200 RPS Storage = 1B URLs × 500 bytes = 500 GB -
Study real outage post-mortems — AWS, Cloudflare, Slack. Real systems fail in ways textbooks don't cover. Post-mortems show what "perfect" designs miss.
-
Know when NOT to use a tool — Being able to say "we don't need Kafka here yet" or "skip the cache for this traffic pattern" is more impressive than adding every component.
Related
- [[System Design/Problem Designs/Design a URL shortener]] — hot key problem + cache hit rate applied
- [[Caching & Redis]] — cache strategies, LRU vs LFU, stampede
- [[Distributed Systems Concepts]] — consistency models, replication lag
- [[synthesis/Interview Prep Hub]] — SD interview formula + problem checklist