Back to Notes

Senior SD Mindset — L7 Lessons

Senior SD Mindset — L7 Lessons

Source: [[raw/The Day a Google L7 Engineer Tore My System Design to Shreds]]

The gap between mid-level and staff-level system design isn't more patterns — it's understanding the physics behind the patterns.


The Pattern Trap

Most engineers study SD by memorizing what big companies did. They can draw CDN + Redis + Kafka from memory and explain Netflix's tech stack.

An L7 interviewer doesn't care if you can copy Netflix. They ask: "Why did Netflix have to move?"

Line cook vs Master Chef: A line cook follows the recipe — add salt when it says add salt. A Master Chef understands the chemistry — if tomatoes are more acidic today, they adjust the sugar. System design is the same. Your "ingredients" are your constraints (latency budget, throughput, consistency requirements).

Symptom you're in the Pattern Trap: You add a component (cache, queue, CDN) because you've seen it in other designs — not because your current numbers demand it.

Fix: For every component you add, state: the dollar cost, the latency cost, the maintenance cost, and the failure mode.


Hot Key Problem

Classic example: viral URL in a URL shortener (Super Bowl ad).

What happens:

Viral shortCode → every app server hammers same cache key / same DB row
Adding more app servers → MORE concurrent requests to same single contention point
Result: horizontal scaling makes it worse

Solutions:

FixHowWhen to use
Request collapsingMultiple identical requests wait for one DB read; result sharedHigh read concurrency on same key
Local in-process cacheEach app server caches top-N keys in memoryPredictable hot keys, can tolerate slight staleness
Adaptive cachingDetect hot keys dynamically, promote to edge/local tierUnpredictable viral spikes
Read replicas per shardRoute hot-key reads to dedicated replicaWhen DB is the bottleneck, not cache

Interview signal: Mention hot key problem proactively when discussing caching for any read-heavy system. Saying "we'll just add more cache nodes" is a junior answer.


Cache Hit Rate Math

The formula:

Effective Latency = (HitRate × CacheLatency) + ((1 - HitRate) × DBLatency)

Why it matters — worked example:

Assume: Cache = 1ms, DB = 10ms

Hit RateEffective LatencyCache overhead net benefit?
90%0.9ms + 1ms = 1.9msYes — massive win
50%0.5ms + 5ms = 5.5msMarginal
10%0.1ms + 9ms = 9.1ms + network/serialization overheadNet negative

Long-tail data problem: If 90% of shortened URLs are clicked only once, cache hit rate ≈ 10%. Adding Redis adds 5ms overhead to 90% of traffic for no benefit — and introduces stale data risk.

Decision rule: If expected hit rate < ~50%, don't add a cache yet. Start naked, instrument, add cache only when numbers prove it helps.

The brave answer in interviews: "In this traffic pattern, I'd skip the distributed cache and benchmark first."


Failure-First Thinking

L7 doesn't ask "How does this work?" — they ask "How does this break?"

Design for each failure mode explicitly:

ComponentFailureYour answer
Cache goes downAll traffic hits DB coldCircuit breaker, fallback reads, pre-warm on restart
DB primary failsWrites blocked during failoverFailover time SLA, WAL sync lag, what data is lost
Key generator failsCan't create new short URLsFallback: hash-based generation
Short code duplicateTwo URLs collideCheck-before-insert, counter suffix
Disk fillsWrites rejectedLazy delete of expired URLs + TTL cleanup job

State inconsistency during failover: When primary DB in US-East fails and traffic routes to US-West, the network link is throttled, replication may be behind. Users will see stale or missing data for the lag window. This is unavoidable — the question is how you handle it gracefully (serve stale, queue writes, surface error).

Netflix Chaos Monkey principle: In distributed systems, failure isn't a possibility — it's a certainty. Design for it explicitly, not as an afterthought.


Trade-offs Are the Answer

System design has no right answer. Every choice is a trade-off. State trade-offs explicitly and defend your choice for your constraints:

AxisOption AOption BHow to choose
Consistency vs AvailabilityExact correct dataPage loads even if data is 2s staleDepends on domain — bank account vs social feed
Latency vs CostSub-10ms (more infra)50ms (cheaper)Check the SLA, not your ego
Complexity vs Maintainability50 microservices, optimal3 services, 3 engineers can manageDefault to fewer services until you prove you need more
Horizontal vs VerticalMore nodes, coordination overheadBigger machine, simplerCoordination overhead is real — don't add nodes reflexively

Interview signal: Interviewers promote candidates who say "I chose X over Y because of constraint Z, and I'd revisit this decision if Z changes."


Prep Methodology (L7 Standard)

  1. Naked system first — Design with 1 server + 1 DB. Then add complexity only when back-of-envelope proves you've hit a limit.

  2. Cost every component — For each box you draw: latency cost, dollar cost, operational cost, failure mode.

  3. Learn the math cold — QPS estimation, storage for 5 years, bandwidth. These are table stakes at senior level.

    Writes/sec = (100M writes/day) / (86,400 sec) ≈ 1,200 RPS
    Storage = 1B URLs × 500 bytes = 500 GB
    
  4. Study real outage post-mortems — AWS, Cloudflare, Slack. Real systems fail in ways textbooks don't cover. Post-mortems show what "perfect" designs miss.

  5. Know when NOT to use a tool — Being able to say "we don't need Kafka here yet" or "skip the cache for this traffic pattern" is more impressive than adding every component.


Related

  • [[System Design/Problem Designs/Design a URL shortener]] — hot key problem + cache hit rate applied
  • [[Caching & Redis]] — cache strategies, LRU vs LFU, stampede
  • [[Distributed Systems Concepts]] — consistency models, replication lag
  • [[synthesis/Interview Prep Hub]] — SD interview formula + problem checklist