Message Queues & Kafka
Message Queues & Kafka
Message queues decouple producers from consumers, enabling async processing and absorbing traffic spikes. Kafka is the dominant distributed event streaming platform. Appears in notification, feed, Uber, and most distributed system designs.
Why Message Queues?
Problems they solve:
- Async decoupling: Producer doesn't wait for consumer to process
- Load leveling: Absorb traffic spikes — producer writes fast, consumer processes at its own pace
- Reliability: Message persisted in queue even if consumer crashes
- Fan-out: One message → multiple independent consumers
- Ordering: Process events in sequence
Without queue: Payment API directly calls email service. Email service down → payment fails. With queue: Payment API → queue → email service. Payment succeeds regardless.
Queue Types
Point-to-Point (SQS standard)
- One producer, one consumer group
- Message consumed and deleted
- At-least-once delivery (can deliver twice on failure)
Producer → [Queue] → Consumer
↗ (competing consumers — only one gets each message)
Consumer
Publish-Subscribe (Kafka, SNS)
- One producer, multiple independent consumer groups
- Each consumer group reads all messages independently
- Message retained after consumption (for duration of retention period)
Producer → [Topic] → Consumer Group A (reads all messages)
→ Consumer Group B (reads all messages)
→ Consumer Group C (reads all messages)
Kafka Architecture
Producers → Topics → Partitions → Consumer Groups
Core Concepts
| Concept | Description |
|---|---|
| Topic | Named stream of messages (like a table) |
| Partition | Ordered, immutable log within a topic. Unit of parallelism. |
| Offset | Position of a message within a partition. Consumer tracks its own offset. |
| Consumer Group | Set of consumers sharing a topic. Each partition assigned to one consumer. |
| Broker | Server storing partition replicas |
| Replication Factor | N copies of each partition across N brokers |
Partitioning for Scale
Topic: "orders" with 3 partitions
Partition 0: order_id % 3 == 0
Partition 1: order_id % 3 == 1
Partition 2: order_id % 3 == 2
Consumer Group has 3 consumers → each gets 1 partition
Consumer Group has 6 consumers → 3 consumers idle (partitions < consumers = wasted)
Rule: Max throughput = number of partitions. To scale consumer throughput, add partitions.
Message Ordering
- Within a partition: Guaranteed order
- Across partitions: No guarantee
- Solution: Key-based partitioning (same key → same partition → ordered)
# Same user_id always goes to same partition → user events are ordered
producer.send("orders", key=str(user_id).encode(), value=order_json)
Delivery Guarantees
| Guarantee | How | Risk |
|---|---|---|
| At-most-once | Commit offset before processing | Message loss if consumer crashes |
| At-least-once | Process then commit offset | Duplicate messages possible |
| Exactly-once | Idempotent producer + transactional consumer | Complex, lower throughput |
In practice: at-least-once + idempotent consumer (make processing idempotent — duplicate = no effect).
# Idempotent consumer: track processed message IDs
def process_order(order_id, data):
if redis.sismember("processed_orders", order_id):
return # already processed, skip duplicate
# process...
redis.sadd("processed_orders", order_id)
Kafka vs SQS vs RabbitMQ
| Kafka | SQS | RabbitMQ | |
|---|---|---|---|
| Model | Log (retained) | Queue (deleted after consume) | Queue (deleted after ack) |
| Replay | Yes — rewind offset | No | No |
| Fan-out | Yes — consumer groups | No (need SNS for fan-out) | Yes (exchanges) |
| Throughput | Very high (millions/sec) | High | Medium |
| Ordering | Per partition | FIFO queue only | Per queue |
| Retention | Configurable (days/weeks) | Short (14 days max) | Until consumed |
| Use case | Event streaming, audit log, replay | Simple async decoupling, AWS native | Complex routing, low-latency |
Common Patterns
Fan-out (Notification System)
User posts → "events" topic →
[email-service consumer group] → sends email
[push-service consumer group] → sends push notification
[feed-service consumer group] → updates follower feeds
[analytics consumer group] → logs event
One event touches 4 services — all decoupled. See: [[System Design/Problem Designs/Notification System]]
Dead Letter Queue (DLQ)
When a message fails to process N times, move to DLQ:
- Prevents poison pill (one bad message blocking entire queue)
- Ops can inspect/replay DLQ messages
Event Sourcing
Store every state change as an event. Current state = replay all events.
- Full audit trail
- Rebuild state at any point in time
- Kafka as the event log (immutable, replayable)
CQRS (Command Query Responsibility Segregation)
Write path → produces events → Kafka → read models updated asynchronously. Query hits read model (optimized for reads), not the write DB.
Back-Pressure
When consumers are slower than producers, queue fills up. Options:
- Scale consumers horizontally
- Rate-limit producers
- Add more partitions (increase consumer parallelism)
- Drop messages (acceptable for metrics/analytics, not for orders)
Interview Questions
"Why Kafka over a simple DB queue?" Kafka persists messages in an ordered, replayable log. DB queues require polling, don't scale, and messages are hard to replay. Kafka separates storage from compute — consumers can rewind to any offset.
"How do you ensure message order?" Partition by the entity key (user_id, order_id). All events for the same entity land in the same partition → same consumer → ordered.
"How would you handle a slow consumer?" Increase consumer group size up to partition count. If consumers = partitions and still slow, add partitions (requires re-keying). Also consider: async processing within consumer, batching.
"What's the difference between a queue and a topic?" Queue: message consumed and removed, one consumer group. Topic: message retained, multiple consumer groups each read independently.
Redis Pub/Sub vs Kafka
| Redis Pub/Sub | Kafka | |
|---|---|---|
| Persistence | No — fire and forget | Yes — log retained |
| Replay | No | Yes |
| At-least-once | No | Yes |
| Throughput | Low-medium | Very high |
| Use case | Real-time notifications, live chat | Durable event streaming |
Use Redis Pub/Sub for ephemeral notifications (typing indicators). Use Kafka for anything that needs reliability or replay.
Related
- [[System Design/Problem Designs/Notification System]] — Kafka fan-out
- [[System Design/Problem Designs/Rate Limiter]] — SQS for async decoupling
- [[Distributed Systems Concepts]] — consistency, ordering
- [[Caching & Redis]] — Redis Pub/Sub comparison
- [[AWS/SQS & SNS]] — AWS managed messaging