Message Queues & Kafka

Message queues decouple producers from consumers, enabling async processing and absorbing traffic spikes. Kafka is the dominant distributed event streaming platform. Appears in notification, feed, Uber, and most distributed system designs.

Why Message Queues?

Problems they solve:

Async decoupling: Producer doesn't wait for consumer to process
Load leveling: Absorb traffic spikes — producer writes fast, consumer processes at its own pace
Reliability: Message persisted in queue even if consumer crashes
Fan-out: One message → multiple independent consumers
Ordering: Process events in sequence

Without queue: Payment API directly calls email service. Email service down → payment fails. With queue: Payment API → queue → email service. Payment succeeds regardless.

Queue Types

Point-to-Point (SQS standard)

One producer, one consumer group
Message consumed and deleted
At-least-once delivery (can deliver twice on failure)

Producer → [Queue] → Consumer
                  ↗ (competing consumers — only one gets each message)
Consumer

Publish-Subscribe (Kafka, SNS)

One producer, multiple independent consumer groups
Each consumer group reads all messages independently
Message retained after consumption (for duration of retention period)

Producer → [Topic] → Consumer Group A (reads all messages)
                   → Consumer Group B (reads all messages)
                   → Consumer Group C (reads all messages)

Kafka Architecture

Producers → Topics → Partitions → Consumer Groups

Core Concepts

Concept	Description
Topic	Named stream of messages (like a table)
Partition	Ordered, immutable log within a topic. Unit of parallelism.
Offset	Position of a message within a partition. Consumer tracks its own offset.
Consumer Group	Set of consumers sharing a topic. Each partition assigned to one consumer.
Broker	Server storing partition replicas
Replication Factor	N copies of each partition across N brokers

Partitioning for Scale

Topic: "orders" with 3 partitions
Partition 0: order_id % 3 == 0
Partition 1: order_id % 3 == 1
Partition 2: order_id % 3 == 2

Consumer Group has 3 consumers → each gets 1 partition
Consumer Group has 6 consumers → 3 consumers idle (partitions < consumers = wasted)

Rule: Max throughput = number of partitions. To scale consumer throughput, add partitions.

Message Ordering

Within a partition: Guaranteed order
Across partitions: No guarantee
Solution: Key-based partitioning (same key → same partition → ordered)

# Same user_id always goes to same partition → user events are ordered
producer.send("orders", key=str(user_id).encode(), value=order_json)

Delivery Guarantees

Guarantee	How	Risk
At-most-once	Commit offset before processing	Message loss if consumer crashes
At-least-once	Process then commit offset	Duplicate messages possible
Exactly-once	Idempotent producer + transactional consumer	Complex, lower throughput

In practice: at-least-once + idempotent consumer (make processing idempotent — duplicate = no effect).

# Idempotent consumer: track processed message IDs
def process_order(order_id, data):
    if redis.sismember("processed_orders", order_id):
        return  # already processed, skip duplicate
    
    # process...
    redis.sadd("processed_orders", order_id)

Kafka vs SQS vs RabbitMQ

	Kafka	SQS	RabbitMQ
Model	Log (retained)	Queue (deleted after consume)	Queue (deleted after ack)
Replay	Yes — rewind offset	No	No
Fan-out	Yes — consumer groups	No (need SNS for fan-out)	Yes (exchanges)
Throughput	Very high (millions/sec)	High	Medium
Ordering	Per partition	FIFO queue only	Per queue
Retention	Configurable (days/weeks)	Short (14 days max)	Until consumed
Use case	Event streaming, audit log, replay	Simple async decoupling, AWS native	Complex routing, low-latency

Common Patterns

Fan-out (Notification System)

User posts → "events" topic →
    [email-service consumer group]  → sends email
    [push-service consumer group]   → sends push notification
    [feed-service consumer group]   → updates follower feeds
    [analytics consumer group]      → logs event

One event touches 4 services — all decoupled. See: [[System Design/Problem Designs/Notification System]]

Dead Letter Queue (DLQ)

When a message fails to process N times, move to DLQ:

Prevents poison pill (one bad message blocking entire queue)
Ops can inspect/replay DLQ messages

Event Sourcing

Store every state change as an event. Current state = replay all events.

Full audit trail
Rebuild state at any point in time
Kafka as the event log (immutable, replayable)

CQRS (Command Query Responsibility Segregation)

Write path → produces events → Kafka → read models updated asynchronously. Query hits read model (optimized for reads), not the write DB.

Back-Pressure

When consumers are slower than producers, queue fills up. Options:

Scale consumers horizontally
Rate-limit producers
Add more partitions (increase consumer parallelism)
Drop messages (acceptable for metrics/analytics, not for orders)

Interview Questions

"Why Kafka over a simple DB queue?" Kafka persists messages in an ordered, replayable log. DB queues require polling, don't scale, and messages are hard to replay. Kafka separates storage from compute — consumers can rewind to any offset.

"How do you ensure message order?" Partition by the entity key (user_id, order_id). All events for the same entity land in the same partition → same consumer → ordered.

"How would you handle a slow consumer?" Increase consumer group size up to partition count. If consumers = partitions and still slow, add partitions (requires re-keying). Also consider: async processing within consumer, batching.

"What's the difference between a queue and a topic?" Queue: message consumed and removed, one consumer group. Topic: message retained, multiple consumer groups each read independently.

Redis Pub/Sub vs Kafka

	Redis Pub/Sub	Kafka
Persistence	No — fire and forget	Yes — log retained
Replay	No	Yes
At-least-once	No	Yes
Throughput	Low-medium	Very high
Use case	Real-time notifications, live chat	Durable event streaming

Use Redis Pub/Sub for ephemeral notifications (typing indicators). Use Kafka for anything that needs reliability or replay.

[[System Design/Problem Designs/Notification System]] — Kafka fan-out
[[System Design/Problem Designs/Rate Limiter]] — SQS for async decoupling
[[Distributed Systems Concepts]] — consistency, ordering
[[Caching & Redis]] — Redis Pub/Sub comparison
[[AWS/SQS & SNS]] — AWS managed messaging

Message Queues & Kafka

Message Queues & Kafka

Why Message Queues?

Queue Types

Point-to-Point (SQS standard)

Publish-Subscribe (Kafka, SNS)

Kafka Architecture

Core Concepts

Partitioning for Scale

Message Ordering

Delivery Guarantees

Kafka vs SQS vs RabbitMQ

Common Patterns

Fan-out (Notification System)

Dead Letter Queue (DLQ)

Event Sourcing

CQRS (Command Query Responsibility Segregation)

Back-Pressure

Interview Questions

Redis Pub/Sub vs Kafka

Related