Back to Notes

Message Queues & Kafka

Message Queues & Kafka

Message queues decouple producers from consumers, enabling async processing and absorbing traffic spikes. Kafka is the dominant distributed event streaming platform. Appears in notification, feed, Uber, and most distributed system designs.


Why Message Queues?

Problems they solve:

  • Async decoupling: Producer doesn't wait for consumer to process
  • Load leveling: Absorb traffic spikes — producer writes fast, consumer processes at its own pace
  • Reliability: Message persisted in queue even if consumer crashes
  • Fan-out: One message → multiple independent consumers
  • Ordering: Process events in sequence

Without queue: Payment API directly calls email service. Email service down → payment fails. With queue: Payment API → queue → email service. Payment succeeds regardless.


Queue Types

Point-to-Point (SQS standard)

  • One producer, one consumer group
  • Message consumed and deleted
  • At-least-once delivery (can deliver twice on failure)
Producer → [Queue] → Consumer
                  ↗ (competing consumers — only one gets each message)
Consumer

Publish-Subscribe (Kafka, SNS)

  • One producer, multiple independent consumer groups
  • Each consumer group reads all messages independently
  • Message retained after consumption (for duration of retention period)
Producer → [Topic] → Consumer Group A (reads all messages)
                   → Consumer Group B (reads all messages)
                   → Consumer Group C (reads all messages)

Kafka Architecture

Producers → Topics → Partitions → Consumer Groups

Core Concepts

ConceptDescription
TopicNamed stream of messages (like a table)
PartitionOrdered, immutable log within a topic. Unit of parallelism.
OffsetPosition of a message within a partition. Consumer tracks its own offset.
Consumer GroupSet of consumers sharing a topic. Each partition assigned to one consumer.
BrokerServer storing partition replicas
Replication FactorN copies of each partition across N brokers

Partitioning for Scale

Topic: "orders" with 3 partitions
Partition 0: order_id % 3 == 0
Partition 1: order_id % 3 == 1
Partition 2: order_id % 3 == 2

Consumer Group has 3 consumers → each gets 1 partition
Consumer Group has 6 consumers → 3 consumers idle (partitions < consumers = wasted)

Rule: Max throughput = number of partitions. To scale consumer throughput, add partitions.

Message Ordering

  • Within a partition: Guaranteed order
  • Across partitions: No guarantee
  • Solution: Key-based partitioning (same key → same partition → ordered)
# Same user_id always goes to same partition → user events are ordered
producer.send("orders", key=str(user_id).encode(), value=order_json)

Delivery Guarantees

GuaranteeHowRisk
At-most-onceCommit offset before processingMessage loss if consumer crashes
At-least-onceProcess then commit offsetDuplicate messages possible
Exactly-onceIdempotent producer + transactional consumerComplex, lower throughput

In practice: at-least-once + idempotent consumer (make processing idempotent — duplicate = no effect).

# Idempotent consumer: track processed message IDs
def process_order(order_id, data):
    if redis.sismember("processed_orders", order_id):
        return  # already processed, skip duplicate
    
    # process...
    redis.sadd("processed_orders", order_id)

Kafka vs SQS vs RabbitMQ

KafkaSQSRabbitMQ
ModelLog (retained)Queue (deleted after consume)Queue (deleted after ack)
ReplayYes — rewind offsetNoNo
Fan-outYes — consumer groupsNo (need SNS for fan-out)Yes (exchanges)
ThroughputVery high (millions/sec)HighMedium
OrderingPer partitionFIFO queue onlyPer queue
RetentionConfigurable (days/weeks)Short (14 days max)Until consumed
Use caseEvent streaming, audit log, replaySimple async decoupling, AWS nativeComplex routing, low-latency

Common Patterns

Fan-out (Notification System)

User posts → "events" topic →
    [email-service consumer group]  → sends email
    [push-service consumer group]   → sends push notification
    [feed-service consumer group]   → updates follower feeds
    [analytics consumer group]      → logs event

One event touches 4 services — all decoupled. See: [[System Design/Problem Designs/Notification System]]

Dead Letter Queue (DLQ)

When a message fails to process N times, move to DLQ:

  • Prevents poison pill (one bad message blocking entire queue)
  • Ops can inspect/replay DLQ messages

Event Sourcing

Store every state change as an event. Current state = replay all events.

  • Full audit trail
  • Rebuild state at any point in time
  • Kafka as the event log (immutable, replayable)

CQRS (Command Query Responsibility Segregation)

Write path → produces events → Kafka → read models updated asynchronously. Query hits read model (optimized for reads), not the write DB.


Back-Pressure

When consumers are slower than producers, queue fills up. Options:

  1. Scale consumers horizontally
  2. Rate-limit producers
  3. Add more partitions (increase consumer parallelism)
  4. Drop messages (acceptable for metrics/analytics, not for orders)

Interview Questions

"Why Kafka over a simple DB queue?" Kafka persists messages in an ordered, replayable log. DB queues require polling, don't scale, and messages are hard to replay. Kafka separates storage from compute — consumers can rewind to any offset.

"How do you ensure message order?" Partition by the entity key (user_id, order_id). All events for the same entity land in the same partition → same consumer → ordered.

"How would you handle a slow consumer?" Increase consumer group size up to partition count. If consumers = partitions and still slow, add partitions (requires re-keying). Also consider: async processing within consumer, batching.

"What's the difference between a queue and a topic?" Queue: message consumed and removed, one consumer group. Topic: message retained, multiple consumer groups each read independently.


Redis Pub/Sub vs Kafka

Redis Pub/SubKafka
PersistenceNo — fire and forgetYes — log retained
ReplayNoYes
At-least-onceNoYes
ThroughputLow-mediumVery high
Use caseReal-time notifications, live chatDurable event streaming

Use Redis Pub/Sub for ephemeral notifications (typing indicators). Use Kafka for anything that needs reliability or replay.


Related

  • [[System Design/Problem Designs/Notification System]] — Kafka fan-out
  • [[System Design/Problem Designs/Rate Limiter]] — SQS for async decoupling
  • [[Distributed Systems Concepts]] — consistency, ordering
  • [[Caching & Redis]] — Redis Pub/Sub comparison
  • [[AWS/SQS & SNS]] — AWS managed messaging