Design Google Docs

The hardest SD problem. Core challenge: multiple users editing same document simultaneously without conflicts.

Requirements Clarification

Functional:

Multiple users edit same document in real-time
All users see each other's changes (near real-time)
Cursor positions visible
Persistent storage (no data loss)
Version history

Non-functional:

Conflict resolution — concurrent edits must merge correctly
Low latency (< 100ms local feel)
Eventual consistency across all clients

The Core Problem: Concurrent Edits

Doc: "Hello"
User A (pos 5): inserts " World" → "Hello World"
User B (pos 5): inserts " Earth" → "Hello Earth"

Both sent to server simultaneously.
What should final doc be?

Two algorithms to know: OT and CRDT.

Operational Transformation (OT) — Google Docs' actual approach

Transform operations against each other to account for concurrent changes.

Operations:
  Insert(pos, char)
  Delete(pos)

Transformation function T(op1, op2) → op1'
  "Transform op1 as if op2 had already happened"

Example:
  Doc: "ab"
  User A: Insert(1, 'x') → "axb"
  User B: Insert(1, 'y') → "ayb" (sent concurrently)

  Server receives A first:
    Apply A: "axb"
    Transform B against A: Insert(1, 'y') → Insert(2, 'y') (pos shifted by A's insert)
    Apply B: "axyb" ← consistent result

Why OT is hard: Transformation function must handle all operation pairs correctly. Multi-client OT (Jupiter algorithm) requires central server to serialize operations.

CRDT (Conflict-free Replicated Data Types) — modern alternative

Assign unique, globally ordered IDs to every character.
Position = relationship to neighbours (not integer index).

Example (RGA - Replicated Growable Array):
  Each char: (value, unique_id, left_id)
  "a" = (a, id1, null)
  "b" = (b, id2, id1)   ← 'b' is to the right of 'a'

Concurrent inserts between same two chars:
  Both get right parent = same id.
  Tie-break by site ID (alphabetical or timestamp).
  → Deterministic, no server needed to transform.

Pros: no central server needed, peer-to-peer possible
Cons: tombstones (deleted chars still stored), higher memory

Architecture

[Client Browser]
  - Local copy of document (in-memory)
  - Apply own edits immediately (optimistic)
  - Queue ops for server
  │
  │ WebSocket (persistent)
  ▼
[Document Server]
  - Receives ops from all clients
  - Applies OT transformation (for Google's approach)
  - Broadcasts transformed op to all other clients
  - Persists to storage
  │
  ├── Op log (Kafka) — ordered, replayable
  ├── Document store (Colossus / S3) — snapshots
  └── Active session store (Redis) — who's editing, cursors

Detailed Flow

1. User A types 'x' at position 5
   → Immediately applied to local doc (no lag)
   → Op queued: Insert(5, 'x', client_version=42)

2. Client sends op to server via WebSocket

3. Server transforms op against any ops received since client_version=42
   → Applies to server doc
   → Broadcasts transformed op to all other clients

4. Other clients receive op
   → Transform against their own pending (unsent) ops
   → Apply to local doc
   → Render update

Cursor Presence

Separate from document edits (lightweight).
Each client broadcasts: { user_id, cursor_pos, color }
Server relays to all other clients.
Transform cursor positions same as insert/delete ops.
Ephemeral — stored in Redis only (not in doc history).

Persistence + Version History

Snapshot strategy:
  - Apply ops to in-memory doc
  - Every N ops (or every minute): save snapshot to S3
  - Op log stored in Kafka/Cassandra (append-only)

Version history:
  - Snapshots = named versions
  - Replay ops from snapshot to reach any point in time

Recovery:
  Server restarts → load latest snapshot → replay ops from op log

Scaling

Single document → single server node (simplest, correct OT)
  Why: OT requires total ordering of ops → easier with one coordinator

Multiple documents → consistent hash docs to servers
  Each server handles N documents

If server dies:
  Re-assign document to another server
  Load from latest snapshot + replay op log

Trade-offs

Decision	OT (Google Docs)	CRDT (Figma, others)
Conflict resolution	Central server transforms	Distributed, no coordination
Implementation	Complex transform functions	Complex data structure
Memory	Efficient	Tombstones accumulate
Offline support	Harder	Natural fit
Latency	Server round-trip to transform	Local-first

Google chose OT because they built it in 2006. Modern systems prefer CRDTs.

What to Say in Interview

"The core challenge is concurrent edit conflict resolution. I'd use OT with a central server that serializes operations and transforms them before broadcasting. Each client applies changes optimistically locally, then reconciles when the server's transformed op arrives. For storage, ops are appended to an immutable log (Kafka), with periodic snapshots to S3 for fast recovery. Cursors are ephemeral — tracked in Redis and broadcast separately from document ops."

[[Message Queues & Kafka]] — op log
[[Caching & Redis]] — cursor presence, active sessions
[[System Design/Problem Designs/WhatsApp]] — real-time delivery (WebSocket pattern)

Design Google Docs (Collaborative Editor)