Design Google Docs (Collaborative Editor)
Design Google Docs
The hardest SD problem. Core challenge: multiple users editing same document simultaneously without conflicts.
Requirements Clarification
Functional:
- Multiple users edit same document in real-time
- All users see each other's changes (near real-time)
- Cursor positions visible
- Persistent storage (no data loss)
- Version history
Non-functional:
- Conflict resolution — concurrent edits must merge correctly
- Low latency (< 100ms local feel)
- Eventual consistency across all clients
The Core Problem: Concurrent Edits
Doc: "Hello"
User A (pos 5): inserts " World" → "Hello World"
User B (pos 5): inserts " Earth" → "Hello Earth"
Both sent to server simultaneously.
What should final doc be?
Two algorithms to know: OT and CRDT.
Operational Transformation (OT) — Google Docs' actual approach
Transform operations against each other to account for concurrent changes.
Operations:
Insert(pos, char)
Delete(pos)
Transformation function T(op1, op2) → op1'
"Transform op1 as if op2 had already happened"
Example:
Doc: "ab"
User A: Insert(1, 'x') → "axb"
User B: Insert(1, 'y') → "ayb" (sent concurrently)
Server receives A first:
Apply A: "axb"
Transform B against A: Insert(1, 'y') → Insert(2, 'y') (pos shifted by A's insert)
Apply B: "axyb" ← consistent result
Why OT is hard: Transformation function must handle all operation pairs correctly. Multi-client OT (Jupiter algorithm) requires central server to serialize operations.
CRDT (Conflict-free Replicated Data Types) — modern alternative
Assign unique, globally ordered IDs to every character.
Position = relationship to neighbours (not integer index).
Example (RGA - Replicated Growable Array):
Each char: (value, unique_id, left_id)
"a" = (a, id1, null)
"b" = (b, id2, id1) ← 'b' is to the right of 'a'
Concurrent inserts between same two chars:
Both get right parent = same id.
Tie-break by site ID (alphabetical or timestamp).
→ Deterministic, no server needed to transform.
Pros: no central server needed, peer-to-peer possible
Cons: tombstones (deleted chars still stored), higher memory
Architecture
[Client Browser]
- Local copy of document (in-memory)
- Apply own edits immediately (optimistic)
- Queue ops for server
│
│ WebSocket (persistent)
▼
[Document Server]
- Receives ops from all clients
- Applies OT transformation (for Google's approach)
- Broadcasts transformed op to all other clients
- Persists to storage
│
├── Op log (Kafka) — ordered, replayable
├── Document store (Colossus / S3) — snapshots
└── Active session store (Redis) — who's editing, cursors
Detailed Flow
1. User A types 'x' at position 5
→ Immediately applied to local doc (no lag)
→ Op queued: Insert(5, 'x', client_version=42)
2. Client sends op to server via WebSocket
3. Server transforms op against any ops received since client_version=42
→ Applies to server doc
→ Broadcasts transformed op to all other clients
4. Other clients receive op
→ Transform against their own pending (unsent) ops
→ Apply to local doc
→ Render update
Cursor Presence
Separate from document edits (lightweight).
Each client broadcasts: { user_id, cursor_pos, color }
Server relays to all other clients.
Transform cursor positions same as insert/delete ops.
Ephemeral — stored in Redis only (not in doc history).
Persistence + Version History
Snapshot strategy:
- Apply ops to in-memory doc
- Every N ops (or every minute): save snapshot to S3
- Op log stored in Kafka/Cassandra (append-only)
Version history:
- Snapshots = named versions
- Replay ops from snapshot to reach any point in time
Recovery:
Server restarts → load latest snapshot → replay ops from op log
Scaling
Single document → single server node (simplest, correct OT)
Why: OT requires total ordering of ops → easier with one coordinator
Multiple documents → consistent hash docs to servers
Each server handles N documents
If server dies:
Re-assign document to another server
Load from latest snapshot + replay op log
Trade-offs
| Decision | OT (Google Docs) | CRDT (Figma, others) |
|---|---|---|
| Conflict resolution | Central server transforms | Distributed, no coordination |
| Implementation | Complex transform functions | Complex data structure |
| Memory | Efficient | Tombstones accumulate |
| Offline support | Harder | Natural fit |
| Latency | Server round-trip to transform | Local-first |
Google chose OT because they built it in 2006. Modern systems prefer CRDTs.
What to Say in Interview
"The core challenge is concurrent edit conflict resolution. I'd use OT with a central server that serializes operations and transforms them before broadcasting. Each client applies changes optimistically locally, then reconciles when the server's transformed op arrives. For storage, ops are appended to an immutable log (Kafka), with periodic snapshots to S3 for fast recovery. Cursors are ephemeral — tracked in Redis and broadcast separately from document ops."
Related
- [[Message Queues & Kafka]] — op log
- [[Caching & Redis]] — cursor presence, active sessions
- [[System Design/Problem Designs/WhatsApp]] — real-time delivery (WebSocket pattern)