SQL Indexes & Performance
SQL Indexes & Performance
What Is an Index?
Separate data structure (usually B-tree) that stores column values + pointers to table rows. Trades write overhead + storage for read speed.
-- Create index
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Create index concurrently (no table lock — use in production)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- Drop index
DROP INDEX idx_orders_user_id;
-- List indexes on a table (PostgreSQL)
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'orders';
Index Types
B-Tree (Default)
Balanced tree. Good for: equality (=), range (>, <, BETWEEN), ORDER BY, LIKE 'prefix%'.
CREATE INDEX idx_salary ON employees(salary);
-- These queries USE the index:
WHERE salary = 80000
WHERE salary > 70000
WHERE salary BETWEEN 60000 AND 90000
ORDER BY salary
-- This does NOT use it:
WHERE LOWER(name) = 'alice' -- function call on indexed column defeats index
Hash Index
Equality only. Faster than B-tree for = but useless for range queries. Rarely used.
CREATE INDEX idx_hash_email ON users USING HASH (email);
-- Only good for: WHERE email = 'x@x.com'
GIN / GiST (PostgreSQL)
GIN— full-text search, arrays, JSONBGiST— geometric data, range types
CREATE INDEX idx_gin_tags ON articles USING GIN (tags);
SELECT * FROM articles WHERE tags @> ARRAY['python'];
Composite Index
Index on multiple columns. Column order matters.
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
Left-prefix rule: index helps queries that use the leading columns.
| Query | Uses index? |
|---|---|
WHERE user_id = 5 | ✅ leading column |
WHERE user_id = 5 AND status = 'active' | ✅ both columns |
WHERE status = 'active' | ❌ not leading column |
WHERE user_id = 5 ORDER BY created_at | ❌ created_at not in index |
General rule: put the most selective column first. Put equality conditions before range conditions.
-- Better order: equality col first, range col second
CREATE INDEX idx_orders_status_date ON orders(user_id, status, created_at);
-- WHERE user_id = 5 AND status = 'active' AND created_at > '2026-01-01' ✅
Partial Index
Index only a subset of rows. Smaller index, faster for targeted queries.
-- Only index active orders (common query pattern)
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';
-- Helps: WHERE user_id = 5 AND status = 'active'
-- Does NOT help: WHERE user_id = 5 AND status = 'cancelled'
-- Index only non-deleted rows
CREATE INDEX idx_users_email ON users(email) WHERE deleted_at IS NULL;
Covering Index (Index-Only Scan)
Include all columns the query needs so DB never touches the main table.
-- Query: SELECT user_id, status FROM orders WHERE user_id = 5
-- Covering index — includes status so no heap fetch needed
CREATE INDEX idx_orders_covering ON orders(user_id) INCLUDE (status, total);
-- PostgreSQL: INCLUDE clause
-- MySQL: just add columns to the index
Functional / Expression Index
Index on a computed expression.
-- Enable case-insensitive email lookup
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now this uses the index:
WHERE LOWER(email) = 'alice@example.com'
EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
Key terms to look for:
| Term | Meaning |
|---|---|
Seq Scan | Full table scan — bad for large tables |
Index Scan | Using index, fetches heap rows |
Index Only Scan | Using covering index — fastest |
Bitmap Index Scan | Multi-condition index combining |
rows=X | Estimated rows |
actual rows=Y | Actual rows — large gap = stale stats |
cost=X..Y | Estimated cost (startup..total) |
loops=N | How many times node executed |
-- Force stats refresh if estimates are off
ANALYZE orders;
ANALYZE employees;
-- Full explain with all stats
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT ...;
When Indexes Are Not Used
-- Function on indexed column
WHERE YEAR(created_at) = 2026 -- ❌ use range instead
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01' -- ✅
-- Implicit type cast
WHERE user_id = '123' -- user_id is INT — cast defeats index in some DBs
-- OR condition (can defeat index — DB may prefer seq scan)
WHERE status = 'active' OR status = 'pending'
-- Better: WHERE status IN ('active', 'pending')
-- Leading wildcard in LIKE
WHERE name LIKE '%alice%' -- ❌ can't use B-tree
WHERE name LIKE 'alice%' -- ✅ prefix match uses B-tree
-- Low selectivity column — index not worth it
WHERE is_deleted = FALSE -- 95% rows are FALSE — seq scan cheaper
Index Tradeoffs
| Benefit | Cost |
|---|---|
| Faster SELECT | Slower INSERT/UPDATE/DELETE (index must be maintained) |
| Faster ORDER BY | Extra disk space |
| Faster JOINs | More memory for index pages |
Rule: Index foreign keys and columns used in WHERE/JOIN/ORDER BY. Don't over-index high-write tables.
Query Optimization Checklist
1. EXPLAIN ANALYZE — find bottleneck (Seq Scan on large table?)
2. Add index on filter/join column
3. Check composite index column order (equality before range)
4. Avoid functions on indexed columns in WHERE
5. Use covering index to eliminate heap fetches
6. ANALYZE table if estimated rows far from actual
7. Consider partial index if filtering on common value
N+1 Problem
-- Bad: 1 query to get users + N queries to get each user's orders
for user in users:
orders = query("SELECT * FROM orders WHERE user_id = %s", user.id)
-- Fix 1: JOIN in one query
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- Fix 2: batch load
SELECT * FROM orders WHERE user_id IN (1, 2, 3, 4, ...);
Related
- [[Transactions]] — locking and index interaction during writes
- [[SQL for Interviews]] — EXPLAIN ANALYZE interview questions
- [[JOINs]] — join performance and index use on join columns