Back to Notes

SQL Indexes & Performance

SQL Indexes & Performance

What Is an Index?

Separate data structure (usually B-tree) that stores column values + pointers to table rows. Trades write overhead + storage for read speed.

-- Create index
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Create index concurrently (no table lock — use in production)
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

-- Drop index
DROP INDEX idx_orders_user_id;

-- List indexes on a table (PostgreSQL)
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'orders';

Index Types

B-Tree (Default)

Balanced tree. Good for: equality (=), range (>, <, BETWEEN), ORDER BY, LIKE 'prefix%'.

CREATE INDEX idx_salary ON employees(salary);

-- These queries USE the index:
WHERE salary = 80000
WHERE salary > 70000
WHERE salary BETWEEN 60000 AND 90000
ORDER BY salary

-- This does NOT use it:
WHERE LOWER(name) = 'alice'  -- function call on indexed column defeats index

Hash Index

Equality only. Faster than B-tree for = but useless for range queries. Rarely used.

CREATE INDEX idx_hash_email ON users USING HASH (email);
-- Only good for: WHERE email = 'x@x.com'

GIN / GiST (PostgreSQL)

  • GIN — full-text search, arrays, JSONB
  • GiST — geometric data, range types
CREATE INDEX idx_gin_tags ON articles USING GIN (tags);
SELECT * FROM articles WHERE tags @> ARRAY['python'];

Composite Index

Index on multiple columns. Column order matters.

CREATE INDEX idx_orders_user_status ON orders(user_id, status);

Left-prefix rule: index helps queries that use the leading columns.

QueryUses index?
WHERE user_id = 5✅ leading column
WHERE user_id = 5 AND status = 'active'✅ both columns
WHERE status = 'active'❌ not leading column
WHERE user_id = 5 ORDER BY created_at❌ created_at not in index

General rule: put the most selective column first. Put equality conditions before range conditions.

-- Better order: equality col first, range col second
CREATE INDEX idx_orders_status_date ON orders(user_id, status, created_at);
-- WHERE user_id = 5 AND status = 'active' AND created_at > '2026-01-01'  ✅

Partial Index

Index only a subset of rows. Smaller index, faster for targeted queries.

-- Only index active orders (common query pattern)
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';
-- Helps: WHERE user_id = 5 AND status = 'active'
-- Does NOT help: WHERE user_id = 5 AND status = 'cancelled'

-- Index only non-deleted rows
CREATE INDEX idx_users_email ON users(email) WHERE deleted_at IS NULL;

Covering Index (Index-Only Scan)

Include all columns the query needs so DB never touches the main table.

-- Query: SELECT user_id, status FROM orders WHERE user_id = 5
-- Covering index — includes status so no heap fetch needed
CREATE INDEX idx_orders_covering ON orders(user_id) INCLUDE (status, total);
-- PostgreSQL: INCLUDE clause
-- MySQL: just add columns to the index

Functional / Expression Index

Index on a computed expression.

-- Enable case-insensitive email lookup
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now this uses the index:
WHERE LOWER(email) = 'alice@example.com'

EXPLAIN ANALYZE

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

Key terms to look for:

TermMeaning
Seq ScanFull table scan — bad for large tables
Index ScanUsing index, fetches heap rows
Index Only ScanUsing covering index — fastest
Bitmap Index ScanMulti-condition index combining
rows=XEstimated rows
actual rows=YActual rows — large gap = stale stats
cost=X..YEstimated cost (startup..total)
loops=NHow many times node executed
-- Force stats refresh if estimates are off
ANALYZE orders;
ANALYZE employees;

-- Full explain with all stats
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT ...;

When Indexes Are Not Used

-- Function on indexed column
WHERE YEAR(created_at) = 2026          -- ❌ use range instead
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01'  -- ✅

-- Implicit type cast
WHERE user_id = '123'    -- user_id is INT — cast defeats index in some DBs

-- OR condition (can defeat index — DB may prefer seq scan)
WHERE status = 'active' OR status = 'pending'
-- Better: WHERE status IN ('active', 'pending')

-- Leading wildcard in LIKE
WHERE name LIKE '%alice%'   -- ❌ can't use B-tree
WHERE name LIKE 'alice%'    -- ✅ prefix match uses B-tree

-- Low selectivity column — index not worth it
WHERE is_deleted = FALSE    -- 95% rows are FALSE — seq scan cheaper

Index Tradeoffs

BenefitCost
Faster SELECTSlower INSERT/UPDATE/DELETE (index must be maintained)
Faster ORDER BYExtra disk space
Faster JOINsMore memory for index pages

Rule: Index foreign keys and columns used in WHERE/JOIN/ORDER BY. Don't over-index high-write tables.


Query Optimization Checklist

1. EXPLAIN ANALYZE — find bottleneck (Seq Scan on large table?)
2. Add index on filter/join column
3. Check composite index column order (equality before range)
4. Avoid functions on indexed columns in WHERE
5. Use covering index to eliminate heap fetches
6. ANALYZE table if estimated rows far from actual
7. Consider partial index if filtering on common value

N+1 Problem

-- Bad: 1 query to get users + N queries to get each user's orders
for user in users:
    orders = query("SELECT * FROM orders WHERE user_id = %s", user.id)

-- Fix 1: JOIN in one query
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Fix 2: batch load
SELECT * FROM orders WHERE user_id IN (1, 2, 3, 4, ...);

Related

  • [[Transactions]] — locking and index interaction during writes
  • [[SQL for Interviews]] — EXPLAIN ANALYZE interview questions
  • [[JOINs]] — join performance and index use on join columns