SQL for Interviews
SQL for Interviews
Tested at: Razorpay, PhonePe, Juspay, Zepto, most fintech/product companies. Round format: 2-3 queries in 30-45 min. Medium complexity expected.
Core Clauses — Order of Execution
-- Logical execution order (not written order):
FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
Key: WHERE runs BEFORE GROUP BY. HAVING runs AFTER. Don't filter aggregates with WHERE.
JOINs
-- INNER JOIN: only rows matching in both tables
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;
-- LEFT JOIN: all users, nulls where no order
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- Self join: employees and their managers
SELECT e.name AS emp, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
Interview question: "Difference between INNER and LEFT JOIN?" → INNER returns only matched rows. LEFT returns all left rows, NULLs for unmatched right.
GROUP BY + HAVING
-- Users with more than 3 orders
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 3; -- not WHERE — this is post-aggregation filter
-- Revenue per category, only categories > 10000
SELECT category, SUM(price) as revenue
FROM products
GROUP BY category
HAVING SUM(price) > 10000
ORDER BY revenue DESC;
CTEs (Common Table Expressions)
-- CTEs = named subqueries, more readable than nested subqueries
WITH active_users AS (
SELECT user_id, COUNT(*) as sessions
FROM user_sessions
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY user_id
),
high_value AS (
SELECT user_id, SUM(amount) as total
FROM orders
GROUP BY user_id
HAVING SUM(amount) > 1000
)
SELECT a.user_id, a.sessions, h.total
FROM active_users a
JOIN high_value h ON a.user_id = h.user_id;
When CTE vs subquery: CTE when you reuse the result or want readability. Subquery for one-off inline filters.
Window Functions ← come up constantly at senior level
-- ROW_NUMBER: unique rank (no ties)
-- RANK: same rank for ties, gaps after (1,1,3)
-- DENSE_RANK: same rank for ties, no gaps (1,1,2)
SELECT name, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rn,
RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as rnk,
DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as drnk
FROM employees;
-- LAG / LEAD: access previous/next row
SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) as prev_day,
revenue - LAG(revenue, 1) OVER (ORDER BY date) as day_over_day
FROM daily_revenue;
-- Running total
SELECT date, amount,
SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) as running_total
FROM transactions;
-- Partition running total (per user)
SELECT user_id, date, amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY date) as user_running_total
FROM transactions;
Classic Interview Problems
Second highest salary
-- Method 1: DENSE_RANK (preferred — handles ties correctly)
SELECT salary FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
-- Method 2: Subquery
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Top N per group (top 3 earners per department)
SELECT dept, name, salary FROM (
SELECT dept, name, salary,
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rn
FROM employees
) ranked
WHERE rn <= 3;
Consecutive duplicates
SELECT DISTINCT l1.num AS ConsecutiveNums
FROM logs l1, logs l2, logs l3
WHERE l1.id = l2.id - 1 AND l2.id = l3.id - 1
AND l1.num = l2.num AND l2.num = l3.num;
Users active in last 30 days
SELECT COUNT(DISTINCT user_id) as active_users
FROM activity
WHERE activity_date >= CURRENT_DATE - INTERVAL '30 days';
Indexes
-- B-tree index (default) — good for range queries, equality, ORDER BY
CREATE INDEX idx_orders_user_id ON orders(user_id);
-- Composite index — column order MATTERS
-- This index helps: WHERE user_id = ? AND status = ?
-- Also helps: WHERE user_id = ? (leading column)
-- Does NOT help: WHERE status = ? alone
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Partial index — index subset of rows
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';
When indexes slow you down: INSERT/UPDATE/DELETE — index must be maintained. High-write tables: be selective.
EXPLAIN ANALYZE:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
-- Look for: Seq Scan (bad for large tables) vs Index Scan (good)
-- Rows estimated vs actual — large gap = stale statistics, run ANALYZE
Transactions + ACID
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- both or neither
-- On error:
ROLLBACK;
ACID verbal:
- Atomicity — all or nothing
- Consistency — constraints always satisfied after transaction
- Isolation — concurrent transactions don't see each other's partial state
- Durability — committed data survives crash (WAL / write-ahead log)
N+1 Problem
-- Bad: 1 query to get users, N queries to get each user's orders
for user in users:
orders = query("SELECT * FROM orders WHERE user_id = ?", user.id)
-- Fix: JOIN in one query
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- Or batch load:
SELECT * FROM orders WHERE user_id IN (1, 2, 3, ...)
Practice Track
Resources: DataLemur (concept + company-tagged) → LeetCode SQL 50 (volume) Target: write any medium query without reference in under 5 minutes.
DataLemur Progress
| Difficulty | Done | Notes |
|---|---|---|
| Easy | ||
| Medium | ||
| Hard |
LeetCode SQL 50 Progress
| # | Problem | Difficulty | Done |
|---|---|---|---|
| 1 | Select... | Easy | |
| — | (fill as you go) | ||
| Total | /50 |
Concept Checklist
- JOINs — INNER, LEFT, RIGHT, FULL OUTER, self-join
- GROUP BY + HAVING vs WHERE execution order
- CTEs — non-recursive + recursive
- Window: ROW_NUMBER, RANK, DENSE_RANK
- Window: LAG, LEAD, running totals
- Classic: 2nd highest salary (DENSE_RANK + subquery both)
- Classic: top-N per group
- Indexes — B-tree, composite column order, EXPLAIN ANALYZE
- N+1 problem — detect + fix
- ACID — verbal 4 sentences cold
DataLemur Instacart Case Study
Instacart dataset appears frequently on DataLemur. Key tables:
orders(order_id, user_id, order_dow, order_hour_of_day, days_since_prior_order)
order_products(order_id, product_id, add_to_cart_order, reordered)
products(product_id, product_name, aisle_id, department_id)
departments(department_id, department)
aisles(aisle_id, aisle)
Common problem types:
- Reorder rate per product:
SUM(reordered) / COUNT(*) - Most popular products per department:
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY count DESC) - Basket analysis: users who bought product A — what else did they buy?
- Peak order hours:
GROUP BY order_hour_of_day - Customer churn:
days_since_prior_order > 30
-- Example: top 5 most reordered products
SELECT p.product_name,
ROUND(100.0 * SUM(op.reordered) / COUNT(*), 2) AS reorder_rate
FROM order_products op
JOIN products p ON op.product_id = p.product_id
GROUP BY p.product_id, p.product_name
HAVING COUNT(*) > 50 -- filter low-volume products
ORDER BY reorder_rate DESC
LIMIT 5;
Related
- [[System Design/Backend 101/Data Base/Basics]] — SQL vs NoSQL, indexing theory
- [[AWS/RDS]] — PostgreSQL on AWS, connection pooling