Back to Notes

SQL for Interviews

SQL for Interviews

Tested at: Razorpay, PhonePe, Juspay, Zepto, most fintech/product companies. Round format: 2-3 queries in 30-45 min. Medium complexity expected.


Core Clauses — Order of Execution

-- Logical execution order (not written order):
FROMJOINWHEREGROUP BYHAVINGSELECTORDER BY → LIMIT

Key: WHERE runs BEFORE GROUP BY. HAVING runs AFTER. Don't filter aggregates with WHERE.


JOINs

-- INNER JOIN: only rows matching in both tables
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- LEFT JOIN: all users, nulls where no order
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Self join: employees and their managers
SELECT e.name AS emp, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Interview question: "Difference between INNER and LEFT JOIN?" → INNER returns only matched rows. LEFT returns all left rows, NULLs for unmatched right.


GROUP BY + HAVING

-- Users with more than 3 orders
SELECT user_id, COUNT(*) as order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 3;  -- not WHERE — this is post-aggregation filter

-- Revenue per category, only categories > 10000
SELECT category, SUM(price) as revenue
FROM products
GROUP BY category
HAVING SUM(price) > 10000
ORDER BY revenue DESC;

CTEs (Common Table Expressions)

-- CTEs = named subqueries, more readable than nested subqueries
WITH active_users AS (
    SELECT user_id, COUNT(*) as sessions
    FROM user_sessions
    WHERE created_at > NOW() - INTERVAL '30 days'
    GROUP BY user_id
),
high_value AS (
    SELECT user_id, SUM(amount) as total
    FROM orders
    GROUP BY user_id
    HAVING SUM(amount) > 1000
)
SELECT a.user_id, a.sessions, h.total
FROM active_users a
JOIN high_value h ON a.user_id = h.user_id;

When CTE vs subquery: CTE when you reuse the result or want readability. Subquery for one-off inline filters.


Window Functions ← come up constantly at senior level

-- ROW_NUMBER: unique rank (no ties)
-- RANK: same rank for ties, gaps after (1,1,3)
-- DENSE_RANK: same rank for ties, no gaps (1,1,2)

SELECT name, salary,
    ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rn,
    RANK()       OVER (PARTITION BY dept ORDER BY salary DESC) as rnk,
    DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) as drnk
FROM employees;

-- LAG / LEAD: access previous/next row
SELECT date, revenue,
    LAG(revenue, 1) OVER (ORDER BY date) as prev_day,
    revenue - LAG(revenue, 1) OVER (ORDER BY date) as day_over_day
FROM daily_revenue;

-- Running total
SELECT date, amount,
    SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) as running_total
FROM transactions;

-- Partition running total (per user)
SELECT user_id, date, amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY date) as user_running_total
FROM transactions;

Classic Interview Problems

Second highest salary

-- Method 1: DENSE_RANK (preferred — handles ties correctly)
SELECT salary FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
    FROM employees
) ranked
WHERE rnk = 2;

-- Method 2: Subquery
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Top N per group (top 3 earners per department)

SELECT dept, name, salary FROM (
    SELECT dept, name, salary,
        ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rn
    FROM employees
) ranked
WHERE rn <= 3;

Consecutive duplicates

SELECT DISTINCT l1.num AS ConsecutiveNums
FROM logs l1, logs l2, logs l3
WHERE l1.id = l2.id - 1 AND l2.id = l3.id - 1
AND l1.num = l2.num AND l2.num = l3.num;

Users active in last 30 days

SELECT COUNT(DISTINCT user_id) as active_users
FROM activity
WHERE activity_date >= CURRENT_DATE - INTERVAL '30 days';

Indexes

-- B-tree index (default) — good for range queries, equality, ORDER BY
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Composite index — column order MATTERS
-- This index helps: WHERE user_id = ? AND status = ?
-- Also helps: WHERE user_id = ?  (leading column)
-- Does NOT help: WHERE status = ? alone
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- Partial index — index subset of rows
CREATE INDEX idx_active_orders ON orders(user_id) WHERE status = 'active';

When indexes slow you down: INSERT/UPDATE/DELETE — index must be maintained. High-write tables: be selective.

EXPLAIN ANALYZE:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;
-- Look for: Seq Scan (bad for large tables) vs Index Scan (good)
-- Rows estimated vs actual — large gap = stale statistics, run ANALYZE

Transactions + ACID

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;  -- both or neither

-- On error:
ROLLBACK;

ACID verbal:

  • Atomicity — all or nothing
  • Consistency — constraints always satisfied after transaction
  • Isolation — concurrent transactions don't see each other's partial state
  • Durability — committed data survives crash (WAL / write-ahead log)

N+1 Problem

-- Bad: 1 query to get users, N queries to get each user's orders
for user in users:
    orders = query("SELECT * FROM orders WHERE user_id = ?", user.id)

-- Fix: JOIN in one query
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Or batch load:
SELECT * FROM orders WHERE user_id IN (1, 2, 3, ...)

Practice Track

Resources: DataLemur (concept + company-tagged) → LeetCode SQL 50 (volume) Target: write any medium query without reference in under 5 minutes.

DataLemur Progress

DifficultyDoneNotes
Easy
Medium
Hard

LeetCode SQL 50 Progress

#ProblemDifficultyDone
1Select...Easy
(fill as you go)
Total/50

Concept Checklist

  • JOINs — INNER, LEFT, RIGHT, FULL OUTER, self-join
  • GROUP BY + HAVING vs WHERE execution order
  • CTEs — non-recursive + recursive
  • Window: ROW_NUMBER, RANK, DENSE_RANK
  • Window: LAG, LEAD, running totals
  • Classic: 2nd highest salary (DENSE_RANK + subquery both)
  • Classic: top-N per group
  • Indexes — B-tree, composite column order, EXPLAIN ANALYZE
  • N+1 problem — detect + fix
  • ACID — verbal 4 sentences cold

DataLemur Instacart Case Study

Instacart dataset appears frequently on DataLemur. Key tables:

orders(order_id, user_id, order_dow, order_hour_of_day, days_since_prior_order)
order_products(order_id, product_id, add_to_cart_order, reordered)
products(product_id, product_name, aisle_id, department_id)
departments(department_id, department)
aisles(aisle_id, aisle)

Common problem types:

  • Reorder rate per product: SUM(reordered) / COUNT(*)
  • Most popular products per department: ROW_NUMBER() OVER (PARTITION BY dept ORDER BY count DESC)
  • Basket analysis: users who bought product A — what else did they buy?
  • Peak order hours: GROUP BY order_hour_of_day
  • Customer churn: days_since_prior_order > 30
-- Example: top 5 most reordered products
SELECT p.product_name,
  ROUND(100.0 * SUM(op.reordered) / COUNT(*), 2) AS reorder_rate
FROM order_products op
JOIN products p ON op.product_id = p.product_id
GROUP BY p.product_id, p.product_name
HAVING COUNT(*) > 50  -- filter low-volume products
ORDER BY reorder_rate DESC
LIMIT 5;

Related

  • [[System Design/Backend 101/Data Base/Basics]] — SQL vs NoSQL, indexing theory
  • [[AWS/RDS]] — PostgreSQL on AWS, connection pooling