Skip to content

SQL

Status: 🌿

Motivation

Strengthen practical SQL fluency for analytics, application backends, and operational troubleshooting.

Starter Points

  • Query shaping: joins, CTEs, window functions.
  • Performance basics: indexing, execution plans, cardinality awareness.
  • Data quality checks and migration-safe changes.

Performance Deep Dive

JOIN vs SELECT IN

JOIN (usually faster for large datasets):

-- Preferred for most cases
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'Europe';

SELECT IN (simpler, but can be slower):

-- Subquery executes first, may create temp table
SELECT *
FROM orders
WHERE customer_id IN (
    SELECT id FROM customers WHERE region = 'Europe'
);

Performance Notes: - JOINs allow the optimizer to choose the best execution plan - IN clauses may force nested loops or temporary tables - Modern optimizers often convert IN to JOIN internally

DISTINCT vs GROUP BY

DISTINCT (removes duplicates from result set):

-- Simple deduplication
SELECT DISTINCT product_id
FROM order_items;

GROUP BY (groups rows, enables aggregations):

-- More powerful, supports aggregations
SELECT
    product_id,
    COUNT(*) AS order_count,
    SUM(quantity) AS total_quantity
FROM order_items
GROUP BY product_id;

When to Use Each: - Use DISTINCT for simple deduplication of a single column - Use GROUP BY when you need aggregations or multi-column grouping - DISTINCT can sometimes be optimized better than GROUP BY for single-column cases

FULL GROUP BY Mode

MySQL's ONLY_FULL_GROUP_BY SQL mode (enabled by default in 5.7+) requires: - All non-aggregated columns in SELECT must appear in GROUP BY - Prevents ambiguous queries that could return indeterminate values

Invalid (without FULL GROUP BY):

-- Which customer_name should be returned for each region?
SELECT
    region,
    customer_name,
    COUNT(*)
FROM customers
GROUP BY region;

Valid (with FULL GROUP BY):

-- Explicit about which customer_name to return
SELECT
    region,
    MAX(customer_name) AS example_customer,
    COUNT(*)
FROM customers
GROUP BY region;

Why It Matters: - Prevents subtle bugs from indeterminate column values - Makes queries more predictable and portable - Encourages explicit aggregation functions

GROUP BY Optimization

  1. Index grouping columns:

    CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
    

  2. Filter before grouping:

    SELECT customer_id, COUNT(*)
    FROM orders
    WHERE order_date > '2023-01-01'
    GROUP BY customer_id;
    

  3. Avoid SELECT *:

    -- Bad: selects unnecessary columns
    SELECT * FROM orders GROUP BY customer_id;
    
    -- Good: only what you need
    SELECT customer_id, COUNT(*), SUM(amount)
    FROM orders
    GROUP BY customer_id;
    

Aggregation Functions

Common aggregation functions used with GROUP BY:

Basic Aggregations

SELECT
    department,
    COUNT(*) AS employee_count,      -- Count rows
    SUM(salary) AS total_payroll,      -- Sum values
    AVG(salary) AS avg_salary,        -- Average
    MIN(salary) AS min_salary,        -- Minimum
    MAX(salary) AS max_salary         -- Maximum
FROM employees
GROUP BY department;

Statistical Aggregations

SELECT
    product_category,
    STDDEV(price) AS price_stddev,    -- Standard deviation
    VARIANCE(price) AS price_variance, -- Variance
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM products
GROUP BY product_category;

String Aggregations

-- PostgreSQL: concatenate strings
SELECT
    department,
    STRING_AGG(employee_name, ', ' ORDER BY hire_date) AS employees
FROM employees
GROUP BY department;

-- MySQL: group concatenation
SELECT
    department,
    GROUP_CONCAT(employee_name ORDER BY hire_date SEPARATOR ', ') AS employees
FROM employees
GROUP BY department;

Conditional Aggregations

SELECT
    department,
    COUNT(*) AS total_employees,
    COUNT(CASE WHEN salary > 100000 THEN 1 END) AS high_earners,
    SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS female_count,
    AVG(CASE WHEN on_leave = true THEN 0 ELSE 1 END) AS attendance_rate
FROM employees
GROUP BY department;

Execution Plans and Index Selection

Understanding EXPLAIN

-- Basic EXPLAIN shows the query plan
EXPLAIN SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

-- EXPLAIN ANALYZE executes the query and shows actual timing
EXPLAIN ANALYZE SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

How Indexes Are Chosen

The query optimizer considers: 1. Index availability on WHERE, JOIN, and GROUP BY columns 2. Cardinality (number of unique values) of indexed columns 3. Selectivity (how much the index reduces the dataset) 4. Query complexity and estimated cost

Example with index:

-- Index on grouping column
CREATE INDEX idx_orders_customer ON orders(customer_id);

EXPLAIN ANALYZE
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

-- Likely uses "Index Only Scan" or "GroupAggregate" with the index

What Happens Without an Index

Sequential Scan (Full Table Scan):

-- No index on customer_id
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

-- Shows: "Seq Scan on orders" followed by "HashAggregate"
-- Performance: O(n log n) sorting or O(n) hashing of entire table

Performance Impact: - Small tables (<10K rows): Often acceptable - Medium tables (10K-1M rows): Noticeable slowdown - Large tables (>1M rows): Can take minutes or hours

Visualizing the Difference:

With Index:          Without Index:
┌─────────────┐       ┌─────────────┐
│ Index Scan  │       │ Seq Scan    │
│ (Fast)      │       │ (Slow)      │
└────────┬────┘       └────────┬────┘
         │                     │
         ▼                     ▼
┌────────────────┐       ┌─────────────┐
│ GroupAggregate │       │ Sort/Hash   │
│ (Efficient)    │       │ (Expensive) │
└────────────────┘       └─────────────┘

Forcing Index Usage

-- Hint to use a specific index (syntax varies by database)
-- PostgreSQL:
SELECT /*+ IndexScan(orders idx_orders_customer) */
    customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

-- MySQL:
SELECT customer_id, COUNT(*)
FROM orders FORCE INDEX (idx_orders_customer)
GROUP BY customer_id;

Common Anti-Patterns

  1. GROUP BY on non-indexed columns:

    -- Slow: no index on product_category
    SELECT product_category, COUNT(*)
    FROM products
    GROUP BY product_category;
    

  2. GROUP BY with expensive expressions:

    -- Slow: function call on every row
    SELECT DATE_TRUNC('month', order_date), COUNT(*)
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date);
    
    -- Better: pre-compute or use generated columns
    

  3. Multiple GROUP BY columns without composite index:

    -- Slow: separate indexes on customer_id and order_date
    SELECT customer_id, order_date, COUNT(*)
    FROM orders
    GROUP BY customer_id, order_date;
    
    -- Better: composite index
    CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
    

Field Notes (2026-02-13)

  • SQL helped debug a production-adjacent issue even without introducing new query concepts.
  • Running backup/import test queries for courses increased confidence in operational recovery flow.
  • Practical takeaway: repeated operational exercises reinforce tool fluency and reduce incident response time.

Field Notes (2026-03-09)

  • Investigated discrepancies between Moodle report builder queries and API results
  • Root cause: Report builder uses dynamic GROUP BY with different filtering logic than API endpoints
  • Resolution: Standardized query patterns and ensured consistent GROUP BY clauses across both systems
  • Outcome: Results now match between report builder and API endpoints