SQL
Status: 🌿
Motivation
Strengthen practical SQL fluency for analytics, application backends, and operational troubleshooting.
Starter Points
- Query shaping: joins, CTEs, window functions.
- Performance basics: indexing, execution plans, cardinality awareness.
- Data quality checks and migration-safe changes.
Performance Deep Dive
JOIN vs SELECT IN
JOIN (usually faster for large datasets):
-- Preferred for most cases
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'Europe';
SELECT IN (simpler, but can be slower):
-- Subquery executes first, may create temp table
SELECT *
FROM orders
WHERE customer_id IN (
SELECT id FROM customers WHERE region = 'Europe'
);
Performance Notes: - JOINs allow the optimizer to choose the best execution plan - IN clauses may force nested loops or temporary tables - Modern optimizers often convert IN to JOIN internally
DISTINCT vs GROUP BY
DISTINCT (removes duplicates from result set):
-- Simple deduplication
SELECT DISTINCT product_id
FROM order_items;
GROUP BY (groups rows, enables aggregations):
-- More powerful, supports aggregations
SELECT
product_id,
COUNT(*) AS order_count,
SUM(quantity) AS total_quantity
FROM order_items
GROUP BY product_id;
When to Use Each:
- Use DISTINCT for simple deduplication of a single column
- Use GROUP BY when you need aggregations or multi-column grouping
- DISTINCT can sometimes be optimized better than GROUP BY for single-column cases
FULL GROUP BY Mode
MySQL's ONLY_FULL_GROUP_BY SQL mode (enabled by default in 5.7+) requires:
- All non-aggregated columns in SELECT must appear in GROUP BY
- Prevents ambiguous queries that could return indeterminate values
Invalid (without FULL GROUP BY):
-- Which customer_name should be returned for each region?
SELECT
region,
customer_name,
COUNT(*)
FROM customers
GROUP BY region;
Valid (with FULL GROUP BY):
-- Explicit about which customer_name to return
SELECT
region,
MAX(customer_name) AS example_customer,
COUNT(*)
FROM customers
GROUP BY region;
Why It Matters: - Prevents subtle bugs from indeterminate column values - Makes queries more predictable and portable - Encourages explicit aggregation functions
GROUP BY Optimization
-
Index grouping columns:
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date); -
Filter before grouping:
SELECT customer_id, COUNT(*) FROM orders WHERE order_date > '2023-01-01' GROUP BY customer_id; -
Avoid SELECT *:
-- Bad: selects unnecessary columns SELECT * FROM orders GROUP BY customer_id; -- Good: only what you need SELECT customer_id, COUNT(*), SUM(amount) FROM orders GROUP BY customer_id;
Aggregation Functions
Common aggregation functions used with GROUP BY:
Basic Aggregations
SELECT
department,
COUNT(*) AS employee_count, -- Count rows
SUM(salary) AS total_payroll, -- Sum values
AVG(salary) AS avg_salary, -- Average
MIN(salary) AS min_salary, -- Minimum
MAX(salary) AS max_salary -- Maximum
FROM employees
GROUP BY department;
Statistical Aggregations
SELECT
product_category,
STDDEV(price) AS price_stddev, -- Standard deviation
VARIANCE(price) AS price_variance, -- Variance
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM products
GROUP BY product_category;
String Aggregations
-- PostgreSQL: concatenate strings
SELECT
department,
STRING_AGG(employee_name, ', ' ORDER BY hire_date) AS employees
FROM employees
GROUP BY department;
-- MySQL: group concatenation
SELECT
department,
GROUP_CONCAT(employee_name ORDER BY hire_date SEPARATOR ', ') AS employees
FROM employees
GROUP BY department;
Conditional Aggregations
SELECT
department,
COUNT(*) AS total_employees,
COUNT(CASE WHEN salary > 100000 THEN 1 END) AS high_earners,
SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS female_count,
AVG(CASE WHEN on_leave = true THEN 0 ELSE 1 END) AS attendance_rate
FROM employees
GROUP BY department;
Execution Plans and Index Selection
Understanding EXPLAIN
-- Basic EXPLAIN shows the query plan
EXPLAIN SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- EXPLAIN ANALYZE executes the query and shows actual timing
EXPLAIN ANALYZE SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
How Indexes Are Chosen
The query optimizer considers: 1. Index availability on WHERE, JOIN, and GROUP BY columns 2. Cardinality (number of unique values) of indexed columns 3. Selectivity (how much the index reduces the dataset) 4. Query complexity and estimated cost
Example with index:
-- Index on grouping column
CREATE INDEX idx_orders_customer ON orders(customer_id);
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- Likely uses "Index Only Scan" or "GroupAggregate" with the index
What Happens Without an Index
Sequential Scan (Full Table Scan):
-- No index on customer_id
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- Shows: "Seq Scan on orders" followed by "HashAggregate"
-- Performance: O(n log n) sorting or O(n) hashing of entire table
Performance Impact: - Small tables (<10K rows): Often acceptable - Medium tables (10K-1M rows): Noticeable slowdown - Large tables (>1M rows): Can take minutes or hours
Visualizing the Difference:
With Index: Without Index:
┌─────────────┐ ┌─────────────┐
│ Index Scan │ │ Seq Scan │
│ (Fast) │ │ (Slow) │
└────────┬────┘ └────────┬────┘
│ │
▼ ▼
┌────────────────┐ ┌─────────────┐
│ GroupAggregate │ │ Sort/Hash │
│ (Efficient) │ │ (Expensive) │
└────────────────┘ └─────────────┘
Forcing Index Usage
-- Hint to use a specific index (syntax varies by database)
-- PostgreSQL:
SELECT /*+ IndexScan(orders idx_orders_customer) */
customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
-- MySQL:
SELECT customer_id, COUNT(*)
FROM orders FORCE INDEX (idx_orders_customer)
GROUP BY customer_id;
Common Anti-Patterns
-
GROUP BY on non-indexed columns:
-- Slow: no index on product_category SELECT product_category, COUNT(*) FROM products GROUP BY product_category; -
GROUP BY with expensive expressions:
-- Slow: function call on every row SELECT DATE_TRUNC('month', order_date), COUNT(*) FROM orders GROUP BY DATE_TRUNC('month', order_date); -- Better: pre-compute or use generated columns -
Multiple GROUP BY columns without composite index:
-- Slow: separate indexes on customer_id and order_date SELECT customer_id, order_date, COUNT(*) FROM orders GROUP BY customer_id, order_date; -- Better: composite index CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
Field Notes (2026-02-13)
- SQL helped debug a production-adjacent issue even without introducing new query concepts.
- Running backup/import test queries for courses increased confidence in operational recovery flow.
- Practical takeaway: repeated operational exercises reinforce tool fluency and reduce incident response time.
Field Notes (2026-03-09)
- Investigated discrepancies between Moodle report builder queries and API results
- Root cause: Report builder uses dynamic GROUP BY with different filtering logic than API endpoints
- Resolution: Standardized query patterns and ensured consistent GROUP BY clauses across both systems
- Outcome: Results now match between report builder and API endpoints