7 Practical SQL Patterns Every Data Scientist Should Know

Practical SQL Tricks Every Data Scientist Should Know

Introduction

Focusing only on SELECT, WHERE, and GROUP BY is enough for basic aggregation, but many real analytical tasks require patterns that go beyond simple queries. Examples include detecting consecutive activity streaks, segmenting customers by spend tier, smoothing noisy time-series data, or tracing plan upgrade paths across rows.

This article walks through 7 practical SQL patterns beyond the basics, focusing on techniques that solve real analytical problems.

Setting Up the Dataset

We’ll use a sample customer transactions table from a fictional subscription software as a service (SaaS) company:

CREATE TABLE transactions (
    transaction_id   SERIAL PRIMARY KEY,
    customer_id      INT,
    plan_type        VARCHAR(20),   -- 'starter', 'pro', 'enterprise'
    amount           NUMERIC(10,2),
    status           VARCHAR(20),   -- 'completed', 'refunded', 'failed'
    created_at       TIMESTAMP
);

The full dataset of 36 transactions across 7 customers, spanning September 2023 through June 2024, is available in seed.sql. Run it before you move on to the queries.

1. Measuring Time Between Events with `LAG()`

LAG() and LEAD() let you access a previous or next row’s value without a self-join. They’re particularly useful for calculating gaps between events like renewal cadence, churn signals, and re-engagement delays.

Task: Calculate how many days elapsed between each customer’s successive completed transactions.

SELECT
    customer_id,
    created_at,
    LAG(created_at) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
    ) AS previous_transaction_at,
    ROUND(
        EXTRACT(EPOCH FROM (
            created_at - LAG(created_at) OVER (
                PARTITION BY customer_id
                ORDER BY created_at
            )
        )) / 86400
    ) AS days_since_last
FROM transactions
WHERE status = 'completed'
ORDER BY customer_id, created_at;

Output (truncated):

customer_id |     created_at      | previous_transaction_at | days_since_last
-------------+---------------------+-------------------------+-----------------
        3317 | 2024-01-03 11:02:00 |                         |
        3317 | 2024-03-15 10:45:00 | 2024-01-03 11:02:00     |              72
        3317 | 2024-05-22 09:30:00 | 2024-03-15 10:45:00     |              68
        4482 | 2023-09-10 09:00:00 |                         |
        4482 | 2023-10-10 09:00:00 | 2023-09-10 09:00:00     |              30
        4482 | 2023-11-10 09:14:00 | 2023-10-10 09:00:00     |              31
        4482 | 2024-01-03 09:14:00 | 2023-11-10 09:14:00     |              54
        4482 | 2024-03-03 08:20:00 | 2024-01-03 09:14:00     |              60
        4482 | 2024-04-03 10:00:00 | 2024-03-03 08:20:00     |              31
        4482 | 2024-05-01 11:00:00 | 2024-04-03 10:00:00     |              28
        ...
        7891 | 2024-02-01 09:00:00 |                         |
        7891 | 2024-04-01 09:00:00 | 2024-02-01 09:00:00     |              60
        7891 | 2024-05-15 09:00:00 | 2024-04-01 09:00:00     |              44
        8810 | 2024-01-05 12:00:00 |                         |
        8810 | 2024-02-05 12:00:00 | 2024-01-05 12:00:00     |              31
        8810 | 2024-04-05 12:00:00 | 2024-02-05 12:00:00     |              60
(29 rows)

The first row per customer always has NULL for both columns — there’s no prior event to reference. EXTRACT(EPOCH ...) converts the timestamp interval to seconds; dividing by 86400 gives days.

LEAD() works the same way but looks forward instead of backward, making it useful for calculating time-to-next-renewal or flagging the last transaction before churn.

2. Comparing a Row to Other Rows in the Same Table with a Self-Join

A self-join relates rows within the same table to each other. It’s the right tool when you need to compare two events for the same entity across time — upgrades, downgrades, re-activations, or any before/after pattern.

Task: Find customers who upgraded from starter to pro (or pro to enterprise) at any point.

SELECT DISTINCT t1.customer_id
FROM transactions t1
JOIN transactions t2
    ON  t1.customer_id = t2.customer_id
    AND t1.plan_type   = 'starter'
    AND t2.plan_type   = 'pro'
    AND t2.created_at  > t1.created_at
WHERE t1.status = 'completed'
  AND t2.status = 'completed'
ORDER BY t1.customer_id;

Output:

customer_id
-------------
        4482
        6204
        7891
(3 rows)

The table is aliased twice (t1, t2) so each alias can represent a different point in time for the same customer. The condition t2.created_at > t1.created_at enforces temporal order — without it, you’d match customers who simply had both plan types in any order, including the wrong one. DISTINCT collapses cases where a customer had multiple starter transactions before upgrading, which would otherwise produce duplicate rows.

This same structure works for detecting downgrades, finding customers who churned and came back, or comparing any two states that need to be ordered by time.

3. Selecting the Top Row per Group with `ROW_NUMBER()`

When you need the top-N rows per category — highest transaction per customer, most recent event per account, first purchase per cohort — ROW_NUMBER() inside a common table expression (CTE) is the standard approach.

Task: Get each customer’s single highest completed transaction.

WITH ranked AS (
    SELECT
        customer_id,
        transaction_id,
        amount,
        plan_type,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY amount DESC, created_at DESC
        ) AS rn
    FROM transactions
    WHERE status = 'completed'
)
SELECT customer_id, transaction_id, amount, plan_type
FROM ranked
WHERE rn = 1
ORDER BY customer_id;

Output:

customer_id  | transaction_id  | amount  | plan_type
-------------+----------------+--------+------------
        3317 |             12 |  19.00 | starter
        4482 |              8 | 299.00 | enterprise
        5901 |             19 | 299.00 | enterprise
        6103 |             25 | 299.00 | enterprise
        6204 |             28 |  79.00 | pro
        7891 |             32 |  79.00 | pro
        8810 |             36 |  79.00 | pro
(7 rows)

ROW_NUMBER() assigns 1 to the row that sorts first within each partition. The outer query then filters to only those rows. The secondary sort on created_at DESC acts as a tiebreaker; when two transactions have the same amount, the more recent one wins.

If you want ties included rather than broken, swap ROW_NUMBER() for RANK(). RANK() assigns the same number to tied rows and skips the next rank (1, 1, 3), while DENSE_RANK() does the same without skipping (1, 1, 2).

4. Segmenting Customers by Spend with `NTILE(n)`

NTILE(n) divides ordered rows into n roughly equal buckets and assigns each row a bucket number. It’s the right tool for customer tiering, spend quartiles, or building cohorts for A/B analysis without hardcoding thresholds.

Task: Rank customers into spend quartiles based on their total completed transaction value.

WITH customer_spend AS (
    SELECT
        customer_id,
        SUM(amount) AS total_spend,
        COUNT(*) AS total_transactions
    FROM transactions
    WHERE status = 'completed'
    GROUP BY customer_id
)
SELECT
    customer_id,
    total_spend,
    total_transactions,
    NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM customer_spend
ORDER BY total_spend DESC;

Output:

customer_id | total_spend | total_transactions | spend_quartile
-------------+-------------+--------------------+----------------
        5901 |     1495.00 |                  5 |              4
        6103 |      835.00 |                  5 |              3
        4482 |      653.00 |                  7 |              3
        8810 |      237.00 |                  3 |              2
        6204 |      177.00 |                  3 |              2
        7891 |      177.00 |                  3 |              1
        3317 |       57.00 |                  3 |              1
(7 rows)

Quartile 4 is your highest spenders; quartile 1 is your lowest. NTILE() doesn’t hardcode spend thresholds, so the buckets recalibrate automatically as new customers are added. This makes it more robust than static cutoffs like CASE WHEN total_spend > 500.

5. Smoothing Noisy Data with a Rolling Window

A rolling (or moving) average smooths out month-to-month volatility, making trends in time-series data much easier to read. Window functions with an explicit ROWS BETWEEN frame give you precise control over how many periods to include.

Task: Calculate a 3-month rolling average of monthly revenue to smooth out noise.

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at)::DATE AS month,
        SUM(amount) AS monthly_revenue
    FROM transactions
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    monthly_revenue,
    ROUND(AVG(monthly_revenue) OVER (
        ORDER BY month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) AS revenue_3mo_avg
FROM monthly
ORDER BY month;

Output:

month    | monthly_revenue | revenue_3mo_avg
-------------+-----------------+-----------------
 2023-09-01  |           19.00 |           19.00
 2023-10-01  |           19.00 |           19.00
 2023-11-01  |           79.00 |           39.00
 2024-01-01  |          275.00 |          124.33
 2024-02-01  |          476.00 |          276.67
 2024-03-01  |          555.00 |          435.33
 2024-04-01  |          835.00 |          622.00
 2024-05-01  |          775.00 |          721.67
 2024-06-01  |          598.00 |          736.00
(9 rows)

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW tells the window function to look at the current row and the two rows before it. The first two rows use fewer inputs since there’s no prior history, so they act as a 1-month and 2-month average respectively. Swap ROWS for RANGE if you want the frame defined by value distance rather than physical row position — useful when your time series has gaps and you want consistent interval-based windows rather than a fixed row count.

Introduction

Setting Up the Dataset

1. Measuring Time Between Events with LAG()

2. Comparing a Row to Other Rows in the Same Table with a Self-Join

3. Selecting the Top Row per Group with ROW_NUMBER()

4. Segmenting Customers by Spend with NTILE(n)

5. Smoothing Noisy Data with a Rolling Window

1. Measuring Time Between Events with `LAG()`

3. Selecting the Top Row per Group with `ROW_NUMBER()`

4. Segmenting Customers by Spend with `NTILE(n)`