Conquering the SQL Islands Problem: Finding Continuous Ranges in Date Data

Working with date ranges in SQL often presents a classic challenge: identifying “islands” of continuous data amidst potentially overlapping periods. This problem frequently appears in scenarios like analyzing employee work history, managing booking systems, tracking system uptime, or understanding customer activity patterns.

In this post, we’ll explore the “islands” problem, its importance, and how to conquer it using powerful SQL techniques like window functions and common table expressions (CTEs).

The Challenge: Overlapping Dates and Continuous Periods

The “islands” problem arises when we need to group overlapping or contiguous date ranges into single continuous periods. Consider a table tracking employee work assignments:

employee_id: Unique identifier for each employee.
start_date: Start date of the assignment.
end_date: End date of the assignment.

The challenge is to identify the continuous periods of work for each employee, even when assignments overlap.

Sample Data: Let’s Get Practical

Let’s create a sample table and populate it with data:

CREATE TABLE work_assignments (
    employee_id INT,
    start_date DATE,
    end_date DATE
);

INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-01-01', '2024-01-15'),
(1, '2024-01-10', '2024-01-25'),
(1, '2024-02-01', '2024-02-10'),
(1, '2024-02-15', '2024-02-28'),
(2, '2024-01-05', '2024-01-20'),
(2, '2024-01-18', '2024-02-05'),
(1, '2024-03-01', NULL); -- Added an ongoing assignment

The SQL Solution: Step-by-Step

Here’s the SQL query to identify the islands:

WITH date_ranges AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
    FROM
        work_assignments
),
islands AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) 
            OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
    FROM
        date_ranges
)
SELECT
    employee_id,
    MIN(start_date) AS island_start_date,
    MAX(end_date) AS island_end_date
FROM
    islands
GROUP BY
    employee_id, island_group
ORDER BY
    employee_id, island_start_date;

Breaking Down the Query:

The date_ranges CTE:

WITH date_ranges AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
    FROM
        work_assignments
)
SELECT * FROM date_ranges;

Output of date_ranges:

employee_id	start_date	end_date	prev_end_date
1	2024-01-01	2024-01-15	NULL
1	2024-01-10	2024-01-25	2024-01-15
1	2024-02-01	2024-02-10	2024-01-25
1	2024-02-15	2024-02-28	2024-02-10
2	2024-01-05	2024-01-20	NULL
2	2024-01-18	2024-02-05	2024-01-20

The LAG(end_date) window function gets the end_date of the previous row within the same employee_id partition, ordered by start_date. This prev_end_date is crucial for determining overlaps.

** The islands CTE:**

WITH date_ranges AS (
    -- ... (previous code)
),
islands AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) 
            OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
    FROM
        date_ranges
)
SELECT * FROM islands;

Output of islands:

employee_id	start_date	end_date	island_group
1	2024-01-01	2024-01-15	1
1	2024-01-10	2024-01-25	1
1	2024-02-01	2024-02-10	2
1	2024-02-15	2024-02-28	3
2	2024-01-05	2024-01-20	1
2	2024-01-18	2024-02-05	1

The SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) OVER (...) expression is the core of the island detection. It’s a running sum. If a start_date is less than or equal to the prev_end_date, it means there’s an overlap or the ranges are contiguous, so we add 0 to the running sum, keeping the island_group the same. If there’s a gap (start_date > prev_end_date), we add 1, incrementing the island_group and starting a new island.

The Final SELECT Statement:

This statement groups the results by employee_id and island_group to find the minimum start_date and maximum end_date for each island. (This part remains the same as in the original explanation). This gives you the start and end dates of each continuous period of work.

Handling Edge Cases

The provided SQL solution handles most common scenarios effectively. However, certain edge cases require attention:

NULL end_date (Ongoing Assignments): A NULL end_date usually signifies an ongoing assignment. The current query handles this gracefully as the LAG() function will treat NULL as less than any non-NULL date. This ensures that ongoing assignments are correctly included in islands. To illustrate, let’s add an ongoing assignment to our sample data:

INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-03-01', NULL);

Now, the final query will correctly include this ongoing assignment as a new island for employee 1.

Single-Day Assignments: Single-day assignments (where start_date = end_date) are handled correctly as well. They are treated as normal date ranges and included in the appropriate islands based on overlaps or contiguity.

Overlapping End and Start Dates: If an employee’s end_date and the subsequent start_date are the same, this will be treated as contiguous (part of the same island) by the current query. If you need to treat this as a gap, you’d need to modify the islands CTE’s CASE statement to check for start_date > prev_end_date instead of start_date <= prev_end_date.

Alternative Approaches

While the presented solution using LAG() and a running sum is efficient and clear, there are alternative approaches to the gaps and islands problem:

Self-Join: You can join the work_assignments table to itself, comparing each row with all subsequent rows for the same employee. This approach can be less efficient than using window functions, especially for large datasets.

SELECT wa1.employee_id, wa1.start_date, wa2.end_date
FROM work_assignments wa1
JOIN work_assignments wa2 ON wa1.employee_id = wa2.employee_id
    AND wa1.start_date <= wa2.end_date AND wa2.start_date <= wa1.end_date -- Overlap condition
WHERE NOT EXISTS (
    SELECT 1
    FROM work_assignments wa3
    WHERE wa3.employee_id = wa1.employee_id
      AND wa3.start_date < wa1.start_date
      AND wa3.end_date >= wa1.start_date
);

Recursive CTEs: A recursive CTE can be used to iteratively build the islands. This approach can be more complex to understand but can be useful for handling more intricate scenarios.

Numbering and Grouping: Assign row numbers within each employee partition ordered by start and end dates. Then, group rows where the difference between row numbers and a partitioned row number based on start and end dates are the same. This approach can be more complex but offers another way to identify the islands.

The choice of approach depends on the specific requirements of the situation, the size of the data, and performance considerations. The LAG() and running sum method offers a good balance of clarity and efficiency for many common use cases.

The Task Activity Problem

Let’s say you have a dataset that records when tasks are assigned to an employee. For each task, you have a start date and an end date. Your goal is to determine the periods when tasks were active on a day-to-day basis and to track which tasks were active on any given date.

Here’s an example dataset:

employee_id	date	type	task_id
1	2024-01-01	start	101
1	2024-01-15	end	101
1	2024-01-10	start	102
1	2024-01-25	end	102
1	2024-02-01	start	103
1	2024-02-10	end	103
1	2024-02-15	start	104
1	2024-02-28	end	104

This dataset contains both start and end events for four tasks assigned to the same employee. We want to create a representation of each day between the start and end dates and determine which tasks were active on that day.

Step-by-Step Solution in Snowflake SQL

1. Create the Recursive CTE to Generate Dates

We will use a recursive CTE to generate all the dates between the start and end dates for each task_id.

2. Group by Date to Collect Active Tasks

We will group the generated dates by date and collect the active tasks into an array using Snowflake’s ARRAY_AGG() function.

3. Generate a Full Date Range

We will also generate a full date range to ensure that we include dates with no tasks (gaps). This will cover the entire period from the earliest start date to the latest end date.

4. Left Join to Include Dates with No Tasks

Finally, we’ll perform a left join between the full date range and the active tasks per day to ensure that days with no tasks are included (as NULL).

SQL Query in Snowflake

WITH task_data AS (
    -- Your raw data
    SELECT * FROM (
        VALUES
            (1, '2024-01-01', 'start', 101),
            (1, '2024-01-15', 'end', 101),
            (1, '2024-01-10', 'start', 102),
            (1, '2024-01-25', 'end', 102),
            (1, '2024-02-01', 'start', 103),
            (1, '2024-02-10', 'end', 103),
            (1, '2024-02-15', 'start', 104),
            (1, '2024-02-28', 'end', 104)
    ) AS tasks(employee_id, date, type, task_id)
),
task_ranges AS (
    -- Pivot the start and end dates for each task
    SELECT
        task_id,
        MIN(CASE WHEN type = 'start' THEN date END) AS start_date,
        MIN(CASE WHEN type = 'end' THEN date END) AS end_date
    FROM task_data
    GROUP BY task_id
),
date_exploded AS (
    -- Recursive CTE to generate all dates between start and end for each task
    SELECT
        task_id,
        start_date AS date
    FROM task_ranges
    UNION ALL
    SELECT
        task_id,
        DATEADD('DAY', 1, date) AS date
    FROM date_exploded
    JOIN task_ranges ON date_exploded.task_id = task_ranges.task_id
    WHERE date < task_ranges.end_date
),
active_tasks_per_day AS (
    -- Group by date and aggregate active tasks into an array
    SELECT
        date,
        ARRAY_AGG(task_id) AS active_tasks
    FROM date_exploded
    GROUP BY date
),
full_date_range AS (
    -- Generate full date range from the earliest start date to the latest end date
    SELECT
        DATEADD('DAY', SEQ4(), (SELECT MIN(start_date) FROM task_ranges)) AS date
    FROM TABLE(GENERATOR(ROWCOUNT => DATEDIFF('DAY', (SELECT MIN(start_date) FROM task_ranges), (SELECT MAX(end_date) FROM task_ranges))))
)
-- Left join to ensure all dates are included, even those with no tasks
SELECT
    fdr.date,
    atpd.active_tasks
FROM full_date_range fdr
LEFT JOIN active_tasks_per_day atpd ON fdr.date = atpd.date
ORDER BY fdr.date;

Explanation of the Query:

task_data CTE: This is where we load the raw data, simulating your table of tasks with their start and end dates.
task_ranges CTE: We pivot the data to get each task’s start_date and end_date using CASE statements. This creates a row for each task_id with its start and end dates.
date_exploded CTE: This is the key part. We use a recursive CTE to generate all dates between the start_date and end_date for each task. It starts with the start_date and recursively adds one day (DATEADD('DAY', 1, ...)) until it reaches the end_date.
active_tasks_per_day CTE: Here, we group the generated dates by date and use the ARRAY_AGG() function to collect the task_ids into an array (which corresponds to the list of active tasks on that day).
full_date_range CTE: This generates a continuous sequence of dates from the earliest start_date to the latest end_date across all tasks. We use Snowflake’s TABLE(GENERATOR(...)) function to create a sequence of numbers and convert them into dates.
Final SELECT: We perform a left join between the full date range and the active tasks per day to ensure all dates are included in the final output, even if no tasks are active on those dates. Any dates with no active tasks will show NULL in the active_tasks column.

Output:

The query will give you the following output:

         date active_tasks
0  2024-01-01       [101]
1  2024-01-02       [101]
2  2024-01-03       [101]
3  2024-01-04       [101]
4  2024-01-05       [101]
5  2024-01-06       [101]
6  2024-01-07       [101]
7  2024-01-08       [101]
8  2024-01-09       [101]
9  2024-01-10  [101, 102]
10 2024-01-11  [101, 102]
11 2024-01-12  [101, 102]
12 2024-01-13  [101, 102]
13 2024-01-14  [101, 102]
14 2024-01-15  [101, 102]
15 2024-01-16       [102]
16 2024-01-17       [102]
17 2024-01-18       [102]
18 2024-01-19       [102]
19 2024-01-20       [102]
20 2024-01-21       [102]
21 2024-01-22       [102]
22 2024-01-23       [102]
23 2024-01-24       [102]
24 2024-01-25       [102]
25 2024-01-26        NULL
26 2024-01-27        NULL
27 2024-01-28        NULL
28 2024-01-29        NULL
29 2024-01-30        NULL
30 2024-01-31        NULL
31 2024-02-01       [103]
...

Key Points:

Recursive CTE: We used recursion to generate all the dates between the start and end dates for each task.
ARRAY_AGG(): This Snowflake function allowed us to aggregate the tasks active on each day into an array.
Generator for Full Date Range: Snowflake’s GENERATOR function was used to create a sequence of dates, ensuring all dates are included, even those with no active tasks.
Left Join: Ensures that dates with no tasks appear in the final results as NULL.

This approach efficiently solves the problem of generating a list of active tasks per day, including days where no tasks are active, in Snowflake SQL.