Working with date ranges in SQL often presents a classic challenge: identifying “islands” of continuous data amidst potentially overlapping periods. This problem frequently appears in scenarios like analyzing employee work history, managing booking systems, tracking system uptime, or understanding customer activity patterns.
In this post, we’ll explore the “islands” problem, its importance, and how to conquer it using powerful SQL techniques like window functions and common table expressions (CTEs).
The Challenge: Overlapping Dates and Continuous Periods
The “islands” problem arises when we need to group overlapping or contiguous date ranges into single continuous periods. Consider a table tracking employee work assignments:
employee_id
: Unique identifier for each employee.start_date
: Start date of the assignment.end_date
: End date of the assignment.
The challenge is to identify the continuous periods of work for each employee, even when assignments overlap.
Sample Data: Let’s Get Practical
Let’s create a sample table and populate it with data:
CREATE TABLE work_assignments (
employee_id INT,
start_date DATE,
end_date DATE
);
INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-01-01', '2024-01-15'),
(1, '2024-01-10', '2024-01-25'),
(1, '2024-02-01', '2024-02-10'),
(1, '2024-02-15', '2024-02-28'),
(2, '2024-01-05', '2024-01-20'),
(2, '2024-01-18', '2024-02-05'),
(1, '2024-03-01', NULL); -- Added an ongoing assignment
The SQL Solution: Step-by-Step
Here’s the SQL query to identify the islands:
WITH date_ranges AS (
SELECT
employee_id,
start_date,
end_date,
LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
FROM
work_assignments
),
islands AS (
SELECT
employee_id,
start_date,
end_date,
SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END)
OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
FROM
date_ranges
)
SELECT
employee_id,
MIN(start_date) AS island_start_date,
MAX(end_date) AS island_end_date
FROM
islands
GROUP BY
employee_id, island_group
ORDER BY
employee_id, island_start_date;
Breaking Down the Query:
The date_ranges
CTE:
WITH date_ranges AS (
SELECT
employee_id,
start_date,
end_date,
LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
FROM
work_assignments
)
SELECT * FROM date_ranges;
Output of date_ranges
:
employee_id | start_date | end_date | prev_end_date |
---|---|---|---|
1 | 2024-01-01 | 2024-01-15 | NULL |
1 | 2024-01-10 | 2024-01-25 | 2024-01-15 |
1 | 2024-02-01 | 2024-02-10 | 2024-01-25 |
1 | 2024-02-15 | 2024-02-28 | 2024-02-10 |
2 | 2024-01-05 | 2024-01-20 | NULL |
2 | 2024-01-18 | 2024-02-05 | 2024-01-20 |
The LAG(end_date)
window function gets the end_date
of the previous row within the same employee_id
partition, ordered by start_date
. This prev_end_date
is crucial for determining overlaps.
** The islands
CTE:**
WITH date_ranges AS (
-- ... (previous code)
),
islands AS (
SELECT
employee_id,
start_date,
end_date,
SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END)
OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
FROM
date_ranges
)
SELECT * FROM islands;
Output of islands
:
employee_id | start_date | end_date | island_group |
---|---|---|---|
1 | 2024-01-01 | 2024-01-15 | 1 |
1 | 2024-01-10 | 2024-01-25 | 1 |
1 | 2024-02-01 | 2024-02-10 | 2 |
1 | 2024-02-15 | 2024-02-28 | 3 |
2 | 2024-01-05 | 2024-01-20 | 1 |
2 | 2024-01-18 | 2024-02-05 | 1 |
The SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) OVER (...)
expression is the core of the island detection. It’s a running sum. If a start_date
is less than or equal to the prev_end_date
, it means there’s an overlap or the ranges are contiguous, so we add 0 to the running sum, keeping the island_group
the same. If there’s a gap (start_date > prev_end_date), we add 1, incrementing the island_group
and starting a new island.
The Final SELECT
Statement:
This statement groups the results by employee_id
and island_group
to find the minimum start_date
and maximum end_date
for each island. (This part remains the same as in the original explanation). This gives you the start and end dates of each continuous period of work.
Handling Edge Cases
The provided SQL solution handles most common scenarios effectively. However, certain edge cases require attention:
NULL end_date
(Ongoing Assignments): A NULL
end_date
usually signifies an ongoing assignment. The current query handles this gracefully as the LAG()
function will treat NULL
as less than any non-NULL
date. This ensures that ongoing assignments are correctly included in islands. To illustrate, let’s add an ongoing assignment to our sample data:
INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-03-01', NULL);
Now, the final query will correctly include this ongoing assignment as a new island for employee 1.
Single-Day Assignments: Single-day assignments (where start_date
= end_date
) are handled correctly as well. They are treated as normal date ranges and included in the appropriate islands based on overlaps or contiguity.
Overlapping End and Start Dates: If an employee’s end_date and the subsequent start_date are the same, this will be treated as contiguous (part of the same island) by the current query. If you need to treat this as a gap, you’d need to modify the islands
CTE’s CASE
statement to check for start_date > prev_end_date
instead of start_date <= prev_end_date
.
Alternative Approaches
While the presented solution using LAG()
and a running sum is efficient and clear, there are alternative approaches to the gaps and islands problem:
Self-Join: You can join the work_assignments
table to itself, comparing each row with all subsequent rows for the same employee. This approach can be less efficient than using window functions, especially for large datasets.
SELECT wa1.employee_id, wa1.start_date, wa2.end_date
FROM work_assignments wa1
JOIN work_assignments wa2 ON wa1.employee_id = wa2.employee_id
AND wa1.start_date <= wa2.end_date AND wa2.start_date <= wa1.end_date -- Overlap condition
WHERE NOT EXISTS (
SELECT 1
FROM work_assignments wa3
WHERE wa3.employee_id = wa1.employee_id
AND wa3.start_date < wa1.start_date
AND wa3.end_date >= wa1.start_date
);
Recursive CTEs: A recursive CTE can be used to iteratively build the islands. This approach can be more complex to understand but can be useful for handling more intricate scenarios.
Numbering and Grouping: Assign row numbers within each employee partition ordered by start and end dates. Then, group rows where the difference between row numbers and a partitioned row number based on start and end dates are the same. This approach can be more complex but offers another way to identify the islands.
The choice of approach depends on the specific requirements of the situation, the size of the data, and performance considerations. The LAG()
and running sum method offers a good balance of clarity and efficiency for many common use cases.
The Task Activity Problem
Let’s say you have a dataset that records when tasks are assigned to an employee. For each task, you have a start
date and an end
date. Your goal is to determine the periods when tasks were active on a day-to-day basis and to track which tasks were active on any given date.
Here’s an example dataset:
employee_id | date | type | task_id |
---|---|---|---|
1 | 2024-01-01 | start | 101 |
1 | 2024-01-15 | end | 101 |
1 | 2024-01-10 | start | 102 |
1 | 2024-01-25 | end | 102 |
1 | 2024-02-01 | start | 103 |
1 | 2024-02-10 | end | 103 |
1 | 2024-02-15 | start | 104 |
1 | 2024-02-28 | end | 104 |
This dataset contains both start
and end
events for four tasks assigned to the same employee. We want to create a representation of each day between the start and end dates and determine which tasks were active on that day.
Step-by-Step Solution in Snowflake SQL
1. Create the Recursive CTE to Generate Dates
We will use a recursive CTE to generate all the dates between the start
and end
dates for each task_id
.
2. Group by Date to Collect Active Tasks
We will group the generated dates by date
and collect the active tasks into an array using Snowflake’s ARRAY_AGG()
function.
3. Generate a Full Date Range
We will also generate a full date range to ensure that we include dates with no tasks (gaps). This will cover the entire period from the earliest start date to the latest end date.
4. Left Join to Include Dates with No Tasks
Finally, we’ll perform a left join between the full date range and the active tasks per day to ensure that days with no tasks are included (as NULL
).
SQL Query in Snowflake
WITH task_data AS (
-- Your raw data
SELECT * FROM (
VALUES
(1, '2024-01-01', 'start', 101),
(1, '2024-01-15', 'end', 101),
(1, '2024-01-10', 'start', 102),
(1, '2024-01-25', 'end', 102),
(1, '2024-02-01', 'start', 103),
(1, '2024-02-10', 'end', 103),
(1, '2024-02-15', 'start', 104),
(1, '2024-02-28', 'end', 104)
) AS tasks(employee_id, date, type, task_id)
),
task_ranges AS (
-- Pivot the start and end dates for each task
SELECT
task_id,
MIN(CASE WHEN type = 'start' THEN date END) AS start_date,
MIN(CASE WHEN type = 'end' THEN date END) AS end_date
FROM task_data
GROUP BY task_id
),
date_exploded AS (
-- Recursive CTE to generate all dates between start and end for each task
SELECT
task_id,
start_date AS date
FROM task_ranges
UNION ALL
SELECT
task_id,
DATEADD('DAY', 1, date) AS date
FROM date_exploded
JOIN task_ranges ON date_exploded.task_id = task_ranges.task_id
WHERE date < task_ranges.end_date
),
active_tasks_per_day AS (
-- Group by date and aggregate active tasks into an array
SELECT
date,
ARRAY_AGG(task_id) AS active_tasks
FROM date_exploded
GROUP BY date
),
full_date_range AS (
-- Generate full date range from the earliest start date to the latest end date
SELECT
DATEADD('DAY', SEQ4(), (SELECT MIN(start_date) FROM task_ranges)) AS date
FROM TABLE(GENERATOR(ROWCOUNT => DATEDIFF('DAY', (SELECT MIN(start_date) FROM task_ranges), (SELECT MAX(end_date) FROM task_ranges))))
)
-- Left join to ensure all dates are included, even those with no tasks
SELECT
fdr.date,
atpd.active_tasks
FROM full_date_range fdr
LEFT JOIN active_tasks_per_day atpd ON fdr.date = atpd.date
ORDER BY fdr.date;
Explanation of the Query:
-
task_data CTE: This is where we load the raw data, simulating your table of tasks with their
start
andend
dates. -
task_ranges CTE: We pivot the data to get each task’s
start_date
andend_date
usingCASE
statements. This creates a row for eachtask_id
with itsstart
andend
dates. -
date_exploded CTE: This is the key part. We use a recursive CTE to generate all dates between the
start_date
andend_date
for each task. It starts with thestart_date
and recursively adds one day (DATEADD('DAY', 1, ...)
) until it reaches theend_date
. -
active_tasks_per_day CTE: Here, we group the generated dates by
date
and use theARRAY_AGG()
function to collect thetask_id
s into an array (which corresponds to the list of active tasks on that day). -
full_date_range CTE: This generates a continuous sequence of dates from the earliest
start_date
to the latestend_date
across all tasks. We use Snowflake’sTABLE(GENERATOR(...))
function to create a sequence of numbers and convert them into dates. -
Final SELECT: We perform a left join between the full date range and the active tasks per day to ensure all dates are included in the final output, even if no tasks are active on those dates. Any dates with no active tasks will show
NULL
in theactive_tasks
column.
Output:
The query will give you the following output:
date active_tasks
0 2024-01-01 [101]
1 2024-01-02 [101]
2 2024-01-03 [101]
3 2024-01-04 [101]
4 2024-01-05 [101]
5 2024-01-06 [101]
6 2024-01-07 [101]
7 2024-01-08 [101]
8 2024-01-09 [101]
9 2024-01-10 [101, 102]
10 2024-01-11 [101, 102]
11 2024-01-12 [101, 102]
12 2024-01-13 [101, 102]
13 2024-01-14 [101, 102]
14 2024-01-15 [101, 102]
15 2024-01-16 [102]
16 2024-01-17 [102]
17 2024-01-18 [102]
18 2024-01-19 [102]
19 2024-01-20 [102]
20 2024-01-21 [102]
21 2024-01-22 [102]
22 2024-01-23 [102]
23 2024-01-24 [102]
24 2024-01-25 [102]
25 2024-01-26 NULL
26 2024-01-27 NULL
27 2024-01-28 NULL
28 2024-01-29 NULL
29 2024-01-30 NULL
30 2024-01-31 NULL
31 2024-02-01 [103]
...
Key Points:
- Recursive CTE: We used recursion to generate all the dates between the
start
andend
dates for each task. - ARRAY_AGG(): This Snowflake function allowed us to aggregate the tasks active on each day into an array.
- Generator for Full Date Range: Snowflake’s
GENERATOR
function was used to create a sequence of dates, ensuring all dates are included, even those with no active tasks. - Left Join: Ensures that dates with no tasks appear in the final results as
NULL
.
This approach efficiently solves the problem of generating a list of active tasks per day, including days where no tasks are active, in Snowflake SQL.