4 min read
Conquering the SQL Islands Problem: Finding Continuous Ranges in Date Data

Working with date ranges in SQL often presents a classic challenge: identifying “islands” of continuous data amidst potentially overlapping periods. This problem frequently appears in scenarios like analyzing employee work history, managing booking systems, tracking system uptime, or understanding customer activity patterns.

In this post, we’ll explore the “islands” problem, its importance, and how to conquer it using powerful SQL techniques like window functions and common table expressions (CTEs).

The Challenge: Overlapping Dates and Continuous Periods

The “islands” problem arises when we need to group overlapping or contiguous date ranges into single continuous periods. Consider a table tracking employee work assignments:

  • employee_id: Unique identifier for each employee.
  • start_date: Start date of the assignment.
  • end_date: End date of the assignment.

The challenge is to identify the continuous periods of work for each employee, even when assignments overlap.

Sample Data: Let’s Get Practical

Let’s create a sample table and populate it with data:

CREATE TABLE work_assignments (
    employee_id INT,
    start_date DATE,
    end_date DATE
);

INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-01-01', '2024-01-15'),
(1, '2024-01-10', '2024-01-25'),
(1, '2024-02-01', '2024-02-10'),
(1, '2024-02-15', '2024-02-28'),
(2, '2024-01-05', '2024-01-20'),
(2, '2024-01-18', '2024-02-05'),
(1, '2024-03-01', NULL); -- Added an ongoing assignment

The SQL Solution: Step-by-Step

Here’s the SQL query to identify the islands:

WITH date_ranges AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
    FROM
        work_assignments
),
islands AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) 
            OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
    FROM
        date_ranges
)
SELECT
    employee_id,
    MIN(start_date) AS island_start_date,
    MAX(end_date) AS island_end_date
FROM
    islands
GROUP BY
    employee_id, island_group
ORDER BY
    employee_id, island_start_date;

Breaking Down the Query:

** The date_ranges CTE:**

WITH date_ranges AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS prev_end_date
    FROM
        work_assignments
)
SELECT * FROM date_ranges;

Output of date_ranges:

employee_idstart_dateend_dateprev_end_date
12024-01-012024-01-15NULL
12024-01-102024-01-252024-01-15
12024-02-012024-02-102024-01-25
12024-02-152024-02-282024-02-10
22024-01-052024-01-20NULL
22024-01-182024-02-052024-01-20

The LAG(end_date) window function gets the end_date of the previous row within the same employee_id partition, ordered by start_date. This prev_end_date is crucial for determining overlaps.

** The islands CTE:**

WITH date_ranges AS (
    -- ... (previous code)
),
islands AS (
    SELECT
        employee_id,
        start_date,
        end_date,
        SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) 
            OVER (PARTITION BY employee_id ORDER BY start_date) AS island_group
    FROM
        date_ranges
)
SELECT * FROM islands;

Output of islands:

employee_idstart_dateend_dateisland_group
12024-01-012024-01-151
12024-01-102024-01-251
12024-02-012024-02-102
12024-02-152024-02-283
22024-01-052024-01-201
22024-01-182024-02-051

The SUM(CASE WHEN start_date <= prev_end_date THEN 0 ELSE 1 END) OVER (...) expression is the core of the island detection. It’s a running sum. If a start_date is less than or equal to the prev_end_date, it means there’s an overlap or the ranges are contiguous, so we add 0 to the running sum, keeping the island_group the same. If there’s a gap (start_date > prev_end_date), we add 1, incrementing the island_group and starting a new island.

The Final SELECT Statement:

This statement groups the results by employee_id and island_group to find the minimum start_date and maximum end_date for each island. (This part remains the same as in the original explanation). This gives you the start and end dates of each continuous period of work.

Handling Edge Cases

The provided SQL solution handles most common scenarios effectively. However, certain edge cases require attention:

NULL end_date (Ongoing Assignments): A NULL end_date usually signifies an ongoing assignment. The current query handles this gracefully as the LAG() function will treat NULL as less than any non-NULL date. This ensures that ongoing assignments are correctly included in islands. To illustrate, let’s add an ongoing assignment to our sample data:

INSERT INTO work_assignments (employee_id, start_date, end_date) VALUES
(1, '2024-03-01', NULL);

Now, the final query will correctly include this ongoing assignment as a new island for employee 1.

Single-Day Assignments: Single-day assignments (where start_date = end_date) are handled correctly as well. They are treated as normal date ranges and included in the appropriate islands based on overlaps or contiguity.

Overlapping End and Start Dates: If an employee’s end_date and the subsequent start_date are the same, this will be treated as contiguous (part of the same island) by the current query. If you need to treat this as a gap, you’d need to modify the islands CTE’s CASE statement to check for start_date > prev_end_date instead of start_date <= prev_end_date.

Alternative Approaches

While the presented solution using LAG() and a running sum is efficient and clear, there are alternative approaches to the gaps and islands problem:

Self-Join: You can join the work_assignments table to itself, comparing each row with all subsequent rows for the same employee. This approach can be less efficient than using window functions, especially for large datasets.

SELECT wa1.employee_id, wa1.start_date, wa2.end_date
FROM work_assignments wa1
JOIN work_assignments wa2 ON wa1.employee_id = wa2.employee_id
    AND wa1.start_date <= wa2.end_date AND wa2.start_date <= wa1.end_date -- Overlap condition
WHERE NOT EXISTS (
    SELECT 1
    FROM work_assignments wa3
    WHERE wa3.employee_id = wa1.employee_id
      AND wa3.start_date < wa1.start_date
      AND wa3.end_date >= wa1.start_date
);

Recursive CTEs: A recursive CTE can be used to iteratively build the islands. This approach can be more complex to understand but can be useful for handling more intricate scenarios.

Numbering and Grouping: Assign row numbers within each employee partition ordered by start and end dates. Then, group rows where the difference between row numbers and a partitioned row number based on start and end dates are the same. This approach can be more complex but offers another way to identify the islands.

The choice of approach depends on the specific requirements of the situation, the size of the data, and performance considerations. The LAG() and running sum method offers a good balance of clarity and efficiency for many common use cases.

Conclusion

Mastering the “islands” problem empowers you to tackle complex date range scenarios in SQL. By understanding these techniques, you’ll gain valuable insights from your time-based data and make more informed decisions.