Solving the Gaps and Islands Problem Using Python Pandas

The “gaps and islands” problem is a well-known challenge in SQL, where continuous sequences (islands) and gaps between them need to be identified, particularly in date ranges. While this problem is traditionally solved using SQL (see my post here), Python’s Pandas library offers a versatile and intuitive way to tackle it as well.

In this post, we’ll explore how two ways to solve the gaps and islands problem using the Python package Pandas.

The Problem: Gaps and Islands

As a quick recap, the “gaps and islands” problem involves identifying sequences of contiguous records (islands) and the gaps between them. This is particularly common when working with date ranges, such as tracking employee work periods, booking reservations, or monitoring system uptime.

Example Scenario

Imagine you have a DataFrame that tracks employee work assignments with the following columns:

employee_id: An integer that uniquely identifies an employee.
start_date: The date when the employee’s assignment starts.
end_date: The date when the employee’s assignment ends.

The goal is to identify continuous periods of work (islands) for each employee and highlight any gaps between these periods. Let’s visualize this with a simple example:

| employee_id | start_date | end_date   |
|-------------|------------|------------|
| 1           | 2024-01-01 | 2024-01-15 |
| 1           | 2024-01-10 | 2024-01-25 |  <- Overlapping, part of the same island for employee_id=1
| 1           | 2024-02-01 | 2024-02-10 |  <- New island
| 1           | 2024-02-15 | 2024-02-28 |  <- New island
| 2           | 2024-01-05 | 2024-01-20 |
| 2           | 2024-01-18 | 2024-02-05 |  <- Overlapping, part of the same island for employee_id=2

The Pandas Solution

To solve this problem using Pandas, we’ll follow these steps:

Sort the Data: Ensure the data is sorted by employee ID and start date.
Identify Gaps: Create a boolean mask that identifies gaps between consecutive date ranges. We’ll compare the current start_date with the previous end_date.
Group Islands: Use the mask to create a group identifier for each continuous sequence (island). The cumsum() function is key here.
Aggregate the Islands: Aggregate the data to get the start and end dates of each island (This will be added in the enhanced example).

Here’s how you can implement this in Python using Pandas:

import pandas as pd

data = {'employee_id': [1, 1, 1, 1, 2, 2],
        'start_date': ['2024-01-01', '2024-01-10', '2024-02-01', '2024-02-15', '2024-01-05', '2024-01-18'],
        'end_date': ['2024-01-15', '2024-01-25', '2024-02-10', '2024-02-28', '2024-01-20', '2024-02-05']}

df = pd.DataFrame(data)

# Convert dates to datetime objects
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

# Sort by employee_id and start_date
df = df.sort_values(['employee_id', 'start_date'])

# Group by employee_id and apply the island grouping logic
df['island_group'] = df.groupby('employee_id').apply(
    lambda g: (g['start_date'] > g['end_date'].shift()).cumsum() + 1  # The core logic
).reset_index(drop=True)

# Display the result
print(df)

# Aggregate to get island start and end dates
islands = df.groupby(['employee_id', 'island_group']).agg(
    island_start=('start_date', 'min'),
    island_end=('end_date', 'max')
).reset_index()
print("\nAggregated Islands:")
print(islands)

Breaking Down the Code

Data Preparation: The code begins by creating a Pandas DataFrame and converting the date strings to datetime objects. This is crucial for date comparisons.

Sorting: The sort_values function sorts the DataFrame by employee_id and start_date, which is essential for identifying consecutive periods.

Island Grouping: This is the core of the solution. The groupby method groups the data by employee_id. Within each group, the apply method uses a lambda function to perform the following:

g['end_date'].shift(): Shifts the end_date one row up within each employee group. This allows comparison of the current start_date with the previous end_date.
g['start_date'] > g['end_date'].shift(): This comparison creates a boolean Series. True indicates a gap (the current start_date is after the previous end_date), and False indicates continuity.
.cumsum(): The cumulative sum of the boolean Series creates the island_group identifier. Each True value increments the group number, effectively marking the start of a new island. The +1 is added to start group numbering from 1 instead of 0.

The Task Activity Problem (An application of Gaps and Islands)

Let’s say you have a dataset that records when tasks are assigned to an employee. For each task, you have a start date and an end date. Your goal is to determine the periods when tasks were active on a day-to-day basis and to track which tasks were active on any given date.

Here’s an example dataset:

employee_id	date	type	task_id
1	2024-01-01	start	101
1	2024-01-15	end	101
1	2024-01-10	start	102
1	2024-01-25	end	102
1	2024-02-01	start	103
1	2024-02-10	end	103
1	2024-02-15	start	104
1	2024-02-28	end	104

This dataset contains both start and end events for four tasks assigned to the same employee. We want to create a representation of each day between the start and end dates and determine which tasks were active on that day.

The Solution

We’ll use the power of pandas to:

Reshape the data so that each task has both its start and end dates in the same row.
Create a date range for each task’s active period.
Explode the date ranges into individual rows, with one row for each day the task is active.
Group the tasks by date to track which tasks are active on which day.

Let’s walk through the solution.

Step 1: Setup the Data

We begin by creating a pandas DataFrame from our sample data:

import pandas as pd

# Sample data
data = {
    'employee_id': [1, 1, 1, 1, 1, 1, 1, 1],
    'date': ['2024-01-01', '2024-01-15', '2024-01-10', '2024-01-25', '2024-02-01', '2024-02-10', '2024-02-15', '2024-02-28'],
    'type': ['start', 'end', 'start', 'end', 'start', 'end', 'start', 'end'],
    'task_id': [101, 101, 102, 102, 103, 103, 104, 104]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert the 'date' column to datetime
df['date'] = pd.to_datetime(df['date'])

Here, we have converted the date column to a datetime object so that we can easily manipulate the dates.

Step 2: Reshape the Data

Next, we want to reshape the data so that each task (task_id) has both its start and end dates in the same row. We’ll use the pivot function for this:

# Reshape the data to get task-specific start and end dates
df_tasks = df.pivot(index='task_id', columns='type', values='date').reset_index()

This transforms the original DataFrame into the following format:

task_id	start	end
101	2024-01-01	2024-01-15
102	2024-01-10	2024-01-25
103	2024-02-01	2024-02-10
104	2024-02-15	2024-02-28

Now, each task has its start and end dates in a single row.

Step 3: Create Date Ranges for Each Task

Next, we’ll generate a date range for each task, representing all the dates between its start and end:

# Create a time series for each task's active period
df_tasks['date_range'] = df_tasks.apply(lambda row: pd.date_range(start=row['start'], end=row['end'], freq='D'), axis=1)

This creates a new column date_range which, for each task, contains a list of dates when the task was active.

Step 4: Explode the Date Ranges

We now want to “explode” the date ranges into individual rows, so that each row corresponds to a single day when a task was active:

# Explode the date ranges into individual rows
df_expanded = df_tasks[['task_id', 'date_range']].explode('date_range').reset_index(drop=True)

This produces a DataFrame where each task is repeated for each day it was active:

task_id	date
101	2024-01-01
101	2024-01-02
101	2024-01-03
101	2024-01-04
…	…

Step 5: Group by Date to Track Active Tasks

Finally, we want to group the tasks by date so that we can see which tasks were active on each day. We do this by grouping the DataFrame by date and aggregating the task_id values into lists:

# Group by date and aggregate tasks into lists
df_active_tasks = df_expanded.groupby('date')['task_id'].apply(list).reset_index()

# Rename columns to match the expected output
df_active_tasks.columns = ['date', 'active_tasks']

This gives us the following output:

date	active_tasks
2024-01-01	[101]
2024-01-02	[101]
2024-01-03	[101]
2024-01-04	[101]
…	…
2024-01-10	[101, 102]
…	…

Each row now shows a date and the list of tasks that were active on that day. As you see above the inverse operation of explode() is to group or aggregate multiple rows back into a list. This can be done using groupby() and agg() (or apply(list)), which will collect the exploded rows back into list-like structures.

If you want to include the dates where no tasks are active, you need to ensure that the final DataFrame contains a continuous date range, even if no tasks were assigned on certain days. This can be accomplished by creating a complete date range that spans from the earliest task start date to the latest task end date, and then merging this range with the active tasks data.

Explanation:

Full Date Range: We create a complete date range that spans from the earliest task start date to the latest task end date:

full_date_range = pd.DataFrame(pd.date_range(df['date'].min(), df['date'].max()), columns=['date'])

Merging: We use pd.merge to join the full date range with the DataFrame that contains active tasks (df_active_tasks). This ensures that all dates are included in the final DataFrame, even if no tasks were active on some of those dates.

# Merge the full date range with the active tasks to fill missing dates with NaN
merged_df = pd.merge(full_date_range, active_tasks_by_date, on='date', how='left')

Handling Empty Days: The how='left' argument in the merge ensures that we get all dates from the full date range, and rows where no tasks were active will have NaN in the active_tasks column.

Output:

         date active_tasks
0  2024-01-01       [101]
1  2024-01-02       [101]
2  2024-01-03       [101]
3  2024-01-04       [101]
4  2024-01-05       [101]
5  2024-01-06       [101]
6  2024-01-07       [101]
7  2024-01-08       [101]
8  2024-01-09       [101]
9  2024-01-10  [101, 102]
10 2024-01-11  [101, 102]
11 2024-01-12  [101, 102]
12 2024-01-13  [101, 102]
13 2024-01-14  [101, 102]
14 2024-01-15  [101, 102]
15 2024-01-16       [102]
16 2024-01-17       [102]
17 2024-01-18       [102]
18 2024-01-19       [102]
19 2024-01-20       [102]
20 2024-01-21       [102]
21 2024-01-22       [102]
22 2024-01-23       [102]
23 2024-01-24       [102]
24 2024-01-25       [102]
25 2024-01-26        NaN
26 2024-01-27        NaN
27 2024-01-28        NaN
28 2024-01-29        NaN
29 2024-01-30        NaN
30 2024-01-31        NaN
31 2024-02-01       [103]
...

Dates with No Tasks: As you can see in the output, dates like 2024-01-26 to 2024-01-31 have no tasks active, so their active_tasks column shows NaN.
Continuous Date Range: We’ve ensured that the final DataFrame includes every single day within the period covered by the tasks, even if no tasks were active on that day.

The task activity problem is a practical example of the Gaps and Islands Problem:

Islands represent continuous periods of task activity.
Gaps represent days where no tasks are active.

By using pandas to manipulate the data, we can effectively solve this problem through techniques like reshaping, exploding date ranges, and merging with a full date range to handle missing periods. The underlying logic is similar to the gaps and islands problem in SQL, where we group or identify continuous sequences and the gaps between them.