The “gaps and islands” problem is a well-known challenge in SQL, where continuous sequences (islands) and gaps between them need to be identified, particularly in date ranges. While this problem is traditionally solved using SQL (see my post here), Python’s Pandas library offers a versatile and intuitive way to tackle it as well.
In this post, we’ll explore how two ways to solve the gaps and islands problem using the Python package Pandas.
The Problem: Gaps and Islands
As a quick recap, the “gaps and islands” problem involves identifying sequences of contiguous records (islands) and the gaps between them. This is particularly common when working with date ranges, such as tracking employee work periods, booking reservations, or monitoring system uptime.
Example Scenario
Imagine you have a DataFrame that tracks employee work assignments with the following columns:
- employee_id: An integer that uniquely identifies an employee.
- start_date: The date when the employee’s assignment starts.
- end_date: The date when the employee’s assignment ends.
The goal is to identify continuous periods of work (islands) for each employee and highlight any gaps between these periods. Let’s visualize this with a simple example:
| employee_id | start_date | end_date |
|-------------|------------|------------|
| 1 | 2024-01-01 | 2024-01-15 |
| 1 | 2024-01-10 | 2024-01-25 | <- Overlapping, part of the same island for employee_id=1
| 1 | 2024-02-01 | 2024-02-10 | <- New island
| 1 | 2024-02-15 | 2024-02-28 | <- New island
| 2 | 2024-01-05 | 2024-01-20 |
| 2 | 2024-01-18 | 2024-02-05 | <- Overlapping, part of the same island for employee_id=2
The Pandas Solution
To solve this problem using Pandas, we’ll follow these steps:
- Sort the Data: Ensure the data is sorted by employee ID and start date.
- Identify Gaps: Create a boolean mask that identifies gaps between consecutive date ranges. We’ll compare the current
start_date
with the previousend_date
. - Group Islands: Use the mask to create a group identifier for each continuous sequence (island). The
cumsum()
function is key here. - Aggregate the Islands: Aggregate the data to get the start and end dates of each island (This will be added in the enhanced example).
Here’s how you can implement this in Python using Pandas:
import pandas as pd
data = {'employee_id': [1, 1, 1, 1, 2, 2],
'start_date': ['2024-01-01', '2024-01-10', '2024-02-01', '2024-02-15', '2024-01-05', '2024-01-18'],
'end_date': ['2024-01-15', '2024-01-25', '2024-02-10', '2024-02-28', '2024-01-20', '2024-02-05']}
df = pd.DataFrame(data)
# Convert dates to datetime objects
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# Sort by employee_id and start_date
df = df.sort_values(['employee_id', 'start_date'])
# Group by employee_id and apply the island grouping logic
df['island_group'] = df.groupby('employee_id').apply(
lambda g: (g['start_date'] > g['end_date'].shift()).cumsum() + 1 # The core logic
).reset_index(drop=True)
# Display the result
print(df)
# Aggregate to get island start and end dates
islands = df.groupby(['employee_id', 'island_group']).agg(
island_start=('start_date', 'min'),
island_end=('end_date', 'max')
).reset_index()
print("\nAggregated Islands:")
print(islands)
Breaking Down the Code
Data Preparation: The code begins by creating a Pandas DataFrame and converting the date strings to datetime objects. This is crucial for date comparisons.
Sorting: The sort_values
function sorts the DataFrame by employee_id
and start_date
, which is essential for identifying consecutive periods.
Island Grouping: This is the core of the solution. The groupby
method groups the data by employee_id
. Within each group, the apply
method uses a lambda function to perform the following:
g['end_date'].shift()
: Shifts theend_date
one row up within each employee group. This allows comparison of the currentstart_date
with the previousend_date
.g['start_date'] > g['end_date'].shift()
: This comparison creates a boolean Series.True
indicates a gap (the currentstart_date
is after the previousend_date
), andFalse
indicates continuity..cumsum()
: The cumulative sum of the boolean Series creates theisland_group
identifier. EachTrue
value increments the group number, effectively marking the start of a new island. The+1
is added to start group numbering from 1 instead of 0.
The Task Activity Problem (An application of Gaps and Islands)
Let’s say you have a dataset that records when tasks are assigned to an employee. For each task, you have a start
date and an end
date. Your goal is to determine the periods when tasks were active on a day-to-day basis and to track which tasks were active on any given date.
Here’s an example dataset:
employee_id | date | type | task_id |
---|---|---|---|
1 | 2024-01-01 | start | 101 |
1 | 2024-01-15 | end | 101 |
1 | 2024-01-10 | start | 102 |
1 | 2024-01-25 | end | 102 |
1 | 2024-02-01 | start | 103 |
1 | 2024-02-10 | end | 103 |
1 | 2024-02-15 | start | 104 |
1 | 2024-02-28 | end | 104 |
This dataset contains both start
and end
events for four tasks assigned to the same employee. We want to create a representation of each day between the start and end dates and determine which tasks were active on that day.
The Solution
We’ll use the power of pandas
to:
- Reshape the data so that each task has both its start and end dates in the same row.
- Create a date range for each task’s active period.
- Explode the date ranges into individual rows, with one row for each day the task is active.
- Group the tasks by date to track which tasks are active on which day.
Let’s walk through the solution.
Step 1: Setup the Data
We begin by creating a pandas DataFrame from our sample data:
import pandas as pd
# Sample data
data = {
'employee_id': [1, 1, 1, 1, 1, 1, 1, 1],
'date': ['2024-01-01', '2024-01-15', '2024-01-10', '2024-01-25', '2024-02-01', '2024-02-10', '2024-02-15', '2024-02-28'],
'type': ['start', 'end', 'start', 'end', 'start', 'end', 'start', 'end'],
'task_id': [101, 101, 102, 102, 103, 103, 104, 104]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Convert the 'date' column to datetime
df['date'] = pd.to_datetime(df['date'])
Here, we have converted the date
column to a datetime
object so that we can easily manipulate the dates.
Step 2: Reshape the Data
Next, we want to reshape the data so that each task (task_id
) has both its start
and end
dates in the same row. We’ll use the pivot
function for this:
# Reshape the data to get task-specific start and end dates
df_tasks = df.pivot(index='task_id', columns='type', values='date').reset_index()
This transforms the original DataFrame into the following format:
task_id | start | end |
---|---|---|
101 | 2024-01-01 | 2024-01-15 |
102 | 2024-01-10 | 2024-01-25 |
103 | 2024-02-01 | 2024-02-10 |
104 | 2024-02-15 | 2024-02-28 |
Now, each task has its start
and end
dates in a single row.
Step 3: Create Date Ranges for Each Task
Next, we’ll generate a date range for each task, representing all the dates between its start and end:
# Create a time series for each task's active period
df_tasks['date_range'] = df_tasks.apply(lambda row: pd.date_range(start=row['start'], end=row['end'], freq='D'), axis=1)
This creates a new column date_range
which, for each task, contains a list of dates when the task was active.
Step 4: Explode the Date Ranges
We now want to “explode” the date ranges into individual rows, so that each row corresponds to a single day when a task was active:
# Explode the date ranges into individual rows
df_expanded = df_tasks[['task_id', 'date_range']].explode('date_range').reset_index(drop=True)
This produces a DataFrame where each task is repeated for each day it was active:
task_id | date |
---|---|
101 | 2024-01-01 |
101 | 2024-01-02 |
101 | 2024-01-03 |
101 | 2024-01-04 |
… | … |
Step 5: Group by Date to Track Active Tasks
Finally, we want to group the tasks by date so that we can see which tasks were active on each day. We do this by grouping the DataFrame by date
and aggregating the task_id
values into lists:
# Group by date and aggregate tasks into lists
df_active_tasks = df_expanded.groupby('date')['task_id'].apply(list).reset_index()
# Rename columns to match the expected output
df_active_tasks.columns = ['date', 'active_tasks']
This gives us the following output:
date | active_tasks |
---|---|
2024-01-01 | [101] |
2024-01-02 | [101] |
2024-01-03 | [101] |
2024-01-04 | [101] |
… | … |
2024-01-10 | [101, 102] |
… | … |
Each row now shows a date and the list of tasks that were active on that day. As you see above the inverse operation of explode() is to group or aggregate multiple rows back into a list. This can be done using groupby() and agg() (or apply(list)), which will collect the exploded rows back into list-like structures.
If you want to include the dates where no tasks are active, you need to ensure that the final DataFrame contains a continuous date range, even if no tasks were assigned on certain days. This can be accomplished by creating a complete date range that spans from the earliest task start date to the latest task end date, and then merging this range with the active tasks data.
Explanation:
- Full Date Range: We create a complete date range that spans from the earliest task start date to the latest task end date:
full_date_range = pd.DataFrame(pd.date_range(df['date'].min(), df['date'].max()), columns=['date'])
- Merging: We use
pd.merge
to join the full date range with the DataFrame that contains active tasks (df_active_tasks
). This ensures that all dates are included in the final DataFrame, even if no tasks were active on some of those dates.
# Merge the full date range with the active tasks to fill missing dates with NaN
merged_df = pd.merge(full_date_range, active_tasks_by_date, on='date', how='left')
- Handling Empty Days: The
how='left'
argument in themerge
ensures that we get all dates from the full date range, and rows where no tasks were active will haveNaN
in theactive_tasks
column.
Output:
date active_tasks
0 2024-01-01 [101]
1 2024-01-02 [101]
2 2024-01-03 [101]
3 2024-01-04 [101]
4 2024-01-05 [101]
5 2024-01-06 [101]
6 2024-01-07 [101]
7 2024-01-08 [101]
8 2024-01-09 [101]
9 2024-01-10 [101, 102]
10 2024-01-11 [101, 102]
11 2024-01-12 [101, 102]
12 2024-01-13 [101, 102]
13 2024-01-14 [101, 102]
14 2024-01-15 [101, 102]
15 2024-01-16 [102]
16 2024-01-17 [102]
17 2024-01-18 [102]
18 2024-01-19 [102]
19 2024-01-20 [102]
20 2024-01-21 [102]
21 2024-01-22 [102]
22 2024-01-23 [102]
23 2024-01-24 [102]
24 2024-01-25 [102]
25 2024-01-26 NaN
26 2024-01-27 NaN
27 2024-01-28 NaN
28 2024-01-29 NaN
29 2024-01-30 NaN
30 2024-01-31 NaN
31 2024-02-01 [103]
...
- Dates with No Tasks: As you can see in the output, dates like
2024-01-26
to2024-01-31
have no tasks active, so theiractive_tasks
column showsNaN
. - Continuous Date Range: We’ve ensured that the final DataFrame includes every single day within the period covered by the tasks, even if no tasks were active on that day.
The task activity problem is a practical example of the Gaps and Islands Problem:
- Islands represent continuous periods of task activity.
- Gaps represent days where no tasks are active.
By using pandas to manipulate the data, we can effectively solve this problem through techniques like reshaping, exploding date ranges, and merging with a full date range to handle missing periods. The underlying logic is similar to the gaps and islands problem in SQL, where we group or identify continuous sequences and the gaps between them.