The “gaps and islands” problem is a well-known challenge in SQL, where continuous sequences (islands) and gaps between them need to be identified, particularly in date ranges. While this problem is traditionally solved using SQL, Python’s Pandas library offers a versatile and intuitive way to tackle it as well.
In this post, we’ll explore how to solve the gaps and islands problem using Pandas, a powerful data manipulation tool that is widely used in the data science community.
The Problem: Gaps and Islands
As a quick recap, the “gaps and islands” problem involves identifying sequences of contiguous records (islands) and the gaps between them. This is particularly common when working with date ranges, such as tracking employee work periods, booking reservations, or monitoring system uptime.
Example Scenario
Imagine you have a DataFrame that tracks employee work assignments with the following columns:
- employee_id: An integer that uniquely identifies an employee.
- start_date: The date when the employee’s assignment starts.
- end_date: The date when the employee’s assignment ends.
The goal is to identify continuous periods of work (islands) for each employee and highlight any gaps between these periods. Let’s visualize this with a simple example:
employee_id | start_date | end_date |
---|---|---|
1 | 2024-01-01 | 2024-01-15 |
1 | 2024-01-10 | 2024-01-25 |
1 | 2024-02-01 | 2024-02-10 |
1 | 2024-02-15 | 2024-02-28 |
2 | 2024-01-05 | 2024-01-20 |
2 | 2024-01-18 | 2024-02-05 |
The Pandas Solution
To solve this problem using Pandas, we’ll follow these steps:
- Sort the Data: Ensure the data is sorted by employee ID and start date.
- Identify Gaps: Create a boolean mask that identifies gaps between consecutive date ranges. We’ll compare the current
start_date
with the previousend_date
. - Group Islands: Use the mask to create a group identifier for each continuous sequence (island). The
cumsum()
function is key here. - Aggregate the Islands: Aggregate the data to get the start and end dates of each island (This will be added in the enhanced example).
Here’s how you can implement this in Python using Pandas:
import pandas as pd
data = {'employee_id': [1, 1, 1, 1, 2, 2],
'start_date': ['2024-01-01', '2024-01-10', '2024-02-01', '2024-02-15', '2024-01-05', '2024-01-18'],
'end_date': ['2024-01-15', '2024-01-25', '2024-02-10', '2024-02-28', '2024-01-20', '2024-02-05']}
df = pd.DataFrame(data)
# Convert dates to datetime objects
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
# Sort by employee_id and start_date
df = df.sort_values(['employee_id', 'start_date'])
# Group by employee_id and apply the island grouping logic
df['island_group'] = df.groupby('employee_id').apply(
lambda g: (g['start_date'] > g['end_date'].shift()).cumsum() + 1 # The core logic
).reset_index(drop=True)
# Display the result
print(df)
# Aggregate to get island start and end dates
islands = df.groupby(['employee_id', 'island_group']).agg(
island_start=('start_date', 'min'),
island_end=('end_date', 'max')
).reset_index()
print("\nAggregated Islands:")
print(islands)
Breaking Down the Code
Data Preparation: The code begins by creating a Pandas DataFrame and converting the date strings to datetime objects. This is crucial for date comparisons.
Sorting: The sort_values
function sorts the DataFrame by employee_id
and start_date
, which is essential for identifying consecutive periods.
Island Grouping: This is the core of the solution. The groupby
method groups the data by employee_id
. Within each group, the apply
method uses a lambda function to perform the following:
g['end_date'].shift()
: Shifts theend_date
one row up within each employee group. This allows comparison of the currentstart_date
with the previousend_date
.g['start_date'] > g['end_date'].shift()
: This comparison creates a boolean Series.True
indicates a gap (the currentstart_date
is after the previousend_date
), andFalse
indicates continuity..cumsum()
: The cumulative sum of the boolean Series creates theisland_group
identifier. EachTrue
value increments the group number, effectively marking the start of a new island. The+1
is added to start group numbering from 1 instead of 0.
Why Use Pandas?
Pandas offers several advantages for solving this type of problem:
Vectorized Operations: Pandas excels at vectorized operations, making it efficient for processing large datasets. Intuitive Syntax: The syntax for data manipulation is generally more readable and easier to understand than SQL for similar tasks. Integration with Data Science Ecosystem: Pandas seamlessly integrates with other Python libraries commonly used in data science, such as NumPy, Scikit-learn, and Matplotlib.