3 min read
Solving the Gaps and Islands Problem Using Python Pandas

The “gaps and islands” problem is a well-known challenge in SQL, where continuous sequences (islands) and gaps between them need to be identified, particularly in date ranges. While this problem is traditionally solved using SQL, Python’s Pandas library offers a versatile and intuitive way to tackle it as well.

In this post, we’ll explore how to solve the gaps and islands problem using Pandas, a powerful data manipulation tool that is widely used in the data science community.

The Problem: Gaps and Islands

As a quick recap, the “gaps and islands” problem involves identifying sequences of contiguous records (islands) and the gaps between them. This is particularly common when working with date ranges, such as tracking employee work periods, booking reservations, or monitoring system uptime.

Example Scenario

Imagine you have a DataFrame that tracks employee work assignments with the following columns:

  • employee_id: An integer that uniquely identifies an employee.
  • start_date: The date when the employee’s assignment starts.
  • end_date: The date when the employee’s assignment ends.

The goal is to identify continuous periods of work (islands) for each employee and highlight any gaps between these periods. Let’s visualize this with a simple example:

employee_idstart_dateend_date
12024-01-012024-01-15
12024-01-102024-01-25
12024-02-012024-02-10
12024-02-152024-02-28
22024-01-052024-01-20
22024-01-182024-02-05

The Pandas Solution

To solve this problem using Pandas, we’ll follow these steps:

  1. Sort the Data: Ensure the data is sorted by employee ID and start date.
  2. Identify Gaps: Create a boolean mask that identifies gaps between consecutive date ranges. We’ll compare the current start_date with the previous end_date.
  3. Group Islands: Use the mask to create a group identifier for each continuous sequence (island). The cumsum() function is key here.
  4. Aggregate the Islands: Aggregate the data to get the start and end dates of each island (This will be added in the enhanced example).

Here’s how you can implement this in Python using Pandas:

import pandas as pd

data = {'employee_id': [1, 1, 1, 1, 2, 2],
        'start_date': ['2024-01-01', '2024-01-10', '2024-02-01', '2024-02-15', '2024-01-05', '2024-01-18'],
        'end_date': ['2024-01-15', '2024-01-25', '2024-02-10', '2024-02-28', '2024-01-20', '2024-02-05']}

df = pd.DataFrame(data)

# Convert dates to datetime objects
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

# Sort by employee_id and start_date
df = df.sort_values(['employee_id', 'start_date'])

# Group by employee_id and apply the island grouping logic
df['island_group'] = df.groupby('employee_id').apply(
    lambda g: (g['start_date'] > g['end_date'].shift()).cumsum() + 1  # The core logic
).reset_index(drop=True)

# Display the result
print(df)

# Aggregate to get island start and end dates
islands = df.groupby(['employee_id', 'island_group']).agg(
    island_start=('start_date', 'min'),
    island_end=('end_date', 'max')
).reset_index()
print("\nAggregated Islands:")
print(islands)

Breaking Down the Code

Data Preparation: The code begins by creating a Pandas DataFrame and converting the date strings to datetime objects. This is crucial for date comparisons.

Sorting: The sort_values function sorts the DataFrame by employee_id and start_date, which is essential for identifying consecutive periods.

Island Grouping: This is the core of the solution. The groupby method groups the data by employee_id. Within each group, the apply method uses a lambda function to perform the following:

  • g['end_date'].shift(): Shifts the end_date one row up within each employee group. This allows comparison of the current start_date with the previous end_date.
  • g['start_date'] > g['end_date'].shift(): This comparison creates a boolean Series. True indicates a gap (the current start_date is after the previous end_date), and False indicates continuity.
  • .cumsum(): The cumulative sum of the boolean Series creates the island_group identifier. Each True value increments the group number, effectively marking the start of a new island. The +1 is added to start group numbering from 1 instead of 0.

Why Use Pandas?

Pandas offers several advantages for solving this type of problem:

Vectorized Operations: Pandas excels at vectorized operations, making it efficient for processing large datasets. Intuitive Syntax: The syntax for data manipulation is generally more readable and easier to understand than SQL for similar tasks. Integration with Data Science Ecosystem: Pandas seamlessly integrates with other Python libraries commonly used in data science, such as NumPy, Scikit-learn, and Matplotlib.