15 min read
Sliding and Expanding Windows in SQL and Pandas: A Comprehensive Guide

Sliding and expanding windows are two powerful tools for analyzing sequential data, especially in time series analysis. While both approaches involve calculating statistics over a subset of data, they differ in how the window is defined and applied:

  • Sliding Windows: These involve a fixed-size window that moves or “rolls” across the data. Each window contains a fixed number of observations or a fixed time range. As new observations are added, older ones are dropped, keeping the window size constant. This is useful for calculating moving averages, rolling sums, or other statistics where each window is of equal size and provides a localized view of the data.

  • Expanding Windows: In contrast, expanding windows start at the beginning of the dataset and grow progressively with each new observation. Instead of maintaining a constant window size, the window includes all past observations up to the current point. This makes expanding windows particularly useful for cumulative statistics like running totals, cumulative averages, or cumulative counts, where you want to aggregate all previous data.

Both sliding and expanding windows are essential in different contexts, and choosing the right one depends on what kind of insight you’re trying to derive from the data. Sliding windows are great for keeping the analysis focused on a fixed period, offering a balanced view of recent changes. Expanding windows, on the other hand, give a broader, cumulative picture that grows over time, making them ideal for capturing long-term trends.

Overview of Rolling Windows in Pandas

Function Signature of DataFrame.rolling()

DataFrame.rolling(window, min_periods=None, center=False, win_type=None, 
                  on=None, axis=<no_default>, closed=None, step=None, 
                  method='single')

The key parameters of the function are:

  • window (Required)
  • min_periods
  • center
  • win_type
  • on
  • axis
  • closed
  • step
  • method

Let’s go over these parameters in detail.


1. window (Required)

This parameter specifies the size of the moving window. It can accept different types of values:

  • Fixed number of observations: An integer, such as window=3, defines a window of 3 consecutive observations.
  • Variable window based on time offset: A string, such as window='5D', defines a window based on a time duration (e.g., 5 days).

Key Concept:
This is the heart of the rolling operation, as it determines how many data points are included in each rolling window. When specifying time-based windows, the data must have a DatetimeIndex.

Example:

df['rolling_mean'] = df['value'].rolling(window=3).mean()

SQL Equivalent:

  • Fixed window: SQL supports rolling windows through the OVER clause combined with ROWS or RANGE specifications.
  • For example, a fixed window of 3 rows can be achieved like this:
SELECT value,
        AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;

This tells SQL to compute the rolling average over the current row and the two preceding rows.

  • Variable window (time-based): SQL can also handle time-based rolling windows using RANGE instead of ROWS. This is particularly useful when dealing with time-series data.
  • For example, a rolling window based on a 7-day range:
SELECT value,
        AVG(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;

2. min_periods

  • Default: None
  • Description: This specifies the minimum number of observations in the window required to have a valid result. If fewer observations are available, the result is NaN.

Key Concept:
This parameter is crucial when dealing with incomplete or missing data. It controls how many non-NaN values are needed to compute the result for each window.

Example:

# Will only compute mean if at least 2 values are present in the window
df['rolling_mean'] = df['value'].rolling(window=3, min_periods=2).mean()

SQL Equivalent:
SQL does not have a direct min_periods equivalent, but you can often achieve similar functionality using conditional aggregation. For instance, you can count the number of rows in the window and only calculate the average if the count exceeds a threshold:

SELECT value,
       CASE 
           WHEN COUNT(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) >= 2 
           THEN AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
           ELSE NULL
       END AS rolling_avg
FROM table;

Here, the rolling average is only calculated if at least 2 values are present in the window.


3. center

  • Default: False
  • Description: This parameter determines whether the window should be centered around the current observation. If True, the window is centered (i.e., the observation is in the middle of the window); otherwise, the window is right-aligned.

Key Concept:
Centering can be useful in certain analyses where you want to smooth the data symmetrically around each observation.

Example:

# Center the rolling window
df['rolling_mean'] = df['value'].rolling(window=3, center=True).mean()

SQL Equivalent:

SQL doesn’t have a built-in equivalent of centered windows like Pandas does, but you can simulate the behavior manually by adjusting the window frame to include rows both before and after the current row. In SQL’s OVER clause, you typically define a moving window using ROWS BETWEEN X PRECEDING AND Y FOLLOWING. By adjusting both the preceding and following parts, you can create a centered window.

Let’s say you want a centered rolling window over 3 rows where the current row is in the center (1 row before, the current row, and 1 row after). You can specify this using the ROWS BETWEEN clause.

Here’s an example SQL query that achieves this:

SELECT value,
       AVG(value) OVER (
           ORDER BY date 
           ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
       ) AS centered_rolling_avg
FROM table;

Explanation:

  • ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: This defines a window that includes:

    • 1 row before the current row,
    • the current row,
    • 1 row after the current row.

    This effectively creates a centered window of size 3, similar to how Pandas’ center=True works with a rolling window.


4. win_type

  • Default: None
  • Description: This specifies the type of window function to use. Possible values include boxcar, triang, blackman, hamming, bartlett, and more. These are weighted windows that apply different weighting schemes to the values within the window.

Key Concept:
Weighted windows are useful when you don’t want to treat all observations equally within the window. For example, in a Hamming window, the values near the center of the window might receive more weight than the values at the edges.

Example:

# Using a window with a Hamming function
df['weighted_rolling_mean'] = df['value'].rolling(window=5, win_type='hamming').mean()

SQL Equivalent:
SQL does not have built-in support for weighted window functions like Pandas. However, you can manually create weighted windows by using custom formulas. For instance, you could assign weights to rows based on their position in the window:

SELECT value,
       SUM(value * CASE 
                   WHEN ROW_NUMBER() OVER (ORDER BY date) = 1 THEN 0.5
                   WHEN ROW_NUMBER() OVER (ORDER BY date) = 2 THEN 0.3
                   ELSE 0.2
                 END
           ) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS weighted_rolling_avg
FROM table;

This is a simple example of manually applying weights, but it’s more cumbersome compared to Pandas’ built-in win_type.


5. on

  • Default: None
  • Description: This specifies the column to use for the rolling window, typically a Datetime column. It’s used when the DataFrame does not have a DatetimeIndex, but you still want to perform rolling operations based on a time-based column.

Key Concept:
If your DataFrame has a separate column for dates or timestamps (and not as the index), this parameter helps you tell Pandas which column to use for time-based rolling windows.

Example:

# Rolling window based on a datetime column
df['rolling_mean'] = df.rolling(window='7D', on='date')['value'].mean()

SQL Equivalent:
In SQL, the ORDER BY clause within the OVER clause serves a similar purpose. It specifies the column by which the rolling window is ordered (typically a date column for time-series data).

SELECT value,
       AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;

Here, the date column is used to define the order of the rolling window, similar to how the on parameter works in Pandas.


6. axis

  • Default: 0 (rows)
  • Description: This specifies the axis over which the rolling window is applied. 0 applies the window over rows, while 1 applies the window over columns.

Key Concept:
Typically, the rolling operation is done over rows (axis=0), but it can also be applied across columns with axis=1.

Example:

# Rolling mean across columns instead of rows
df['rolling_mean'] = df.rolling(window=3, axis=1).mean()

SQL Equivalent:
SQL operates in a row-wise manner by default, so there is no direct equivalent to changing the axis in SQL. SQL window functions are always applied across rows, and pivoting or transposing data to operate across columns is not straightforward. You would need to reshape the data using PIVOT or other methods, which is not as flexible as Pandas’ axis=1.


7. closed

  • Default: None
  • Description: This defines whether the intervals are closed on the right, left, both, or neither. Possible values are 'right', 'left', 'both', and 'neither'.

Key Concept:
This is particularly useful when working with time-based rolling windows, as it determines whether the current time point is included in the window.

Example:

# Rolling window with intervals closed on both sides
df['rolling_mean'] = df.rolling(window='7D', closed='both', on='date')['value'].mean()

SQL Equivalent:
SQL provides control over whether the window is inclusive or exclusive via the BETWEEN clause in the ROWS or RANGE specification. However, SQL typically defaults to right-closed (<=) intervals, and there is no built-in mechanism for fully controlling the window closure as in Pandas.

For example, SQL’s default behavior is usually equivalent to a right-closed interval:

SELECT value,
       AVG(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;

To simulate a left-closed window, you would need to manipulate the ORDER BY and window logic, but this is not straightforward in SQL.


8. step

  • Default: None
  • Description: This defines the step size between windows. For example, a step=2 would mean applying the rolling function every 2 observations, skipping one in between.

Key Concept:
This parameter allows you to control the frequency at which the rolling operation is applied. It’s useful when you want less granular results.

Example:

# Rolling window that skips every second observation
df['rolling_mean'] = df.rolling(window=3, step=2).mean()

SQL Equivalent:
SQL does not have a built-in equivalent for skipping rows in rolling windows. However, you can simulate this behavior by creating a row number and filtering based on the desired step:

WITH numbered_rows AS (
    SELECT value, ROW_NUMBER() OVER (ORDER BY date) AS row_num
    FROM table
)
SELECT value,
       AVG(value) OVER (ORDER BY row_num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM numbered_rows
WHERE row_num % 2 = 0;

This query uses the ROW_NUMBER() function to generate row numbers and then filters based on the step size.


9. method

  • Default: 'single'
  • Description: This parameter defines whether the operation should be performed with a single-threaded ('single') or multi-threaded ('table') approach. The multi-threaded option is helpful for large datasets.

Key Concept:
For large datasets, using 'table' can improve performance by parallelizing the computation.

Example:

# Multi-threaded rolling operation
df['rolling_mean'] = df.rolling(window=3, method='table').mean()

SQL Equivalent:
SQL databases handle performance optimization internally, so there is no direct control over whether window functions are computed single-threaded or multi-threaded. The SQL query planner and optimizer will automatically determine the best execution strategy, depending on the database engine (PostgreSQL, MySQL, etc.) and its parallel processing capabilities.

Some databases (e.g., PostgreSQL) offer ways to control parallelism in general, but there is no direct equivalent to Pandas’ method parameter.

Fixed vs. Variable Windows

An important distinction is between fixed windows (based on a number of observations) and variable windows (based on time offsets):

Fixed Windows

  • These are windows that contain a fixed number of observations, regardless of their timestamp or index.
  • Example: rolling(window=3) will always calculate the rolling statistic over 3 consecutive data points.

Variable Windows (Offset-based)

  • These windows are based on time intervals, which adjust dynamically based on the data’s timestamps.
  • Example: rolling(window='7D') will include all data points within the last 7 days for each calculation, which may be more or fewer than a fixed number of rows.

Example:

# Fixed window of 3 observations
df['fixed_window_mean'] = df['value'].rolling(window=3).mean()

# Variable window of 7 days
df['variable_window_mean'] = df.rolling(window='7D', on='date')['value'].mean()

Function Signature of DataFrame.expanding()

DataFrame.expanding(min_periods=1, axis=<no_default>, method='single')

This function returns a window object that enables expanding computations (e.g., cumulative sums, means) on the data. We’ll explore the three main parameters of this function: min_periods, axis, and method.


1. min_periods

  • Type: int
  • Default: 1
  • Description: This parameter specifies the minimum number of observations in the expanding window required to have a valid result. If the number of observations is less than min_periods, the result will be NaN.

Key Concepts:

  • Cumulative Calculations: In an expanding window, the window starts from the first observation and includes all previous values up to the current point. The min_periods parameter ensures that only after a certain number of observations have been included in the window will the result be computed.

  • Handling Missing Data: If your data contains missing values (NaN), you can use min_periods to control when the calculation will start producing non-NaN results. For instance, setting min_periods=3 means that the expanding window will only return a result starting with the third row, as there won’t be enough data points before that.

Example:

# Expanding window with a minimum of 3 observations
df['expanding_mean'] = df['value'].expanding(min_periods=3).mean()

In this example, the cumulative mean will only start being calculated after the third row. The first two rows will contain NaN.

Important Concept:

  • When to Use: The min_periods parameter is especially important when working with datasets that may not have many observations in the early part of the series. It prevents early calculations when there isn’t enough data, ensuring that the results are statistically meaningful.

in SQL

similar cumulative operations can be achieved using window functions. SQL window functions, particularly cumulative functions like SUM(), AVG(), and COUNT() with the OVER clause, can replicate the core behavior of expanding windows. However, there are some differences in how SQL handles these operations compared to Pandas.

SQL Equivalent:
SQL does not have a direct equivalent for min_periods, but you can replicate the behavior using conditional logic (e.g., CASE statements) to check the number of observations in the window and return NULL if the count is below a certain threshold.

Example:

Suppose you want to calculate the cumulative sum in SQL but only return a result if there are at least 3 rows (equivalent to min_periods=3 in Pandas).

SELECT value,
       CASE 
           WHEN COUNT(value) OVER (ORDER BY date) >= 3 
           THEN SUM(value) OVER (ORDER BY date)
           ELSE NULL
       END AS expanding_sum
FROM table;

In this query:

  • COUNT(value) OVER (ORDER BY date) counts the number of rows up to the current row.
  • The CASE statement checks if the count is greater than or equal to 3. If true, it calculates the cumulative sum using SUM() OVER, otherwise it returns NULL.

This mimics the behavior of min_periods in Pandas.


2. axis

  • Type: int or str
  • Default: 0
  • Description: This parameter specifies the axis along which the expanding window is applied. It can be either:
    • 0 (or 'index') to apply the expanding operation row-wise (default),
    • 1 (or 'columns') to apply the expanding operation column-wise.

Key Concepts:

  • Row-wise (axis=0): The expanding operation is applied to each column independently for each row. This is the most common use case, especially for time series, where you want to calculate cumulative statistics over time for each column.

  • Column-wise (axis=1): The expanding operation is applied across the columns for each row. This can be useful in cases where the columns represent different variables, and you want to calculate cumulative statistics across them.

Example:

# Apply expanding mean row-wise (default)
df['expanding_mean'] = df['value'].expanding(axis=0).mean()

# Apply expanding sum across columns
df['expanding_sum_across_columns'] = df.expanding(axis=1).sum()

Important Concept:

  • When to Use: In most time-series applications, you’ll use axis=0 because you’re typically interested in cumulative operations over rows (e.g., dates). Using axis=1 is less common but can be useful in cases where you have multiple features in columns and want to aggregate across them for each row.

3. method

  • Type: str
  • Options: {'single', 'table'}
  • Default: 'single'
  • Description: This parameter controls how the expanding operation is executed:
    • 'single': Performs the expanding operation over a single column or row at a time.
    • 'table': Executes the expanding operation over the entire DataFrame or Series in one go, which can be more efficient depending on the operation and the size of the data.

Key Concepts:

  • Single vs. Table:
    • The 'single' method applies the expanding window function to each column or row individually. This is the default behavior and is often sufficient for smaller datasets or when working with one column at a time.
    • The 'table' method processes the entire DataFrame or Series in one operation, which can be faster for large datasets but may require more memory and computational resources.

Example:

# Apply expanding sum using the single method (default)
df['expanding_sum'] = df['value'].expanding(method='single').sum()

# Apply expanding mean using the table method (useful for large datasets)
df['expanding_mean'] = df.expanding(method='table').mean()

Important Concept:

  • When to Use: The method='table' option can be more efficient for large datasets, especially when performing operations on multiple columns, but it may not always be necessary for smaller datasets. Use the 'table' method when you’re working with a large DataFrame and need performance optimization.

Running Totals with Additional Conditions

Sometimes, you may need to calculate running totals based on specific conditions, such as filtering certain rows.

SQL Approach

In SQL, incorporating conditions into window functions can be achieved using CASE WHEN statements within the aggregate function or by applying a WHERE clause in a subquery.

Example Scenario: Calculate a running total of sales amounts, but only include sales from a specific region.

SELECT 
    date,
    region,
    amount,
    SUM(CASE WHEN region = 'North' THEN amount ELSE 0 END) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total_north
FROM 
    sales
ORDER BY 
    date;

Explanation:

  • CASE WHEN region = 'North' THEN amount ELSE 0 END: Ensures that only sales from the ‘North’ region are included in the sum.
  • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Defines the window from the start of the dataset up to the current row, creating a running total.

Pandas Approach

Pandas allows for more straightforward conditional operations using boolean indexing and the cumsum() function.

Example Scenario: Using the same sales data, calculate a running total for the ‘North’ region.

import pandas as pd

# Sample data
data = {
    'date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
    'region': ['North', 'South', 'North', 'East', 'North', 'West', 'North', 'South', 'North', 'East'],
    'amount': [100, 150, 200, 130, 170, 160, 180, 190, 210, 220]
}
df = pd.DataFrame(data)

# Ensure data is sorted by date
df = df.sort_values('date')

# Calculate running total for 'North' region
df['running_total_north'] = (
    df[df['region'] == 'North']
    .sort_values('date')['amount']
    .cumsum()
)
# Fill NaN for non-North regions
df['running_total_north'] = df['running_total_north'].fillna(method='ffill').fillna(0)

print(df)

Output:

        date region  amount  running_total_north
0 2023-01-01  North     100                 100.0
1 2023-01-02  South     150                 100.0
2 2023-01-03  North     200                 300.0
3 2023-01-04   East     130                 300.0
4 2023-01-05  North     170                 470.0
5 2023-01-06   West     160                 470.0
6 2023-01-07  North     180                 650.0
7 2023-01-08  South     190                 650.0
8 2023-01-09  North     210                 860.0
9 2023-01-10   East     220                 860.0

Explanation:

  • df[df['region'] == 'North']: Filters the DataFrame to include only rows from the ‘North’ region.
  • .cumsum(): Calculates the cumulative sum.
  • fillna(method='ffill').fillna(0): Fills NaN values for non-North regions by forward filling and replacing remaining NaN with 0.