Sliding and expanding windows are two powerful tools for analyzing sequential data, especially in time series analysis. While both approaches involve calculating statistics over a subset of data, they differ in how the window is defined and applied:
-
Sliding Windows: These involve a fixed-size window that moves or “rolls” across the data. Each window contains a fixed number of observations or a fixed time range. As new observations are added, older ones are dropped, keeping the window size constant. This is useful for calculating moving averages, rolling sums, or other statistics where each window is of equal size and provides a localized view of the data.
-
Expanding Windows: In contrast, expanding windows start at the beginning of the dataset and grow progressively with each new observation. Instead of maintaining a constant window size, the window includes all past observations up to the current point. This makes expanding windows particularly useful for cumulative statistics like running totals, cumulative averages, or cumulative counts, where you want to aggregate all previous data.
Both sliding and expanding windows are essential in different contexts, and choosing the right one depends on what kind of insight you’re trying to derive from the data. Sliding windows are great for keeping the analysis focused on a fixed period, offering a balanced view of recent changes. Expanding windows, on the other hand, give a broader, cumulative picture that grows over time, making them ideal for capturing long-term trends.
Overview of Rolling Windows in Pandas
Function Signature of DataFrame.rolling()
DataFrame.rolling(window, min_periods=None, center=False, win_type=None,
on=None, axis=<no_default>, closed=None, step=None,
method='single')
The key parameters of the function are:
window
(Required)min_periods
center
win_type
on
axis
closed
step
method
Let’s go over these parameters in detail.
1. window
(Required)
This parameter specifies the size of the moving window. It can accept different types of values:
- Fixed number of observations: An integer, such as
window=3
, defines a window of 3 consecutive observations. - Variable window based on time offset: A string, such as
window='5D'
, defines a window based on a time duration (e.g., 5 days).
Key Concept:
This is the heart of the rolling operation, as it determines how many data points are included in each rolling window. When specifying time-based windows, the data must have a DatetimeIndex
.
Example:
df['rolling_mean'] = df['value'].rolling(window=3).mean()
SQL Equivalent:
- Fixed window: SQL supports rolling windows through the
OVER
clause combined withROWS
orRANGE
specifications. - For example, a fixed window of 3 rows can be achieved like this:
SELECT value,
AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;
This tells SQL to compute the rolling average over the current row and the two preceding rows.
- Variable window (time-based): SQL can also handle time-based rolling windows using
RANGE
instead ofROWS
. This is particularly useful when dealing with time-series data. - For example, a rolling window based on a 7-day range:
SELECT value,
AVG(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;
2. min_periods
- Default:
None
- Description: This specifies the minimum number of observations in the window required to have a valid result. If fewer observations are available, the result is
NaN
.
Key Concept:
This parameter is crucial when dealing with incomplete or missing data. It controls how many non-NaN
values are needed to compute the result for each window.
Example:
# Will only compute mean if at least 2 values are present in the window
df['rolling_mean'] = df['value'].rolling(window=3, min_periods=2).mean()
SQL Equivalent:
SQL does not have a direct min_periods
equivalent, but you can often achieve similar functionality using conditional aggregation. For instance, you can count the number of rows in the window and only calculate the average if the count exceeds a threshold:
SELECT value,
CASE
WHEN COUNT(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) >= 2
THEN AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
ELSE NULL
END AS rolling_avg
FROM table;
Here, the rolling average is only calculated if at least 2 values are present in the window.
3. center
- Default:
False
- Description: This parameter determines whether the window should be centered around the current observation. If
True
, the window is centered (i.e., the observation is in the middle of the window); otherwise, the window is right-aligned.
Key Concept:
Centering can be useful in certain analyses where you want to smooth the data symmetrically around each observation.
Example:
# Center the rolling window
df['rolling_mean'] = df['value'].rolling(window=3, center=True).mean()
SQL Equivalent:
SQL doesn’t have a built-in equivalent of centered windows like Pandas does, but you can simulate the behavior manually by adjusting the window frame to include rows both before and after the current row. In SQL’s OVER
clause, you typically define a moving window using ROWS BETWEEN X PRECEDING AND Y FOLLOWING
. By adjusting both the preceding and following parts, you can create a centered window.
Let’s say you want a centered rolling window over 3 rows where the current row is in the center (1 row before, the current row, and 1 row after). You can specify this using the ROWS BETWEEN
clause.
Here’s an example SQL query that achieves this:
SELECT value,
AVG(value) OVER (
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS centered_rolling_avg
FROM table;
Explanation:
-
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
: This defines a window that includes:- 1 row before the current row,
- the current row,
- 1 row after the current row.
This effectively creates a centered window of size 3, similar to how Pandas’
center=True
works with a rolling window.
4. win_type
- Default:
None
- Description: This specifies the type of window function to use. Possible values include
boxcar
,triang
,blackman
,hamming
,bartlett
, and more. These are weighted windows that apply different weighting schemes to the values within the window.
Key Concept:
Weighted windows are useful when you don’t want to treat all observations equally within the window. For example, in a Hamming window, the values near the center of the window might receive more weight than the values at the edges.
Example:
# Using a window with a Hamming function
df['weighted_rolling_mean'] = df['value'].rolling(window=5, win_type='hamming').mean()
SQL Equivalent:
SQL does not have built-in support for weighted window functions like Pandas. However, you can manually create weighted windows by using custom formulas. For instance, you could assign weights to rows based on their position in the window:
SELECT value,
SUM(value * CASE
WHEN ROW_NUMBER() OVER (ORDER BY date) = 1 THEN 0.5
WHEN ROW_NUMBER() OVER (ORDER BY date) = 2 THEN 0.3
ELSE 0.2
END
) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS weighted_rolling_avg
FROM table;
This is a simple example of manually applying weights, but it’s more cumbersome compared to Pandas’ built-in win_type
.
5. on
- Default:
None
- Description: This specifies the column to use for the rolling window, typically a
Datetime
column. It’s used when the DataFrame does not have aDatetimeIndex
, but you still want to perform rolling operations based on a time-based column.
Key Concept:
If your DataFrame has a separate column for dates or timestamps (and not as the index), this parameter helps you tell Pandas which column to use for time-based rolling windows.
Example:
# Rolling window based on a datetime column
df['rolling_mean'] = df.rolling(window='7D', on='date')['value'].mean()
SQL Equivalent:
In SQL, the ORDER BY
clause within the OVER
clause serves a similar purpose. It specifies the column by which the rolling window is ordered (typically a date column for time-series data).
SELECT value,
AVG(value) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;
Here, the date
column is used to define the order of the rolling window, similar to how the on
parameter works in Pandas.
6. axis
- Default:
0
(rows) - Description: This specifies the axis over which the rolling window is applied.
0
applies the window over rows, while1
applies the window over columns.
Key Concept:
Typically, the rolling operation is done over rows (axis=0), but it can also be applied across columns with axis=1.
Example:
# Rolling mean across columns instead of rows
df['rolling_mean'] = df.rolling(window=3, axis=1).mean()
SQL Equivalent:
SQL operates in a row-wise manner by default, so there is no direct equivalent to changing the axis in SQL. SQL window functions are always applied across rows, and pivoting or transposing data to operate across columns is not straightforward. You would need to reshape the data using PIVOT
or other methods, which is not as flexible as Pandas’ axis=1
.
7. closed
- Default:
None
- Description: This defines whether the intervals are closed on the right, left, both, or neither. Possible values are
'right'
,'left'
,'both'
, and'neither'
.
Key Concept:
This is particularly useful when working with time-based rolling windows, as it determines whether the current time point is included in the window.
Example:
# Rolling window with intervals closed on both sides
df['rolling_mean'] = df.rolling(window='7D', closed='both', on='date')['value'].mean()
SQL Equivalent:
SQL provides control over whether the window is inclusive or exclusive via the BETWEEN
clause in the ROWS
or RANGE
specification. However, SQL typically defaults to right-closed (<=
) intervals, and there is no built-in mechanism for fully controlling the window closure as in Pandas.
For example, SQL’s default behavior is usually equivalent to a right-closed interval:
SELECT value,
AVG(value) OVER (ORDER BY date RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS rolling_avg
FROM table;
To simulate a left-closed window, you would need to manipulate the ORDER BY
and window logic, but this is not straightforward in SQL.
8. step
- Default:
None
- Description: This defines the step size between windows. For example, a
step=2
would mean applying the rolling function every 2 observations, skipping one in between.
Key Concept:
This parameter allows you to control the frequency at which the rolling operation is applied. It’s useful when you want less granular results.
Example:
# Rolling window that skips every second observation
df['rolling_mean'] = df.rolling(window=3, step=2).mean()
SQL Equivalent:
SQL does not have a built-in equivalent for skipping rows in rolling windows. However, you can simulate this behavior by creating a row number and filtering based on the desired step:
WITH numbered_rows AS (
SELECT value, ROW_NUMBER() OVER (ORDER BY date) AS row_num
FROM table
)
SELECT value,
AVG(value) OVER (ORDER BY row_num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rolling_avg
FROM numbered_rows
WHERE row_num % 2 = 0;
This query uses the ROW_NUMBER()
function to generate row numbers and then filters based on the step size.
9. method
- Default:
'single'
- Description: This parameter defines whether the operation should be performed with a single-threaded (
'single'
) or multi-threaded ('table'
) approach. The multi-threaded option is helpful for large datasets.
Key Concept:
For large datasets, using 'table'
can improve performance by parallelizing the computation.
Example:
# Multi-threaded rolling operation
df['rolling_mean'] = df.rolling(window=3, method='table').mean()
SQL Equivalent:
SQL databases handle performance optimization internally, so there is no direct control over whether window functions are computed single-threaded or multi-threaded. The SQL query planner and optimizer will automatically determine the best execution strategy, depending on the database engine (PostgreSQL, MySQL, etc.) and its parallel processing capabilities.
Some databases (e.g., PostgreSQL) offer ways to control parallelism in general, but there is no direct equivalent to Pandas’ method
parameter.
Fixed vs. Variable Windows
An important distinction is between fixed windows (based on a number of observations) and variable windows (based on time offsets):
Fixed Windows
- These are windows that contain a fixed number of observations, regardless of their timestamp or index.
- Example:
rolling(window=3)
will always calculate the rolling statistic over 3 consecutive data points.
Variable Windows (Offset-based)
- These windows are based on time intervals, which adjust dynamically based on the data’s timestamps.
- Example:
rolling(window='7D')
will include all data points within the last 7 days for each calculation, which may be more or fewer than a fixed number of rows.
Example:
# Fixed window of 3 observations
df['fixed_window_mean'] = df['value'].rolling(window=3).mean()
# Variable window of 7 days
df['variable_window_mean'] = df.rolling(window='7D', on='date')['value'].mean()
Function Signature of DataFrame.expanding()
DataFrame.expanding(min_periods=1, axis=<no_default>, method='single')
This function returns a window object that enables expanding computations (e.g., cumulative sums, means) on the data. We’ll explore the three main parameters of this function: min_periods
, axis
, and method
.
1. min_periods
- Type:
int
- Default:
1
- Description: This parameter specifies the minimum number of observations in the expanding window required to have a valid result. If the number of observations is less than
min_periods
, the result will beNaN
.
Key Concepts:
-
Cumulative Calculations: In an expanding window, the window starts from the first observation and includes all previous values up to the current point. The
min_periods
parameter ensures that only after a certain number of observations have been included in the window will the result be computed. -
Handling Missing Data: If your data contains missing values (
NaN
), you can usemin_periods
to control when the calculation will start producing non-NaN
results. For instance, settingmin_periods=3
means that the expanding window will only return a result starting with the third row, as there won’t be enough data points before that.
Example:
# Expanding window with a minimum of 3 observations
df['expanding_mean'] = df['value'].expanding(min_periods=3).mean()
In this example, the cumulative mean will only start being calculated after the third row. The first two rows will contain NaN
.
Important Concept:
- When to Use: The
min_periods
parameter is especially important when working with datasets that may not have many observations in the early part of the series. It prevents early calculations when there isn’t enough data, ensuring that the results are statistically meaningful.
in SQL
similar cumulative operations can be achieved using window functions. SQL window functions, particularly cumulative functions like SUM()
, AVG()
, and COUNT()
with the OVER
clause, can replicate the core behavior of expanding windows. However, there are some differences in how SQL handles these operations compared to Pandas.
SQL Equivalent:
SQL does not have a direct equivalent for min_periods
, but you can replicate the behavior using conditional logic (e.g., CASE
statements) to check the number of observations in the window and return NULL
if the count is below a certain threshold.
Example:
Suppose you want to calculate the cumulative sum in SQL but only return a result if there are at least 3 rows (equivalent to min_periods=3
in Pandas).
SELECT value,
CASE
WHEN COUNT(value) OVER (ORDER BY date) >= 3
THEN SUM(value) OVER (ORDER BY date)
ELSE NULL
END AS expanding_sum
FROM table;
In this query:
COUNT(value) OVER (ORDER BY date)
counts the number of rows up to the current row.- The
CASE
statement checks if the count is greater than or equal to 3. If true, it calculates the cumulative sum usingSUM() OVER
, otherwise it returnsNULL
.
This mimics the behavior of min_periods
in Pandas.
2. axis
- Type:
int
orstr
- Default:
0
- Description: This parameter specifies the axis along which the expanding window is applied. It can be either:
0
(or'index'
) to apply the expanding operation row-wise (default),1
(or'columns'
) to apply the expanding operation column-wise.
Key Concepts:
-
Row-wise (axis=0): The expanding operation is applied to each column independently for each row. This is the most common use case, especially for time series, where you want to calculate cumulative statistics over time for each column.
-
Column-wise (axis=1): The expanding operation is applied across the columns for each row. This can be useful in cases where the columns represent different variables, and you want to calculate cumulative statistics across them.
Example:
# Apply expanding mean row-wise (default)
df['expanding_mean'] = df['value'].expanding(axis=0).mean()
# Apply expanding sum across columns
df['expanding_sum_across_columns'] = df.expanding(axis=1).sum()
Important Concept:
- When to Use: In most time-series applications, you’ll use
axis=0
because you’re typically interested in cumulative operations over rows (e.g., dates). Usingaxis=1
is less common but can be useful in cases where you have multiple features in columns and want to aggregate across them for each row.
3. method
- Type:
str
- Options:
{'single', 'table'}
- Default:
'single'
- Description: This parameter controls how the expanding operation is executed:
'single'
: Performs the expanding operation over a single column or row at a time.'table'
: Executes the expanding operation over the entire DataFrame or Series in one go, which can be more efficient depending on the operation and the size of the data.
Key Concepts:
- Single vs. Table:
- The
'single'
method applies the expanding window function to each column or row individually. This is the default behavior and is often sufficient for smaller datasets or when working with one column at a time. - The
'table'
method processes the entire DataFrame or Series in one operation, which can be faster for large datasets but may require more memory and computational resources.
- The
Example:
# Apply expanding sum using the single method (default)
df['expanding_sum'] = df['value'].expanding(method='single').sum()
# Apply expanding mean using the table method (useful for large datasets)
df['expanding_mean'] = df.expanding(method='table').mean()
Important Concept:
- When to Use: The
method='table'
option can be more efficient for large datasets, especially when performing operations on multiple columns, but it may not always be necessary for smaller datasets. Use the'table'
method when you’re working with a large DataFrame and need performance optimization.
Running Totals with Additional Conditions
Sometimes, you may need to calculate running totals based on specific conditions, such as filtering certain rows.
SQL Approach
In SQL, incorporating conditions into window functions can be achieved using CASE WHEN
statements within the aggregate function or by applying a WHERE
clause in a subquery.
Example Scenario: Calculate a running total of sales amounts, but only include sales from a specific region.
SELECT
date,
region,
amount,
SUM(CASE WHEN region = 'North' THEN amount ELSE 0 END) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total_north
FROM
sales
ORDER BY
date;
Explanation:
CASE WHEN region = 'North' THEN amount ELSE 0 END
: Ensures that only sales from the ‘North’ region are included in the sum.ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
: Defines the window from the start of the dataset up to the current row, creating a running total.
Pandas Approach
Pandas allows for more straightforward conditional operations using boolean indexing and the cumsum()
function.
Example Scenario: Using the same sales data, calculate a running total for the ‘North’ region.
import pandas as pd
# Sample data
data = {
'date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'region': ['North', 'South', 'North', 'East', 'North', 'West', 'North', 'South', 'North', 'East'],
'amount': [100, 150, 200, 130, 170, 160, 180, 190, 210, 220]
}
df = pd.DataFrame(data)
# Ensure data is sorted by date
df = df.sort_values('date')
# Calculate running total for 'North' region
df['running_total_north'] = (
df[df['region'] == 'North']
.sort_values('date')['amount']
.cumsum()
)
# Fill NaN for non-North regions
df['running_total_north'] = df['running_total_north'].fillna(method='ffill').fillna(0)
print(df)
Output:
date region amount running_total_north
0 2023-01-01 North 100 100.0
1 2023-01-02 South 150 100.0
2 2023-01-03 North 200 300.0
3 2023-01-04 East 130 300.0
4 2023-01-05 North 170 470.0
5 2023-01-06 West 160 470.0
6 2023-01-07 North 180 650.0
7 2023-01-08 South 190 650.0
8 2023-01-09 North 210 860.0
9 2023-01-10 East 220 860.0
Explanation:
df[df['region'] == 'North']
: Filters the DataFrame to include only rows from the ‘North’ region..cumsum()
: Calculates the cumulative sum.fillna(method='ffill').fillna(0)
: FillsNaN
values for non-North regions by forward filling and replacing remainingNaN
with 0.