Advanced Pandas Data Analysis: Window Functions, Correlation, and Indexing
DataFrame Window Calculations
The rolling method in Pandas enables sliding window calculations, which are essential for smoothing time-series data or identifying trends over a specific interval. For instance, calculating a 5-day moving average for stock prices involves defining a window size and applying an aggregation function like mean().
To demonstrate this without relying on external live feeds, we can simulate a stock price dataset.
import pandas as pd
import numpy as np
# Simulate stock price data
dates = pd.date_range(start='2023-01-01', periods=20, freq='D')
prices = np.random.randint(100, 150, size=20)
stock_df = pd.DataFrame({'Date': dates, 'Price': prices})
# Set Date as index
stock_df.set_index('Date', inplace=True)
# Calculate 5-day moving average
stock_df['5_Day_MA'] = stock_df['Price'].rolling(window=5).mean()
print(stock_df.head(10))
The initial rows in the resulting moving average column will contain NaN (Not a Number) because there are insufficient data points to fill the window size at the beginning of the series.
Statistical Correlation Analysis
In statistical analysis, determining the relationship between variables is crucial. Covariance measures how two variables vary together, but its magnitude is often hard to interpret directly. Therefore, the Pearson correlation coefficient is commonly used to normalize this measure, providing a value between -1 and 1.
- Positive Correlation (ρ > 0): Variables move in the same direction.
- Negative Correlation (ρ < 0): Variables move in opposite directions.
- Strength: Values closer to 1 or -1 indicate a strong linear relationship, while values near 0 indicate a weak or no linear relationship.
The Pandas DataFrame provides cov() for covariance and cov() for correlation. By default, cov() computes the Pearson correlation, but you can also specify 'kendall' or 'spearman' for non-parametric data.
The following example uses a synthetic real estate dataset to calculate correlations between property features.
# Synthetic real estate data
real_estate_data = {
'Area_sqft': [1500, 2500, 1800, 3200, 1200],
'Age_Years': [10, 5, 15, 2, 20],
'Distance_km': [5.5, 10.2, 2.1, 15.5, 1.0],
'Price': [300000, 550000, 350000, 700000, 250000]
}
prop_df = pd.DataFrame(real_estate_data)
# Calculate Pearson Correlation
pearson_corr = prop_df.corr(method='pearson')
print(pearson_corr)
For data that does not meet normality assumptions or contains ordinal data, the Spearman rank correlation is often preferred.
# Calculate Spearman Correlation
spearman_corr = prop_df.corr(method='spearman')
# Visualize correlation for the 'Price' column
styled_corr = spearman_corr.style.background_gradient(cmap='coolwarm', subset=['Price'])
styled_corr
The background_gradient method highlights the strength of the relationship using a color scale, making it easier to identify which features are most influential on the target variable.
Advanced Indexing Structures
Pandas provides various Index types to optimize data storage and retrieval depending on the data nature.
RangeIndex
RangeIndex is an optimized memory-saving index for sequential integer data. It is the default index for most DataFrames.
month_idx = pd.RangeIndex(start=1, stop=13, name='Month')
sales_series = pd.Series(np.random.randint(100, 500, 12), index=month_idx)
print(sales_series.head())
CategoricalIndex
This index type is useful when the index consists of a fixed set of categories, offering performance improvements and memory efficiency.
categories = ['Electronics', 'Furniture', 'Clothing']
cat_idx = pd.CategoricalIndex(
['Electronics', 'Furniture', 'Electronics', 'Clothing'],
categories=categories,
ordered=True
)
inventory = pd.Series([10, 5, 8, 20], index=cat_idx)
print(inventory)
Grouping operations become very efficient with categorical indexes.
print(inventory.groupby(level=0).sum())
MultiIndex
MultiIndex (or hierarchical indexing) allows for multiple index levels on a single axis, enabling representation of higher-dimensional data in a 2D structure. This is created using the from_product method to form a Cartesian product of iterables.
students = ['S101', 'S102']
terms = ['Midterm', 'Final']
subjects = ['Math', 'Science', 'History']
# Create Hierarchical Index
h_index = pd.MultiIndex.from_product([students, terms], names=['Student_ID', 'Term'])
# Generate random grades
grades = np.random.randint(60, 100, size=(4, 3))
report_card = pd.DataFrame(grades, index=h_index, columns=subjects)
print(report_card)
You can perform complex aggregations on specific levels. For example, calculating a final grade where the Final exam counts for 70% and the Midterm for 30%:
def calc_final(x):
return x.iloc[0] * 0.3 + x.iloc[1] * 0.7
final_grades = report_card.groupby(level=0).apply(calc_final)
print(final_grades)
DatetimeIndex
Time-series analysis relies heavily on DatetimeIndex.
1. Generation: The date_range function creates sequences of dates.
# Generate daily dates
dates = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
print(dates)
2. Time Offsets: DateOffset allows for arithmetic operations on dates that respect calendar logic (e.g., adding months).
offset_dates = dates + pd.DateOffset(days=2)
print(offset_dates)
3. Manipulation Methods:
shift(n): Moves data forward or backward bynperiods, useful for calculating day-over-day changes.asfreq(freq): Converts the time series to a specified frequancy, filling missing values if necessary.resample(rule): A powerful grouping operation based on time frequency (e.g., 'M' for month, 'W' for week).
# Create a time series DataFrame
ts_df = pd.DataFrame({'Value': range(10)}, index=dates)
# Shift data by 1 day
ts_df['Shifted'] = ts_df['Value'].shift(1)
# Resample to 3-day frequency and calculate sum
resampled = ts_df.resample('3D').sum()
print(resampled)
4. Time Zones: Pandas supports timezone localization and conversion using tz_localize and tz_convert.
# Localize to UTC
ts_utc = ts_df.tz_localize('UTC')
# Convert to US/Eastern
ts_eastern = ts_utc.tz_convert('US/Eastern')
print(ts_eastern.head())