Time Series Forecasting for Electricity Demand Using Pandas and LightGBM
Problem Overview
This challenge focuses on forecasting electricity consumption for multiple households using historical time-series data. Given sequences of past power usage labeled by household ID and day index (dt), the objective is to predict future target values — representing actual electricity demand.
The task falls under univariate time-series regression, where temporal dependencies, seasonality, and cross-household heterogeneity must be accounted for in modeling.
Baseline Solution: Static Historical Averaging
A minimal yet interpretable baseline computes the average target value per househlod over a recent window (days 11–20) and applies it uniformly across all test entries for that household:
import pandas as pd
import numpy as np
# Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Compute mean target per 'id' over dt ∈ [11, 20]
recent_window = train.query('11 <= dt <= 20').groupby('id')['target'].mean().reset_index(name='pred_mean')
# Attach predictions to test set via left join
submission = test.merge(recent_window, on='id', how='left')
# Export final submission
submission[['id', 'dt', 'pred_mean']].rename(columns={'pred_mean': 'target'}).to_csv('submit.csv', index=False)
Key operations:
.query()filters rows more readably than boolean indexing..groupby(...).mean().reset_index()aggregates and restores flat structure..merge(..., how='left')preserves all test rows, filling missing matches withNaN.
This approach assumes stationarity within the short horizon and serves as a performence floor.
Advanced Modeling with LightGBM
To capture non-linear patterns and interactions, we adopt LightGBM — a gradient-boosted decision tree framework optimized for speed and memory efficeincy.
Data Preparation
We first unify and sort data chronologically per household, then engineer lagged and rolling features:
import pandas as pd
import numpy as np
# Concatenate and sort descending by time (newest first)
data = pd.concat([train, test], ignore_index=True)
data = data.sort_values(['id', 'dt'], ascending=[True, False]).reset_index(drop=True)
# Generate lag features: target values from 10–29 days prior
for lag in range(10, 30):
data[f't_{lag}'] = data.groupby('id')['target'].shift(lag)
# Compute 3-day rolling mean of lags t_10, t_11, t_12
data['t_3day_avg'] = data[['t_10', 't_11', 't_12']].mean(axis=1)
# Split back into train/test using presence of 'target'
train_df = data[data['target'].notna()].reset_index(drop=True)
test_df = data[data['target'].isna()].reset_index(drop=True)
# Define feature set (exclude identifiers and target)
feature_cols = [col for col in data.columns if col not in {'id', 'dt', 'type', 'target'}]
Note: Sorting by dt in descending order ensures shift() retrieves earlier timestamps correctly when grouped by id.
Train-Validation Split Strategy
Instead of random or temporal splits, we use date-based separation:
- Training: samples with
dt >= 31 - Validation: samples with
dt <= 30
This mimics real-world deployment where models trained on newer observations predict older unseen ones — preserving temporal integrity.
Model Configuration and Training
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
# Define feature and label subsets
X_trn = train_df.query('dt >= 31')[feature_cols]
y_trn = train_df.query('dt >= 31')['target']
X_val = train_df.query('dt <= 30')[feature_cols]
y_val = train_df.query('dt <= 30')['target']
# Construct LightGBM datasets
dtrain = lgb.Dataset(X_trn, label=y_trn)
dvalid = lgb.Dataset(X_val, label=y_val, reference=dtrain)
# Hyperparameters tuned for stability and generalization
params = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 32,
'learning_rate': 0.045,
'feature_fraction': 0.75,
'bagging_fraction': 0.85,
'bagging_freq': 5,
'lambda_l2': 8.0,
'min_data_in_leaf': 20,
'seed': 42,
'verbose': -1
}
# Train with early stopping
model = lgb.train(
params,
dtrain,
valid_sets=[dtrain, dvalid],
num_boost_round=10000,
callbacks=[lgb.early_stopping(stopping_rounds=500, verbose=True)]
)
# Predict and evaluate
y_val_pred = model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f'Validation RMSE: {val_rmse:.4f}')
# Apply to test set
test_predictions = model.predict(test_df[feature_cols])
test_df['target'] = test_predictions
test_df[['id', 'dt', 'target']].to_csv('submit.csv', index=False)
Critical considerations:
rmseis used as the primary metric during training and evaluation.early_stoppingprevents overfitting by halting optimization if validation loss stagnates for 500 rounds.- Categorical variables like
typeare omitted here but could be encoded and added tofeature_colsfor richer modeling.
Feature importance can be inspected via model.feature_importance() to guide iterative engineering.