Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Time Series Forecasting for Electricity Demand Using Pandas and LightGBM

Tech 1

Problem Overview

This challenge focuses on forecasting electricity consumption for multiple households using historical time-series data. Given sequences of past power usage labeled by household ID and day index (dt), the objective is to predict future target values — representing actual electricity demand.

The task falls under univariate time-series regression, where temporal dependencies, seasonality, and cross-household heterogeneity must be accounted for in modeling.


Baseline Solution: Static Historical Averaging

A minimal yet interpretable baseline computes the average target value per househlod over a recent window (days 11–20) and applies it uniformly across all test entries for that household:

import pandas as pd
import numpy as np

# Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Compute mean target per 'id' over dt ∈ [11, 20]
recent_window = train.query('11 <= dt <= 20').groupby('id')['target'].mean().reset_index(name='pred_mean')

# Attach predictions to test set via left join
submission = test.merge(recent_window, on='id', how='left')

# Export final submission
submission[['id', 'dt', 'pred_mean']].rename(columns={'pred_mean': 'target'}).to_csv('submit.csv', index=False)

Key operations:

  • .query() filters rows more readably than boolean indexing.
  • .groupby(...).mean().reset_index() aggregates and restores flat structure.
  • .merge(..., how='left') preserves all test rows, filling missing matches with NaN.

This approach assumes stationarity within the short horizon and serves as a performence floor.


Advanced Modeling with LightGBM

To capture non-linear patterns and interactions, we adopt LightGBM — a gradient-boosted decision tree framework optimized for speed and memory efficeincy.

Data Preparation

We first unify and sort data chronologically per household, then engineer lagged and rolling features:

import pandas as pd
import numpy as np

# Concatenate and sort descending by time (newest first)
data = pd.concat([train, test], ignore_index=True)
data = data.sort_values(['id', 'dt'], ascending=[True, False]).reset_index(drop=True)

# Generate lag features: target values from 10–29 days prior
for lag in range(10, 30):
    data[f't_{lag}'] = data.groupby('id')['target'].shift(lag)

# Compute 3-day rolling mean of lags t_10, t_11, t_12
data['t_3day_avg'] = data[['t_10', 't_11', 't_12']].mean(axis=1)

# Split back into train/test using presence of 'target'
train_df = data[data['target'].notna()].reset_index(drop=True)
test_df = data[data['target'].isna()].reset_index(drop=True)

# Define feature set (exclude identifiers and target)
feature_cols = [col for col in data.columns if col not in {'id', 'dt', 'type', 'target'}]

Note: Sorting by dt in descending order ensures shift() retrieves earlier timestamps correctly when grouped by id.

Train-Validation Split Strategy

Instead of random or temporal splits, we use date-based separation:

  • Training: samples with dt >= 31
  • Validation: samples with dt <= 30

This mimics real-world deployment where models trained on newer observations predict older unseen ones — preserving temporal integrity.

Model Configuration and Training

import lightgbm as lgb
from sklearn.metrics import mean_squared_error

# Define feature and label subsets
X_trn = train_df.query('dt >= 31')[feature_cols]
y_trn = train_df.query('dt >= 31')['target']
X_val = train_df.query('dt <= 30')[feature_cols]
y_val = train_df.query('dt <= 30')['target']

# Construct LightGBM datasets
dtrain = lgb.Dataset(X_trn, label=y_trn)
dvalid = lgb.Dataset(X_val, label=y_val, reference=dtrain)

# Hyperparameters tuned for stability and generalization
params = {
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 32,
    'learning_rate': 0.045,
    'feature_fraction': 0.75,
    'bagging_fraction': 0.85,
    'bagging_freq': 5,
    'lambda_l2': 8.0,
    'min_data_in_leaf': 20,
    'seed': 42,
    'verbose': -1
}

# Train with early stopping
model = lgb.train(
    params,
    dtrain,
    valid_sets=[dtrain, dvalid],
    num_boost_round=10000,
    callbacks=[lgb.early_stopping(stopping_rounds=500, verbose=True)]
)

# Predict and evaluate
y_val_pred = model.predict(X_val)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f'Validation RMSE: {val_rmse:.4f}')

# Apply to test set
test_predictions = model.predict(test_df[feature_cols])
test_df['target'] = test_predictions

test_df[['id', 'dt', 'target']].to_csv('submit.csv', index=False)

Critical considerations:

  • rmse is used as the primary metric during training and evaluation.
  • early_stopping prevents overfitting by halting optimization if validation loss stagnates for 500 rounds.
  • Categorical variables like type are omitted here but could be encoded and added to feature_cols for richer modeling.

Feature importance can be inspected via model.feature_importance() to guide iterative engineering.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.