Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Python Data Analysis and Application Homework: Processing User Electricity Consumption Data

Tech 1

This homework uses knowledge of Numpy, Matplotlib, and Pandas to process electricity consumption data from file data.csv for 200 users (IDs 1-200). The dataset includes columns: CONS_NO (user ID), DATA_DATE (date, e.g., 2015/1/1), and KWH (electricity consumption). Tasks are as follows:

  1. Transpose data: Rows become user IDs, columns become dates, values become daily consumption.
  2. Identify and handle anomalies.
  3. Compute basic statistics per user: min, max, mean, median, sum, variance, skewness, kurtosis.
  4. Daily differencing per user and compute basic statistics of the differences.
  5. Compute the 5th percentile for each user.
  6. Weekly aggregation and differencing (7-day weeks, separate years) and compute statistics of the differences.
  7. Count days where daily consumptino is ≥ 0.9 times the user's maximum.
  8. Find the month with the most occurrences of the maximum/minimum value per user (if multiple, choose the month with the highest count).
  9. Compute ratios of July-August total to March-April total, maxima, minima, and means.
  10. Combine all derived features.

1. Transpose Data

import pandas as pd

data = pd.read_csv('./data/data.csv', encoding='gbk')
data['DATA_DATE'] = pd.to_datetime(data['DATA_DATE'])
result = pd.pivot_table(data, index='CONS_NO', columns='DATA_DATE', values='KWH')
print(result)

The date column is converted to datetime to ensure chronological ordering in the transposed table.

2. Anomaly Detection and Handling

# Missing value identification and forward fill
null_count = data.isnull().sum()
print("Missing values before:", null_count)
data.fillna(method='pad', inplace=True)
null_count = data.isnull().sum()
print("Missing values after:", null_count)

# Outlier detection using 3-sigma rule
u = data['KWH'].mean()
sigma = data['KWH'].std()
outliers = data['KWH'].apply(lambda x: x > u + 3*sigma or x < u - 3*sigma)
print("Outliers based on 3-sigma:", data.loc[outliers, 'KWH'])

Missing values are filled with the previous valid value. Outliers are identified using the 3σ method but not handled further.

3. Basic Statistics per User

def compute_statistics(df):
    stats = pd.concat([df.min(), df.max(), df.mean(), df.median(), df.sum(), df.var(), df.skew(), df.kurt()], axis=1)
    stats.columns = ['min', 'max', 'mean', 'median', 'sum', 'var', 'skew', 'kurt']
    return stats

print(compute_statistics(result.T))

The transposed DataFrame is transposed again (so rows=users) and passed to a custom function that returns a DataFrame with all requested statistics.

4. Daily Differencing and Statistics

diff_daily = result.T.diff(1)
print("Daily differences:", diff_daily)
print("Statistics of differences:", compute_statistics(diff_daily))

5. 5th Percentile

print("5th percentile:", result.T.describe(percentiles=[0.05]))

6. Weekly Aggregation and Differencing

data['DATA_DATE'] = pd.to_datetime(data['DATA_DATE'])
week_period = pd.PeriodIndex(data['DATA_DATE'], freq='W')
weekly_sum = data.groupby(by=['CONS_NO', week_period]).sum()
weekly_pivot = pd.pivot_table(weekly_sum, index='DATA_DATE', columns='CONS_NO')
weekly_diff = weekly_pivot.diff(1)
print("Weekly differences:", weekly_diff)
print("Statistics of weekly differences:", compute_statistics(weekly_diff))

Note: Years are not separated in this grouping.

7. Count Days with Consumption ≥ 0.9 * Max

count_high = result.T.apply(lambda x: x > x.max() * 0.9).sum()
print("Counts of high consumption days:", count_high)

8. Month with Most Max/Min Occurrences

# For maximum values
month_period = pd.PeriodIndex(data['DATA_DATE'], freq='M')
monthly_max = data.groupby(by=['CONS_NO', month_period])['KWH'].max()
monthly_max_df = pd.DataFrame(monthly_max)
max_month_idx = monthly_max_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()
max_month_values = monthly_max_df.iloc[max_month_idx]
max_month_values.columns = ['Max KWH']
print("Month with most max values:", max_month_values)

# For minimum values
min_data = data[data['KWH'] == data['KWH'].min()]
min_month_period = pd.PeriodIndex(min_data['DATA_DATE'], freq='M')
min_count = min_data.groupby(by=['CONS_NO', min_month_period])['KWH'].count()
min_count_df = pd.DataFrame(min_count)
min_count_idx = min_count_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()
min_month_values = min_count_df.iloc[min_count_idx]
min_month_values.columns = ['Min Count']
print("Month with most min values:", min_month_values)

9. Ratios of Summer to Spring

def filter_dates(df):
    idx = pd.IndexSlice
    summer = df.loc[idx[:, ['2015-7', '2015-8', '2016-7', '2016-8']], :]
    spring = df.loc[idx[:, ['2015-3', '2015-4', '2016-3', '2016-4']], :]
    return summer, spring

def merge_ratios(df_summer, df_spring, name):
    merged = pd.merge(df_summer, df_spring, on='CONS_NO')
    merged.columns = ['Jul-Aug', 'Mar-Apr']
    merged[name] = merged['Jul-Aug'] / merged['Mar-Apr']
    return merged

key = pd.PeriodIndex(data['DATA_DATE'], freq='M')
monthly = data.groupby(by=['CONS_NO', key])

# Sum ratio
monthly_sum = monthly.sum()
s_summer, s_spring = filter_dates(monthly_sum)
sum_summer = s_summer.groupby('CONS_NO').sum()
sum_spring = s_spring.groupby('CONS_NO').sum()
sum_ratio = merge_ratios(sum_summer, sum_spring, 'sum_ratio')
print("Sum ratio:", sum_ratio)

# Max ratio
monthly_max = monthly.max()
s_max_summer, s_max_spring = filter_dates(monthly_max)
max_summer = s_max_summer.groupby('CONS_NO').max()['KWH']
max_spring = s_max_spring.groupby('CONS_NO').max()['KWH']
max_ratio = merge_ratios(max_summer, max_spring, 'max_ratio')
print("Max ratio:", max_ratio)

# Min ratio
monthly_min = monthly.min()
s_min_summer, s_min_spring = filter_dates(monthly_min)
min_summer = s_min_summer.groupby('CONS_NO').min()['KWH']
min_spring = s_min_spring.groupby('CONS_NO').min()['KWH']
min_ratio = merge_ratios(min_summer, min_spring, 'min_ratio')
print("Min ratio:", min_ratio)

# Mean ratio (daily average: total / 122 days)
monthly_sum_means = monthly.sum()
s_mean_summer, s_mean_spring = filter_dates(monthly_sum_means)
mean_summer = s_mean_summer.groupby('CONS_NO').apply(lambda x: x.sum() / 122)
mean_spring = s_mean_spring.groupby('CONS_NO').apply(lambda x: x.sum() / 122)
mean_ratio = merge_ratios(mean_summer, mean_spring, 'mean_ratio')
print("Mean ratio:", mean_ratio)

10. Combine Features

all_features = pd.concat([
    sum_ratio['sum_ratio'],
    max_ratio['max_ratio'],
    min_ratio['min_ratio'],
    mean_ratio['mean_ratio']
], axis=1)
print("Combined features:", all_features)

Note: The original solution consolidated only the ratio features from task 9 rather than all features from tasks 1-9. The code above follows the same approach.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.