Python Data Analysis and Application Homework: Processing User Electricity Consumption Data
This homework uses knowledge of Numpy, Matplotlib, and Pandas to process electricity consumption data from file data.csv for 200 users (IDs 1-200). The dataset includes columns: CONS_NO (user ID), DATA_DATE (date, e.g., 2015/1/1), and KWH (electricity consumption). Tasks are as follows:
- Transpose data: Rows become user IDs, columns become dates, values become daily consumption.
- Identify and handle anomalies.
- Compute basic statistics per user: min, max, mean, median, sum, variance, skewness, kurtosis.
- Daily differencing per user and compute basic statistics of the differences.
- Compute the 5th percentile for each user.
- Weekly aggregation and differencing (7-day weeks, separate years) and compute statistics of the differences.
- Count days where daily consumptino is ≥ 0.9 times the user's maximum.
- Find the month with the most occurrences of the maximum/minimum value per user (if multiple, choose the month with the highest count).
- Compute ratios of July-August total to March-April total, maxima, minima, and means.
- Combine all derived features.
1. Transpose Data
import pandas as pd
data = pd.read_csv('./data/data.csv', encoding='gbk')
data['DATA_DATE'] = pd.to_datetime(data['DATA_DATE'])
result = pd.pivot_table(data, index='CONS_NO', columns='DATA_DATE', values='KWH')
print(result)
The date column is converted to datetime to ensure chronological ordering in the transposed table.
2. Anomaly Detection and Handling
# Missing value identification and forward fill
null_count = data.isnull().sum()
print("Missing values before:", null_count)
data.fillna(method='pad', inplace=True)
null_count = data.isnull().sum()
print("Missing values after:", null_count)
# Outlier detection using 3-sigma rule
u = data['KWH'].mean()
sigma = data['KWH'].std()
outliers = data['KWH'].apply(lambda x: x > u + 3*sigma or x < u - 3*sigma)
print("Outliers based on 3-sigma:", data.loc[outliers, 'KWH'])
Missing values are filled with the previous valid value. Outliers are identified using the 3σ method but not handled further.
3. Basic Statistics per User
def compute_statistics(df):
stats = pd.concat([df.min(), df.max(), df.mean(), df.median(), df.sum(), df.var(), df.skew(), df.kurt()], axis=1)
stats.columns = ['min', 'max', 'mean', 'median', 'sum', 'var', 'skew', 'kurt']
return stats
print(compute_statistics(result.T))
The transposed DataFrame is transposed again (so rows=users) and passed to a custom function that returns a DataFrame with all requested statistics.
4. Daily Differencing and Statistics
diff_daily = result.T.diff(1)
print("Daily differences:", diff_daily)
print("Statistics of differences:", compute_statistics(diff_daily))
5. 5th Percentile
print("5th percentile:", result.T.describe(percentiles=[0.05]))
6. Weekly Aggregation and Differencing
data['DATA_DATE'] = pd.to_datetime(data['DATA_DATE'])
week_period = pd.PeriodIndex(data['DATA_DATE'], freq='W')
weekly_sum = data.groupby(by=['CONS_NO', week_period]).sum()
weekly_pivot = pd.pivot_table(weekly_sum, index='DATA_DATE', columns='CONS_NO')
weekly_diff = weekly_pivot.diff(1)
print("Weekly differences:", weekly_diff)
print("Statistics of weekly differences:", compute_statistics(weekly_diff))
Note: Years are not separated in this grouping.
7. Count Days with Consumption ≥ 0.9 * Max
count_high = result.T.apply(lambda x: x > x.max() * 0.9).sum()
print("Counts of high consumption days:", count_high)
8. Month with Most Max/Min Occurrences
# For maximum values
month_period = pd.PeriodIndex(data['DATA_DATE'], freq='M')
monthly_max = data.groupby(by=['CONS_NO', month_period])['KWH'].max()
monthly_max_df = pd.DataFrame(monthly_max)
max_month_idx = monthly_max_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()
max_month_values = monthly_max_df.iloc[max_month_idx]
max_month_values.columns = ['Max KWH']
print("Month with most max values:", max_month_values)
# For minimum values
min_data = data[data['KWH'] == data['KWH'].min()]
min_month_period = pd.PeriodIndex(min_data['DATA_DATE'], freq='M')
min_count = min_data.groupby(by=['CONS_NO', min_month_period])['KWH'].count()
min_count_df = pd.DataFrame(min_count)
min_count_idx = min_count_df.reset_index().groupby('CONS_NO')['KWH'].idxmax()
min_month_values = min_count_df.iloc[min_count_idx]
min_month_values.columns = ['Min Count']
print("Month with most min values:", min_month_values)
9. Ratios of Summer to Spring
def filter_dates(df):
idx = pd.IndexSlice
summer = df.loc[idx[:, ['2015-7', '2015-8', '2016-7', '2016-8']], :]
spring = df.loc[idx[:, ['2015-3', '2015-4', '2016-3', '2016-4']], :]
return summer, spring
def merge_ratios(df_summer, df_spring, name):
merged = pd.merge(df_summer, df_spring, on='CONS_NO')
merged.columns = ['Jul-Aug', 'Mar-Apr']
merged[name] = merged['Jul-Aug'] / merged['Mar-Apr']
return merged
key = pd.PeriodIndex(data['DATA_DATE'], freq='M')
monthly = data.groupby(by=['CONS_NO', key])
# Sum ratio
monthly_sum = monthly.sum()
s_summer, s_spring = filter_dates(monthly_sum)
sum_summer = s_summer.groupby('CONS_NO').sum()
sum_spring = s_spring.groupby('CONS_NO').sum()
sum_ratio = merge_ratios(sum_summer, sum_spring, 'sum_ratio')
print("Sum ratio:", sum_ratio)
# Max ratio
monthly_max = monthly.max()
s_max_summer, s_max_spring = filter_dates(monthly_max)
max_summer = s_max_summer.groupby('CONS_NO').max()['KWH']
max_spring = s_max_spring.groupby('CONS_NO').max()['KWH']
max_ratio = merge_ratios(max_summer, max_spring, 'max_ratio')
print("Max ratio:", max_ratio)
# Min ratio
monthly_min = monthly.min()
s_min_summer, s_min_spring = filter_dates(monthly_min)
min_summer = s_min_summer.groupby('CONS_NO').min()['KWH']
min_spring = s_min_spring.groupby('CONS_NO').min()['KWH']
min_ratio = merge_ratios(min_summer, min_spring, 'min_ratio')
print("Min ratio:", min_ratio)
# Mean ratio (daily average: total / 122 days)
monthly_sum_means = monthly.sum()
s_mean_summer, s_mean_spring = filter_dates(monthly_sum_means)
mean_summer = s_mean_summer.groupby('CONS_NO').apply(lambda x: x.sum() / 122)
mean_spring = s_mean_spring.groupby('CONS_NO').apply(lambda x: x.sum() / 122)
mean_ratio = merge_ratios(mean_summer, mean_spring, 'mean_ratio')
print("Mean ratio:", mean_ratio)
10. Combine Features
all_features = pd.concat([
sum_ratio['sum_ratio'],
max_ratio['max_ratio'],
min_ratio['min_ratio'],
mean_ratio['mean_ratio']
], axis=1)
print("Combined features:", all_features)
Note: The original solution consolidated only the ratio features from task 9 rather than all features from tasks 1-9. The code above follows the same approach.