Introduction
The Major Power Outage Risks in the U.S. dataset collects large-scale outages in the continental United States from 2000–2016. Each row records where the outage occurred, why it happened, how long it lasted, and how many customers were affected.
This project focuses on a location-based question:
Are major power outages more severe, in terms of the number of
customers affected, in some parts of the U.S. than in others?
I compare outages in the South climate region to those in
West North Central, using the region where the outage occurred
and the number of customers affected.
The raw Excel export contains 1541 rows and 57 columns. After cleaning (removing non-data rows, parsing dates, and dropping empty rows), the working dataset has 1534 outages and 54 columns. Columns most relevant to this question include:
CUSTOMERS.AFFECTED– number of customers impacted by the outage.CLIMATE.REGION– coarse climate region for each state.OUTAGE.DURATION– outage length in minutes.YEARandMONTH– when the outage occurred.
Data Cleaning and Exploratory Data Analysis
- Removed title/description rows and used row 5 as the source of column names.
- Dropped a non-data "variables" column and the units row beneath the header.
- Combined separate date/time columns into timestamp features
OUTAGE.STARTandOUTAGE.RESTORATION. - Converted key numeric columns (including
CUSTOMERS.AFFECTEDandOUTAGE.DURATION) to numeric types, coercing invalid entries to missing values.
To understand the data, I first looked at the number of major
outages per climate region. The South and West regions experience
the largest counts of major outages, while West North Central
has relatively few.
To summarize the data at the state level, I also examine a grouped table of outages, showing the mean total price and mean number of customers served for each state.
TOTAL.PRICE and TOTAL.CUSTOMERS
across major outages:
| State | Mean total price | Mean total customers |
|---|---|---|
| Alabama | 7.99 | 2.40e6 |
| Alaska | NaN | 2.74e5 |
| Arizona | 9.12 | 2.80e6 |
| Arkansas | 7.62 | 1.54e6 |
| California | 13.09 | 1.47e7 |
Assessment of Missingness
I focus on missing values in CUSTOMERS.AFFECTED, since this
is my main outcome variable. I create a boolean indicator
CUSTOMERS.AFFECTED_missing and study how it relates to other
columns using permutation tests on a chi-square statistic from
contingency tables.
Two comparisons are of particular interest:
- Missingness vs.
U.S._STATE– The observed chi-square statistic is about 242.5 with an empirical p-value ≈ 0.000, suggesting that missingness inCUSTOMERS.AFFECTEDdepends on the state. Some states are much more likely to be missing values than others. - Missingness vs.
CLIMATE.CATEGORY– Here the chi-square statistic is about 2.57 with p ≈ 0.286, so I do not find evidence that missingness depends on this higher-level climate category.
Substantively, this suggests that missing customer counts may be MAR (Missing At Random) with respect to state-level properties like reporting practices or utility infrastructure, but not strongly tied to the broad climate category.
Hypothesis Testing
To formally answer the main research question, I perform a permutation
test comparing the mean number of customers affected in the
South and West North Central climate regions.
- Null hypothesis:
The mean
CUSTOMERS.AFFECTEDis the same in the South and West North Central. - Alternative hypothesis:
The mean
CUSTOMERS.AFFECTEDis higher in the South. - Test statistic: Difference in sample means, South minus West North Central.
In the observed data, the South has about 136,185 more customers affected on average than West North Central. Using 5,000 label permutations, the empirical one-sided p-value is ≈ 0.040, which is below a 5% significance threshold.
Conclusion: I reject the null hypothesis at the 0.05 level. The analysis provides evidence that major outages in the South affect more customers on average than those in West North Central, although this conclusion is still subject to the limitations of observational data and the modeling choices made here.
Framing a Prediction Problem
Beyond hypothesis testing, I frame a regression problem:
predict CUSTOMERS.AFFECTED for a large outage using only
information that would be available at the time of prediction.
The response variable is CUSTOMERS.AFFECTED. I evaluate
models using Root Mean Squared Error (RMSE), which keeps
the metric in the same units as the outcome and penalizes large errors.
Features are restricted to attributes known before or at the very start of the event, such as:
YEARandMONTHof the outage.CLIMATE.REGION,U.S._STATE, andNERC.REGION.- When appropriate, early estimates like
OUTAGE.DURATIONif they are available at prediction time.
The data is split into train and test sets using a 75/25 split with a fixed random seed. All subsequent models use the same split so their performance is directly comparable.
Baseline Model
As a baseline, I fit a linear regression model using two features:
YEAR(quantitative, passed through unchanged).CLIMATE.REGION(categorical, one-hot encoded).
These features are combined in a scikit-learn
Pipeline with a ColumnTransformer that
one-hot encodes the climate region, followed by LinearRegression.
Residual plots for the baseline model show wide dispersion and substantial under-prediction on the largest events, which is not surprising given the limited feature set and the heavy-tailed nature of the outcome variable.
Final Model
To improve performance, the final model adds more information and performs explicit hyperparameter tuning. I:
- Add two new quantitative features:
MONTHandOUTAGE.DURATION. - Standardize numeric features and one-hot encode
CLIMATE.REGIONinside a singlePipeline. - Use
Ridgeregression and select the regularization strengthalphaviaGridSearchCVover {0.01, 0.1, 1.0, 10.0}.
While the improvement over the baseline is modest, it is consistent: both train and test RMSE decrease, and residual plots show slightly tighter dispersion around the diagonal. The model still struggles with the largest outages, suggesting that additional features (such as utility-level characteristics or weather severity) might be needed to capture these extreme events.
To interpret the model, I also examine the magnitude of each Ridge coefficient after preprocessing. The plot below shows the features with the largest absolute coefficients; features at the top contribute most strongly (in either direction) to the predicted number of customers affected.
Fairness Analysis
Finally, I assess whether the final model performs equally well for
different climate regions. I focus again on the South and
West North Central groups and compute the RMSE of the
final model on the test set within each group.
The model clearly performs much worse for the South: its typical error
there is more than half a million customers, compared to only tens of
thousands in West North Central. To quantify whether this difference
could be due to chance, I run a permutation test that shuffles climate
region labels across the test set and recomputes
|RMSE_South − RMSE_WNC|.
The observed difference is about 575,439 customers, and the permutation distribution yields an empirical p-value ≈ 0.051. This is borderline with respect to a 5% significance level: the evidence points toward the model being unfair to the South, but just misses the conventional cutoff.
Substantively, the huge gap in RMSE matters more than the exact p-value. The model systematically makes larger errors in the South, the very region where outages already appear more severe. Improving the model for this group, perhaps by incorporating more detailed weather or infrastructure features, would be an important direction for future work.