Power Outage Severity by Location

Introduction

The Major Power Outage Risks in the U.S. dataset collects large-scale outages in the continental United States from 2000–2016. Each row records where the outage occurred, why it happened, how long it lasted, and how many customers were affected.

This project focuses on a location-based question: Are major power outages more severe, in terms of the number of customers affected, in some parts of the U.S. than in others? I compare outages in the South climate region to those in West North Central, using the region where the outage occurred and the number of customers affected.

The raw Excel export contains 1541 rows and 57 columns. After cleaning (removing non-data rows, parsing dates, and dropping empty rows), the working dataset has 1534 outages and 54 columns. Columns most relevant to this question include:

CUSTOMERS.AFFECTED – number of customers impacted by the outage.
CLIMATE.REGION – coarse climate region for each state.
OUTAGE.DURATION – outage length in minutes.
YEAR and MONTH – when the outage occurred.

Data Cleaning and Exploratory Data Analysis

Removed title/description rows and used row 5 as the source of column names.
Dropped a non-data "variables" column and the units row beneath the header.
Combined separate date/time columns into timestamp features OUTAGE.START and OUTAGE.RESTORATION.
Converted key numeric columns (including CUSTOMERS.AFFECTED and OUTAGE.DURATION) to numeric types, coercing invalid entries to missing values.

To understand the data, I first looked at the number of major outages per climate region. The South and West regions experience the largest counts of major outages, while West North Central has relatively few.

To summarize the data at the state level, I also examine a grouped table of outages, showing the mean total price and mean number of customers served for each state.

Grouped table. The table below shows a sample of states with their average TOTAL.PRICE and TOTAL.CUSTOMERS across major outages:

State	Mean total price	Mean total customers
Alabama	7.99	2.40e6
Alaska	NaN	2.74e5
Arizona	9.12	2.80e6
Arkansas	7.62	1.54e6
California	13.09	1.47e7

Assessment of Missingness

I focus on missing values in CUSTOMERS.AFFECTED, since this is my main outcome variable. I create a boolean indicator CUSTOMERS.AFFECTED_missing and study how it relates to other columns using permutation tests on a chi-square statistic from contingency tables.

Two comparisons are of particular interest:

Missingness vs. U.S._STATE – The observed chi-square statistic is about 242.5 with an empirical p-value ≈ 0.000, suggesting that missingness in CUSTOMERS.AFFECTED depends on the state. Some states are much more likely to be missing values than others.
Missingness vs. CLIMATE.CATEGORY – Here the chi-square statistic is about 2.57 with p ≈ 0.286, so I do not find evidence that missingness depends on this higher-level climate category.

Substantively, this suggests that missing customer counts may be MAR (Missing At Random) with respect to state-level properties like reporting practices or utility infrastructure, but not strongly tied to the broad climate category.

Hypothesis Testing

To formally answer the main research question, I perform a permutation test comparing the mean number of customers affected in the South and West North Central climate regions.

Null hypothesis: The mean CUSTOMERS.AFFECTED is the same in the South and West North Central.
Alternative hypothesis: The mean CUSTOMERS.AFFECTED is higher in the South.
Test statistic: Difference in sample means, South minus West North Central.

In the observed data, the South has about 136,185 more customers affected on average than West North Central. Using 5,000 label permutations, the empirical one-sided p-value is ≈ 0.040, which is below a 5% significance threshold.

Conclusion: I reject the null hypothesis at the 0.05 level. The analysis provides evidence that major outages in the South affect more customers on average than those in West North Central, although this conclusion is still subject to the limitations of observational data and the modeling choices made here.

Framing a Prediction Problem

Beyond hypothesis testing, I frame a regression problem: predict CUSTOMERS.AFFECTED for a large outage using only information that would be available at the time of prediction.

The response variable is CUSTOMERS.AFFECTED. I evaluate models using Root Mean Squared Error (RMSE), which keeps the metric in the same units as the outcome and penalizes large errors.

Features are restricted to attributes known before or at the very start of the event, such as:

YEAR and MONTH of the outage.
CLIMATE.REGION, U.S._STATE, and NERC.REGION.
When appropriate, early estimates like OUTAGE.DURATION if they are available at prediction time.

The data is split into train and test sets using a 75/25 split with a fixed random seed. All subsequent models use the same split so their performance is directly comparable.

Baseline Model

As a baseline, I fit a linear regression model using two features:

YEAR (quantitative, passed through unchanged).
CLIMATE.REGION (categorical, one-hot encoded).

These features are combined in a scikit-learn Pipeline with a ColumnTransformer that one-hot encodes the climate region, followed by LinearRegression.

Baseline RMSE (Train)

≈ 254,180 customers

Baseline RMSE (Test)

≈ 343,652 customers

Residual plots for the baseline model show wide dispersion and substantial under-prediction on the largest events, which is not surprising given the limited feature set and the heavy-tailed nature of the outcome variable.

Final Model

To improve performance, the final model adds more information and performs explicit hyperparameter tuning. I:

Add two new quantitative features: MONTH and OUTAGE.DURATION.
Standardize numeric features and one-hot encode CLIMATE.REGION inside a single Pipeline.
Use Ridge regression and select the regularization strength alpha via GridSearchCV over {0.01, 0.1, 1.0, 10.0}.

Best α

10.0

Final RMSE (Train)

≈ 251,106 customers

Final RMSE (Test)

≈ 329,891 customers

Test Improvement

≈ 13,761 fewer customers vs. baseline

While the improvement over the baseline is modest, it is consistent: both train and test RMSE decrease, and residual plots show slightly tighter dispersion around the diagonal. The model still struggles with the largest outages, suggesting that additional features (such as utility-level characteristics or weather severity) might be needed to capture these extreme events.

To interpret the model, I also examine the magnitude of each Ridge coefficient after preprocessing. The plot below shows the features with the largest absolute coefficients; features at the top contribute most strongly (in either direction) to the predicted number of customers affected.

Fairness Analysis

Finally, I assess whether the final model performs equally well for different climate regions. I focus again on the South and West North Central groups and compute the RMSE of the final model on the test set within each group.

RMSE · South

≈ 597,312 customers

RMSE · West North Central

≈ 21,874 customers

The model clearly performs much worse for the South: its typical error there is more than half a million customers, compared to only tens of thousands in West North Central. To quantify whether this difference could be due to chance, I run a permutation test that shuffles climate region labels across the test set and recomputes |RMSE_South − RMSE_WNC|.

The observed difference is about 575,439 customers, and the permutation distribution yields an empirical p-value ≈ 0.051. This is borderline with respect to a 5% significance level: the evidence points toward the model being unfair to the South, but just misses the conventional cutoff.

Substantively, the huge gap in RMSE matters more than the exact p-value. The model systematically makes larger errors in the South, the very region where outages already appear more severe. Improving the model for this group, perhaps by incorporating more detailed weather or infrastructure features, would be an important direction for future work.