Analyzing and Predicting Lengthy Power Outages: Insights from Data and Model Development

title	layout
Analyzing and Predicting Lengthy Power Outages: Insights from Data and Model Development	spec

Analyzing and Predicting Lengthy Power Outages: Insights from Data and Model Development

Introduction

We are given a comprehensive dataset about the power outages that occured in the United States from the year of 2000 and up until the year of 2016, the table below is the detailed description for some columns that we were investigating or haven investigated on, throughout this analysis study, each column represents a feature of the dataset.

Power Outage Data Legends and Attributes

{% include table.md %}

Note: “NA” in the data file indicates data not available.

While there are many columns in the dataset, we aim to primarily focus on the columns that would align with our research interest regarding the question of how do socio-economic factors and historical reliability metrics influence the resilience of power systems in different regions, and can we predict areas with high risk of prolonged outages?

We initially planned to investigate the correlation between the scio-econmic factors and the duration of the power outage, it was estimated that the region with weaker economies, or with a lower economical contribution by the utility sector may have experienced the longer power outage duration. However, after various investigations against the columns related to the socio-economical factors such as the real GSP, the GSP contributed by the utility sector, we did not obtain any applausible findings that could suggest the pattern of our initial assumption. Furthermore, we could not summarize if there is some pattern behind these features.

Therefore, we have made some analysis regarding some other features in the dataset and made the visualization to identify if there exists any interest correlation between the features that we are analyzing. After making several different analysis, we found it could be interesting to study the correlation between the frequency of the power outage incidents and the duration of a particular outage happened at a particular month of a year.

Data Cleaning and Exploratory Data Analysis

Data Fetching and Cleaning

Before we could start our data analysis, we need to read the data from the given excel sheet, parse it into a Pandas data frame, and then clean as necessary. For the sake of data completeness, we would not drop any data beforehand, and in case of the N/A values, we will use the specific situation to determine if the dropping is needed, or we can simply call fillna(0), or otherwise use any other applicable method to make up the N/A value.

Data Visualization

Real GSP Contributed by the Utility Industry vs. the Duration of the Power Outage

At the begining, we started the data analysis by investigating the correlation between the duration of outages and the real GSP contributed by the utility industry in the state/region, using the box graph as the visualization.

State utility sector’s income as a percentage of total U.S. utility sector income (%) vs. the Duration of the Power Outage

Then we moved onto the analysis of the correlation between the state utility sector’s income as a percentage of total U.S. utility sector income (%) and the duration of the power outage, the rationale would be if the percentage value is larger, it indicates the larger the state utilities compared to the rest of the country. Therefore, for a state with a larger utility industry, it might be able to handle the power outage more efficient than those with the smaller ones.

Residential Electricity Price vs. the Duration of the Power Outage

As one of our last a few attempt on finding the relationship between the scio-economical related feature and the duration of a power outage, we have investigated the correlation between the residential electricity price and the duration of the power outage. Our assumption before the investigation would be if a region sets a higher electricity price, they may experience a shorter power outage, if there is one. This rationale is if users are paying a higher price then the utility industry should enjoy more funding available to them for the emergency response.

However, after observing the visualization, there was no sufficient evidence to back our assumption. Although it is shown that the highest pricing category does have the shortest outage, it could be the reason that there were too little utilities charging that unusually high price, and effectively making this category an outlier, which does not affect the overall pattern.

The Cause of Power Outages and Its Effects

At this point, we have changed our initial proposal on targeting the scio-economical factors on the effects of the duration of the power outage. We have made two analysis on how the cause of power outages affect its duration and the loss the power demand.

Bivariate Analysis

The analysis and visualizations above were focused on the univariate analysis, we have also explored the dataset a little bit more, to gain a deeper understanding of the pattern. The first would the trend of the frequency of the power outages occured over years.

We have analyzed the trend over months within a year, but we have not found a specific pattern or correlation of which specific month of a year would likely to have more power outages. Therefore, we have combined all months and years into a new data frame, and created a heatmap as the visualization of the frequency of the power outages happened in each month of each year. This heatmap would combine two variables, which are month and year of the occured power outage.

Framing a Prediction Problem

Conclusion From the Exploratory Data Analysis

Through our observation from the exploratory data analysis, we initially investigated correlations between socio-economic factors and power outage durations. However, no strong or reliable patterns were identified between variables like the real GSP contributed by the utility industry, state utility sector income as a percentage of the total U.S. utility income, and residential electricity price against outage duration. Our assumptions regarding these socio-economic variables influencing outage durations could not be substantiated.

Development of Our Prediction Problem

The focus then shifted to analyzing the causes of power outages and their impact. Clear correlations were observed:

The cause of the power outage had a significant influence on its duration.
The cause also affected the loss in power demand during outages.

Additionally, trends in outage frequency across years and months indicated that:

Frequency surged in 2010 but showed overall fluctuations in later years.
Months with higher outage frequency tended to have shorter average durations.

These insights led to the refinement of our prediction problem to focus on classifying outages as either long or short based on available features like CUSTOMERS.AFFECTED and CAUSE.CATEGORY.

The prediction problem is framed as a binary classification task to determine whether a power outage will last 60 minutes or longer. Long outages are defined as outages lasting one hour or more, with the target variable LONG_OUTAGE set to 1 for long outages and 0 for short outages. The following aspects were considered:

Features (X):
- CUSTOMERS.AFFECTED
- CAUSE.CATEGORY
Target (y):
- LONG_OUTAGE

The goal was to preprocess the features and train a model to accurately classify power outages based on the available data.

Baseline Model

Model Development

The baseline model was developed using Python's scikit-learn library to predict outage-related classifications based on key features. Below is a summary of the steps involved in the development:

Data Preparation:
- The dataset includes features such as CUSTOMERS.AFFECTED, CAUSE.CATEGORY, and OUTAGE.DURATION.
- Missing values were handled by filling with zeros for numerical features like LONG.OUTAGE.
Feature Selection:
- The selected features for training included:
  - Numerical: CUSTOMERS.AFFECTED, OUTAGE.DURATION.
  - Categorical: CAUSE.CATEGORY.
Data Splitting:
- The dataset was split into training and testing subsets with an 80-20 ratio to ensure unbiased evaluation.
- A random seed (random_state=42) was used to allow reproducibility.
Preprocessing:
- A ColumnTransformer was used to preprocess the features:
  - Numerical features were standardized using StandardScaler.
  - Categorical features were encoded using OneHotEncoder.
Pipeline:
- A Pipeline was built to streamline preprocessing and model fitting:
  - Preprocessing steps were incorporated into the pipeline.
  - Logistic regression (LogisticRegression) was used as the classification model for simplicity and interpretability.
Training:
- The pipeline was fit on the training data using the selected features.

Performance Evaluation

The model was evaluated on the test dataset using the following metrics:

Precision: Indicates how many of the predicted positive results were correctly classified.
Recall: Measures the ability of the model to identify all positive instances.
F1-Score: Harmonic mean of precision and recall, providing a balance between the two.
Accuracy: Proportion of correctly classified instances over the total number of predictions.

The classification report generated for the test data is as follows:

Class	Precision	Recall	F1-Score	Support
0	0.47	0.57	0.52	70
1	0.87	0.78	0.82	237
Overall	0.77	0.76	0.77	307

Macro Average: Precision (0.69), Recall (0.68), F1-Score (0.69).
Weighted Average: Precision (0.77), Recall (0.76), F1-Score (0.77).

Insights

The model achieved an overall accuracy of 77%, which is a reasonable performance for a baseline model.
Class 1 (majority class) had significantly higher precision, recall, and F1-score, indicating the model performs better for this class.
Class 0 (minority class) performance was weaker, highlighting class imbalance as a challenge for improvement.

Next Steps

Cross-validation:
- Use cross-validation to ensure the model generalized well across different subsets of the data.
Feature Engineering:
- Explore additional features or transformations to enhance predictive power.
Hyperparameter Tuning:
- Optimize the logistic regression model using techniques like grid search or random search.

Final Model

Features and Their Importance

For the final model, we added MONTH as an additional feature, for the following reasons:

Temporal patterns, such as seasonal effects, could play a role in outage duration, especially in regions prone to severe weather.
In the heatmap that we have created from the exploratory data analysis, we have found there are some months of some years tend to have more frequent power outages, and it could correlate to the duration of the power outage in that month.

Modeling Algorithm

We replaced Logistic Regression with a Decision Tree Classifier to explore non-linear relationships between features and the target variable. Decision trees can capture complex patterns in the data, making them a suitable choice for this task. To optimize the model, we used the following techniques:

Hyperparameter Tuning:
- The following hyperparameters were tuned using GridSearchCV:
  - max_depth: Controls the maximum depth of the tree. Values tested: [5, 10, 20].
  - min_samples_split: Specifies the minimum number of samples required to split an internal node. Values tested: [2, 5, 10].
- The best parameters were:
  - max_depth: 5
  - min_samples_split: 2
Cross-Validation:
- Used 5-fold cross-validation to ensure the model generalized well across different subsets of the data.
Pipeline Integration:
- A pipeline was constructed to combine preprocessing steps and the Decision Tree Classifier, ensuring streamlined model training and evaluation.

Performance Comparison

Final Model Metrics (Decision Tree Classifier):

Metric	Class 0	Class 1	Accuracy	Macro Avg	Weighted Avg
Precision	0.66	0.89		0.77	0.83
Recall	0.64	0.90		0.77	0.83
F1-Score	0.65	0.89	0.83	0.77	0.83

Key Insights

Improved Balance Across Classes:

The final model achieved balanced precision, recall, and F1-score for both classes, indicating good generalization.

Class 0 (Short Outages):

Precision improved to 0.66, and recall reached 0.64, resulting in a well-rounded F1-score of 0.65.

Class 1 (Long Outages):

Maintained strong precision and recall at 0.89, highlighting the model's ability to correctly identify long outages.

Overall Metrics:

Accuracy improved to 83%, with macro and weighted averages also reflecting balanced performance.

Conclusion

After we added the additional feature of the month and replaced our model to the Decision Tree Classifier, we have seen a significant improvement in handling the minority class (Class 0) while maintaining strong performance for the majority class (Class 1). The model's ability to capture non-linear relationships contributed to its better overall performance compared to the baseline Logistic Regression model.

Acknowledgments

This project report authored by Daniel X. He and Michael D. Boze as part of EECS 398-003 Practical Data Science portfolio homework by Professor Suraj Rampure at the University of Michigan College of Engineering.

Dataset provided by Laboratory for Advancing Sustainable Critical Infrastructure at Purdue University.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
_includes		_includes
assets		assets
static		static
.DS_Store		.DS_Store
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
_config.yml		_config.yml
copy_html.sh		copy_html.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing and Predicting Lengthy Power Outages: Insights from Data and Model Development

Introduction

Power Outage Data Legends and Attributes

Data Cleaning and Exploratory Data Analysis

Data Fetching and Cleaning

Data Visualization

Real GSP Contributed by the Utility Industry vs. the Duration of the Power Outage

State utility sector’s income as a percentage of total U.S. utility sector income (%) vs. the Duration of the Power Outage

Residential Electricity Price vs. the Duration of the Power Outage

The Cause of Power Outages and Its Effects

Bivariate Analysis

Framing a Prediction Problem

Conclusion From the Exploratory Data Analysis

Development of Our Prediction Problem

Baseline Model

Model Development

Performance Evaluation

Insights

Next Steps

Final Model

Features and Their Importance

Modeling Algorithm

Performance Comparison

Key Insights

Conclusion

Acknowledgments

About

Releases

Packages

Languages

xinzhouhe/power-outage-analysis

Folders and files

Latest commit

History

Repository files navigation

Analyzing and Predicting Lengthy Power Outages: Insights from Data and Model Development

Introduction

Power Outage Data Legends and Attributes

Data Cleaning and Exploratory Data Analysis

Data Fetching and Cleaning

Data Visualization

Real GSP Contributed by the Utility Industry vs. the Duration of the Power Outage

State utility sector’s income as a percentage of total U.S. utility sector income (%) vs. the Duration of the Power Outage

Residential Electricity Price vs. the Duration of the Power Outage

The Cause of Power Outages and Its Effects

Bivariate Analysis

Framing a Prediction Problem

Conclusion From the Exploratory Data Analysis

Development of Our Prediction Problem

Baseline Model

Model Development

Performance Evaluation

Insights

Next Steps

Final Model

Features and Their Importance

Modeling Algorithm

Performance Comparison

Key Insights

Conclusion

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages