Analysis of median household income and environmental impact metrics, deforestation and $CO_2$ emissions across U.S. states

This project implements an ETL (Extract, Transform, Load) pipeline that combines US state-level income data from the National Center for Education Statistics (NCES) with deforestation and CO₂ emissions data from Global Forest Watch (GFW). The pipeline processes and merges these datasets to facilitate analysis of the relationship between income levels and environmental factors across US states.

The pipeline extracts data from online sources, performs necessary transformations and cleanups, and loads the merged dataset into a SQLite database for further analysis. This tool is designed for researchers, policymakers, and data analysts interested in exploring the connections between economic indicators and environmental changes at the state level in the United States.

Repository Structure

.
├── project
│   ├── data_report.md
│   ├── ETL
│   │   ├── extract.py
│   │   ├── load.py
│   │   └── transform.py
│   ├── pipeline.py
│   ├── pipeline.sh
│   ├── project-plan.md
│   ├── test.py
│   └── tests.sh
└── README.md

Key Files:

project/pipeline.py: Main entry point for the ETL pipeline
project/ETL/extract.py: Handles data extraction from NCES and GFW sources
project/ETL/transform.py: Performs data cleaning and transformation
project/ETL/load.py: Loads processed data into SQLite database
project/test.py: Contains unit tests for the pipeline
project/tests.sh: Shell script to run the test suite

Usage Instructions

Prerequisites

Python 3.11
project/requirements.txt

Installation

Clone the repository:

git clone <repository_url>
cd <repository_local_name>

Install required Python packages:

pip install -r project/requirements.txt

Running the Pipeline

To execute the ETL pipeline:

cd project
./pipeline.sh

This will download the latest data from NCES and GFW, process it, and store the results in a SQLite database.

Running Tests

To run the test suite:

cd project
./tests.sh

Configuration

The pipeline uses the following data sources by default:

NCES Income Data: https://nces.ed.gov/programs/digest/d22/tables/xls/tabn102.30.xlsx
GFW Deforestation Data: https://gfw2-data.s3.amazonaws.com/country-pages/country_stats/download/2023/USA.xlsx

To modify these sources, update the URLs in project/pipeline.py.

Troubleshooting

Common issues and solutions:

Data Download Failures
- Problem: Unable to download data from NCES or GFW URLs
- Solution:
  1. Check your internet connection
  2. Verify the URLs in project/pipeline.py are still valid
  3. If URLs have changed, update them in the code
Database Creation Issues
- Problem: SQLite database is not created after running the pipeline
- Solution:
  1. Ensure you have write permissions in the project directory
  2. Check if SQLite is installed and functioning on your system
  3. Verify the database path in project/ETL/load.py

Debugging

To enable debug logging:

Open project/pipeline.py
Locate the logging configuration section
Change level=logging.INFO to level=logging.DEBUG
Run the pipeline again to see detailed debug output

Log files are stored in etl_pipeline.log in the project directory.

Data Flow

The ETL pipeline processes data through the following steps:

Extraction:
- Downloads NCES income data and GFW deforestation data from specified URLs
- Saves raw data as Excel files
Transformation:
- Cleans NCES income data by removing unnecessary rows and columns
- Processes GFW data to extract relevant deforestation and CO₂ emission information
- Merges cleaned income and environmental data based on state-level information
Loading:
- Creates an SQLite database
- Loads the merged and transformed data into the database

[NCES Data] --> [Extract] --> [Clean Income Data] --\
                                                     \
[GFW Data]  --> [Extract] --> [Clean GFW Data]   ---> [Merge Data] --> [Load to SQLite]

Notes:

The pipeline ensures data consistency by aligning years between income and environmental data
Error handling and logging are implemented throughout the pipeline.
Further data and correlation analysis is in project/report.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
data		data
examples		examples
exercises		exercises
project		project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of median household income and environmental impact metrics, deforestation and $CO_2$ emissions across U.S. states

Repository Structure

Key Files:

Usage Instructions

Prerequisites

Installation

Running the Pipeline

Running Tests

Configuration

Troubleshooting

Debugging

Data Flow

About

Releases

Packages

Languages

License

nish-nm/made

Folders and files

Latest commit

History

Repository files navigation

Analysis of median household income and environmental impact metrics, deforestation and $CO_2$ emissions across U.S. states

Repository Structure

Key Files:

Usage Instructions

Prerequisites

Installation

Running the Pipeline

Running Tests

Configuration

Troubleshooting

Debugging

Data Flow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages