Data Processing with PySpark and Delta Lake on AWS EMR

This is the companion code for the Unskew data blog post here.

We use PySpark to process data and write it to S3 as a delta lake table. Later, we discuss how to deploy this PySpark application to EMR.

As always, please write to us with any questions, comments or improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
pyspark_emr_delta		pyspark_emr_delta
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
prepare_submit.sh		prepare_submit.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback