Skip to content

yTek01/apache-airflow-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

apache-airflow-spark

Here in this repository, we have designed a simple ETL process that extract data from an API and we are transforming this data using Spark and loading this data into an AWS S3 bucket. We running this batch processes using Airflow by Spark job submit Operator in Airflow.

We are using the API data provided by Indian Mutual Fund API, you can read more about the API online to learn more about it. The focus of this repository is Spark and Airflow.

Things to do;

  • Set up Apache Spark locally.
  • Set up Apache Airflow on locally.
  • Write the Spark Jobs to Extract, Transform and Load the data.
  • Design the Airflow DAG to trigger and schedule the Spark jobs.

Set up Apache Spark locally

bash scripts/spark_installation.sh
source ~/.bashrc \
start-master.sh 
start-slave.sh spark://XXXXXXXXXXXX:7077

Set up Apache Airflow on locally

bash scripts/airflow_installer.sh

Run the Spark job to see if everything works on the Spark side

spark-submit --master spark://XXXXXXXXXXXX:7077 spark_etl_script.py
  • If everything works as expected on the Spark side, now create the Airflow dags folder in AIRFLOW_HOME
mkdir ~/airflow/dags
  • Move the Spark job DAG file to the Airflow dags folder
mv dags/spark_jobs_dag.py ~/airflow/dags

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published