Perform data engineering on healthcare data by applying EDA and data corectness using Great Expectations. We also built an end to end data pipeline using Apache Airflow, that reads, clean, and structure data according to Tuva Schema. At the end we did a Data Analysis on processed data.
- Database schema
- Clone the repository and make it working directory.
- Open the notbook
Data Analysis Health Care.ipynb
and run data Assessment and Data Analysis on merged data once you run the pipeline. Make sure jupyter notebook is installed or you can use Google Colab to run the notebook.
-
create a python virtual environment
$python -m venv .airflow_venv
-
Activate virtual environment
$source .airflow_venv/bin/activate
-
Upgrade pip
$pip install --upgrade pip
-
Install requirements
$pip install requirements.txt
or follow the Airflow docs to install Airflow. Some important commands are as flollowsa.
$export AIRFLOW_HOME="$(pwd)"
fire this command from./airflow
directoryb.
$pip
[install by following docs] -
Run the service
a.
$airflow db init
b.
$airflow db migrate
c.
airflow users create \ --username admin \ --firstname Younis \ --lastname Ali \ --role Admin \ --email [email protected]
Or simply fire the following command that will execute all the above commands. The following command will create the user
admin
with passworda.
$airflow standalone
-
Then visit http://0.0.0.0:8080/dags/health_data/grid to trigger the pipeline.
-
Before trigger the pipeline make sure you have required files in
airflow/data/
directory in the given format.
Visit the link to see database schema that gives sense how different tables are linked Database Schema
I am doing EDA and Data Validation on Patients data only. The same mechanism can be used for other four data sets, though there will be a little difference in implementation but the core concept remains the same.
I used great expectations to add expectations or rules on patients data. I get the json file of expectations and saved the file in ./gx/expectations/patients_ex_suite.json
folder. This json can be used later on to validate the data to check and examine the dtaset according to our set expectations. The same exceptions json can be obtained for other datasets
I did the EDA and data corerectness only on patients
data same can be followed on other datasets.
Using dataframes I read 6 different datasets in respective dataframes. I use XCom
to return dataframes as json so that dataframes can be used by another tasks in the pipeline.
In cleaning process I check the data quality issues using great expectations. There is a lot of scope for data cleaning like managing null values, handling data types etc. I resolved two main issues:
- In
patients
raw data theGENDER
column has null values so I usedpatient_gender
meta data to update thepatient
data with actual values. Symptoms
table, that corresponds toObservation
in Tuva model has no primary key so I added the PK to this table.- I drop the columns from datasets that didnot match the Tuva model schema.
- Check the schema of the tuva project for each dataset.
- Drop the columns from the given datasets that are not in the respective schema in tuva input layer schema.
- There is file
./resources/tuva_schema_map.json
, this file contains the mapping of our dataset feilds with tuva shema. This will enable user to handle the mapping eassily by simply configuring this file. - Rename the remaining columns in each dataset and then dump the poressed csv files into
processed_data
directory. Though the pipeline gets data of different types, but we use the single type,csv
to store the data. - In future we can use the open database like
postgresssql
to read and save the data in the pipeline.
- On the basis of patient_id we merge the whole data into a singe csv
./processed_data/merged_data
- This merged data can be accessed in jupyter notebook for Data Analysis purpose.
We perform the basis data analysis on merged data in jupyter notebook.
-
Miro. I used this tool to create the ER schema of Database that given the sense how different realtions are linked together
-
Great Expectations. I used great expectations, an open source python based framework to validate data and maintain data quality and corectness.
-
Pandas. Pandas a data manipulation pyhton based framework is used to perform basic level of EDA