├── ana_code
│ ├── combined_data_analysis.hql
│ ├── screenshots
│ │ ├── **/*.png
├── data
│ ├── raw
│ │ ├── **/*.csv
├── data_ingest
│ ├── data_ingestion.sh
│ ├── data_ingest_screenshot.png
├── etl_code
│ ├── nicole
│ │ ├── cleaned_children_blood_lead.scala
│ │ ├── **/*.png
│ ├── seoeun
│ │ ├── cleaned_housing_con.scala
│ │ ├── **/*.png
│ ├── nicole&seoeun
│ │ ├── cleaned_economic.scala
│ │ ├── **/*.png
├── profiling_code
│ ├── nicole
│ │ ├── before_clean
│ │ │ ├── children_blood_lead_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── children_blood_lead_profile_clean.hql
│ │ │ ├── **/*.png
│ ├── seoeun
│ │ ├── before_clean
│ │ │ ├── housing_con_data_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── housing_con_profile_clean.hql
│ │ │ ├── **/*.png
│ ├── nicole&seoeun
│ │ ├── before_clean
│ │ │ ├── economics_data_profile_before_clean.scala
│ │ │ ├── **/*.png
│ │ ├── after_clean
│ │ │ ├── econ_profile_clean.hql
│ │ │ ├── **/*.png
└── README.md
-
Start dataproc
-
Upload the original version of the csv datasets which can be found in
data/raw
to dataproc -
Upload the script file
data_ingestion.sh
which can be found indata_ingest
to dataproc -
Make executable file and run it with this commands
chmod +x data_ingestion.sh ./data_ingestion.sh
this will set up a folder called
originalDataSets
all your csv files on hdfs
-
Upload
children_blood_lead_profile_before_clean.scala
located inprofiling_code/nicole/before_clean
to dataproc -
Run that scala file using this command
spark-shell --deploy-mode client -i children_blood_lead_profile_before_clean.scala
-
After running this command, you will see the results of this profile on the spark Scala shell
-
Keep repeating
1 - 3
steps for scala fileshousing_con_data_profile_before_clean.scala
andeconomics_data_profile_before_clean.scala
located inprofiling_code/seoeun/before_clean
andprofiling_code/nicole&seoeun/before_clean
-
Upload
cleaned_children_blood_lead.scala
located inetl_code/nicole
to dataproc -
Run that scala file using this command
spark-shell --deploy-mode client -i cleaned_children_blood_lead.scala
-
After running this command, you will see the results at
finalCode/lead
on your hdfs. Note that results for other scala files will be located infinalCode/housing
andfinalCode/econ
-
Keep repeating
1 - 3
steps for scala filescleaned_housing_con.scala
andcleaned_economic.scala
located inetl_code/seoeun
andetl_code/nicole&seoeun
-
Upload
children_blood_lead_profile_clean.hql
located inprofiling_code/nicole/after_clean
to dataproc -
Run that HiveQL file using this command
beeline -u "jdbc:hive2://localhost:10000" -f children_blood_lead_profile_clean.hql
-
After running this command, you will see the results of this profile on the hive shell
-
Keep repeating
1 - 3
steps for hiveQL fileshousing_con_profile_clean.hql
andecon_profile_clean.hql
located inprofiling_code/seoeun/after_clean
andprofiling_code/nicole&seoeun/after_clean
-
Upload
combined_data_analysis.hql
located inana_code
to dataproc -
Run that HiveQL file using this command
beeline -u "jdbc:hive2://localhost:10000" -f combined_data_analysis.hql
-
After running this command, you will see the results of this profile on the hive shell
We put our original input data into originalDataSets
directory and cleaned input data into finalCode
directory.