GitHub - UCSF-MS-DCC/msbio_origins_etl: etl files for adding origins patient data to msbioscreen database

UCSF-MS-DCC / msbio_origins_etl Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

etl files for adding origins patient data to msbioscreen database

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
README.txt		README.txt
extract_attacks.sql		extract_attacks.sql
extract_patients.sql		extract_patients.sql
extract_treatments.sql		extract_treatments.sql
extract_visits.sql		extract_visits.sql
load_attacks.rb		load_attacks.rb
load_patients.rb		load_patients.rb
load_treatments.rb		load_treatments.rb
load_visits.rb		load_visits.rb
upload_attacks.rb		upload_attacks.rb
upload_patients.rb		upload_patients.rb
upload_treatments.rb		upload_treatments.rb
upload_visits.rb		upload_visits.rb

Repository files navigation

ETL OF ORIGINS DATA POINTS FROM MSBASE TO MSBIOSCREEN - July 10, 2017 edited August 2, 2017

The process of adding Origins patient data followed these steps:

1) Extract the data points using .sql statements executed on the msbase databases using phpmyadmin and outputting the query results to a .csv file.

2) Sanity checking the query results in excel.

3) Parsing and uploading the data to the database via the msbioscreen api using ruby scripts.

EXTRACTING DATA POINTS

Using the msbioscreen api documentation as a guide, I created sql queries to gather the datapoints for uploading to the subjects, visits, attacks, and treatments endpoints. These files are named extract_modelname.

PARSING AND UPLOADING DATA

To upload the resulting .csv files, I created simple ruby scripts to parse the csv and upload the data row by row. This was achieved using the CSV and HTTParty gems. Each script parses the csv row by row and executes a post request to the appropriate api endpoint for the data points involved. I tried some elementary error handling, but removed it as it was not working as intended. All data so far has uploaded without problem. The scripts to upload data all have names that begin with load_ followed by the model name of the data. In
most cases this is the same name as the api endpoint to post the data to. The load scripts may be run from a terminal command line, either in a ruby console (irb) or by executing with the ruby command (ruby load_visits.rb).

STILL TO DO

Automate the extraction/upsertion process to run several times an hour.
Find out why uploaded data points are not appearing in the bioscreen app (any version).

OTHER NOTES

This directory contains a .gitignore file to ensure that .csv files and vars.rb are not checked into github.

To use the load scripts, you will need to create a file called vars.rb in the same directory as the load scripts. This is where you should place variable data such as urls and token data. DO NOT check this file into github!

vars.rb variable format:

@acceptance_base_url = "https://path/to/acceptance/url"
@prod_base_url = "https://path/to/production/url"

@patients_url = "api/v1/subjects"
@visits_url = "api/v1/visits"
@attacks_url = "api/v1/attacks"
@treatments_url = "api/v1/treatments"

@acceptance_auth_token = "AUTH_TOKEN_GOES_HERE"
@prod_auth_token = "AUTH_TOKEN_GOES_HERE"

@working_dir_base = absolute path of the local directory containing the folder with the csv files
@working_dir = name of the directory containing the csv files to be loaded into the database

Note that the variable names are configured to work with the load scripts. The variable values are subject to change. Substitute your actual authorization token for AUTH_TOKEN_GOES_HERE.