Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-153] Changes necessary for exporting study data in prod #65

Merged
merged 24 commits into from
May 10, 2022
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9ca1496
Initial prod stacks for mtb studies
philerooski Apr 13, 2022
46d750f
Update backfill json datasets script to use new workflow parameters
philerooski Feb 23, 2022
34ed903
extend functionality of and rename backfill_json_datasets
philerooski Feb 25, 2022
e5377d2
Remove support for submitting Synapse folders
philerooski Feb 25, 2022
c93240e
Add dockerfile for bootstrap trigger job
philerooski Feb 25, 2022
2bafcb9
submit to workflow in batches, rather than all files at once
philerooski Mar 7, 2022
06bbee4
Update dataset mapping
philerooski Apr 14, 2022
0e5df8d
Add microphone table and resources
philerooski Apr 14, 2022
fc7bcc1
Use a dataset mapping with all tables to create dataset resources
philerooski Apr 14, 2022
0ac4faa
Upload EC2 resources (crontab) to artifacts bucket
philerooski Apr 21, 2022
0c603bb
Add parameter for json to parquet trigger schedule
philerooski Apr 21, 2022
e3f14fa
Add rest of prod study stacks, add ec2-bootstrap-trigger template
philerooski Apr 22, 2022
4b01649
changes in resonse to tom's comments
philerooski Apr 22, 2022
c2657f5
Add AmazonSSMManagedInstanceCore managed policy to bootstrap trigger ec2
philerooski Apr 26, 2022
b66d419
Update dataset mapping
philerooski Apr 14, 2022
21cc7b7
Add microphone table and resources
philerooski Apr 14, 2022
bb9dad7
Fix botched rebasing
philerooski May 2, 2022
b0eedd9
Place bootstrap trigger EC2 in private subnet
philerooski May 3, 2022
4021c37
Add crontab for bootstrap trigger cron job
philerooski May 3, 2022
62473cb
Use new ec2-bootstrap-trigger template w/private subnets
philerooski May 3, 2022
85df163
Update crontab to use prod ECR repository
philerooski May 9, 2022
181f2b3
Parameterize ECR repository for use across dev/prod
philerooski May 9, 2022
d179e8b
Parameterize security group which ec2 is placed in
philerooski May 10, 2022
0571a58
Fix cron expression in prod study stacks
philerooski May 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ project_code: bridge_downstream
namespace: {{ var.namespace | default('bridge-downstream') }}
latest_version: v0.1
region: us-east-1
artifact_bucket_name: sceptre-cloudformation-bucket-bucket-65ci2qog5w6l
synapseAuthSsmParameterName: synapse-bridgedownstream-auth
admincentral_cf_bucket: bootstrap-awss3cloudformationbucket-19qromfd235z9
default_stack_tags:
Expand Down
1 change: 1 addition & 0 deletions config/develop/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
artifact_bucket_name: sceptre-cloudformation-bucket-bucket-65ci2qog5w6l
1 change: 1 addition & 0 deletions config/prod/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
artifact_bucket_name: sceptre-cloudformation-bucket-bucket-10mwvvuhlvtk9
thomasyu888 marked this conversation as resolved.
Show resolved Hide resolved
7 changes: 7 additions & 0 deletions config/prod/ec2-bootstrap-trigger.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
template_path: ec2-bootstrap-trigger.yaml
stack_name: ec2-bootstrap-trigger
parameters:
SsmParameterName: synapse-bridgedownstream-auth
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/ec2/resources/crontab
DockerImage: 611413694531.dkr.ecr.us-east-1.amazonaws.com/bootstrap_trigger
SubnetId: !stack_output_external vpc-mini::PrivateSubnet
7 changes: 7 additions & 0 deletions config/prod/glue-classifier-array-of-records.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#{% set classifier_name= stack_group_config.namespace + '-array-of-records-classifier' %}
template_path: glue-json-classifier.yaml
stack_name: {{classifier_name}}
parameters:
ClassifierName: {{classifier_name}}
stack_tags:
{{ stack_group_config.default_stack_tags }}
16 changes: 16 additions & 0 deletions config/prod/glue-job-S3ToJsonS3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
template_path: glue-spark-job.j2
dependencies:
- prod/s3-intermediate-bucket.yaml
stack_name: '{{ stack_group_config.namespace }}-glue-job-S3ToJsonS3'
parameters:
JobRole: !stack_output_external glue-job-role::RoleArn
S3BucketName: !stack_output_external bridge-downstream-intermediate-bucket::BucketName
BookmarkOption: job-bookmark-disable
JobDescription: Convert data to JSONS3 data
MaxConcurrentRuns: '150'
S3ScriptLocation: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/glue/jobs/s3_to_json_s3.py
SynapseAuthSsmParameterName: {{ stack_group_config.synapseAuthSsmParameterName }}
AdditionalPythonModules: 'synapseclient'
DatasetMapping: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.namespace }}/glue/resources/dataset_mapping.json
stack_tags:
{{ stack_group_config.default_stack_tags }}
6 changes: 6 additions & 0 deletions config/prod/glue-job-role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
template_path: glue-job-role.yaml
stack_name: glue-job-role
parameters:
SsmParameterName: {{ stack_group_config.synapseAuthSsmParameterName }}
stack_tags:
{{ stack_group_config.default_stack_tags }}
10 changes: 10 additions & 0 deletions config/prod/s3-intermediate-bucket.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
template_path: s3-bucket.yaml
stack_name: bridge-downstream-intermediate-bucket
dependencies:
- prod/glue-job-role.yaml
parameters:
BucketName: bridge-downstream-intermediate-data
ReadWriteAccessArns:
- !stack_output_external glue-job-role::RoleArn
stack_tags:
{{ stack_group_config.default_stack_tags }}
1 change: 1 addition & 0 deletions config/prod/s3-parquet-bucket.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@ parameters:
BucketName: bridge-downstream-parquet
ReadWriteAccessArns:
- !stack_output_external glue-job-role::RoleArn
SynapseIds: '3432808'
stack_tags:
{{ stack_group_config.default_stack_tags }}
30 changes: 30 additions & 0 deletions config/prod/studies/mobile-toolbox-cxhnxd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
template_path: study-pipeline-infra.j2
stack_name: '{{ stack_group_config.namespace }}-cxhnxd'
dependencies:
- prod/glue-job-role.yaml
- prod/s3-intermediate-bucket.yaml
- prod/s3-parquet-bucket.yaml
- prod/glue-classifier-array-of-records.yaml
- prod/glue-job-S3ToJsonS3.yaml
parameters:
Namespace: {{ stack_group_config.namespace }}
AppName: mobile-toolbox
StudyName: cxhnxd
TemplateBucketName: {{ stack_group_config.artifact_bucket_name }}
ArtifactRef: {{ stack_group_config.latest_version }}
JsonBucketName: !stack_output_external bridge-downstream-intermediate-bucket::BucketName
ParquetBucketName: !stack_output_external bridge-downstream-parquet-bucket::BucketName
RoleArn: !stack_output_external glue-job-role::RoleArn
ClassifierName: !stack_output_external '{{ stack_group_config.namespace }}-array-of-records-classifier::ClassifierName'
SynapseAuthSsmParameterName: '{{ stack_group_config.synapseAuthSsmParameterName }}'
S3ToJsonS3JobName: !stack_output_external '{{ stack_group_config.namespace }}-glue-job-S3ToJsonS3::JobName'
JsonToParquetTriggerSchedule: 'cron(30 * * * * *)'

stack_tags:
{{ stack_group_config.default_stack_tags }}

sceptre_user_data:
dataset_version_mapping: !file src/glue/resources/dataset_mapping.json
dataset_crawler_assignments: !file src/glue/resources/dataset_crawler_assignments.yaml
# this needs to be replaced with real versioned schemas
dataset_schemas: !file src/glue/resources/table_columns.yaml
30 changes: 30 additions & 0 deletions config/prod/studies/mobile-toolbox-fmqcjv.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
template_path: study-pipeline-infra.j2
stack_name: '{{ stack_group_config.namespace }}-fmqcjv'
dependencies:
- prod/glue-job-role.yaml
- prod/s3-intermediate-bucket.yaml
- prod/s3-parquet-bucket.yaml
- prod/glue-classifier-array-of-records.yaml
- prod/glue-job-S3ToJsonS3.yaml
parameters:
Namespace: {{ stack_group_config.namespace }}
AppName: mobile-toolbox
StudyName: fmqcjv
TemplateBucketName: {{ stack_group_config.artifact_bucket_name }}
ArtifactRef: {{ stack_group_config.latest_version }}
JsonBucketName: !stack_output_external bridge-downstream-intermediate-bucket::BucketName
ParquetBucketName: !stack_output_external bridge-downstream-parquet-bucket::BucketName
RoleArn: !stack_output_external glue-job-role::RoleArn
ClassifierName: !stack_output_external '{{ stack_group_config.namespace }}-array-of-records-classifier::ClassifierName'
SynapseAuthSsmParameterName: '{{ stack_group_config.synapseAuthSsmParameterName }}'
S3ToJsonS3JobName: !stack_output_external '{{ stack_group_config.namespace }}-glue-job-S3ToJsonS3::JobName'
JsonToParquetTriggerSchedule: 'cron(40 * * * * *)'

stack_tags:
{{ stack_group_config.default_stack_tags }}

sceptre_user_data:
dataset_version_mapping: !file src/glue/resources/dataset_mapping.json
dataset_crawler_assignments: !file src/glue/resources/dataset_crawler_assignments.yaml
# this needs to be replaced with real versioned schemas
dataset_schemas: !file src/glue/resources/table_columns.yaml
30 changes: 30 additions & 0 deletions config/prod/studies/mobile-toolbox-htshxm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
template_path: study-pipeline-infra.j2
stack_name: '{{ stack_group_config.namespace }}-htshxm'
dependencies:
- prod/glue-job-role.yaml
- prod/s3-intermediate-bucket.yaml
- prod/s3-parquet-bucket.yaml
- prod/glue-classifier-array-of-records.yaml
- prod/glue-job-S3ToJsonS3.yaml
parameters:
Namespace: {{ stack_group_config.namespace }}
AppName: mobile-toolbox
StudyName: htshxm
TemplateBucketName: {{ stack_group_config.artifact_bucket_name }}
ArtifactRef: {{ stack_group_config.latest_version }}
JsonBucketName: !stack_output_external bridge-downstream-intermediate-bucket::BucketName
ParquetBucketName: !stack_output_external bridge-downstream-parquet-bucket::BucketName
RoleArn: !stack_output_external glue-job-role::RoleArn
ClassifierName: !stack_output_external '{{ stack_group_config.namespace }}-array-of-records-classifier::ClassifierName'
SynapseAuthSsmParameterName: '{{ stack_group_config.synapseAuthSsmParameterName }}'
S3ToJsonS3JobName: !stack_output_external '{{ stack_group_config.namespace }}-glue-job-S3ToJsonS3::JobName'
JsonToParquetTriggerSchedule: 'cron(35 * * * * *)'
thomasyu888 marked this conversation as resolved.
Show resolved Hide resolved

stack_tags:
{{ stack_group_config.default_stack_tags }}

sceptre_user_data:
dataset_version_mapping: !file src/glue/resources/dataset_mapping.json
dataset_crawler_assignments: !file src/glue/resources/dataset_crawler_assignments.yaml
# this needs to be replaced with real versioned schemas
dataset_schemas: !file src/glue/resources/table_columns.yaml
3 changes: 3 additions & 0 deletions src/ec2/resources/crontab
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
0 * * * * ec2-user docker run --rm 611413694531.dkr.ecr.us-east-1.amazonaws.com/bootstrap_trigger python /root/BridgeDownstream/src/scripts/bootstrap_trigger/bootstrap_trigger.py --file-view syn29357512 --raw-folder-id syn29300097 --glue-workflow bridge-downstream-mobile-toolbox-cxhnxd-S3ToJsonWorkflow --ssm-parameter synapse-bridgedownstream-auth
5 * * * * ec2-user docker run --rm 611413694531.dkr.ecr.us-east-1.amazonaws.com/bootstrap_trigger python /root/BridgeDownstream/src/scripts/bootstrap_trigger/bootstrap_trigger.py --file-view syn29357540 --raw-folder-id syn29300165 --glue-workflow bridge-downstream-mobile-toolbox-htshxm-S3ToJsonWorkflow --ssm-parameter synapse-bridgedownstream-auth
10 * * * * ec2-user docker run --rm 611413694531.dkr.ecr.us-east-1.amazonaws.com/bootstrap_trigger python /root/BridgeDownstream/src/scripts/bootstrap_trigger/bootstrap_trigger.py --file-view syn29357582 --raw-folder-id syn29300178 --glue-workflow bridge-downstream-mobile-toolbox-fmqcjv-S3ToJsonWorkflow --ssm-parameter synapse-bridgedownstream-auth
1 change: 1 addition & 0 deletions src/glue/resources/dataset_crawler_assignments.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ standard:
- weather_v1
array_of_records:
- microphone_levels_v1
- microphone_v1
- motion_v1
24 changes: 24 additions & 0 deletions src/glue/resources/dataset_mapping.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,18 @@
"taskdata": "v1",
"taskresult": "v1",
"weather": "v1"
},
"14":
{
"answers": "v1",
"info": "v1",
"metadata": "v1",
"microphone_levels": "v1",
"microphone": "v1",
"motion": "v1",
"taskdata": "v1",
"taskresult": "v1",
"weather": "v1"
}
}
},
Expand All @@ -32,6 +44,18 @@
"taskdata": "v1",
"taskresult": "v1",
"weather": "v1"
},
"74":
{
"answers": "v1",
"info": "v1",
"metadata": "v1",
"microphone_levels": "v1",
"microphone": "v1",
"motion": "v1",
"taskdata": "v1",
"taskresult": "v1",
"weather": "v1"
}
}
}
Expand Down
35 changes: 35 additions & 0 deletions src/glue/resources/table_columns.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,41 @@ tables:
Type: string
- Name: day
Type: string
microphone_v1:
columns:
- Name: uptime
Type: double
- Name: unit
Type: string
- Name: peak
Type: double
- Name: average
Type: double
- Name: steppath
Type: string
- Name: timeinterval
Type: int
- Name: timestamp
Type: double
- Name: assessmentid
Type: string
- Name: year
Type: int
- Name: month
Type: int
- Name: day
Type: int
- Name: recordid
Type: string
partition_keys:
- Name: assessmentid
Type: string
- Name: year
Type: string
- Name: month
Type: string
- Name: day
Type: string
motion_v1:
columns:
- Name: uptime
Expand Down
110 changes: 0 additions & 110 deletions src/scripts/backfill_json_datasets/backfill_json_datasets.py

This file was deleted.

4 changes: 4 additions & 0 deletions src/scripts/bootstrap_trigger/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
FROM python:3.9.10

RUN pip install boto3==1.16.33 pandas==1.1.5 synapseclient==2.5.1 pyarrow==4.0.0
RUN git clone -b https://github.com/Sage-Bionetworks/BridgeDownstream.git /root/BridgeDownstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the docker build doesn't build from an existing cache or new changes to the repo won't get pulled in.

Loading