Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-153] Changes necessary for exporting study data in prod #65

Merged
merged 24 commits into from
May 10, 2022

Conversation

philerooski
Copy link
Contributor

@philerooski philerooski commented Apr 22, 2022

Change summary:

  • Move artifact bucket from global config to dev/prod specific config
  • Introduce prod stacks
  • Add microphone_v1 table. (Same structure as microphone_levels_v1, the developers decided to rename this file for some reason).
  • Added Android build 14 and iPhone OS build 74 to dataset mapping
  • Add bootstrap_trigger script (formerly backfill json datasets)
  • Add ec2-bootstrap-trigger stack, which is used to run the bootstrap trigger script in prod
  • Add crontab for bootstrap trigger script to src/ec2/resources/
  • Added JsonToParquetTriggerSchedule parameter to study-pipeline template. Omitting this parameter creates an on-demand json to parquet trigger. Passing a cron expression to this parameter creates a json to parquet trigger which runs on a schedule. 6. t4tg. 6edrfgvb

@philerooski philerooski temporarily deployed to develop April 22, 2022 17:58 Inactive
@philerooski philerooski marked this pull request as ready for review April 22, 2022 17:59
@philerooski philerooski requested a review from a team as a code owner April 22, 2022 17:59
@philerooski philerooski changed the title Changes necessary for exporting study data in prod [ETL-153] Changes necessary for exporting study data in prod Apr 22, 2022
@philerooski philerooski temporarily deployed to develop April 22, 2022 18:01 Inactive
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: !Sub "The security group for ${AWS::StackName}"
SecurityGroupIngress:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this security group? also there probably needs to be tighter control on this instance. We should probably put it behind our VPN.

Copy link
Member

@thomasyu888 thomasyu888 Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding VpcId - something like this: https://github.com/nlpsandbox/nlpsandbox-infra/blob/main/sceptre/nlpsandbox/templates/nlpsandbox.yaml#L188..

Do we have a VPC set up? If there is no VPC set up, we have to do something like this: https://github.com/Sage-Bionetworks-Challenges/cnb-aws-infra/pull/1/files

That would be another JIRA ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another commit to add the security group to the VPC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 Does the EC2 need to be in a VPC because of the ingress rules? I should be able to remove this rule so that it's not possible to connect to the instance. Everything is taken care of by the EC2 user data. The ssh access is just for debugging purposes.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but there is a security concern about the ec2. We need to put it behind our VPN.

@philerooski philerooski temporarily deployed to develop April 22, 2022 20:18 Inactive
@philerooski philerooski temporarily deployed to develop April 22, 2022 20:21 Inactive
@philerooski philerooski requested a review from thomasyu888 April 22, 2022 20:22
SsmParameterName: synapse-bridgedownstream-auth
PrivateKey: ec2-bootstrap-trigger
CrontabURI: s3://{{ stack_group_config.artifact_bucket_name }}/BridgeDownstream/{{ stack_group_config.latest_version }}/ec2/resources/crontab
VPCId: vpc-d6c2a9ab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this vpc-d6c2a9ab? Did you create it yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can view the VPC in the console. I didn't create it myself.

Copy link
Member

@thomasyu888 thomasyu888 Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this, and the VPC is just the default VPC created and it doesn't meet our security standards. This resource needs to sit behind the Sage VPN: https://sagebionetworks.jira.com/browse/ETL-176

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philerooski philerooski temporarily deployed to develop April 26, 2022 23:13 Inactive
@philerooski philerooski temporarily deployed to develop April 26, 2022 23:15 Inactive
@philerooski philerooski temporarily deployed to develop May 2, 2022 21:50 Inactive
@philerooski philerooski temporarily deployed to develop May 2, 2022 22:13 Inactive
@philerooski philerooski temporarily deployed to develop May 2, 2022 22:15 Inactive
@philerooski philerooski temporarily deployed to develop May 3, 2022 19:13 Inactive
@philerooski philerooski temporarily deployed to develop May 3, 2022 19:15 Inactive
In addition to submitting files on Synapse to a workflow, this script
can now diff those Synapse files with a parquet dataset on S3 before
submitting to a workflow. The script has been renamed to reflect this
additional functionality.
@philerooski philerooski temporarily deployed to develop May 9, 2022 18:40 Inactive
@philerooski philerooski temporarily deployed to develop May 9, 2022 18:44 Inactive
@philerooski philerooski temporarily deployed to develop May 9, 2022 23:48 Inactive
@philerooski philerooski temporarily deployed to develop May 9, 2022 23:52 Inactive
FROM python:3.9.10

RUN pip install boto3==1.16.33 pandas==1.1.5 synapseclient==2.5.1 pyarrow==4.0.0
RUN git clone -b https://github.com/Sage-Bionetworks/BridgeDownstream.git /root/BridgeDownstream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure the docker build doesn't build from an existing cache or new changes to the repo won't get pulled in.

@philerooski philerooski temporarily deployed to develop May 10, 2022 15:46 Inactive
@philerooski philerooski temporarily deployed to develop May 10, 2022 15:51 Inactive
@philerooski philerooski temporarily deployed to develop May 10, 2022 17:02 Inactive
@philerooski philerooski requested a review from zaro0508 May 10, 2022 17:07
@philerooski philerooski temporarily deployed to develop May 10, 2022 17:17 Inactive
@philerooski philerooski temporarily deployed to develop May 10, 2022 17:17 Inactive
@philerooski philerooski merged commit 442a662 into main May 10, 2022
@philerooski philerooski deleted the etl-153 branch May 10, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants