-
Notifications
You must be signed in to change notification settings - Fork 4
Initial commit archive_dataset script #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also added module comments to the top of the other scripts I wrote in the past few weeks. Additionally, extended backfill_json_datasets to use cached Synapse credentials if an ssm-parameter name is not passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general comments for you. I like how you add examples in the docstrings - very informative.
Not completely necessary to include the jira link in the description but I understand it does make it more convenient than searching for the issue after going to the ETL project in jira.
if len(relevant_archive_dataset_prefixes) == 0: | ||
# No datasets of this type and version in archive | ||
return 0 | ||
preexisting_update_nums = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break if the archive number doesn't get added on accident.
And joined prefixes with os module for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just some comments - well might also want to wait until Tess comes back so she is looped into the process you created.
Keyword arguments: | ||
s3_client -- An AWS S3 client object. | ||
bucket -- The S3 bucket name. | ||
app -- The app identifier, as used in the S3 prefix in the bucket, to archive. | ||
dataset -- The name of the dataset to archive. | ||
dataset_version -- The dataset version to archive. | ||
|
||
Returns: | ||
A dictionary with source prefixes as keys and destination | ||
prefixes as values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, where did you find this python docstring standard? I'm only familiar with
- pydoc
- numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dest = source_and_dest[source] | ||
bash_command = f"aws s3 cp --recursive {source} {dest}" | ||
subprocess.run( | ||
shlex.split(bash_command), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! I did not know about shlex
Addresses: https://sagebionetworks.jira.com/browse/ETL-89
Also added module comments to the top of the other scripts I wrote
in the past few weeks. Additionally, extended backfill_json_datasets
to use cached Synapse credentials if an ssm-parameter name is not
passed.