Skip to content

Initial commit archive_dataset script #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 10, 2022
Merged

Initial commit archive_dataset script #33

merged 2 commits into from
Jan 10, 2022

Conversation

philerooski
Copy link
Contributor

Addresses: https://sagebionetworks.jira.com/browse/ETL-89

Also added module comments to the top of the other scripts I wrote
in the past few weeks. Additionally, extended backfill_json_datasets
to use cached Synapse credentials if an ssm-parameter name is not
passed.

Also added module comments to the top of the other scripts I wrote
in the past few weeks. Additionally, extended backfill_json_datasets
to use cached Synapse credentials if an ssm-parameter name is not
passed.
@philerooski philerooski requested a review from tthyer December 17, 2021 23:34
@philerooski philerooski requested a review from a team as a code owner December 17, 2021 23:34
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments for you. I like how you add examples in the docstrings - very informative.

Not completely necessary to include the jira link in the description but I understand it does make it more convenient than searching for the issue after going to the ETL project in jira.

if len(relevant_archive_dataset_prefixes) == 0:
# No datasets of this type and version in archive
return 0
preexisting_update_nums = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if the archive number doesn't get added on accident.

And joined prefixes with os module for clarity
@philerooski philerooski temporarily deployed to develop December 20, 2021 23:20 Inactive
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just some comments - well might also want to wait until Tess comes back so she is looped into the process you created.

Comment on lines +67 to +76
Keyword arguments:
s3_client -- An AWS S3 client object.
bucket -- The S3 bucket name.
app -- The app identifier, as used in the S3 prefix in the bucket, to archive.
dataset -- The name of the dataset to archive.
dataset_version -- The dataset version to archive.

Returns:
A dictionary with source prefixes as keys and destination
prefixes as values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, where did you find this python docstring standard? I'm only familiar with

  • google
  • pydoc
  • numpy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dest = source_and_dest[source]
bash_command = f"aws s3 cp --recursive {source} {dest}"
subprocess.run(
shlex.split(bash_command),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! I did not know about shlex

@philerooski philerooski merged commit bb8c912 into main Jan 10, 2022
@philerooski philerooski deleted the etl-89 branch January 10, 2022 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants