Initial commit archive_dataset script #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

philerooski merged 2 commits into main from etl-89

Jan 10, 2022

Contributor

philerooski commented Dec 17, 2021

Addresses: https://sagebionetworks.jira.com/browse/ETL-89

Also added module comments to the top of the other scripts I wrote
in the past few weeks. Additionally, extended backfill_json_datasets
to use cached Synapse credentials if an ssm-parameter name is not
passed.


          Initial commit archive_dataset script

c485803

Also added module comments to the top of the other scripts I wrote
in the past few weeks. Additionally, extended backfill_json_datasets
to use cached Synapse credentials if an ssm-parameter name is not
passed.

philerooski requested a review from tthyer

December 17, 2021 23:34

philerooski requested a review from a team as a code owner

December 17, 2021 23:34

thomasyu888 reviewed

View reviewed changes

Member

thomasyu888 left a comment

Some general comments for you. I like how you add examples in the docstrings - very informative.

Not completely necessary to include the jira link in the description but I understand it does make it more convenient than searching for the issue after going to the ETL project in jira.

src/scripts/archive_dataset/archive_dataset.py Show resolved Hide resolved

src/scripts/archive_dataset/archive_dataset.py Show resolved Hide resolved

src/scripts/archive_dataset/archive_dataset.py Outdated Show resolved Hide resolved

src/scripts/archive_dataset/archive_dataset.py Show resolved Hide resolved

src/scripts/archive_dataset/archive_dataset.py

+                  if len(relevant_archive_dataset_prefixes) == 0:
+                      # No datasets of this type and version in archive
+                      return 0
+                  preexisting_update_nums = [

Member

thomasyu888 Dec 19, 2021

This will break if the archive number doesn't get added on accident.

src/scripts/archive_dataset/archive_dataset.py Show resolved Hide resolved


          Added function docstrings

d263d33

And joined prefixes with os module for clarity

philerooski temporarily deployed to develop

December 20, 2021 23:20

Inactive

philerooski mentioned this pull request

Improve schema change docs, remove flip_job script #34

Merged

thomasyu888 approved these changes

View reviewed changes

Member

thomasyu888 left a comment •

edited

Loading

LGTM - just some comments - well might also want to wait until Tess comes back so she is looped into the process you created.

src/scripts/archive_dataset/archive_dataset.py

Comment on lines +67 to +76

+                  Keyword arguments:
+                  s3_client -- An AWS S3 client object.
+                  bucket -- The S3 bucket name.
+                  app -- The app identifier, as used in the S3 prefix in the bucket, to archive.
+                  dataset -- The name of the dataset to archive.
+                  dataset_version -- The dataset version to archive.
+                  Returns:
+                  A dictionary with source prefixes as keys and destination
+                  prefixes as values.

Member

thomasyu888 Dec 22, 2021

Just out of curiosity, where did you find this python docstring standard? I'm only familiar with

google
pydoc
numpy

Contributor Author

philerooski Dec 22, 2021

https://www.python.org/dev/peps/pep-0257/

src/scripts/archive_dataset/archive_dataset.py

+                      dest = source_and_dest[source]
+                      bash_command = f"aws s3 cp --recursive {source} {dest}"
+                      subprocess.run(
+                              shlex.split(bash_command),

Member

thomasyu888 Dec 22, 2021

This is very cool! I did not know about shlex

tthyer approved these changes

View reviewed changes

philerooski merged commit bb8c912 into main

philerooski deleted the etl-89 branch

January 10, 2022 21:51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet