-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for subfolders for aws s3 #1191
Add support for subfolders for aws s3 #1191
Conversation
1483fac
to
24499ba
Compare
@vsreekanti should this subfolder thing also apply when the uses the integration as a data integration? Eg. should the file paths be relative to the subdirectory also? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenxu95 looks good overall, left a few questions and comments.
# Check that any user-supplied root directory exists. | ||
if self.root_dir != "": | ||
# If nothing is returned by this filter call, then the directory does not exist. | ||
if len(list(self.s3.Bucket(self.bucket).objects.filter(Prefix=self.root_dir))) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a lot of objects in this root directory, this call may take very long to finish. I wonder if there's a lower overhead way of checking if a s3 directory exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, let's discuss in https://aqueducthq.slack.com/archives/C01KF131001/p1681168125986639? We already use this method in our extract
implementation when fetching entire directories.
region: str = "" | ||
|
||
# Expected to be in the format "path/to/dir/" (without a leading slash, but with a trailing one). | ||
root_dir: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used in data connection? I thought it's only relevant for S3 used as storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be here because the model forbids extra fields. I'll add a note that it's actually unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to add the note.
return bucket, key | ||
|
||
if root_dir_path != "": | ||
path_parts += [_sanitize_path(root_dir_path)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if root_dir_path is assumed to be in the format path/to/dir/
, why do we need to sanitize it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is assumed, but we don't check it here. I didn't want to be too strict with the assertions in case there was a codepath I missed. Agree, it's confusing - maybe I can just remove the comment for now and make no assumptions about the root path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to remove the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. See reminders inline to add the notes.
return bucket, key | ||
|
||
if root_dir_path != "": | ||
path_parts += [_sanitize_path(root_dir_path)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to remove the comment.
region: str = "" | ||
|
||
# Expected to be in the format "path/to/dir/" (without a leading slash, but with a trailing one). | ||
root_dir: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder to add the note.
Describe your changes and why you are making these changes
When registering an S3 integration as the storage layer, the user can specify a subfolder that then acts as the root of artifact storage. The vault/ folder is created inside this subfolder. This subfolder can be defined with optional leading/trailing slashes - the backend will sanitize it into the consistent form path/to/dir before writing it to the database.
Related issue number (if any)
ENG-2738
Loom demo (if any)
https://www.loom.com/share/3dbf7fd6fa394117a79d5b77661af3d3
Checklist before requesting a review
python3 scripts/run_linters.py -h
for usage).run_integration_test
: Runs integration testsskip_integration_test
: Skips integration tests (Should be used when changes are ONLY documentation/UI)