-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark: remove object storage data path in destination table for snapshot table action #2966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kbendick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, particularly around documenting somehow (via code comments or in the docs) that we’re explicitly not bringing the object storage path with the snapshot, but overall I think this is a good idea.
If a user snapshots a table, sharing the same object storage path could possibly have some funky implications for things like removing orphan files. So this seems safer in my opinion.
| properties.remove(LOCATION); | ||
| properties.remove(TableProperties.WRITE_METADATA_LOCATION); | ||
| properties.remove(TableProperties.WRITE_NEW_DATA_LOCATION); | ||
| properties.remove(TableProperties.OBJECT_STORE_PATH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: it might make sense to add a comment here that we’re explicitly choosing not to bring along OBJECT_STORE_PATH in the snapshot?
Either a comment, or possibly updating the ObjectStorageLocationProvider docs / snapshot docs with this detail would be great 🙂. Documentation updates can be done in a separate PR of course (and happy to assist there if you’d like).
| Assert.assertEquals("should use object storage location provider", | ||
| "org.apache.iceberg.LocationProviders$ObjectStoreLocationProvider", | ||
| locationProvider.getClass().getName()); | ||
| Assert.assertTrue("should use table folder storage path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we might want to further clarify what we’re testing for in the assertion.
Something like Should use table folder storage path after unsetting the object storage location path or `should use table folder storage path if present when object storage path is not present”.
One could argue that these assertions could be subject to the same problems as comment rot if tests get changed, so I’ll defer to your judgement.
Also: Given that the names of the constants and their string representations are a little funky (particularly folder storage path / WRITE_NEW_DATA_LOCATION), it might make sense to refer to both at some point? Again, will leave that to your discretion but I think it might help clarify for readers. 🙂
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
follow up for #2845 , because now we allow a fallback mechanism for object storage data path, it is safe to remove this property when doing a table snapshot. The destination table will use its default location as data path, and users can configure later for a new path if necessary.
@kbendick @aokolnychyi