Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema to table location #134

Closed

Conversation

kmcclenn
Copy link
Collaborator

Summary

Issue Changed table locations to include the scheme.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Changed both the implementation of creating the table location (and dealing with its ramifications) and the tests that test the table location. Allows all table locations to include the schema (eg. hdfs://tmp/... instead of tmp/...). Furthermore, added this fix for the old constructTablePath method and the new allocateTableStorage method. Removed outdated overriding methods for Local and HDFS storage that don't include the scheme.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Changed tests to reflect the table locations now containing the schema.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

@kmcclenn kmcclenn changed the title Kai/add schema to table location Add schema to table location Jul 11, 2024
@kmcclenn kmcclenn closed this Jul 11, 2024
@kmcclenn kmcclenn reopened this Jul 11, 2024
Copy link
Collaborator

@HotSushi HotSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @kmcclenn I'm curious about the necessity of this change. There are some prereqs before we tackle #121 .

@@ -235,7 +235,8 @@ public void runRetention(
}

private Path getTrashPath(String path, String filePath, String trashDir) {
return new Path(filePath.replace(path, new Path(path, trashDir).toString()));
return new Path(
filePath.replace(new Path(path).toString(), new Path(path, trashDir).toString()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we need this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the schema, the string path was file:///tmp/.... However, we were trying to replace it with something like file:/tmp/..., because the Path type only allows for single forward slashes. So, I converted it to a Path to standardize the format to file:/tmp/... so they would match.

log.info("Detected {} orphan files", orphanFiles.size());
for (String of : orphanFiles) {
List<String> orphanFilesWithoutMetadata =
orphanFiles.stream()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why metadata.json files filtered out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the tests were meant for locations without schema: See this slack message from Stas

@@ -60,10 +60,6 @@ public String allocateTableLocation(
String databaseId, String tableId, String tableUUID, String tableCreator) {
Preconditions.checkArgument(databaseId != null, "Database ID cannot be null");
Preconditions.checkArgument(tableId != null, "Table ID cannot be null");
Preconditions.checkArgument(tableUUID != null, "Table UUID cannot be null");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seem regressive, its valid to ensure tableUUID is not null/ storageproperties is correctly configured

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was a bit confused here. I removed it because it was failing this test because the STAGED_TABLE_DTO did not have a UUID; there is probably a better solution here but I wasn't sure what was best

* @return the table location
*/
@Override
public String allocateTableLocation(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we ought to be careful with this. The reason we didn't make this change earlier was to ensure backward compatibilty. I would hold off checking this in until we keep the behavior compatible in li-openhouse. Could you make a change there to not call super.allocateTableStorage?

* href="https://github.com/linkedin/openhouse/issues/121">
*/
@Override
public String allocateTableLocation(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to be careful with this change as well. tables-test-fixtures uses non-schemed paths. Please validate tables-test-fixtures continue to work after this change cc: @jiang95-dev

storageManager.getDefaultStorage().getClient().getRootPrefix(),
databaseID,
tableId + "-" + tableUUID);
String endpoint = storageManager.getDefaultStorage().getClient().getEndpoint();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to make this change yet. constructTablePath is only getting used in CTAS (and only for hdfs). non-hdfs storages need much more things to correctly support ctas: BDP-23723

@@ -144,8 +145,10 @@ private void validatePathOfProvidedRequest(
extractFromTblPropsIfExists(databaseId + "." + tableId, tableProperties, TBL_RAW_KEY);

java.nio.file.Path previousPath =
InternalRepositoryUtils.constructTablePath(
storageManager, dbIdFromProps, tblIdFromProps, tableUUIDProperty);
Paths.get(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid making change to this class. CTAS support is not present for non-hdfs storages, and this needs to be solved thoughtfully: BDP-23723

see doesPathExist(..) UnsupportedOperartionException

@kmcclenn
Copy link
Collaborator Author

@HotSushi Thanks for the comments. I originally started the PR because the ADLS storage was failing without schema in the table location. I added changes for constructTablePath and all of the related tests before I realized that this was changed in main by the allocateTableLocation function. So it may not be needed anymore, especially if I will be breaking the backwards compatibility with the PR. Do you think I should just close it and move forward with allocateTableLocation for ADLS?

@HotSushi
Copy link
Collaborator

@HotSushi Thanks for the comments. I originally started the PR because the ADLS storage was failing without schema in the table location. I added changes for constructTablePath and all of the related tests before I realized that this was changed in main by the allocateTableLocation function. So it may not be needed anymore, especially if I will be breaking the backwards compatibility with the PR. Do you think I should just close it and move forward with allocateTableLocation for ADLS?

yes allocateTableLocation is the way to go about it. We can pursue this PR later once backward compatibility is ensured.

@jiang95-dev
Copy link
Collaborator

Thanks @kmcclenn. You can close this pr if allocateTableLocation solves the problem.
Thanks @HotSushi. I have some questions about the design and plan:

  1. Are we going to remove allocateTableLocation once the backward compatibility is ensured? I feel it is still needed to parse the location whether or not it is with scheme?
  2. What will be the state for the backward compatibility? Old tables storing non-schemed tableLocation, and new tables storing schemed tableLocation, and OH server can read/write both correctly?

@kmcclenn kmcclenn closed this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants