Skip to content

Conversation

@kbendick
Copy link
Contributor

Motivation

As part of the REST Catalog, and for interoperability with other systems, we need to be able to accept a TableMetadata instance as JSON from an external source.

Currently, TableMetadata takes in FileIO as an argument, which it pushes down to BaseSnapshot (the main implementation of Snapshot).

File IO is used lazily in BaseSnapshot, only when we need to fetch added / deleted files or any kind of manifest from the snapshot. The call for these changes is done once, and then the results are cached.

Changes

Because we cannot be sure of which FileIO is the appropriate one to use when parsing TableMetadata, and even if we were sure, it would be very difficult to pass it to the TableMetadataDeserializer, this PR deprecates the stored FileIO state from BaseSnapshot.

With these changes, users will need to call addedFiles / deletedFiles / allManifests / dataManifests / deleteManifests on Snapshot with the desired FileIO implementation.

This is relatively easy to do, as the correct FileIO tends to be readily available where these functions are called. This also makes it much easier for session based use cases, which might have different FileIO with different levels of access, within the same program, such as multi-tenant Trino-style programs.

Remaining

This PR only updates latest Spark and Flink and Hive versions. We'll need to backport for earlier versions, which can either be done in this PR or in another PR. But for the sake of reviewers, I've left it to just the latest versions of all.

cc @rdblue

Comment on lines +92 to +109
BaseSnapshot(long sequenceNumber,
long snapshotId,
Long parentId,
long timestampMillis,
String operation,
Map<String, String> summary,
Integer schemaId,
String manifestList) {
this.io = null;
this.sequenceNumber = sequenceNumber;
this.snapshotId = snapshotId;
this.parentId = parentId;
this.timestampMillis = timestampMillis;
this.operation = operation;
this.summary = summary;
this.schemaId = schemaId;
this.manifestListLocation = manifestList;
}
Copy link
Contributor Author

@kbendick kbendick May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is not presently used. In most situations, it's ok to simply pass null as the FileIO for now.

I didn't update the specific constructor used in most places in an attempt to make this easier to cherry-pick. If we'd like, I can go ahead and change the constructor used in most cases.

@kbendick kbendick force-pushed the kb-deprecate-fileio-state-in-basesnapshot branch from c4bede0 to 1e0449c Compare May 25, 2022 22:26
Copy link
Contributor Author

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @nastra as TestNessieTable has been touched. Additionally, I believe you were previously working around this problem (not having FileIO on hand when parsing TableMetadata from external JSON).

Hopefully this can help simplify some changes for you in the medium to long term 🙂

@kbendick
Copy link
Contributor Author

Also cc @danielcweeks

* Return all {@link ManifestFile} instances for either data or delete manifests in this snapshot.
*
* @return a list of ManifestFile
* @deprecated since 1.0.0 - Use {@link Snapshot#allManifests(FileIO)} instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying when it was deprecated, I think we should say when it will be removed. I'm thinking that will probably be 1.0.0, assuming that we do an 0.14.0 and then follow up with a 1.0.0 to remove deprecations, but open to other opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's a good idea. I believe Yufei did something similar (noting when it would be deprecated) when deprecating the old metadata write directory configurations.

I also think that this gives a little bit more of an argument for a 0.14.0. We can go through, determine what needs to be deprecated, cut a release, and then turn around and cut a 1.0.0 release that from then on has stronger API / ABI guarantees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied existing language and mentioned it's deprecated since 0.14.0 and will be deprecated in 1.0.0.

I found a deprecation notice saying something would be removed in 0.15.0, so I think some ambiguity about 0.14.0 vs 1.0.0 is fine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I’ll make a note of the 0.15.0 thing and we’ll get that cleared up. Found it basically git grep-Ing and it’s very unlikely we’ll have a 0.15.0.

return schemaId;
}

private void cacheManifests(FileIO fileIO) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace the existing cacheManifests method with this, rather than duplicating the implementation?

Copy link
Contributor Author

@kbendick kbendick May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use one cacheManifests function (and same for cacheChanges) - passing in the class level io for the existing (deprecated) functions that call it.

The only changes are:

  • Using the passed in argument
  • Throwing IllegalArgumentException instead of IllegalStateException.

@rdblue rdblue changed the title Core - Let TableMetadata be parsed from JSON without FileIO by deprecating FileIO in BaseSnapshot API: Pass FileIO into Snapshot methods that read metadata May 26, 2022
@rdblue
Copy link
Contributor

rdblue commented May 26, 2022

Looks great! Thanks @kbendick. It's great to have this cleaned up. I've been wanting to do it for a long time now.

@kbendick
Copy link
Contributor Author

Looks great! Thanks @kbendick. It's great to have this cleaned up. I've been wanting to do it for a long time now.

Thanks @rdblue. This makes a lot of sense to me as well, as the TableMetadata should be transmissible in a way that doesn’t immediately couple itself to the FileSystem (or choice of FileIO etc).

I know it seems like a large change but it’s really mostly just tests and the FileIO is already properly serialized for Kryo etc where need be. So the change isn’t nearly as large as it seems.

Question: Should I update tests and code for earlier Spark versions and Flink versions in this PR or in another? It would make code maintenance simpler to keep other versions in sync (either here or in another PR). But happy to slice the PRs however is deemed best.

@nastra
Copy link
Contributor

nastra commented May 26, 2022

@kbendick thanks, I'll review on monday but if anything comes up we could just do a follow-up, so no need to block on merging this.

@rdblue
Copy link
Contributor

rdblue commented May 26, 2022

Should I update tests and code for earlier Spark versions and Flink versions in this PR or in another?

I think in another PR would be best.

@rdblue rdblue merged commit 128d5a1 into apache:master May 26, 2022
@rdblue
Copy link
Contributor

rdblue commented May 26, 2022

Thanks, @kbendick! I merged this so we can move on to the follow-ups.

@kbendick
Copy link
Contributor Author

@kbendick thanks, I'll review on monday but if anything comes up we could just do a follow-up, so no need to block on merging this.

Cool. Let me know if there's anything in particular you'd like to go over or have any questions on as I believe this will benefit you as well.

@nastra
Copy link
Contributor

nastra commented May 30, 2022

retro +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants