-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: View metadata implementation #7759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nastra Is it possible to abstract a common interface for table and view because they have very similar methods? |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great to me just some minor remaining comments, and should be good to merge. Thanks a ton for carrying this forward @nastra! I'll reply on the original PR and we can probably close that one.
core/src/test/resources/org.apache.iceberg.view/ViewMetadataInvalidCurrentSchema.json
Show resolved
Hide resolved
|
@nastra Thanks for your response, it indeed makes the code more complex. |
| return versionsById().get(versionId); | ||
| } | ||
|
|
||
| @Value.Lazy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be Derived rather than Lazy. Lazy makes it more complicated and there's no real benefit to not just calculating this at construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally I agree with you that @Value.Derived is the better option.
However, given that we're deriving from a collection, it makes it difficult/impossible to do validation on the collection before eager validation on the derived field kicks in.
See also my comment in #7759 (comment) about default collection behavior and why it's better to have @Value.Lazy here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, does it make sense to make this Lazy at all? They're not lazy or derived in TableMetadata:
public Schema schema() {
return schemasById.get(currentSchemaId);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok removing @Value.Lazy here, since the only thing it buys us is that it doesn't re-compute the value from the underlying hash map, which is O(1) anyway
| versions().subList(versions().size() - versionHistorySizeToKeep(), versions().size()); | ||
| List<ViewHistoryEntry> history = | ||
| history().subList(history().size() - versionHistorySizeToKeep(), history().size()); | ||
| return ImmutableViewMetadata.builder().from(this).versions(versions).history(history).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing. The builder should enforce the maximum number of versions, not the check method. It's awkward that this returns a copy of the view metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be the "official" way of normalizing an object: http://immutables.github.io/immutable.html#normalization
Not sure it's worth adding our own builder class for this particular case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange to me and I'm not a big fan of having a checkAndNormalize in the interface, but okay I guess? It doesn't seem concerning enough to block using Immutables but it is annoying that the pattern requires exposing additional methods that don't have a clear contract (like validate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here completely.
It would be great if there could be a better & shorter alternative to achieve this.
The (longer) alternative would be to have our own builder that internally uses ImmutableViewMetadata.builder() and does the "normalization" but that doesn't prevent anyone from using ImmutableViewMetadata.builder() directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but if we're using our own builder then what's the point of immutables?
For now, I think we should just get this in to unblock the next PR. We should follow up with the API that adds a new view representation and version. If that goes smoothly then it will be fine. If it doesn't fit then we can remove immutables and go with a direct implementation.
| assertThatThrownBy(() -> ImmutableViewMetadata.builder().build()) | ||
| .isInstanceOf(IllegalStateException.class) | ||
| .hasMessage( | ||
| "Cannot build ViewMetadata, some of required attributes are not set [formatVersion, location, currentSchemaId, currentVersionId]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't schemas and versions required to be non-null and non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections are empty by default if not set and typically they are being checked in a @Value.Check method. See also immutables/immutables#1429 (comment) for some historical reasons why Collections are being empty by default if not set.
The other alternative would be to use the Java Bean Validation API, which would allow expressing validations on Collections via annotations, but I don't think we'd want to do that here.
That's the main reason why I'd like to keep currentVersion() and schema() being @Value.Lazy so that we don't have misleading error messaging.
What I mean here is that for example, if currentVersionId() is set but versions() is empty, we would see java.lang.NullPointerException: currentVersion (because currentVersion cannot be derived from versions() and thus ends up being null) rather than java.lang.IllegalArgumentException: Cannot find current version 23 in view versions: [1]
core/src/test/java/org/apache/iceberg/view/TestViewMetadata.java
Outdated
Show resolved
Hide resolved
69d38ba to
29dd333
Compare
| properties(), | ||
| ViewProperties.VERSION_HISTORY_SIZE, | ||
| ViewProperties.VERSION_HISTORY_SIZE_DEFAULT); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this validation fails, won't it cause the view to be broken and unfixable? We won't be able to construct the ViewMetadata so there would be no way to fix this if another implementation sets it to -1 or something for unlimited history. We would either need to document this requirement in the spec or fail more gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is true, one wouldn't be able to construct the ViewMetadata object`, because it's technically failing validation.
For keeping unlimited history I think we have the following options:
- allow
-1to indicateunlimited(not sure if we have any other properties that carry such semantics). - let users use a large value with
Integer.MAX_VALUEbeing the maximum. This seems to be more appropriate IMO, since the view spec mentionsThe number of versions to retain is controlled by the table property: version.history.num-entries.. However, I can also see the argument to "which value is large enough?".
Whatever we think carries the right semantics to indicate unlimited history should go into the validation logic + the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather not fail construction because of this. If the value is invalid, then let's just ignore it, warn, and not change the versions list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable, I've updated this to issue a WARN and not modify the history
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks about ready, although there are some minor comments.
Co-authored-by: John Zhuge <[email protected]> Co-authored-by: Eduard Tudenhoefner <[email protected]>
|
Thanks, @nastra! And thanks to @amogh-jahagirdar and @jzhuge for major parts of this as well! |
Co-authored-by: Amogh Jahagirdar <[email protected]> Co-authored-by: John Zhuge <[email protected]> (cherry picked from commit f2b01f8)
This is a continuation PR of #6559 and I've addressed all remaining comments.
@stevenzwu @jackye1995 @rdblue since you guys were reviewing the original PR, I've added you here as well. Please take a look when you can.