-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: View representation core implementation #6598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: View representation core implementation #6598
Conversation
| /** The query output schema id at version create time */ | ||
| int schemaId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the flexibility to change this API until core implementation is complete, I changed the API to schemaID instead of schema. This is mostly because it simplifies the parsing logic and then it's still easy for a caller to obtain the schema from the top level view schema mapping.
Currently in the view spec, the schema ID is stored per SQL view representation. So in the current implementation building the representation when parsing is straightforward.
There are a few options here:
1.) Maintain schema ID in metadata and the API just returns schema ID as well (keeping parsing logic simple). That's what's done in this PR and I think is preferable since looking up schema via view.schemas().get(schemaID) should be straightforward.
2.) Preserve the schema at the API level, maintain schema ID in metadata. This complicates parsing logic (although not too much) because during parsing we need to pass the top level schemas list to SQLViewRepresentationParser https://iceberg.apache.org/view-spec/#view-metadata and then obtain the schema based on the parsed schema ID.
3.) Update the spec so that the entire schema object is stored in metadata and then serialize/deserialize.
Another topic (independent of which option we choose) is the spec currently marks schemaID as optional for sql view representation. I think it must be required (I can't think of a case where for a SQL representation we don't want to maintain a well defined Iceberg schema, engines can still choose to ignore it if they want to although I can't really think of such a case). I can raise a PR to update that if we think it's the right approach.
cc: @jzhuge @rdblue @jackye1995 @nastra
Let me know your thoughts on which approach above you find preferable and if we agree schema should be updated to be required for SQL view representation in the spec!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are still ways to use Schema, and ideally that is preferred from API perspective. For parser, similar to PartitionSpec parser, I would imagine the parser to take the schemas and use that. Is the same strategy achieveable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jackye1995 yeah that's achievable, that's what I meant in approach 2 above. It does complicate parsing more than I'd like but it's very doable. I'll update the PR and the community can take a look and we can see which one we prefer more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After implementing it, while it's doable it does seem to complicate more than I thought. I'll see about simplifying it, but we may just want to go back to surfacing schema IDs at the data model level or even serializing the entire schema to keep parsing really simple. For just using schema ID, then engines can select their representation and then do view.schema(repr.schemaId())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. After taking a deeper look, there are multiple places that we have schemaId() as a part of the API model, such as in Snapshot.
I was using the case of PartitionSpec to argue that we should stick with using schema(), but it was because there is a process of binding to a schema, and the serialized version of partition spec is unbounded.
So the question here is not which one is more convenient to implement parser, but does a view representation needs to bind to a schema at runtime, or only have a static schema. I think the answer is that it is static, as described as "ID of the view’s schema when the version was created". So from that perspective, I agree schemaId() seems like a better choice so we don't need to cross reference existing schemas to make the parser work.
Any thoughts? @amogh-jahagirdar @nastra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 exactly my thoughts as well, a schema is known at the time of SQL representation creation so late binding like we do with PartitionSpec doesn't apply or would needlessly complicate the logic of constructing the representation. Will hold for @nastra and @rdblue thought as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the discussion I think just having schemaId makes a lot of sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use Integer schemaId(). Since schemaId may not be defined at creation time as discussed in https://github.com/apache/iceberg/pull/6611/files schemaId() can return null
40e07af to
947a539
Compare
api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/view/ViewRepresentation.java
Outdated
Show resolved
Hide resolved
947a539 to
ff3e0fd
Compare
core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
ff3e0fd to
046c877
Compare
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
e36d2d5 to
02a1938
Compare
core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java
Show resolved
Hide resolved
| */ | ||
| package org.apache.iceberg.view; | ||
|
|
||
| import edu.umd.cs.findbugs.annotations.Nullable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this? We don't use nullable annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since SQLViewRepresentation is an Immutable, this indicates to the builder that certain fields are not required and can indeed be null when an instance is constructed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are using Nullable in a few places already, for example https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/metrics/CommitMetricsResult.java#L54 . I'm good to revert this but that would imply not using Immutables as well here (unless there's another acceptable way to indicate to Immutable values that a field can be null).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nullable annotations have been used in puffin and metrics, but I do think we should standardize on javax.annotation.Nullable as opposed to this findbugs version if we are going to be explicit about nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. There's still a few other places in the Iceberg code base which are using the umd.cs instead of javax. It make sense to use the javax one, it looks like the umd.cs one is deprecated? I'll create a tracking issue so that we can update to use javax in the remaining parts of the code
| public String typeName() { | ||
| return name().toLowerCase(Locale.ENGLISH); | ||
| } | ||
| public static final String SQL = "sql"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this from an enum to a String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the suggestion since I see some other enum like classes are also implemented directly as strings, such as DataOperations, and it seems to simplify the code a bit. But if there is a specific reason for using enum we can stick to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on whether we want to use this enum in switch statements in our code or if we want to extend it. For example, we could have a reference to the parser in the enum so we look up the symbol and then call something like ViewRepresentation.SQL.parse(jsonNode).
Since we only have the parser selection right now, it doesn't seem like it matters much.
02a1938 to
3ec7def
Compare
9dcd73d to
ac923f4
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we still need to settle the debates of:
- use
schemaId()instead ofschema() - use string or enum for representation type
- use nullable or not
Once those points are addressed I am good
|
We probably want to establish a standard in the community at this point on Immutable/Nullable or not. Right now we're in this partial state, where it's used in some cases. Defining a standard can help focus the discussion on fundamental areas. I do think Immutables are really nice at keeping boiler plate code to a minimum but I don't have a strong opinion other than just setting a standard practice in Iceberg :) Maybe it's worth kicking off on the mailing list just to get the wider community's perspective? |
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly LGTM, just some small nits
core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
yyanyy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comment
| DEFAULT_NAMESPACE, Arrays.asList(view.defaultNamespace().levels()), generator); | ||
| } | ||
|
|
||
| if (view.fieldAliases() != null && !view.fieldAliases().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we decide to use nullable annotation, sounds like fieldAliases and fieldComments can be null since we do null check here; should we mark them as nullable in SQLViewRepresentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #6598 (comment) , there's pro's and cons for either leaving it be null at the data model level or just in serialization. In the end we just opted for having an empty list at the data model layer (and a client doesn't have to do a null check). So here one possible cleanup is not do the null check anyways since we have the guarantee at the API level!
ac923f4 to
b4979e6
Compare
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
90d9b43 to
1c9d016
Compare
1c9d016 to
069d6e0
Compare
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few cleanups needed but LGTM otherwise
core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/UnknownViewRepresentation.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
8f04f41 to
ff24c0e
Compare
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java
Show resolved
Hide resolved
ff24c0e to
31d8f04
Compare
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well, thanks @amogh-jahagirdar!
api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java
Outdated
Show resolved
Hide resolved
| package org.apache.iceberg.view; | ||
|
|
||
| import java.util.Locale; | ||
| import org.immutables.value.Value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: javax.annotation.Nullable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ViewRepresentations currently has a single required "type" field so it wasn't marked as Nullable to begin with. Let me know if I'm missing something!
31d8f04 to
36162ea
Compare
Co-authored-by: John Zhuge <[email protected]>
36162ea to
987c96b
Compare
|
@rdblue @danielcweeks the PR looks good to be merged, do you have any additional comment? |
|
Seems like we don't have much movement on the review at this point, given the fact that all current comments are addressed and this is a part of many PRs for view catalog integration, I will go ahead to merge this. We can address any remaining comments in #6559 if any. Thanks @amogh-jahagirdar and @jzhuge for the work and thanks everyone for review! |
| org.apache.parquet:* = 1.12.3 | ||
| org.apache.pig:pig = 0.14.0 | ||
| com.fasterxml.jackson.*:* = 2.14.1 | ||
| com.google.code.findbugs:jsr305 = 3.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackye1995 and @amogh-jahagirdar, this should be a banned dependency that is replaced by stephenc's reimplementation. This is a 1.2.0 release blocker.
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, except for the banned dependency. Thanks for getting this in, and sorry that my review was late!
|
No problem! Let me raise a PR to address the banned dependency, sorry about that |
Co-authored-by: John Zhuge <[email protected]>
Co-authored-by: John Zhuge <[email protected]> (cherry picked from commit e5846a5)
View representation core implementation
Co-authored-by: John Zhuge [email protected]