Core: View representation core implementation #6598

amogh-jahagirdar · 2023-01-15T21:52:20Z

View representation core implementation
Co-authored-by: John Zhuge [email protected]

amogh-jahagirdar · 2023-01-15T22:02:21Z

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

+  /** The query output schema id at version create time */
+  int schemaId();


Since we have the flexibility to change this API until core implementation is complete, I changed the API to schemaID instead of schema. This is mostly because it simplifies the parsing logic and then it's still easy for a caller to obtain the schema from the top level view schema mapping.

Currently in the view spec, the schema ID is stored per SQL view representation. So in the current implementation building the representation when parsing is straightforward.

There are a few options here:

1.) Maintain schema ID in metadata and the API just returns schema ID as well (keeping parsing logic simple). That's what's done in this PR and I think is preferable since looking up schema via view.schemas().get(schemaID) should be straightforward.

2.) Preserve the schema at the API level, maintain schema ID in metadata. This complicates parsing logic (although not too much) because during parsing we need to pass the top level schemas list to SQLViewRepresentationParser https://iceberg.apache.org/view-spec/#view-metadata and then obtain the schema based on the parsed schema ID.

3.) Update the spec so that the entire schema object is stored in metadata and then serialize/deserialize.

Another topic (independent of which option we choose) is the spec currently marks schemaID as optional for sql view representation. I think it must be required (I can't think of a case where for a SQL representation we don't want to maintain a well defined Iceberg schema, engines can still choose to ignore it if they want to although I can't really think of such a case). I can raise a PR to update that if we think it's the right approach.

cc: @jzhuge @rdblue @jackye1995 @nastra

Let me know your thoughts on which approach above you find preferable and if we agree schema should be updated to be required for SQL view representation in the spec!

I think there are still ways to use Schema, and ideally that is preferred from API perspective. For parser, similar to PartitionSpec parser, I would imagine the parser to take the schemas and use that. Is the same strategy achieveable here?

Thanks @jackye1995 yeah that's achievable, that's what I meant in approach 2 above. It does complicate parsing more than I'd like but it's very doable. I'll update the PR and the community can take a look and we can see which one we prefer more.

After implementing it, while it's doable it does seem to complicate more than I thought. I'll see about simplifying it, but we may just want to go back to surfacing schema IDs at the data model level or even serializing the entire schema to keep parsing really simple. For just using schema ID, then engines can select their representation and then do view.schema(repr.schemaId())

I see. After taking a deeper look, there are multiple places that we have schemaId() as a part of the API model, such as in Snapshot.

I was using the case of PartitionSpec to argue that we should stick with using schema(), but it was because there is a process of binding to a schema, and the serialized version of partition spec is unbounded.

So the question here is not which one is more convenient to implement parser, but does a view representation needs to bind to a schema at runtime, or only have a static schema. I think the answer is that it is static, as described as "ID of the view’s schema when the version was created". So from that perspective, I agree schemaId() seems like a better choice so we don't need to cross reference existing schemas to make the parser work.

Any thoughts? @amogh-jahagirdar @nastra

@jackye1995 exactly my thoughts as well, a schema is known at the time of SQL representation creation so late binding like we do with PartitionSpec doesn't apply or would needlessly complicate the logic of constructing the representation. Will hold for @nastra and @rdblue thought as well

After reading through the discussion I think just having schemaId makes a lot of sense to me

Updated to use Integer schemaId(). Since schemaId may not be defined at creation time as discussed in https://github.com/apache/iceberg/pull/6611/files schemaId() can return null

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

api/src/main/java/org/apache/iceberg/view/ViewRepresentation.java

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java

core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java

core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java

rdblue · 2023-01-18T00:36:38Z

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

 */
 package org.apache.iceberg.view;

+import edu.umd.cs.findbugs.annotations.Nullable;


Can you remove this? We don't use nullable annotations.

since SQLViewRepresentation is an Immutable, this indicates to the builder that certain fields are not required and can indeed be null when an instance is constructed

It seems like we are using Nullable in a few places already, for example https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/metrics/CommitMetricsResult.java#L54 . I'm good to revert this but that would imply not using Immutables as well here (unless there's another acceptable way to indicate to Immutable values that a field can be null).

Nullable annotations have been used in puffin and metrics, but I do think we should standardize on javax.annotation.Nullable as opposed to this findbugs version if we are going to be explicit about nullability.

Makes sense. There's still a few other places in the Iceberg code base which are using the umd.cs instead of javax. It make sense to use the javax one, it looks like the umd.cs one is deprecated? I'll create a tracking issue so that we can update to use javax in the remaining parts of the code

rdblue · 2023-01-18T00:37:09Z

api/src/main/java/org/apache/iceberg/view/ViewRepresentation.java

-    public String typeName() {
-      return name().toLowerCase(Locale.ENGLISH);
-    }
+    public static final String SQL = "sql";


Why change this from an enum to a String?

I made the suggestion since I see some other enum like classes are also implemented directly as strings, such as DataOperations, and it seems to simplify the code a bit. But if there is a specific reason for using enum we can stick to that.

Yeah it seems like an established pattern elsewhere just to use a constant string and it simplifies the parsing logic a bit but I'm happy to revert back to enum if there's other advantages. Let me know your thoughts @rdblue @jzhuge

I think it depends on whether we want to use this enum in switch statements in our code or if we want to extend it. For example, we could have a reference to the parser in the enum so we look up the symbol and then call something like ViewRepresentation.SQL.parse(jsonNode).

Since we only have the parser selection right now, it doesn't seem like it matters much.

jackye1995

looks like we still need to settle the debates of:

use schemaId() instead of schema()
use string or enum for representation type
use nullable or not

Once those points are addressed I am good

amogh-jahagirdar · 2023-01-19T23:49:51Z

We probably want to establish a standard in the community at this point on Immutable/Nullable or not. Right now we're in this partial state, where it's used in some cases. Defining a standard can help focus the discussion on fundamental areas.

I do think Immutables are really nice at keeping boiler plate code to a minimum but I don't have a strong opinion other than just setting a standard practice in Iceberg :)

Maybe it's worth kicking off on the mailing list just to get the wider community's perspective?

nastra

mostly LGTM, just some small nits

core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java

yyanyy

LGTM, minor comment

yyanyy · 2023-01-20T18:56:24Z

core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java

+          DEFAULT_NAMESPACE, Arrays.asList(view.defaultNamespace().levels()), generator);
+    }
+
+    if (view.fieldAliases() != null && !view.fieldAliases().isEmpty()) {


nit: if we decide to use nullable annotation, sounds like fieldAliases and fieldComments can be null since we do null check here; should we mark them as nullable in SQLViewRepresentation?

See #6598 (comment) , there's pro's and cons for either leaving it be null at the data model level or just in serialization. In the end we just opted for having an empty list at the data model layer (and a client doesn't have to do a null check). So here one possible cleanup is not do the null check anyways since we have the guarantee at the API level!

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java

nastra

just a few cleanups needed but LGTM otherwise

core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java

core/src/main/java/org/apache/iceberg/view/UnknownViewRepresentation.java

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java

jackye1995

Thanks for the fixes, I don't have further comments, waiting for other comments to be resolved by @nastra and @rdblue

nastra

LGTM as well, thanks @amogh-jahagirdar!

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

danielcweeks · 2023-02-03T19:38:20Z

api/src/main/java/org/apache/iceberg/view/ViewRepresentation.java

 package org.apache.iceberg.view;

-import java.util.Locale;
+import org.immutables.value.Value;


Same here: javax.annotation.Nullable

ViewRepresentations currently has a single required "type" field so it wasn't marked as Nullable to begin with. Let me know if I'm missing something!

Co-authored-by: John Zhuge <[email protected]>

jackye1995 · 2023-02-08T18:34:22Z

@rdblue @danielcweeks the PR looks good to be merged, do you have any additional comment?

jackye1995 · 2023-02-10T00:10:26Z

Seems like we don't have much movement on the review at this point, given the fact that all current comments are addressed and this is a part of many PRs for view catalog integration, I will go ahead to merge this. We can address any remaining comments in #6559 if any. Thanks @amogh-jahagirdar and @jzhuge for the work and thanks everyone for review!

rdblue · 2023-02-12T22:38:34Z

versions.props

 org.apache.parquet:* = 1.12.3
 org.apache.pig:pig = 0.14.0
 com.fasterxml.jackson.*:* = 2.14.1
+com.google.code.findbugs:jsr305 = 3.0.2


@jackye1995 and @amogh-jahagirdar, this should be a banned dependency that is replaced by stephenc's reimplementation. This is a 1.2.0 release blocker.

rdblue

Mostly looks good, except for the banned dependency. Thanks for getting this in, and sorry that my review was late!

amogh-jahagirdar · 2023-02-12T23:08:11Z

No problem! Let me raise a PR to address the banned dependency, sorry about that

Co-authored-by: John Zhuge <[email protected]>

Co-authored-by: John Zhuge <[email protected]> (cherry picked from commit e5846a5)

github-actions bot added API core labels Jan 15, 2023

amogh-jahagirdar commented Jan 15, 2023

View reviewed changes

amogh-jahagirdar force-pushed the view-representation-parser branch from 40e07af to 947a539 Compare January 15, 2023 22:07

jackye1995 reviewed Jan 15, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java Outdated Show resolved Hide resolved

jackye1995 reviewed Jan 15, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/ViewRepresentation.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the view-representation-parser branch from 947a539 to ff3e0fd Compare January 16, 2023 01:20

amogh-jahagirdar commented Jan 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/util/JsonUtil.java Outdated Show resolved Hide resolved

amogh-jahagirdar commented Jan 16, 2023

View reviewed changes

core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the view-representation-parser branch from ff3e0fd to 046c877 Compare January 16, 2023 01:58

amogh-jahagirdar commented Jan 16, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Outdated Show resolved Hide resolved

nastra reviewed Jan 16, 2023

View reviewed changes

amogh-jahagirdar force-pushed the view-representation-parser branch 3 times, most recently from e36d2d5 to 02a1938 Compare January 16, 2023 16:36

nastra reviewed Jan 16, 2023

View reviewed changes

core/src/test/java/org/apache/iceberg/view/TestViewRepresentationParser.java Show resolved Hide resolved

rdblue reviewed Jan 18, 2023

View reviewed changes

amogh-jahagirdar force-pushed the view-representation-parser branch from 02a1938 to 3ec7def Compare January 19, 2023 18:29

github-actions bot added the docs label Jan 19, 2023

amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 9dcd73d to ac923f4 Compare January 19, 2023 18:33

jackye1995 reviewed Jan 19, 2023

View reviewed changes

nastra reviewed Jan 20, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/SQLViewRepresentationParser.java Show resolved Hide resolved

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Outdated Show resolved Hide resolved

yyanyy reviewed Jan 20, 2023

View reviewed changes

amogh-jahagirdar force-pushed the view-representation-parser branch from ac923f4 to b4979e6 Compare January 23, 2023 18:03

rdblue reviewed Jan 23, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 90d9b43 to 1c9d016 Compare January 24, 2023 21:52

amogh-jahagirdar force-pushed the view-representation-parser branch from 1c9d016 to 069d6e0 Compare January 24, 2023 21:56

nastra reviewed Jan 25, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 8f04f41 to ff24c0e Compare January 26, 2023 23:37

jackye1995 reviewed Jan 26, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Outdated Show resolved Hide resolved

jackye1995 reviewed Jan 26, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/view/ViewRepresentationParser.java Show resolved Hide resolved

amogh-jahagirdar force-pushed the view-representation-parser branch from ff24c0e to 31d8f04 Compare January 27, 2023 00:50

jackye1995 approved these changes Jan 27, 2023

View reviewed changes

nastra approved these changes Jan 27, 2023

View reviewed changes

yyanyy approved these changes Jan 31, 2023

View reviewed changes

danielcweeks self-requested a review February 2, 2023 21:41

danielcweeks reviewed Feb 3, 2023

View reviewed changes

amogh-jahagirdar force-pushed the view-representation-parser branch from 31d8f04 to 36162ea Compare February 3, 2023 21:08

github-actions bot added the build label Feb 3, 2023

Core: View representation core implementation

987c96b

Co-authored-by: John Zhuge <[email protected]>

amogh-jahagirdar force-pushed the view-representation-parser branch from 36162ea to 987c96b Compare February 3, 2023 21:09

amogh-jahagirdar requested a review from danielcweeks February 6, 2023 20:56

jackye1995 merged commit e5846a5 into apache:master Feb 10, 2023

rdblue reviewed Feb 12, 2023

View reviewed changes

amogh-jahagirdar mentioned this pull request Feb 13, 2023

API: Revert to using stephenc findbugs dependency for Nullable #6815

Merged

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Core: View representation core implementation (apache#6598)

5ee7284

Co-authored-by: John Zhuge <[email protected]>

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

Core: View representation core implementation (apache#6598)

eff1d8d

Co-authored-by: John Zhuge <[email protected]> (cherry picked from commit e5846a5)

		/** The query output schema id at version create time */
		int schemaId();

Core: View representation core implementation #6598

Core: View representation core implementation #6598

Uh oh!

Conversation

amogh-jahagirdar commented Jan 15, 2023

Uh oh!

amogh-jahagirdar Jan 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yyanyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

amogh-jahagirdar Jan 15, 2023 •

edited

Loading

amogh-jahagirdar Jan 16, 2023 •

edited

Loading

amogh-jahagirdar Jan 16, 2023 •

edited

Loading

jackye1995 Jan 17, 2023 •

edited

Loading

amogh-jahagirdar Jan 17, 2023 •

edited

Loading

amogh-jahagirdar Jan 19, 2023 •

edited

Loading

amogh-jahagirdar commented Jan 19, 2023 •

edited

Loading

amogh-jahagirdar Jan 23, 2023 •

edited

Loading