Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

View representation core implementation
Co-authored-by: John Zhuge [email protected]

Comment on lines 48 to 49
/** The query output schema id at version create time */
int schemaId();
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the flexibility to change this API until core implementation is complete, I changed the API to schemaID instead of schema. This is mostly because it simplifies the parsing logic and then it's still easy for a caller to obtain the schema from the top level view schema mapping.

Currently in the view spec, the schema ID is stored per SQL view representation. So in the current implementation building the representation when parsing is straightforward.

There are a few options here:

1.) Maintain schema ID in metadata and the API just returns schema ID as well (keeping parsing logic simple). That's what's done in this PR and I think is preferable since looking up schema via view.schemas().get(schemaID) should be straightforward.

2.) Preserve the schema at the API level, maintain schema ID in metadata. This complicates parsing logic (although not too much) because during parsing we need to pass the top level schemas list to SQLViewRepresentationParser https://iceberg.apache.org/view-spec/#view-metadata and then obtain the schema based on the parsed schema ID.

3.) Update the spec so that the entire schema object is stored in metadata and then serialize/deserialize.

Another topic (independent of which option we choose) is the spec currently marks schemaID as optional for sql view representation. I think it must be required (I can't think of a case where for a SQL representation we don't want to maintain a well defined Iceberg schema, engines can still choose to ignore it if they want to although I can't really think of such a case). I can raise a PR to update that if we think it's the right approach.

cc: @jzhuge @rdblue @jackye1995 @nastra

Let me know your thoughts on which approach above you find preferable and if we agree schema should be updated to be required for SQL view representation in the spec!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still ways to use Schema, and ideally that is preferred from API perspective. For parser, similar to PartitionSpec parser, I would imagine the parser to take the schemas and use that. Is the same strategy achieveable here?

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackye1995 yeah that's achievable, that's what I meant in approach 2 above. It does complicate parsing more than I'd like but it's very doable. I'll update the PR and the community can take a look and we can see which one we prefer more.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After implementing it, while it's doable it does seem to complicate more than I thought. I'll see about simplifying it, but we may just want to go back to surfacing schema IDs at the data model level or even serializing the entire schema to keep parsing really simple. For just using schema ID, then engines can select their representation and then do view.schema(repr.schemaId())

Copy link
Contributor

@jackye1995 jackye1995 Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. After taking a deeper look, there are multiple places that we have schemaId() as a part of the API model, such as in Snapshot.

I was using the case of PartitionSpec to argue that we should stick with using schema(), but it was because there is a process of binding to a schema, and the serialized version of partition spec is unbounded.

So the question here is not which one is more convenient to implement parser, but does a view representation needs to bind to a schema at runtime, or only have a static schema. I think the answer is that it is static, as described as "ID of the view’s schema when the version was created". So from that perspective, I agree schemaId() seems like a better choice so we don't need to cross reference existing schemas to make the parser work.

Any thoughts? @amogh-jahagirdar @nastra

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 exactly my thoughts as well, a schema is known at the time of SQL representation creation so late binding like we do with PartitionSpec doesn't apply or would needlessly complicate the logic of constructing the representation. Will hold for @nastra and @rdblue thought as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the discussion I think just having schemaId makes a lot of sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Integer schemaId(). Since schemaId may not be defined at creation time as discussed in https://github.com/apache/iceberg/pull/6611/files schemaId() can return null

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from 40e07af to 947a539 Compare January 15, 2023 22:07
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from 947a539 to ff3e0fd Compare January 16, 2023 01:20
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from ff3e0fd to 046c877 Compare January 16, 2023 01:58
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch 3 times, most recently from e36d2d5 to 02a1938 Compare January 16, 2023 16:36
*/
package org.apache.iceberg.view;

import edu.umd.cs.findbugs.annotations.Nullable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this? We don't use nullable annotations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since SQLViewRepresentation is an Immutable, this indicates to the builder that certain fields are not required and can indeed be null when an instance is constructed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are using Nullable in a few places already, for example https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/metrics/CommitMetricsResult.java#L54 . I'm good to revert this but that would imply not using Immutables as well here (unless there's another acceptable way to indicate to Immutable values that a field can be null).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullable annotations have been used in puffin and metrics, but I do think we should standardize on javax.annotation.Nullable as opposed to this findbugs version if we are going to be explicit about nullability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. There's still a few other places in the Iceberg code base which are using the umd.cs instead of javax. It make sense to use the javax one, it looks like the umd.cs one is deprecated? I'll create a tracking issue so that we can update to use javax in the remaining parts of the code

public String typeName() {
return name().toLowerCase(Locale.ENGLISH);
}
public static final String SQL = "sql";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this from an enum to a String?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the suggestion since I see some other enum like classes are also implemented directly as strings, such as DataOperations, and it seems to simplify the code a bit. But if there is a specific reason for using enum we can stick to that.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems like an established pattern elsewhere just to use a constant string and it simplifies the parsing logic a bit but I'm happy to revert back to enum if there's other advantages. Let me know your thoughts @rdblue @jzhuge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on whether we want to use this enum in switch statements in our code or if we want to extend it. For example, we could have a reference to the parser in the enum so we look up the symbol and then call something like ViewRepresentation.SQL.parse(jsonNode).

Since we only have the parser selection right now, it doesn't seem like it matters much.

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from 02a1938 to 3ec7def Compare January 19, 2023 18:29
@github-actions github-actions bot added the docs label Jan 19, 2023
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 9dcd73d to ac923f4 Compare January 19, 2023 18:33
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we still need to settle the debates of:

  1. use schemaId() instead of schema()
  2. use string or enum for representation type
  3. use nullable or not

Once those points are addressed I am good

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 19, 2023

We probably want to establish a standard in the community at this point on Immutable/Nullable or not. Right now we're in this partial state, where it's used in some cases. Defining a standard can help focus the discussion on fundamental areas.

I do think Immutables are really nice at keeping boiler plate code to a minimum but I don't have a strong opinion other than just setting a standard practice in Iceberg :)

Maybe it's worth kicking off on the mailing list just to get the wider community's perspective?

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM, just some small nits

Copy link
Contributor

@yyanyy yyanyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment

DEFAULT_NAMESPACE, Arrays.asList(view.defaultNamespace().levels()), generator);
}

if (view.fieldAliases() != null && !view.fieldAliases().isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we decide to use nullable annotation, sounds like fieldAliases and fieldComments can be null since we do null check here; should we mark them as nullable in SQLViewRepresentation?

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #6598 (comment) , there's pro's and cons for either leaving it be null at the data model level or just in serialization. In the end we just opted for having an empty list at the data model layer (and a client doesn't have to do a null check). So here one possible cleanup is not do the null check anyways since we have the guarantee at the API level!

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from ac923f4 to b4979e6 Compare January 23, 2023 18:03
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 90d9b43 to 1c9d016 Compare January 24, 2023 21:52
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from 1c9d016 to 069d6e0 Compare January 24, 2023 21:56
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few cleanups needed but LGTM otherwise

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch 2 times, most recently from 8f04f41 to ff24c0e Compare January 26, 2023 23:37
@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from ff24c0e to 31d8f04 Compare January 27, 2023 00:50
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, I don't have further comments, waiting for other comments to be resolved by @nastra and @rdblue

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well, thanks @amogh-jahagirdar!

@danielcweeks danielcweeks self-requested a review February 2, 2023 21:41
package org.apache.iceberg.view;

import java.util.Locale;
import org.immutables.value.Value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: javax.annotation.Nullable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ViewRepresentations currently has a single required "type" field so it wasn't marked as Nullable to begin with. Let me know if I'm missing something!

@amogh-jahagirdar amogh-jahagirdar force-pushed the view-representation-parser branch from 31d8f04 to 36162ea Compare February 3, 2023 21:08
@github-actions github-actions bot added the build label Feb 3, 2023
@jackye1995
Copy link
Contributor

@rdblue @danielcweeks the PR looks good to be merged, do you have any additional comment?

@jackye1995
Copy link
Contributor

Seems like we don't have much movement on the review at this point, given the fact that all current comments are addressed and this is a part of many PRs for view catalog integration, I will go ahead to merge this. We can address any remaining comments in #6559 if any. Thanks @amogh-jahagirdar and @jzhuge for the work and thanks everyone for review!

@jackye1995 jackye1995 merged commit e5846a5 into apache:master Feb 10, 2023
org.apache.parquet:* = 1.12.3
org.apache.pig:pig = 0.14.0
com.fasterxml.jackson.*:* = 2.14.1
com.google.code.findbugs:jsr305 = 3.0.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 and @amogh-jahagirdar, this should be a banned dependency that is replaced by stephenc's reimplementation. This is a 1.2.0 release blocker.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, except for the banned dependency. Thanks for getting this in, and sorry that my review was late!

@amogh-jahagirdar
Copy link
Contributor Author

No problem! Let me raise a PR to address the banned dependency, sorry about that

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Co-authored-by: John Zhuge <[email protected]>

(cherry picked from commit e5846a5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants