Core: Simplify/improve View APIs #7992

nastra · 2023-07-05T12:59:35Z

This PR re-visits the existing View APIs and aims at simplifying/improving a few things, namely:

field-aliases and field-comments should be part of the View's schema rather than be part of each individual SQL representation. Both fields are informational, and having them be part of each SQL representation could lead to issues across dialects
default-catalog: should be the name of the current view catalog that loaded this view, thus storing this information isn't required
default-namespace: should be the same across multiple SQL representations. We might want to enforce this and I've added a note in the view spec
Per dialect there should be no more than one SQL representation. We might want to enforce this and I've added a note in the view spec

Creating/updating a view with one or more representations

A few options around improving and simplifying the ViewBuilder API have been explored. The PR deprecates the existing fine-grained methods on for defining a view's representation and adds a withQuery(...) method when creating a View or when updating existing representations:

Creating a View with one representation

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .createOrReplace()

Creating a view with multiple representations

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .withQuery("trino", "select a,b from ns.tbl")
  .createOrReplace()

Adding an additional representation for an existing View

view.updateRepresentation()
  .withQuery("spark", "select...")  // adds this definition if it doesn't exist or updates the existing one
  .remove("trino")                  // removes the definition for this dialect
  .commit()

Updating a View's properties

view.updateProperties()
  .set("key1", "val1")
  .set("key2", "val2")
  .remove("key3")
  .commit()

nastra · 2023-07-05T14:01:05Z

I am actually surprised that RevAPI doesn't fail on those changes. It appears that this is related to to recent Gradle version bump in #7955

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java

rdblue · 2023-07-05T17:43:47Z

api/src/main/java/org/apache/iceberg/view/UpdateViewRepresentation.java

+ * <p>When committing, these changes will be applied to the current view metadata. Commit conflicts
+ * will be resolved by applying the pending changes to the new view metadata.
+ */
+public interface UpdateViewRepresentation extends PendingUpdate<ViewVersion> {


I think we need a better name than UpdateViewRepresentation. That doesn't capture what is happening.

I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?

At this stage, I think it is more important to focus on the ViewVersion API. We can add a way to add/remove representations based on the current version later.

One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.

One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.

Modifying any of this would write a new view version id and conflict detection would detect that the base view version of the second update changed (that's how it's currently implemented in #7913)

I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?

The main use case would be to have an API to update a view's representation without having to reconstruct all other representations. Say you have representations for trino and spark and you'd want to only update the SQL definition of spark. In a normal replace you would replace the existing view completely with the "new" information, but you don't want to lose the representation for trino.

So I think the current issue is the semantics of .replace() in Iceberg (or maybe I just understood them wrong).

While a CREATE OR REPLACE is clear from an engine's perspective, I don't think the semantics are clear enough from Iceberg's perspective at the Builder level when calling createOrReplace().

View representations & replace

In #7880 I've implemented .replace() as a full replace, meaning that if you had a previous representation for spark, it would get completely replaced with whatever new representation was defined (which can be seen here).
In hindsight I think Iceberg's semantics for .replace() should keep previous view representations, so that we get the following behavior:

// create the view when executing "CREATE OR REPLACE VIEW ..." from Spark View view = catalog.buildView(identifier) .withSchema(schema) .withQuery("spark", "select a,b from ns.tbl") .createOrReplace() // replace the view when executing "CREATE OR REPLACE VIEW ..." from Trino view = catalog.buildView(identifier) .withSchema(schema) .withQuery("trino", "select a,b from ns.tbl") .createOrReplace()

Checking the representations of the current version of the view would contain both representations

// this would contain both representations assertThat(view.currentVersion().representations()).hasSize(2)

View Properties & replace

To have the same semantics as with view representations during a replace, we should keep previous properties, so that view.properties() would return {"key1": "val1", "key2": "val2"} after executing the below code:

// create the view with a property View view = catalog.buildView(identifier) .withSchema(schema) .withQuery("spark", "select a,b from ns.tbl") .withProperty("key1", "val1") .createOrReplace() // replace the view view = catalog.buildView(identifier) .withSchema(schema) .withQuery("trino", "select a,b from ns.tbl") .withProperty("key2", "val2") .createOrReplace()

#7880 currently handles properties during a replace as an actual replace and doesn't keep previous properties.

TLDR

The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.

That being said, I think if we use these semantics, then we can probably remove the UpdateViewRepresentation API (in case we'd ever want to drop a view representation for a particular dialect, then we could add an API that does that)

The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.

Yes, this is what we do for tables. If you want to erase the history, then you can drop and re-create the object.

we can probably remove the UpdateViewRepresentation

I think we should remove this. What we want instead is a way to replace the version, like newVersion() or replaceVersion(). I think that has mostly the same API as the ViewBuilder, probably through a shared interface like ViewVersionBuilder that has all the version-related methods.

To summarize: doing a replace should carry over properties, but previous representations will not be carried over (because we don't know whether a previous view representation created for spark is compatible with a new view representation created for trino). I'll add a note to the view spec so that this behavior is clear and doesn't cause surprises.

@nastra, it isn't clear when you talk about carrying over representations because you're missing a layer of metadata. When you replace a view, the previous properties and versions are preserved. Replace also creates an entirely new version of the view that is independent of the other versions and the representations within those versions.

Iceberg is not responsible for somehow merging a view's current version with new SQL provided by the replace.

danielcweeks · 2023-07-05T18:16:18Z

I feel like we lost some context from #7880, specifically the discussion around how we want to represent these APIs.

I agree with @jackye1995 about favoring option B. I just feel like this is too SQL specific and limits our options going forward with different representations and creates unnecessary differences with the spec representation.

Also, I'm not convinced that you can/should have field-aliases and field-comments at the top-level. Some engines may support field aliases, but others do not so they don't apply equally to the representations and is inconsistent with the spec, which has them at the representation level.

I think it's fine to omit the default catalog and default namespaces, especially if we can appropriately qualify them in the engine when constructing the sql representation.

api/src/main/java/org/apache/iceberg/view/View.java

api/src/main/java/org/apache/iceberg/view/ViewBuilder.java

format/view-spec.md

rdblue · 2023-07-05T19:15:47Z

@nastra, the example you gave would not be allowed:

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .withQuery("trino", "select a,b,c from ns.tbl")
  .createOrReplace()

The representations must produce the same schema.

api/src/main/java/org/apache/iceberg/view/ReplaceViewVersion.java

rdblue · 2023-07-10T17:02:44Z

api/src/main/java/org/apache/iceberg/view/VersionBuilder.java

+   * @param defaultNamespace The default namespace to use when the SQL does not contain a namespace
+   * @return this for method chaining
+   */
+  T withQuery(String dialect, String sql, Namespace defaultNamespace);


The default namespace should be a property of the view version, not a representation. withDefaultNamespace should be a separate method.

sorry, I was under the impression we would do #7992 (comment). I've updated the API to properly reflect this and moved the default catalog to the ViewVersion

api/src/main/java/org/apache/iceberg/view/VersionBuilder.java

api/src/main/java/org/apache/iceberg/view/ViewBuilder.java

format/view-spec.md

rdblue · 2023-07-16T23:10:22Z

.palantir/revapi.yml

+        \ java.lang.String) @ org.apache.iceberg.view.ViewBuilder"
+      justification: "Acceptable break due to updating View APIs and the View Spec"
+    - code: "java.method.removed"
+      old: "method java.lang.String org.apache.iceberg.view.ImmutableSQLViewRepresentation::defaultCatalog()"


I think there's a problem: the generated Immutable classes should not be in the api module. All of the implementation classes for API interfaces are in core and it makes no sense to publicly expose these in API when we don't publicly expose the others.

Immutables has a a way to configure the visibility or the package of the generated class, but unfortunately there's nothing currently available that would put the generated class into a different project :(

rdblue

I think this is really close to being ready. Overall, I think it's a big improvement to the view definitions and makes behavior more clear and simplifies the API.

The main issue that I think we need to coordinate is the breaking changes in the public API. @aokolnychyi, @jackye1995, @danielcweeks, @Fokko, it would be great to get your input on whether we should move forward with the breaking changes.

In particular, it's really unfortunate that I didn't notice that Immutables was generating the implementations in the api module until now. That means far more was exposed and will break with these changes -- otherwise we could have simply made the removed methods return null or update them to return values from the ViewVersion and kept the getters. But the implementations are changing and had public builders as well.

I think since this is an unused API, we can probably get away with the breaking changes but I'd like to insulate us from having this issue in the future by moving the implementation classes into core like the others.

nastra · 2023-07-19T15:27:44Z

In particular, it's really unfortunate that I didn't notice that Immutables was generating the implementations in the api module until now. That means far more was exposed and will break with these changes -- otherwise we could have simply made the removed methods return null or update them to return values from the ViewVersion and kept the getters. But the implementations are changing and had public builders as well.

It's a mess up on my part and I should have noticed this when @Value.Immutable usage was added to iceberg-api. I was the main driving factor to get @Value.Immutable into Iceberg and it should have been my responsibility to make sure we're not exposing more than we should, especially in the iceberg-api module.

I have moved the View-specific Immutable classes as part of this PR to iceberg-core and I also opened #8099 to do the same for the other places in iceberg-api.

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

rdblue · 2023-07-25T22:58:04Z

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

I think we should disallow it, not just discourage it. A checkstyle rule sounds like the right approach though, so maybe we're saying the same thing.

rdblue · 2023-07-25T23:10:50Z

format/view-spec.md


 View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

+When view metadata is replaced, then previously set properties are carried over.


I think it's incorrect to make this statement in the spec. The "replace" operation for REPLACE VIEW is a SQL operation with chosen semantics that are represented in the Iceberg view metadata as you describe. But it's a very different statement to say that "when view metadata is replaced" something happens.

Saying that when metadata is "replaced" here could mean any update to the view metadata object, which is a much broader statement than you intended.

As in the table spec, we don't want to add broad statements like this. You can add a section explaining how a SQL REPLACE VIEW works, but even then I wouldn't say that it is a requirement of the spec.

makes sense, thanks for updating this

format/view-spec.md

rdblue · 2023-07-25T23:12:08Z

format/view-spec.md

+| _required_  | `summary`           | A string to string map of [summary metadata](#summary) about the version      |
+| _required_  | `representations`   | A list of [representations](#representations) for the view definition         |
+| _optional_  | `default-catalog`   | Catalog name to use when a reference in the SELECT does not contain a catalog |
+| _optional_  | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier        |


This should not be optional.

Updated to required and added clarification for how to handle this when it is null or missing.

rdblue · 2023-07-25T23:14:01Z

format/view-spec.md

+| _optional_  | `default-catalog`   | Catalog name to use when a reference in the SELECT does not contain a catalog |
+| _optional_  | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier        |
+
+When writing a new view version, all the information from the previous view version is replaced, meaning that `representations` / `default-catalog` / `default-namespace` / `summary` are not carried over.


This is also incorrect for the spec.

There is nothing preventing us from creating an operation that does exactly this carry over, as long as it is the user's intent. For example:

ALTER VIEW v ADD REPRESENTATION spark AS SELECT * FROM foo

This SQL clearly adds to the existing representations. It isn't correct or incorrect to do that. It is simply the behavior that a SQL operation could have. The spec allows all these things.

format/view-spec.md

rdblue · 2023-07-25T23:19:02Z

I had a few comments on the spec that I added as changes and committed. +1 now, so I'll merge when tests are passing.

rdblue · 2023-07-25T23:20:19Z

Thanks, @nastra!

nastra · 2023-07-26T06:11:33Z

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

I think we should disallow it, not just discourage it. A checkstyle rule sounds like the right approach though, so maybe we're saying the same thing.

Yes we are saying the same thing here. The goal with the docs part was mainly to describe why @Value.Immutable should not be in iceberg-api and show 1-2 examples of how to apply @Value.Immutable to an API that is defined in iceberg-api.

wmoustafa · 2023-10-14T20:28:37Z

Also, I'm not convinced that you can/should have field-aliases and field-comments at the top-level. Some engines may support field aliases, but others do not so they don't apply equally to the representations and is inconsistent with the spec, which has them at the representation level.

+1. Also worth considering whether "Representation" really refers to an engine or a dialect. Right now it looks like dialect, but even if we go with the above we will be making an implicit correlation between dialect and engine. I realize that aliases were removed from this PR, but this idea about engine x dialect correlation is worth considering in future designs.

I think it's fine to omit the default catalog and default namespaces, especially if we can appropriately qualify them in the engine when constructing the sql representation.

+1. How are we thinking about the fact that catalog names are not the same across engines, even when the underlying tables end up being looked up from the same metastore/catalog? E.g., hive.db.t in Trino might refer to the same table as spark_catalog.db.t in Spark?

(cherry picked from commit 2a35542)

github-actions bot added API core labels Jul 5, 2023

nastra force-pushed the simplified-view-apis branch 2 times, most recently from 0b2d8e4 to 16376b6 Compare July 5, 2023 13:46

nastra mentioned this pull request Jul 5, 2023

Revert "Bump Gradle to 8.2 (#7955)" #7995

Merged

nastra requested review from amogh-jahagirdar, danielcweeks, jackye1995 and rdblue July 5, 2023 14:09

nastra mentioned this pull request Jul 5, 2023

Core: Add remaining View APIs and support for InMemoryCatalog #7880

Merged

rdblue reviewed Jul 5, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/SQLViewRepresentation.java Show resolved Hide resolved

rdblue reviewed Jul 5, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/View.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 5, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/ViewBuilder.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 5, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jul 5, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

nastra force-pushed the simplified-view-apis branch 3 times, most recently from 987d96a to f9c00d7 Compare July 7, 2023 07:41

rdblue reviewed Jul 7, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/ReplaceViewVersion.java Outdated Show resolved Hide resolved

nastra force-pushed the simplified-view-apis branch 2 times, most recently from 7a4f7bf to 6fdf558 Compare July 10, 2023 16:41

rdblue reviewed Jul 10, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/VersionBuilder.java Show resolved Hide resolved

rdblue reviewed Jul 10, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/view/ViewBuilder.java Show resolved Hide resolved

nastra force-pushed the simplified-view-apis branch 2 times, most recently from 067f3c2 to df45723 Compare July 11, 2023 06:54

rdblue reviewed Jul 16, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jul 16, 2023

View reviewed changes

rdblue requested changes Jul 16, 2023

View reviewed changes

Core: Update View Spec and View APIs

0999931

nastra force-pushed the simplified-view-apis branch from 3e80052 to 0999931 Compare July 17, 2023 06:38

nastra requested a review from aokolnychyi July 17, 2023 07:01

github-actions bot added the build label Jul 19, 2023

nastra force-pushed the simplified-view-apis branch from ee00d7d to ce2c9dd Compare July 19, 2023 08:19

Remove Immutables from iceberg-api

02b63b1

nastra force-pushed the simplified-view-apis branch from ce2c9dd to 02b63b1 Compare July 19, 2023 08:27

nastra mentioned this pull request Jul 19, 2023

API, Core: Move @Value.Immutable usage from iceberg-api to iceberg-core #8099

Merged

rdblue reviewed Jul 25, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jul 25, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jul 25, 2023

View reviewed changes

format/view-spec.md Outdated Show resolved Hide resolved

Apply suggestions from code review

909ec9f

rdblue reviewed Jul 25, 2023

View reviewed changes

format/view-spec.md Show resolved Hide resolved

Update format/view-spec.md

8f3426e

rdblue approved these changes Jul 25, 2023

View reviewed changes

rdblue merged commit 2a35542 into apache:master Jul 25, 2023

nastra deleted the simplified-view-apis branch July 26, 2023 06:00

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

Core: Simplify and improve View APIs (apache#7992)

e568170

(cherry picked from commit 2a35542)


		View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

		When view metadata is replaced, then previously set properties are carried over.

Core: Simplify/improve View APIs #7992

Core: Simplify/improve View APIs #7992

Uh oh!

Conversation

nastra commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Creating/updating a view with one or more representations

Creating a View with one representation

Creating a view with multiple representations

Adding an additional representation for an existing View

Updating a View's properties

Uh oh!

nastra commented Jul 5, 2023

Uh oh!

Uh oh!

rdblue Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

View representations & replace

View Properties & replace

TLDR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielcweeks commented Jul 5, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Jul 5, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

nastra commented Jul 19, 2023

Uh oh!

rdblue commented Jul 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra commented Jul 5, 2023 •

edited

Loading

rdblue Jul 5, 2023 •

edited

Loading

wmoustafa commented Oct 14, 2023 •

edited

Loading