Skip to content

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jul 5, 2023

This PR re-visits the existing View APIs and aims at simplifying/improving a few things, namely:

  • field-aliases and field-comments should be part of the View's schema rather than be part of each individual SQL representation. Both fields are informational, and having them be part of each SQL representation could lead to issues across dialects
  • default-catalog: should be the name of the current view catalog that loaded this view, thus storing this information isn't required
  • default-namespace: should be the same across multiple SQL representations. We might want to enforce this and I've added a note in the view spec
  • Per dialect there should be no more than one SQL representation. We might want to enforce this and I've added a note in the view spec

Creating/updating a view with one or more representations

A few options around improving and simplifying the ViewBuilder API have been explored. The PR deprecates the existing fine-grained methods on for defining a view's representation and adds a withQuery(...) method when creating a View or when updating existing representations:

Creating a View with one representation

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .createOrReplace()

Creating a view with multiple representations

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .withQuery("trino", "select a,b from ns.tbl")
  .createOrReplace()

Adding an additional representation for an existing View

view.updateRepresentation()
  .withQuery("spark", "select...")  // adds this definition if it doesn't exist or updates the existing one
  .remove("trino")                  // removes the definition for this dialect
  .commit()

Updating a View's properties

view.updateProperties()
  .set("key1", "val1")
  .set("key2", "val2")
  .remove("key3")
  .commit()

@nastra nastra force-pushed the simplified-view-apis branch 2 times, most recently from 0b2d8e4 to 16376b6 Compare July 5, 2023 13:46
@nastra
Copy link
Contributor Author

nastra commented Jul 5, 2023

I am actually surprised that RevAPI doesn't fail on those changes. It appears that this is related to to recent Gradle version bump in #7955

* <p>When committing, these changes will be applied to the current view metadata. Commit conflicts
* will be resolved by applying the pending changes to the new view metadata.
*/
public interface UpdateViewRepresentation extends PendingUpdate<ViewVersion> {
Copy link
Contributor

@rdblue rdblue Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a better name than UpdateViewRepresentation. That doesn't capture what is happening.

I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?

At this stage, I think it is more important to focus on the ViewVersion API. We can add a way to add/remove representations based on the current version later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.

Modifying any of this would write a new view version id and conflict detection would detect that the base view version of the second update changed (that's how it's currently implemented in #7913)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?

The main use case would be to have an API to update a view's representation without having to reconstruct all other representations. Say you have representations for trino and spark and you'd want to only update the SQL definition of spark. In a normal replace you would replace the existing view completely with the "new" information, but you don't want to lose the representation for trino.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the current issue is the semantics of .replace() in Iceberg (or maybe I just understood them wrong).

While a CREATE OR REPLACE is clear from an engine's perspective, I don't think the semantics are clear enough from Iceberg's perspective at the Builder level when calling createOrReplace().

View representations & replace

In #7880 I've implemented .replace() as a full replace, meaning that if you had a previous representation for spark, it would get completely replaced with whatever new representation was defined (which can be seen here).
In hindsight I think Iceberg's semantics for .replace() should keep previous view representations, so that we get the following behavior:

// create the view when executing "CREATE OR REPLACE VIEW ..." from Spark
View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .createOrReplace()

// replace the view when executing "CREATE OR REPLACE VIEW ..." from Trino
view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("trino", "select a,b from ns.tbl")
  .createOrReplace()

Checking the representations of the current version of the view would contain both representations

// this would contain both representations
assertThat(view.currentVersion().representations()).hasSize(2)

View Properties & replace

To have the same semantics as with view representations during a replace, we should keep previous properties, so that view.properties() would return {"key1": "val1", "key2": "val2"} after executing the below code:

// create the view with a property
View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .withProperty("key1", "val1")
  .createOrReplace()

// replace the view
view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("trino", "select a,b from ns.tbl")
  .withProperty("key2", "val2")
  .createOrReplace()

#7880 currently handles properties during a replace as an actual replace and doesn't keep previous properties.

TLDR

The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.

That being said, I think if we use these semantics, then we can probably remove the UpdateViewRepresentation API (in case we'd ever want to drop a view representation for a particular dialect, then we could add an API that does that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.

Yes, this is what we do for tables. If you want to erase the history, then you can drop and re-create the object.

we can probably remove the UpdateViewRepresentation

I think we should remove this. What we want instead is a way to replace the version, like newVersion() or replaceVersion(). I think that has mostly the same API as the ViewBuilder, probably through a shared interface like ViewVersionBuilder that has all the version-related methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize: doing a replace should carry over properties, but previous representations will not be carried over (because we don't know whether a previous view representation created for spark is compatible with a new view representation created for trino). I'll add a note to the view spec so that this behavior is clear and doesn't cause surprises.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, it isn't clear when you talk about carrying over representations because you're missing a layer of metadata. When you replace a view, the previous properties and versions are preserved. Replace also creates an entirely new version of the view that is independent of the other versions and the representations within those versions.

Iceberg is not responsible for somehow merging a view's current version with new SQL provided by the replace.

@danielcweeks
Copy link
Contributor

I feel like we lost some context from #7880, specifically the discussion around how we want to represent these APIs.

I agree with @jackye1995 about favoring option B. I just feel like this is too SQL specific and limits our options going forward with different representations and creates unnecessary differences with the spec representation.

Also, I'm not convinced that you can/should have field-aliases and field-comments at the top-level. Some engines may support field aliases, but others do not so they don't apply equally to the representations and is inconsistent with the spec, which has them at the representation level.

I think it's fine to omit the default catalog and default namespaces, especially if we can appropriately qualify them in the engine when constructing the sql representation.

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2023

@nastra, the example you gave would not be allowed:

View view = catalog.buildView(identifier)
  .withSchema(schema)
  .withQuery("spark", "select a,b from ns.tbl")
  .withQuery("trino", "select a,b,c from ns.tbl")
  .createOrReplace()

The representations must produce the same schema.

@nastra nastra force-pushed the simplified-view-apis branch 3 times, most recently from 987d96a to f9c00d7 Compare July 7, 2023 07:41
@nastra nastra force-pushed the simplified-view-apis branch 2 times, most recently from 7a4f7bf to 6fdf558 Compare July 10, 2023 16:41
* @param defaultNamespace The default namespace to use when the SQL does not contain a namespace
* @return this for method chaining
*/
T withQuery(String dialect, String sql, Namespace defaultNamespace);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default namespace should be a property of the view version, not a representation. withDefaultNamespace should be a separate method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I was under the impression we would do #7992 (comment). I've updated the API to properly reflect this and moved the default catalog to the ViewVersion

@nastra nastra force-pushed the simplified-view-apis branch 2 times, most recently from 067f3c2 to df45723 Compare July 11, 2023 06:54
\ java.lang.String) @ org.apache.iceberg.view.ViewBuilder"
justification: "Acceptable break due to updating View APIs and the View Spec"
- code: "java.method.removed"
old: "method java.lang.String org.apache.iceberg.view.ImmutableSQLViewRepresentation::defaultCatalog()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a problem: the generated Immutable classes should not be in the api module. All of the implementation classes for API interfaces are in core and it makes no sense to publicly expose these in API when we don't publicly expose the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Immutables has a a way to configure the visibility or the package of the generated class, but unfortunately there's nothing currently available that would put the generated class into a different project :(

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really close to being ready. Overall, I think it's a big improvement to the view definitions and makes behavior more clear and simplifies the API.

The main issue that I think we need to coordinate is the breaking changes in the public API. @aokolnychyi, @jackye1995, @danielcweeks, @Fokko, it would be great to get your input on whether we should move forward with the breaking changes.

In particular, it's really unfortunate that I didn't notice that Immutables was generating the implementations in the api module until now. That means far more was exposed and will break with these changes -- otherwise we could have simply made the removed methods return null or update them to return values from the ViewVersion and kept the getters. But the implementations are changing and had public builders as well.

I think since this is an unused API, we can probably get away with the breaking changes but I'd like to insulate us from having this issue in the future by moving the implementation classes into core like the others.

@nastra nastra force-pushed the simplified-view-apis branch from 3e80052 to 0999931 Compare July 17, 2023 06:38
@nastra nastra requested a review from aokolnychyi July 17, 2023 07:01
@github-actions github-actions bot added the build label Jul 19, 2023
@nastra nastra force-pushed the simplified-view-apis branch from ee00d7d to ce2c9dd Compare July 19, 2023 08:19
@nastra
Copy link
Contributor Author

nastra commented Jul 19, 2023

In particular, it's really unfortunate that I didn't notice that Immutables was generating the implementations in the api module until now. That means far more was exposed and will break with these changes -- otherwise we could have simply made the removed methods return null or update them to return values from the ViewVersion and kept the getters. But the implementations are changing and had public builders as well.

It's a mess up on my part and I should have noticed this when @Value.Immutable usage was added to iceberg-api. I was the main driving factor to get @Value.Immutable into Iceberg and it should have been my responsibility to make sure we're not exposing more than we should, especially in the iceberg-api module.

I have moved the View-specific Immutable classes as part of this PR to iceberg-core and I also opened #8099 to do the same for the other places in iceberg-api.

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

@rdblue
Copy link
Contributor

rdblue commented Jul 25, 2023

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

I think we should disallow it, not just discourage it. A checkstyle rule sounds like the right approach though, so maybe we're saying the same thing.


View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

When view metadata is replaced, then previously set properties are carried over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's incorrect to make this statement in the spec. The "replace" operation for REPLACE VIEW is a SQL operation with chosen semantics that are represented in the Iceberg view metadata as you describe. But it's a very different statement to say that "when view metadata is replaced" something happens.

Saying that when metadata is "replaced" here could mean any update to the view metadata object, which is a much broader statement than you intended.

As in the table spec, we don't want to add broad statements like this. You can add a section explaining how a SQL REPLACE VIEW works, but even then I wouldn't say that it is a requirement of the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks for updating this

| _required_ | `summary` | A string to string map of [summary metadata](#summary) about the version |
| _required_ | `representations` | A list of [representations](#representations) for the view definition |
| _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog |
| _optional_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be optional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to required and added clarification for how to handle this when it is null or missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

| _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog |
| _optional_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier |

When writing a new view version, all the information from the previous view version is replaced, meaning that `representations` / `default-catalog` / `default-namespace` / `summary` are not carried over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also incorrect for the spec.

There is nothing preventing us from creating an operation that does exactly this carry over, as long as it is the user's intent. For example:

ALTER VIEW v ADD REPRESENTATION spark AS
SELECT * FROM foo

This SQL clearly adds to the existing representations. It isn't correct or incorrect to do that. It is simply the behavior that a SQL operation could have. The spec allows all these things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@rdblue
Copy link
Contributor

rdblue commented Jul 25, 2023

I had a few comments on the spec that I added as changes and committed. +1 now, so I'll merge when tests are passing.

@rdblue rdblue merged commit 2a35542 into apache:master Jul 25, 2023
@rdblue
Copy link
Contributor

rdblue commented Jul 25, 2023

Thanks, @nastra!

@nastra nastra deleted the simplified-view-apis branch July 26, 2023 06:00
@nastra
Copy link
Contributor Author

nastra commented Jul 26, 2023

Once both PRs are merged, I'll add a follow-up and mention in the docs that @Value.Immutable usage is discouraged on iceberg-api (with a checkstyle rule that enforces it) and will also remove the Immutable dependency

I think we should disallow it, not just discourage it. A checkstyle rule sounds like the right approach though, so maybe we're saying the same thing.

Yes we are saying the same thing here. The goal with the docs part was mainly to describe why @Value.Immutable should not be in iceberg-api and show 1-2 examples of how to apply @Value.Immutable to an API that is defined in iceberg-api.

@wmoustafa
Copy link
Contributor

wmoustafa commented Oct 14, 2023

Also, I'm not convinced that you can/should have field-aliases and field-comments at the top-level. Some engines may support field aliases, but others do not so they don't apply equally to the representations and is inconsistent with the spec, which has them at the representation level.

+1. Also worth considering whether "Representation" really refers to an engine or a dialect. Right now it looks like dialect, but even if we go with the above we will be making an implicit correlation between dialect and engine. I realize that aliases were removed from this PR, but this idea about engine x dialect correlation is worth considering in future designs.

I think it's fine to omit the default catalog and default namespaces, especially if we can appropriately qualify them in the engine when constructing the sql representation.

+1. How are we thinking about the fact that catalog names are not the same across engines, even when the underlying tables end up being looked up from the same metastore/catalog? E.g., hive.db.t in Trino might refer to the same table as spark_catalog.db.t in Spark?

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants