-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Simplify/improve View APIs #7992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0b2d8e4 to
16376b6
Compare
|
I am actually surprised that RevAPI doesn't fail on those changes. It appears that this is related to to recent Gradle version bump in #7955 |
| * <p>When committing, these changes will be applied to the current view metadata. Commit conflicts | ||
| * will be resolved by applying the pending changes to the new view metadata. | ||
| */ | ||
| public interface UpdateViewRepresentation extends PendingUpdate<ViewVersion> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a better name than UpdateViewRepresentation. That doesn't capture what is happening.
I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?
At this stage, I think it is more important to focus on the ViewVersion API. We can add a way to add/remove representations based on the current version later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: this may actually be dangerous with multiple people interacting. If you replace a view with a new query at the same time that I add a dialect to the old verison, we could end up mixing SQL that doesn't produce the same result.
Modifying any of this would write a new view version id and conflict detection would detect that the base view version of the second update changed (that's how it's currently implemented in #7913)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure we even need this in the initial API. Are these operations that people will need to use? What is the SQL that exercises them or what is the use case?
The main use case would be to have an API to update a view's representation without having to reconstruct all other representations. Say you have representations for trino and spark and you'd want to only update the SQL definition of spark. In a normal replace you would replace the existing view completely with the "new" information, but you don't want to lose the representation for trino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the current issue is the semantics of .replace() in Iceberg (or maybe I just understood them wrong).
While a CREATE OR REPLACE is clear from an engine's perspective, I don't think the semantics are clear enough from Iceberg's perspective at the Builder level when calling createOrReplace().
View representations & replace
In #7880 I've implemented .replace() as a full replace, meaning that if you had a previous representation for spark, it would get completely replaced with whatever new representation was defined (which can be seen here).
In hindsight I think Iceberg's semantics for .replace() should keep previous view representations, so that we get the following behavior:
// create the view when executing "CREATE OR REPLACE VIEW ..." from Spark
View view = catalog.buildView(identifier)
.withSchema(schema)
.withQuery("spark", "select a,b from ns.tbl")
.createOrReplace()
// replace the view when executing "CREATE OR REPLACE VIEW ..." from Trino
view = catalog.buildView(identifier)
.withSchema(schema)
.withQuery("trino", "select a,b from ns.tbl")
.createOrReplace()
Checking the representations of the current version of the view would contain both representations
// this would contain both representations
assertThat(view.currentVersion().representations()).hasSize(2)
View Properties & replace
To have the same semantics as with view representations during a replace, we should keep previous properties, so that view.properties() would return {"key1": "val1", "key2": "val2"} after executing the below code:
// create the view with a property
View view = catalog.buildView(identifier)
.withSchema(schema)
.withQuery("spark", "select a,b from ns.tbl")
.withProperty("key1", "val1")
.createOrReplace()
// replace the view
view = catalog.buildView(identifier)
.withSchema(schema)
.withQuery("trino", "select a,b from ns.tbl")
.withProperty("key2", "val2")
.createOrReplace()
#7880 currently handles properties during a replace as an actual replace and doesn't keep previous properties.
TLDR
The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.
That being said, I think if we use these semantics, then we can probably remove the UpdateViewRepresentation API (in case we'd ever want to drop a view representation for a particular dialect, then we could add an API that does that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics of .replace() at the View API level should mean that previously configured representations and properties are carried over.
Yes, this is what we do for tables. If you want to erase the history, then you can drop and re-create the object.
we can probably remove the UpdateViewRepresentation
I think we should remove this. What we want instead is a way to replace the version, like newVersion() or replaceVersion(). I think that has mostly the same API as the ViewBuilder, probably through a shared interface like ViewVersionBuilder that has all the version-related methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize: doing a replace should carry over properties, but previous representations will not be carried over (because we don't know whether a previous view representation created for spark is compatible with a new view representation created for trino). I'll add a note to the view spec so that this behavior is clear and doesn't cause surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra, it isn't clear when you talk about carrying over representations because you're missing a layer of metadata. When you replace a view, the previous properties and versions are preserved. Replace also creates an entirely new version of the view that is independent of the other versions and the representations within those versions.
Iceberg is not responsible for somehow merging a view's current version with new SQL provided by the replace.
|
I feel like we lost some context from #7880, specifically the discussion around how we want to represent these APIs. I agree with @jackye1995 about favoring option B. I just feel like this is too SQL specific and limits our options going forward with different representations and creates unnecessary differences with the spec representation. Also, I'm not convinced that you can/should have I think it's fine to omit the default catalog and default namespaces, especially if we can appropriately qualify them in the engine when constructing the sql representation. |
|
@nastra, the example you gave would not be allowed: View view = catalog.buildView(identifier)
.withSchema(schema)
.withQuery("spark", "select a,b from ns.tbl")
.withQuery("trino", "select a,b,c from ns.tbl")
.createOrReplace()The representations must produce the same schema. |
987d96a to
f9c00d7
Compare
api/src/main/java/org/apache/iceberg/view/ReplaceViewVersion.java
Outdated
Show resolved
Hide resolved
7a4f7bf to
6fdf558
Compare
| * @param defaultNamespace The default namespace to use when the SQL does not contain a namespace | ||
| * @return this for method chaining | ||
| */ | ||
| T withQuery(String dialect, String sql, Namespace defaultNamespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default namespace should be a property of the view version, not a representation. withDefaultNamespace should be a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I was under the impression we would do #7992 (comment). I've updated the API to properly reflect this and moved the default catalog to the ViewVersion
067f3c2 to
df45723
Compare
.palantir/revapi.yml
Outdated
| \ java.lang.String) @ org.apache.iceberg.view.ViewBuilder" | ||
| justification: "Acceptable break due to updating View APIs and the View Spec" | ||
| - code: "java.method.removed" | ||
| old: "method java.lang.String org.apache.iceberg.view.ImmutableSQLViewRepresentation::defaultCatalog()" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a problem: the generated Immutable classes should not be in the api module. All of the implementation classes for API interfaces are in core and it makes no sense to publicly expose these in API when we don't publicly expose the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Immutables has a a way to configure the visibility or the package of the generated class, but unfortunately there's nothing currently available that would put the generated class into a different project :(
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is really close to being ready. Overall, I think it's a big improvement to the view definitions and makes behavior more clear and simplifies the API.
The main issue that I think we need to coordinate is the breaking changes in the public API. @aokolnychyi, @jackye1995, @danielcweeks, @Fokko, it would be great to get your input on whether we should move forward with the breaking changes.
In particular, it's really unfortunate that I didn't notice that Immutables was generating the implementations in the api module until now. That means far more was exposed and will break with these changes -- otherwise we could have simply made the removed methods return null or update them to return values from the ViewVersion and kept the getters. But the implementations are changing and had public builders as well.
I think since this is an unused API, we can probably get away with the breaking changes but I'd like to insulate us from having this issue in the future by moving the implementation classes into core like the others.
3e80052 to
0999931
Compare
ee00d7d to
ce2c9dd
Compare
ce2c9dd to
02b63b1
Compare
It's a mess up on my part and I should have noticed this when I have moved the View-specific Immutable classes as part of this PR to Once both PRs are merged, I'll add a follow-up and mention in the docs that |
I think we should disallow it, not just discourage it. A checkstyle rule sounds like the right approach though, so maybe we're saying the same thing. |
format/view-spec.md
Outdated
|
|
||
| View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata. | ||
|
|
||
| When view metadata is replaced, then previously set properties are carried over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's incorrect to make this statement in the spec. The "replace" operation for REPLACE VIEW is a SQL operation with chosen semantics that are represented in the Iceberg view metadata as you describe. But it's a very different statement to say that "when view metadata is replaced" something happens.
Saying that when metadata is "replaced" here could mean any update to the view metadata object, which is a much broader statement than you intended.
As in the table spec, we don't want to add broad statements like this. You can add a section explaining how a SQL REPLACE VIEW works, but even then I wouldn't say that it is a requirement of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks for updating this
format/view-spec.md
Outdated
| | _required_ | `summary` | A string to string map of [summary metadata](#summary) about the version | | ||
| | _required_ | `representations` | A list of [representations](#representations) for the view definition | | ||
| | _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog | | ||
| | _optional_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to required and added clarification for how to handle this when it is null or missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
format/view-spec.md
Outdated
| | _optional_ | `default-catalog` | Catalog name to use when a reference in the SELECT does not contain a catalog | | ||
| | _optional_ | `default-namespace` | Namespace to use when a reference in the SELECT is a single identifier | | ||
|
|
||
| When writing a new view version, all the information from the previous view version is replaced, meaning that `representations` / `default-catalog` / `default-namespace` / `summary` are not carried over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also incorrect for the spec.
There is nothing preventing us from creating an operation that does exactly this carry over, as long as it is the user's intent. For example:
ALTER VIEW v ADD REPRESENTATION spark AS
SELECT * FROM fooThis SQL clearly adds to the existing representations. It isn't correct or incorrect to do that. It is simply the behavior that a SQL operation could have. The spec allows all these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
I had a few comments on the spec that I added as changes and committed. +1 now, so I'll merge when tests are passing. |
|
Thanks, @nastra! |
Yes we are saying the same thing here. The goal with the docs part was mainly to describe why |
+1. Also worth considering whether "Representation" really refers to an engine or a dialect. Right now it looks like dialect, but even if we go with the above we will be making an implicit correlation between dialect and engine. I realize that aliases were removed from this PR, but this idea about engine x dialect correlation is worth considering in future designs.
+1. How are we thinking about the fact that catalog names are not the same across engines, even when the underlying tables end up being looked up from the same metastore/catalog? E.g., |
(cherry picked from commit 2a35542)
This PR re-visits the existing View APIs and aims at simplifying/improving a few things, namely:
field-aliasesandfield-commentsshould be part of the View's schema rather than be part of each individual SQL representation. Both fields are informational, and having them be part of each SQL representation could lead to issues across dialectsdefault-catalog: should be the name of the current view catalog that loaded this view, thus storing this information isn't requireddefault-namespace: should be the same across multiple SQL representations. We might want to enforce this and I've added a note in the view specCreating/updating a view with one or more representations
A few options around improving and simplifying the
ViewBuilderAPI have been explored. The PR deprecates the existing fine-grained methods on for defining a view's representation and adds awithQuery(...)method when creating a View or when updating existing representations:Creating a View with one representation
Creating a view with multiple representations
Adding an additional representation for an existing View
Updating a View's properties