-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add common view format spec #3188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FYI @jacques-n. If you want to help review this to coordinate or deconflict with the Substrait effort, that would be really helpful! |
site/docs/view-spec.md
Outdated
|
|
||
| ### Metadata Location Management | ||
|
|
||
| The view is created in the default warehouse location with database name and view name in the suffix. An example path is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to mention these defaults, because the default root location is different across different catalog implementations and should not be relied on. If it's managed exactly like table location, we should just say that instead.
site/docs/view-spec.md
Outdated
|
|
||
| Imagine the following sequence of operations: | ||
|
|
||
| * `create table base_tab(c1 int, c2 varchar);` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit capitalize SQL keywords
site/docs/view-spec.md
Outdated
| * Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata | ||
| * Support versioning of views to track how a view evolved over time | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to display this in the website? Should we just make it official documentation for the view spec under the Format section instead of saying it's a proposal? @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at this point I think we should edit this doc as though it is a pending spec, not a proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
site/docs/view-spec.md
Outdated
| | Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. | | ||
| | Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. | | ||
| | Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. | | ||
| | Required | sql | A string representing SQL definition of the view as input | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the development of Substrait, I feel we can use a more generic word (like "content", "representation"?), and then in the future we can make "substrait" a dialect? Would that work? @jacques-n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n could you comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue and I both had comments below. In general, it seems like @rdblue is leaning towards a collection of representations with at least one defined. I interpreted that as something like this (pseudo-json this time):
representations: [
{"type": "sql", "dialect": "trino", "version": "361", "text": "CREATE VIEW v1 as SELECT * from foo"}
{"type": "sql", "dialect": "hive", "version": "3.1.2", "text": "CREATE VIEW v1 as SELECT * from foo"}
{"type": "substrait", "version": "1", "text": "..."}
{"type": "substrait", "version": "1", "binary": "..."}
]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Representation field "type" can be used to interpret the rest of the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version of spec will only spell out "sql" type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After giving it some thought, I'd suggest adding representations later. Here are a few reasons:
- For the majority of the use cases, we just need to store the sql dialect when the view is created.
- Current spec does not spell how to convert and store other sql dialect. When that day comes, we might decide to convert on the fly without storing converted sql in json.
- When a concrete representation is proposed, we can discuss how to extend the json. The field
sqlanddialectmight be a fallback for therepresentationslist.
In short, field sql and dialect seem to be enough for now. Let's keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n How about this trade-off?
- Support a list of representations
- Types should be well defined
- Only allow one object per type in the list
- Only define one type "OriginalSql" in this spec
- Need to go through future discussions to add more types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.
I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?
site/docs/view-spec.md
Outdated
| Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document proposes standardizing the view metadata for ease of sharing the views across engines. | ||
|
|
||
| ### Goals | ||
| * Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bullet point before Includes
|
@rdblue, just saw your message. Didn't get a notification for some reason. Will take a look. |
|
A couple of thoughts:
|
|
To further the point on resolved sql: there is also something weird when working with table references and it not being clear if a reference is absolute or relative. For example, if I create a view when my current schema is What is expected behavior in the case of a Iceberg view? (note that I'm using |
|
I take it that we are driving towards consensus. @jzhuge will take this forward with the PR for implementation (as well address any outstanding comments). |
|
@jacques-n, thanks for the example. I was going to ask for one to clarify about SQL resolution but you added one first!
This is a case I hadn't considered much because I thought that table resolution would not change based on the existence of tables. The resolution in Trino never has this case because it always needs 3 parts:
We also specifically avoided this ambiguity in Spark when building the multi-part identifier support. We created rules so that there is only ever one way to resolve a table reference:
For a fixed set of catalogs, these rules provide an unambiguous table reference. There is some ambiguity if the set of catalogs changes, but that's a rare event because it is something done primarily by admins rather than individual users (like creating tables). We decided against the idea of falling back to trying the other branch if using rule 2 or rule 3 failed to load a table. For the view spec, we have tried to keep enough information that resolution is always done the same way. That's why the current catalog and namespace are stored as part of the view definition. We could also store the set of catalogs defined, but if there are some SQL systems that don't have an unambiguous resolution (meaning not based on the existence of tables) then it might not matter and we should decide how to handle it. The argument against requiring a resolved plan is that it may not be possible to produce one. I think Spark in particular used to store resolved plans but gave up on being able to produce them at all. They're also much longer and tended to cause views to break by being truncated in a database column even faster. And if we have a substrait plan, we probably don't need the resolved SQL. I think my preference is to allow storing multiple optional representations. We know that we will need the original SQL for SQL-configured views. And we know that storing a standard query plan (like Substrait) is preferred in the long term over even a resolved SQL string because of dialect problems. Since even the SQL may be optional for views created directly as portable plans, I think it makes sense to make all of these representations possible and optional. This is looking more like @jackye1995's original suggestion to be able to store different SQL for a given dialect. Maybe we should reconsider that. |
|
@rdblue , I think your comments are mostly focused on table identifiers. I'll respond to those below. Field names are probably the more important of the two. We should specify clearly what it means for a view to be "invalid". For example, if the SQL parses but the column names or column types are different, I would expect this to fail with an invalid view definition error. Additionally, I think we should probably cleanup the weird dual column definition pattern (if a user expressed column name aliases in the create view versus the population of derived * column names in the schema). Also, if we're capturing what the user wrote, we should probably capture the CREATE VIEW ... part of the command (not just the inner sql). I think this is consistent with how other tools expose the data. WRT to table identifiers, I struggle with being opinionated about resolution given that end users may work in a system that already has a particular resolution behavior. |
@jacques-n Unfortunately it will make it harder for other engine to consume the sql text, e.g., Presto doesn't understand Spark's syntax with column alias. To make it work, we might have to divide CREATE VIEW text into 2 parts, 1 for sql text, 1 for the rest. However, since CREATE VIEW syntax is fairly stable for years, field |
site/docs/view-spec.md
Outdated
|
|
||
| ### Storage Format | ||
|
|
||
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than refer to a specific technology I suggest a generic storage name
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. | |
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. |
site/docs/view-spec.md
Outdated
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. | | ||
| | Required | object-type | String identifying the type of object this metadata file is for. ‘View’ for all objects covered in this spec. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the table spec need to be updated to add this field also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, as it is necessary to differentiate view from table for catalogs without metastore which typically specifies the object type.
site/docs/view-spec.md
Outdated
| | Optional | current-schema-id | ID of the current schema of the view | | ||
|
|
||
|
|
||
| ‘schemas’ is an array of structs with a field as shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dumb question, why is this optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some catalogs may decide to analyze the schema when using the view thus there is no need to store the schema when creating the view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be made non-optional later? I think optional is fine as long as views are dialect specific. Eg EngineX knows it computes schema at run time so it doesn't write it and view creation time. But we expect to add dialect free views in later versions, in which case I think schema would no longer be optional. That sounds like a breaking change in this spec, maybe easier to make it required now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me as I haven't seen a use case where schema is not available (too expensive to obtain?) at creation. Anyone object to making it required?
site/docs/view-spec.md
Outdated
| | Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. | | ||
| | Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. | | ||
| | Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. | | ||
| | Required | sql | A string representing SQL definition of the view as input | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.
I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?
site/docs/view-spec.md
Outdated
| | Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. | | ||
| | Required | sql | A string representing SQL definition of the view as input | | ||
| | Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. | | ||
| | Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was brought up on the original proposal as well, these two fields seem a bit specific to certain types of engines. Maybe it would be helpful to me to see some examples from different engines and how views would be represented in them.
A few more questions:
- what if the catalog is set up differently or with a different name, can these be overriden?
- why are they required?
- how are these used at run time. If I have a query
select * from bazand thedefault-catalog=fooanddefault-namespace=barare we doing string replacement to then executeselect * from foo.bar.bazor is it assumed that the user has to run aUSE CATALOG foo;USE NAMESPACE barstatement first? String replacement sounds iffy as it can't really be done w/o understanding the sql (which we don't want to do yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Catalog setup has to be consistent (e.g., no rename) between view creation and usage.
- Not required. Fixed. Also change the field names to
session-catalogandsession-namespace, because "default" is a little confusing. - Fully qualify identifiers in the parsed plan. Don't do string replacement which is iffy as you pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re 3. how do we get "Fully qualify identifiers in the parsed plan. " It still isn't clear to me how these values should be used at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, view sql is select * from tbl1, session-catalog is cat1, and session-namespace is ns1. When the view is accessed, after CTE substitution, we will perform view substitution which looks for unresolved relations in the parsed plan for view sql. Here we find tbl which does not have namespace and catalog, we will convert it to cat1.ns1.tbl1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So an engine has to use these fields in its planning stage? That makes sense, sorry for the misunderstanding. I suppose this is potentially iterative. Eg search for tbl1 and not found, search for ns1.tbl1 not found, search for cat1.ns1.tbl1 found?
Or is it assumed that only iceberg tables are in an iceberg view so the lookup can be done from inside iceberg code w/o engines having to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No iterative search. If the identifier has 1 part, tbl -> cat1.ns1; if the identifier has 2 parts, ns.tbl -> cat1.ns.tbl. Ultimately the purpose of these 2 fields is to help query engines resolve identifiers without ambiguity. Of course, engines can choose to send us sql text with all identifiers qualified, then these 2 fields are optional.
site/docs/view-spec.md
Outdated
| | Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. | | ||
| | Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. | | ||
| | Required | default-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘default-catalog’ above. | | ||
| | Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the purpose of this field. Could you please clarify? eg how would this alias be different from aalias in select a as aalias from foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aliasing the columns in select star sql. This is supported in Spark, Hive, and a few other engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry @jzhuge I am not sure I understand your response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the syntax:
CREATE VIEW v1 (alias1 COMMENT 'comment for col1', alias2) AS
SELECT *
FROM tbl
Table tbl has 2 columns: col1 and col2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not the first one who find this feature hard to grasp. Presto does not support it. I'd suggest removing this field for this PR and have a follow-up discussion about the view column alias feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I get it now. Interesting. Yeah, I would prefer to defer this as it sounds rather engine specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax is actually quite common. It's reasonable to not include it right now to get this in, but I think we'll want to add it fairly soon so that we can store the whole set of CREATE VIEW metadata.
site/docs/view-spec.md
Outdated
| | Required/Optional | Key | Value | | ||
| |-------------------|-----|-------| | ||
| | Required | operation |A string value indicating the view operation that caused this metadata to be created. Allowed values are “CREATE” and “REPLACE” | | ||
| | Required | engine-version | A string value indicating the version of the engine that performed the operation (create / replace) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this field is redundant if we go with the suggestion above for representations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right
|
Hi @rymurr @jackye1995 @jacques-n @losipiuk @rdblue , the spec has been updated with all review comments addressed. The main addition is the "representations" list. There are also a few minor reorgs and cleanups. Please take another look. |
|
this looks great @jzhuge! Thanks! |
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really great
site/docs/view-spec.md
Outdated
| "type" : "string", | ||
| "doc" : "" | ||
| } ] | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future, is there a plan to parse the SQL and capture referenced tables/views by the view as part of the View Metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not planned. What use case do you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, use-cases are warehouse management related. For example, table-level audit logging access and ACLs enforcements for tables queried through views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds interesting. For this pull request, we are trying to keep things minimal, but we can definitely start a separate thread for discussion.
As a workaround, you can store the information as view properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let's discuss the expansion separately. Great PR. 👍
|
@rdblue @anjalinorwood @jacques-n @jackye1995 Are we ready to merge this PR? |
|
Hi John, LGTM! Thank you for picking up this PR to drive it to completion. |
site/docs/spec.md
Outdated
| | _required_ | _required_ | `partition-spec` | JSON fields representation of the partition spec used to write the manifest | | ||
| | _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string | | ||
| | _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string | | ||
| | _optional_ | _optional_ | `object-type` | Type of object this metadata file is for: "table" or "view". Default is "table". | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would state that "if missing, assumed to be "table"` rather than "default is table". Using "default" doesn't quite seem correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not like object-type either. How about manifest-type over spec-type?
Type of this manifest file, if missing, assumed to be "table".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be careful about using the term "manifest". That refers to manifest files in the table spec and not these metadata files.
I'm okay with "spec-type" or "object-type". I don't think we should worry too much about this.
site/docs/spec.md
Outdated
| |Metadata field|JSON representation|Example| | ||
| |--- |--- |--- | | ||
| |**`format-version`**|`JSON int`|`1`| | ||
| |**`object-type`**|`JSON string`|`table`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There should be double quotes around table because the example should be a JSON string, like the other strings below.
site/docs/view-spec.md
Outdated
|
|
||
| ## Background and Motivation | ||
|
|
||
| Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to say "commonly known as views"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".
site/docs/view-spec.md
Outdated
|
|
||
| ## Background and Motivation | ||
|
|
||
| Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence, "Views and tables occupy the same namespace" doesn't seem necessary. That is a catalog choice, right? It doesn't affect this spec if someone uses it for catalogs that do not mix views and tables so that they occupy separate namespaces.
site/docs/view-spec.md
Outdated
| ## Background and Motivation | ||
|
|
||
| Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace. | ||
| Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document standardizes the view metadata for ease of sharing the views across engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be a separate paragraph? If so, I think you need a newline between this line and the previous one.
site/docs/view-spec.md
Outdated
|
|
||
| ## Overview | ||
|
|
||
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to call out that the metadata model matches tables, but this goes a bit too far and states requirements for the metastore, like tracking "database name, owner, create time, last access time, ..."
I think this should be more similar to the table spec and bring in a lot of similar wording:
View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks!
site/docs/view-spec.md
Outdated
|
|
||
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. | ||
|
|
||
| Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to use backticks and fixed-width font for CREATE OR REPLACE VIEW?
Is it necessary to state this in terms of SQL? Above, I suggested "all changes to view state create a new metadata file and completely replace the old metadata ..." That seems sufficient to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Combined 2 paragraphs.
site/docs/view-spec.md
Outdated
| The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. | ||
|
|
||
| Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view. | ||
| Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good thing to call out: "Each metadata file is self-sufficient".
I'm not sure what you mean by "and can be used to roll back the view to a previous version". It sounds like this is a suggestion that rolling back the pointer to an old metadata file is a possibility, which is not a good idea. Metadata files should always roll forward. Otherwise, structures like a current version log are not maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "roll back" here does not mean the metadata file, instead, it means rolling back the "version". I'd expect the metadata file to be rolled "forward" in this case.
site/docs/view-spec.md
Outdated
| | Required | versions | An array of structs describing the last few versions of the view. Controlled by the table property: “version.history.num_entries”. See more below. | | ||
| | Required | version-log | An array of structs describing the log of created versions. See more below. | | ||
| | Optional | schemas | A list of schemas, the same as the ‘schemas’ field from Iceberg table spec. | | ||
| | Optional | current-schema-id | ID of the current schema of the view | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be stored with each version rather than here? I think this is required to match the current version's schema, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table's current-schema-id is stored in the similar place. I think it tracks the schema id of the current version, right?
Should we add schema-id to Version struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually added schema-id to SQL representation.
site/docs/view-spec.md
Outdated
| | Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` | | ||
| | Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` | | ||
|
|
||
| “version-log” is an array of structs describing the log of the versions created. The struct has the following fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes when each version was considered "current". Creation is different and is stored in each version's metadata. This allows you to reconstruct what someone would have seen at some point in time. If the view has been updated and rolled back, this will show it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
site/docs/view-spec.md
Outdated
|
|
||
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | timestamp-ms | Timestamp expressed as ms since epoch for when the version of the view indicated in version-id field below was created | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. This timestamp should be when the referenced version was made the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
site/docs/view-spec.md
Outdated
| "representations" : [ { => SQL metadata of the view | ||
| "type" : "sql", | ||
| "sql" : "SELECT *\nFROM\n base_tab\n", => original view SQL | ||
| "schema" : { => Schema of the view expressed in Iceberg types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be in schemas and referenced by ID either here or in versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Updated.
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new docs website and repository, please move the spec changes to docs/common/format, thank you!
site/docs/view-spec.md
Outdated
|
|
||
| ## Background and Motivation | ||
|
|
||
| Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".
site/docs/spec.md
Outdated
| | _required_ | _required_ | `partition-spec` | JSON fields representation of the partition spec used to write the manifest | | ||
| | _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string | | ||
| | _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string | | ||
| | _optional_ | _optional_ | `object-type` | Type of object this metadata file is for: "table" or "view". Default is "table". | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java
site/docs/view-spec.md
Outdated
|
|
||
| The view version metadata file has the following fields: | ||
|
|
||
| | Required/Optional | Field Name | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to have some consistency with the table spec tables: https://github.com/apache/iceberg/blob/master/docs/common/format/spec.md?plain=1#L508, can we make some changes to the tables in the view spec:
- use v1 instead of
Required/Optionalhas header - make column size consistent for everything before description
- lower case and bold
_required_or_optional_for required or optional values
site/docs/view-spec.md
Outdated
| | Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. | | ||
| | Required | object-type | Type of object this metadata file is for: "table" or "view". It must be set to "view" for all objects covered in this spec. | | ||
| | Required | location | Location of the view metadata files | | ||
| | Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by an empty view? A view without a SQL definition?
site/docs/view-spec.md
Outdated
| | Required | sql | A string representing SQL definition of the view as input | | ||
| | Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. | | ||
| | Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. | | ||
| | Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"An array of strings indicating namespace" made me thought it was a list of namespaces until I then realized namespace is a list of string. I think people who don't have context would get confused here. I think it's more clear to say "Namespace to use when table or view references do not contain an explicit namespace", and then say namespace is serialized as a list of string.
site/docs/view-spec.md
Outdated
| | Required | type | It must be set to "sql" | | ||
| | Required | sql | A string representing SQL definition of the view as input | | ||
| | Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. | | ||
| | Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session-catalog as the name does not sound very intuitive to me, as Ryan says the user might be using a different session when evaluating this view, and will still use the catalog defined here to get the referenced table/view information by name. What about default-catalog?
site/docs/view-spec.md
Outdated
|
|
||
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | type | It must be set to "sql" | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already say "The rest of the fields are interpreted by the type of representation." above, I think we don't need type again in this table?
site/docs/view-spec.md
Outdated
|
|
||
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | type | A string indicating the type of representation. The only valid choice is "sql". The rest of the fields are interpreted by the type of representation. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying The only valid choice is "sql" might make people doubt the necessity of the field. What about saying The format used to store this representation of the view, currently the only supported type is "sql"
site/docs/view-spec.md
Outdated
| | Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. | | ||
| | Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. | | ||
| | Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` | | ||
| | Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are field aliases not a part of schemas?
site/docs/view-spec.md
Outdated
| The path is intentionally similar to the path for iceberg tables and contains a ‘metadata’ directory. (`METASTORE_WAREHOUSE_DIR/<dbname>.db/<viewname>/metadata`) | ||
|
|
||
| The metadata directory contains View Version Metadata files. The text after '=>' symbols describes the fields. | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite a few places are not up-to-date, I suppose you will fix later?
|
Thanks @rdblue @jackye1995 for the comments! Will take a look shortly. |
Co-authored-by: John Zhuge <[email protected]>
|
This commit rebased to the master and relocated the spec to the new path. Incorporated all comments from @rdblue. |
|
Removed Instead, we can use OOB flag to determine the spec type. For example, HMS catalogs can use table type while Hadoop catalogs can use file extension ".viewmetadata.json" vs ".metadata.json". |
|
|
||
| * A common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. | ||
| * The view metadata format specification | ||
| * Includes storage format as well as APIs to write/read the metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is actually in this spec, and it probably shouldn't be. Should we remove this?
| ## Overview | ||
|
|
||
| View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata. | ||
| Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a little misleading to say the view metadata file "can be used to roll back the view to a previous version". Like tables, the view metadata files should only go forward. You can roll back by updating a pointer to an older version tracked in metadata but we do not want people to think that they can roll back by pointing to an older metadata file.
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | format-version | An integer version number for the view format. Currently, this must be 1. Implementations must throw an exception if the view's version is higher than the supported version. | | ||
| | Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manifest files? Looks like this may be a copy-paste error.
| | Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. | | ||
| | Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. | | ||
| | Optional | properties | A string to string map of view properties. This is used for metadata such as "comment" and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. | | ||
| | Required | versions | An array of structs describing the last known versions of the view. Controlled by the table property: “version.history.num-entries”. See section [Versions](#versions). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't need "last" in "last known". Just "known versions of the view" should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "Controlled by the table property" sentence could be more clear: "The number of versions to retain is controlled ...".
|
|
||
| Each representation is stored as an object with only one common field "type". | ||
| The rest of the fields are interpreted based on the type. | ||
| There is only one type of representation defined in the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this sentence is needed. Right now there's only one. But we want people to standardize these here.
| | Required/Optional | Field Name | Description | | ||
| |-------------------|------------|-------------| | ||
| | Required | type | A string indicating the type of representation. It is set to "sql" for this type. | | ||
| | Required | sql | A string representing the original view definition in SQL | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we requiring that this is a SELECT statement? I don't think we want this set to CREATE VIEW ... AS SELECT (the full create/replace statement).
| | Optional | field-docs | A list of strings of field comments optionally specified in the create view statement. The list should have the same length as the schema's top level fields. See the example below. | | ||
|
|
||
| For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`, | ||
| the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to have an example of SQL and the values of the representation fields.
| For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`, | ||
| the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc. | ||
|
|
||
| ## Appendix A: An Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than an example for interpretation, could we do the same thing as the table spec and have a JSON serialization section with good examples at a field level?
You could put a real SQL CREATE VIEW statement at the top and base the example on that. I think that would be more helpful because the description would be more specific than big pasted blobs of JSON with comments.
| @@ -0,0 +1,20 @@ | |||
| --- | |||
| title: "View Spec" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samredai, is this extra file needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is required. It adds the menu item to the left-nav on the docs site to link to this View Spec page which is built with the landing-page site.
This should be fine, although I think if you're listing in a file system to find views you're already in a bad place! |
|
While there are some minor things to update to make this more clear, I think this spec is good and unambiguous. That means it's time to merge it. We can fix the remaining issues in follow-up PRs and start discussing adoption and implementation. Thanks @jzhuge and @anjalinorwood! |
|
Thanks @rdblue ! |
|
Should I go ahead to create an PR for view implementation based on a subset of Netflix code |
|
Yes, please do! |
No description provided.