Add common view format spec #3188

anjalinorwood · 2021-09-27T17:31:53Z

No description provided.

rdblue · 2021-09-28T00:07:23Z

FYI @jacques-n. If you want to help review this to coordinate or deconflict with the Substrait effort, that would be really helpful!

jackye1995 · 2021-09-30T01:01:13Z

site/docs/view-spec.md

+
+### Metadata Location Management
+
+The view is created in the default warehouse location with database name and view name in the suffix. An example path is: 


I think we don't need to mention these defaults, because the default root location is different across different catalog implementations and should not be relied on. If it's managed exactly like table location, we should just say that instead.

jackye1995 · 2021-09-30T01:01:30Z

site/docs/view-spec.md

+
+Imagine the following sequence of operations: 
+
+* `create table base_tab(c1 int, c2 varchar);`


nit capitalize SQL keywords

jackye1995 · 2021-09-30T01:03:12Z

site/docs/view-spec.md

+* Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata
+* Support versioning of views to track how a view evolved over time
+
+## Proposal


are we going to display this in the website? Should we just make it official documentation for the view spec under the Format section instead of saying it's a proposal? @rdblue

Yes, at this point I think we should edit this doc as though it is a pending spec, not a proposal.

jackye1995 · 2021-09-30T01:06:42Z

site/docs/view-spec.md

+| Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. |
+| Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. |
+| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
+| Required | sql | A string representing SQL definition of the view as input |


Given the development of Substrait, I feel we can use a more generic word (like "content", "representation"?), and then in the future we can make "substrait" a dialect? Would that work? @jacques-n

@jacques-n could you comment?

@rdblue and I both had comments below. In general, it seems like @rdblue is leaning towards a collection of representations with at least one defined. I interpreted that as something like this (pseudo-json this time):

representations: [ {"type": "sql", "dialect": "trino", "version": "361", "text": "CREATE VIEW v1 as SELECT * from foo"} {"type": "sql", "dialect": "hive", "version": "3.1.2", "text": "CREATE VIEW v1 as SELECT * from foo"} {"type": "substrait", "version": "1", "text": "..."} {"type": "substrait", "version": "1", "binary": "..."} ]

Sounds good to me. Representation field "type" can be used to interpret the rest of the fields.

This version of spec will only spell out "sql" type.

After giving it some thought, I'd suggest adding representations later. Here are a few reasons:

For the majority of the use cases, we just need to store the sql dialect when the view is created.

Current spec does not spell how to convert and store other sql dialect. When that day comes, we might decide to convert on the fly without storing converted sql in json.

When a concrete representation is proposed, we can discuss how to extend the json. The field sql and dialect might be a fallback for the representations list.

In short, field sql and dialect seem to be enough for now. Let's keep it simple.

@jacques-n How about this trade-off?

Support a list of representations

Types should be well defined

Only allow one object per type in the list

Only define one type "OriginalSql" in this spec

Need to go through future discussions to add more types

I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.

I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?

losipiuk · 2021-09-30T12:01:21Z

site/docs/view-spec.md

+Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document proposes standardizing the view metadata for ease of sharing the views across engines.
+
+### Goals
+* Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata


bullet point before Includes

jacques-n · 2021-10-04T05:15:41Z

@rdblue, just saw your message. Didn't get a notification for some reason. Will take a look.

jacques-n · 2021-10-04T19:03:30Z

A couple of thoughts:

Was there a discussion of resolved SQL? Most systems will want to store this one way or another. The schema field can potentially help with this but it feels somewhat backwards. It's actually strange that field-aliases is populated for a view statement that has that part of the statement but the schema is used to resolve field names in the case that a * or arbitrary expression (e.g. select a+b from foo without field as clause) is used. I'm inclined to think that the we should include both resolved and unresolved sql to be explicit about such things. Really, the only reason to have unresolved is so people can get back to what they typed. Engines need resolved to be reliably consistent.
In the context of Substrait: I think there are two distinct use cases. The first is SQL view with substrait plan for reliable consumption cross engine. The second is a substrait plan-based view that doesn't have associated SQL. The latter exists when the FE that produces the plan isn't sql-based (for example, Pandas could be the producer of the Substrait plan). It seems like there probably should be some kind of union such as the following (using pseudo protobuf since I'm in that world right now):

message ProcessingInstruction {

  oneof processing_type {
    Sql sql = 1;
    SubstraitPlan plan = 2; // to be added in future.
  }

   message Sql {
    string text = 1;
    string dialect = 2;
    SubstraitPlan plan = 3; // to be added in future.
  }
  
  // to be added in future.
  message SubstraitPlan {
    int32 substrait_version = 1;
    binary plan = 2;
  }
}

jacques-n · 2021-10-04T23:50:36Z

To further the point on resolved sql: there is also something weird when working with table references and it not being clear if a reference is absolute or relative. For example, if I create a view when my current schema is <root>.s1 and the sql for the view is something along the lines of select * from t1. Imagine in this situation that initially there is a <root>.t1 but not a <root>.s1.t1. In this case, the view will initially bind to <root>.t1. Let's then imagine I create table (or view) called <root>.s1.t1. If you apply resolution rules at view consumption times, it will resolve to <root>.s1.t1 even though that was not the intention. If you had saved the resolved sql at view creation time, it would remain point to <root>.t1.

What is expected behavior in the case of a Iceberg view?

(note that I'm using <root> here to mean the top of the namespace, not an actual name. Just seemed easier to include it explicitly to clarify the scenario since typically resolution is first the current schema followed by the root schema)

anjalinorwood · 2021-10-07T18:21:20Z

I take it that we are driving towards consensus. @jzhuge will take this forward with the PR for implementation (as well address any outstanding comments).

rdblue · 2021-10-07T23:24:34Z

@jacques-n, thanks for the example. I was going to ask for one to clarify about SQL resolution but you added one first!

What is expected behavior in the case of a Iceberg view?

This is a case I hadn't considered much because I thought that table resolution would not change based on the existence of tables.

The resolution in Trino never has this case because it always needs 3 parts:

If there are 3 parts, they are (catalog, schema, table)
If there are 2 parts, they are (schema, table) and the current catalog is used
If there is 1 part, it is the table and the current catalog and schema are used

We also specifically avoided this ambiguity in Spark when building the multi-part identifier support. We created rules so that there is only ever one way to resolve a table reference:

If the name is a single identifier, use it as the table name, use the current catalog, and current namespace from the session.
(Otherwise, the name is multiple parts)
If the first part is a catalog, use that catalog. Use the last identifier part as the table name and any remaining parts as the namespace.
If the first part is not a catalog, use the current catalog from the session. Use the last identifier part as the table name and the remaining parts (including the first that was not a catalog) as the namespace.

For a fixed set of catalogs, these rules provide an unambiguous table reference. There is some ambiguity if the set of catalogs changes, but that's a rare event because it is something done primarily by admins rather than individual users (like creating tables). We decided against the idea of falling back to trying the other branch if using rule 2 or rule 3 failed to load a table.

For the view spec, we have tried to keep enough information that resolution is always done the same way. That's why the current catalog and namespace are stored as part of the view definition. We could also store the set of catalogs defined, but if there are some SQL systems that don't have an unambiguous resolution (meaning not based on the existence of tables) then it might not matter and we should decide how to handle it.

The argument against requiring a resolved plan is that it may not be possible to produce one. I think Spark in particular used to store resolved plans but gave up on being able to produce them at all. They're also much longer and tended to cause views to break by being truncated in a database column even faster. And if we have a substrait plan, we probably don't need the resolved SQL.

I think my preference is to allow storing multiple optional representations. We know that we will need the original SQL for SQL-configured views. And we know that storing a standard query plan (like Substrait) is preferred in the long term over even a resolved SQL string because of dialect problems. Since even the SQL may be optional for views created directly as portable plans, I think it makes sense to make all of these representations possible and optional.

This is looking more like @jackye1995's original suggestion to be able to store different SQL for a given dialect. Maybe we should reconsider that.

jacques-n · 2021-10-20T02:03:06Z

@rdblue , I think your comments are mostly focused on table identifiers. I'll respond to those below.

Field names are probably the more important of the two. We should specify clearly what it means for a view to be "invalid". For example, if the SQL parses but the column names or column types are different, I would expect this to fail with an invalid view definition error. Additionally, I think we should probably cleanup the weird dual column definition pattern (if a user expressed column name aliases in the create view versus the population of derived * column names in the schema). Also, if we're capturing what the user wrote, we should probably capture the CREATE VIEW ... part of the command (not just the inner sql). I think this is consistent with how other tools expose the data.

WRT to table identifiers, I struggle with being opinionated about resolution given that end users may work in a system that already has a particular resolution behavior.

jzhuge · 2021-10-27T22:14:49Z

Also, if we're capturing what the user wrote, we should probably capture the CREATE VIEW ... part of the command (not just the inner sql). I think this is consistent with how other tools expose the data.

@jacques-n Unfortunately it will make it harder for other engine to consume the sql text, e.g., Presto doesn't understand Spark's syntax with column alias. To make it work, we might have to divide CREATE VIEW text into 2 parts, 1 for sql text, 1 for the rest.

However, since CREATE VIEW syntax is fairly stable for years, field sql and optional field-* should be enough to re-create the CREATE VIEW statement in different engines. We can add more optional field to support new syntax down the road.

rymurr · 2021-11-12T13:32:21Z

site/docs/view-spec.md

+
+### Storage Format
+
+The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view. 


Rather than refer to a specific technology I suggest a generic storage name

Suggested change

The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

rymurr · 2021-11-12T13:52:00Z

site/docs/view-spec.md

+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. |
+| Required | object-type    | String identifying the type of object this metadata file is for. ‘View’ for all objects covered in this spec. |


does the table spec need to be updated to add this field also?

I think so, as it is necessary to differentiate view from table for catalogs without metastore which typically specifies the object type.

rymurr · 2021-11-12T13:54:02Z

site/docs/view-spec.md

+| Optional | current-schema-id | ID of the current schema of the view |
+
+
+‘schemas’ is an array of structs with a field as shown below:


dumb question, why is this optional?

Some catalogs may decide to analyze the schema when using the view thus there is no need to store the schema when creating the view.

Can it be made non-optional later? I think optional is fine as long as views are dialect specific. Eg EngineX knows it computes schema at run time so it doesn't write it and view creation time. But we expect to add dialect free views in later versions, in which case I think schema would no longer be optional. That sounds like a breaking change in this spec, maybe easier to make it required now?

Fine with me as I haven't seen a use case where schema is not available (too expensive to obtain?) at creation. Anyone object to making it required?

rymurr · 2021-11-26T20:28:27Z

site/docs/view-spec.md

+| Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. |
+| Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. |
+| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
+| Required | sql | A string representing SQL definition of the view as input |


I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.

I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?

rymurr · 2021-11-26T20:39:19Z

site/docs/view-spec.md

+| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
+| Required | sql | A string representing SQL definition of the view as input |
+| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
+| Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. |


This was brought up on the original proposal as well, these two fields seem a bit specific to certain types of engines. Maybe it would be helpful to me to see some examples from different engines and how views would be represented in them.

A few more questions:

what if the catalog is set up differently or with a different name, can these be overriden?

why are they required?

how are these used at run time. If I have a query select * from baz and the default-catalog=foo and default-namespace=bar are we doing string replacement to then execute select * from foo.bar.baz or is it assumed that the user has to run a USE CATALOG foo;USE NAMESPACE bar statement first? String replacement sounds iffy as it can't really be done w/o understanding the sql (which we don't want to do yet)

Catalog setup has to be consistent (e.g., no rename) between view creation and usage.

Not required. Fixed. Also change the field names to session-catalog and session-namespace, because "default" is a little confusing.

Fully qualify identifiers in the parsed plan. Don't do string replacement which is iffy as you pointed out.

re 3. how do we get "Fully qualify identifiers in the parsed plan. " It still isn't clear to me how these values should be used at runtime.

For example, view sql is select * from tbl1, session-catalog is cat1, and session-namespace is ns1. When the view is accessed, after CTE substitution, we will perform view substitution which looks for unresolved relations in the parsed plan for view sql. Here we find tbl which does not have namespace and catalog, we will convert it to cat1.ns1.tbl1.

So an engine has to use these fields in its planning stage? That makes sense, sorry for the misunderstanding. I suppose this is potentially iterative. Eg search for tbl1 and not found, search for ns1.tbl1 not found, search for cat1.ns1.tbl1 found?

Or is it assumed that only iceberg tables are in an iceberg view so the lookup can be done from inside iceberg code w/o engines having to change?

No iterative search. If the identifier has 1 part, tbl -> cat1.ns1; if the identifier has 2 parts, ns.tbl -> cat1.ns.tbl. Ultimately the purpose of these 2 fields is to help query engines resolve identifiers without ambiguity. Of course, engines can choose to send us sql text with all identifiers qualified, then these 2 fields are optional.

rymurr · 2021-11-26T20:41:18Z

site/docs/view-spec.md

+| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
+| Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. |
+| Required | default-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘default-catalog’ above. |
+| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |


I am not sure I understand the purpose of this field. Could you please clarify? eg how would this alias be different from aalias in select a as aalias from foo

Aliasing the columns in select star sql. This is supported in Spark, Hive, and a few other engines.

sorry @jzhuge I am not sure I understand your response.

Here is the syntax:

CREATE VIEW v1 (alias1 COMMENT 'comment for col1', alias2) AS SELECT * FROM tbl

Table tbl has 2 columns: col1 and col2.

You are not the first one who find this feature hard to grasp. Presto does not support it. I'd suggest removing this field for this PR and have a follow-up discussion about the view column alias feature.

ah I get it now. Interesting. Yeah, I would prefer to defer this as it sounds rather engine specific.

This syntax is actually quite common. It's reasonable to not include it right now to get this in, but I think we'll want to add it fairly soon so that we can store the whole set of CREATE VIEW metadata.

rymurr · 2021-11-26T20:42:15Z

site/docs/view-spec.md

+| Required/Optional | Key | Value |
+|-------------------|-----|-------|
+| Required | operation |A string value indicating the view operation that caused this metadata to be created. Allowed values are “CREATE” and “REPLACE” |
+| Required | engine-version | A string value indicating the version of the engine that performed the operation (create / replace) |


I guess this field is redundant if we go with the suggestion above for representations?

You are right

jzhuge · 2022-01-03T22:47:52Z

Hi @rymurr @jackye1995 @jacques-n @losipiuk @rdblue , the spec has been updated with all review comments addressed. The main addition is the "representations" list. There are also a few minor reorgs and cleanups. Please take another look.

rymurr · 2022-01-17T09:08:27Z

this looks great @jzhuge! Thanks!

nastra

looks really great

sumedhsakdeo · 2022-01-21T17:53:26Z

site/docs/view-spec.md

+          "type" : "string",
+          "doc" : ""
+        } ]
+      },


In future, is there a plan to parse the SQL and capture referenced tables/views by the view as part of the View Metadata?

Not planned. What use case do you have in mind?

Sure, use-cases are warehouse management related. For example, table-level audit logging access and ACLs enforcements for tables queried through views.

Sounds interesting. For this pull request, we are trying to keep things minimal, but we can definitely start a separate thread for discussion.

As a workaround, you can store the information as view properties.

Agree, let's discuss the expansion separately. Great PR. 👍

jzhuge · 2022-01-25T16:27:37Z

@rdblue @anjalinorwood @jacques-n @jackye1995 Are we ready to merge this PR?

anjalinorwood · 2022-02-10T21:37:44Z

Hi John, LGTM! Thank you for picking up this PR to drive it to completion.

rdblue · 2022-02-14T00:06:08Z

site/docs/spec.md

 | _required_ | _required_ | `partition-spec`    | JSON fields representation of the partition spec used to write the manifest  |
 | _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string              |
 | _optional_ | _required_ | `format-version`    | Table format version number of the manifest as a string                      |
+| _optional_ | _optional_ | `object-type`       | Type of object this metadata file is for: "table" or "view". Default is "table". |


I would state that "if missing, assumed to be "table"` rather than "default is table". Using "default" doesn't quite seem correct to me.

would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java

Do not like object-type either. How about manifest-type over spec-type?

Type of this manifest file, if missing, assumed to be "table".

We need to be careful about using the term "manifest". That refers to manifest files in the table spec and not these metadata files.

I'm okay with "spec-type" or "object-type". I don't think we should worry too much about this.

rdblue · 2022-02-14T00:06:37Z

site/docs/spec.md

 |Metadata field|JSON representation|Example|
 |--- |--- |--- |
 |**`format-version`**|`JSON int`|`1`|
+|**`object-type`**|`JSON string`|`table`|


Nit: There should be double quotes around table because the example should be a JSON string, like the other strings below.

rdblue · 2022-02-14T00:08:24Z

site/docs/view-spec.md

+
+## Background and Motivation
+
+Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.


Is it necessary to say "commonly known as views"?

We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".

rdblue · 2022-02-14T00:09:38Z

site/docs/view-spec.md

+
+## Background and Motivation
+
+Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.


The last sentence, "Views and tables occupy the same namespace" doesn't seem necessary. That is a catalog choice, right? It doesn't affect this spec if someone uses it for catalogs that do not mix views and tables so that they occupy separate namespaces.

rdblue · 2022-02-14T00:12:09Z

site/docs/view-spec.md

+## Background and Motivation
+
+Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.
+Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document standardizes the view metadata for ease of sharing the views across engines.


Is this intended to be a separate paragraph? If so, I think you need a newline between this line and the previous one.

rdblue · 2022-02-14T00:17:15Z

site/docs/view-spec.md

+
+## Overview
+
+The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.


I think it is good to call out that the metadata model matches tables, but this goes a bit too far and states requirements for the metastore, like tracking "database name, owner, create time, last access time, ..."

I think this should be more similar to the table spec and bring in a lot of similar wording:

View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

Looks great. Thanks!

rdblue · 2022-02-14T00:18:26Z

site/docs/view-spec.md

+
+The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.
+
+Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view.


Did you intend to use backticks and fixed-width font for CREATE OR REPLACE VIEW?

Is it necessary to state this in terms of SQL? Above, I suggested "all changes to view state create a new metadata file and completely replace the old metadata ..." That seems sufficient to me.

Removed. Combined 2 paragraphs.

rdblue · 2022-02-14T00:20:18Z

site/docs/view-spec.md

+The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.
+
+Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view.
+Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version.


I think this is a good thing to call out: "Each metadata file is self-sufficient".

I'm not sure what you mean by "and can be used to roll back the view to a previous version". It sounds like this is a suggestion that rolling back the pointer to an old metadata file is a possibility, which is not a good idea. Metadata files should always roll forward. Otherwise, structures like a current version log are not maintained.

I think "roll back" here does not mean the metadata file, instead, it means rolling back the "version". I'd expect the metadata file to be rolled "forward" in this case.

rdblue · 2022-02-14T00:52:57Z

site/docs/view-spec.md

+| Required | versions | An array of structs describing the last few versions of the view. Controlled by the table property: “version.history.num_entries”. See more below. |
+| Required | version-log | An array of structs describing the log of created versions. See more below. |
+| Optional | schemas | A list of schemas, the same as the ‘schemas’ field from Iceberg table spec. |
+| Optional | current-schema-id | ID of the current schema of the view |


Should this be stored with each version rather than here? I think this is required to match the current version's schema, right?

Table's current-schema-id is stored in the similar place. I think it tracks the schema id of the current version, right?

Should we add schema-id to Version struct?

Actually added schema-id to SQL representation.

rdblue · 2022-02-14T00:54:15Z

site/docs/view-spec.md

+| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
+| Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
+
+“version-log” is an array of structs describing the log of the versions created. The struct has the following fields:


This describes when each version was considered "current". Creation is different and is stored in each version's metadata. This allows you to reconstruct what someone would have seen at some point in time. If the view has been updated and rolled back, this will show it.

rdblue · 2022-02-14T02:56:16Z

site/docs/view-spec.md

+
+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | timestamp-ms | Timestamp expressed as ms since epoch for when the version of the view indicated in version-id field below was created |


This isn't correct. This timestamp should be when the referenced version was made the current version.

rdblue · 2022-02-14T02:57:38Z

site/docs/view-spec.md

+    "representations" : [ { => SQL metadata of the view
+      "type" : "sql",
+      "sql" : "SELECT *\nFROM\n  base_tab\n", => original view SQL
+      "schema" : { => Schema of the view expressed in Iceberg types


Shouldn't this be in schemas and referenced by ID either here or in versions?

You are right. Updated.

jackye1995

With the new docs website and repository, please move the spec changes to docs/common/format, thank you!

jackye1995 · 2022-02-14T02:25:21Z

site/docs/view-spec.md

+
+## Background and Motivation
+
+Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.


We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".

jackye1995 · 2022-02-14T02:27:04Z

site/docs/spec.md

 | _required_ | _required_ | `partition-spec`    | JSON fields representation of the partition spec used to write the manifest  |
 | _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string              |
 | _optional_ | _required_ | `format-version`    | Table format version number of the manifest as a string                      |
+| _optional_ | _optional_ | `object-type`       | Type of object this metadata file is for: "table" or "view". Default is "table". |


would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java

jackye1995 · 2022-02-14T02:39:55Z

site/docs/view-spec.md

+
+The view version metadata file has the following fields:
+
+| Required/Optional | Field Name | Description |


to have some consistency with the table spec tables: https://github.com/apache/iceberg/blob/master/docs/common/format/spec.md?plain=1#L508, can we make some changes to the tables in the view spec:

use v1 instead of Required/Optional has header

make column size consistent for everything before description

lower case and bold _required_ or _optional_ for required or optional values

jackye1995 · 2022-02-14T02:40:28Z

site/docs/view-spec.md

+| Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. |
+| Required | object-type    | Type of object this metadata file is for: "table" or "view". It must be set to "view" for all objects covered in this spec. |
+| Required | location | Location of the view metadata files |
+| Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. |


what do you mean by an empty view? A view without a SQL definition?

jackye1995 · 2022-02-14T02:49:15Z

site/docs/view-spec.md

+| Required | sql | A string representing SQL definition of the view as input |
+| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
+| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |
+| Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. |


"An array of strings indicating namespace" made me thought it was a list of namespaces until I then realized namespace is a list of string. I think people who don't have context would get confused here. I think it's more clear to say "Namespace to use when table or view references do not contain an explicit namespace", and then say namespace is serialized as a list of string.

jackye1995 · 2022-02-14T02:55:36Z

site/docs/view-spec.md

+| Required | type | It must be set to "sql" |
+| Required | sql | A string representing SQL definition of the view as input |
+| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
+| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |


session-catalog as the name does not sound very intuitive to me, as Ryan says the user might be using a different session when evaluating this view, and will still use the catalog defined here to get the referenced table/view information by name. What about default-catalog?

jackye1995 · 2022-02-14T03:03:43Z

site/docs/view-spec.md

+
+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | type | It must be set to "sql" |


we already say "The rest of the fields are interpreted by the type of representation." above, I think we don't need type again in this table?

jackye1995 · 2022-02-14T03:04:36Z

site/docs/view-spec.md

+
+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | type | A string indicating the type of representation. The only valid choice is "sql". The rest of the fields are interpreted by the type of representation. |


Saying The only valid choice is "sql" might make people doubt the necessity of the field. What about saying The format used to store this representation of the view, currently the only supported type is "sql"

jackye1995 · 2022-02-14T03:05:58Z

site/docs/view-spec.md

+| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |
+| Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. |
+| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
+| Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |


why are field aliases not a part of schemas?

jackye1995 · 2022-02-14T03:06:53Z

site/docs/view-spec.md

+The path is intentionally similar to the path for iceberg tables and contains a ‘metadata’ directory. (`METASTORE_WAREHOUSE_DIR/<dbname>.db/<viewname>/metadata`)
+
+The metadata directory contains View Version Metadata files. The text after '=>' symbols describes the fields.
+```


Quite a few places are not up-to-date, I suppose you will fix later?

jzhuge · 2022-02-15T16:43:05Z

Thanks @rdblue @jackye1995 for the comments! Will take a look shortly.

Co-authored-by: John Zhuge <[email protected]>

jzhuge · 2022-03-31T18:49:35Z

This commit rebased to the master and relocated the spec to the new path.

Incorporated all comments from @rdblue.
Working on @jackye1995 comments.

jzhuge · 2022-03-31T18:53:33Z

Removed object-type from the proposal because listing views only or tables only may be very inefficient as we need to read each json file and determine its object/manifest type.

Instead, we can use OOB flag to determine the spec type. For example, HMS catalogs can use table type while Hadoop catalogs can use file extension ".viewmetadata.json" vs ".metadata.json".

rdblue · 2022-04-03T23:39:54Z

format/view-spec.md

+
+* A common metadata format for view metadata, similar to how Iceberg supports a common table format for tables.
+* The view metadata format specification
+  * Includes storage format as well as APIs to write/read the metadata.


I don't think this is actually in this spec, and it probably shouldn't be. Should we remove this?

rdblue · 2022-04-03T23:42:02Z

format/view-spec.md

+## Overview
+
+View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.
+Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version.


I think it is a little misleading to say the view metadata file "can be used to roll back the view to a previous version". Like tables, the view metadata files should only go forward. You can roll back by updating a pointer to an older version tracked in metadata but we do not want people to think that they can roll back by pointing to an older metadata file.

rdblue · 2022-04-03T23:42:56Z

format/view-spec.md

+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | format-version | An integer version number for the view format. Currently, this must be 1. Implementations must throw an exception if the view's version is higher than the supported version. |
+| Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. |


Manifest files? Looks like this may be a copy-paste error.

rdblue · 2022-04-03T23:43:59Z

format/view-spec.md

+| Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. |
+| Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. |
+| Optional | properties | A string to string map of view properties. This is used for metadata such as "comment" and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. |
+| Required | versions | An array of structs describing the last known versions of the view. Controlled by the table property: “version.history.num-entries”. See section [Versions](#versions). |


Probably don't need "last" in "last known". Just "known versions of the view" should be sufficient.

I think the "Controlled by the table property" sentence could be more clear: "The number of versions to retain is controlled ...".

rdblue · 2022-04-03T23:46:58Z

format/view-spec.md

+
+Each representation is stored as an object with only one common field "type".
+The rest of the fields are interpreted based on the type.
+There is only one type of representation defined in the spec.


I don't think this sentence is needed. Right now there's only one. But we want people to standardize these here.

rdblue · 2022-04-03T23:48:10Z

format/view-spec.md

+| Required/Optional | Field Name | Description |
+|-------------------|------------|-------------|
+| Required | type | A string indicating the type of representation. It is set to "sql" for this type. |
+| Required | sql | A string representing the original view definition in SQL |


Are we requiring that this is a SELECT statement? I don't think we want this set to CREATE VIEW ... AS SELECT (the full create/replace statement).

rdblue · 2022-04-03T23:50:17Z

format/view-spec.md

+| Optional | field-docs | A list of strings of field comments optionally specified in the create view statement. The list should have the same length as the schema's top level fields. See the example below. |
+
+For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`,
+the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc.


I think it would be helpful to have an example of SQL and the values of the representation fields.

rdblue · 2022-04-03T23:52:36Z

format/view-spec.md

+For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`,
+the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc.
+
+## Appendix A: An Example


Rather than an example for interpretation, could we do the same thing as the table spec and have a JSON serialization section with good examples at a field level?

You could put a real SQL CREATE VIEW statement at the top and base the example on that. I think that would be more helpful because the description would be more specific than big pasted blobs of JSON with comments.

rdblue · 2022-04-03T23:52:59Z

docs/format/view-spec.md

@@ -0,0 +1,20 @@
+---
+title: "View Spec"


@samredai, is this extra file needed?

Yes this is required. It adds the menu item to the left-nav on the docs site to link to this View Spec page which is built with the landing-page site.

rdblue · 2022-04-03T23:54:19Z

Removed object-type from the proposal because listing views only or tables only may be very inefficient as we need to read each json file and determine its object/manifest type.

This should be fine, although I think if you're listing in a file system to find views you're already in a bad place!

rdblue · 2022-04-03T23:56:21Z

While there are some minor things to update to make this more clear, I think this spec is good and unambiguous. That means it's time to merge it. We can fix the remaining issues in follow-up PRs and start discussing adoption and implementation. Thanks @jzhuge and @anjalinorwood!

jzhuge · 2022-04-05T17:07:10Z

Thanks @rdblue !

jzhuge · 2022-04-05T17:08:52Z

Should I go ahead to create an PR for view implementation based on a subset of Netflix code

rdblue · 2022-04-05T18:07:50Z

Yes, please do!

github-actions bot added the docs label Sep 27, 2021

jackye1995 reviewed Sep 30, 2021

View reviewed changes

losipiuk reviewed Sep 30, 2021

View reviewed changes

rymurr reviewed Nov 26, 2021

View reviewed changes

rymurr approved these changes Jan 17, 2022

View reviewed changes

nastra approved these changes Jan 21, 2022

View reviewed changes

sumedhsakdeo reviewed Jan 21, 2022

View reviewed changes

rdblue reviewed Feb 14, 2022

View reviewed changes

jackye1995 reviewed Feb 14, 2022

View reviewed changes

Spec: Add common view format spec

e09a094

Co-authored-by: John Zhuge <[email protected]>

jzhuge force-pushed the view_spec branch from ac6469c to e09a094 Compare March 31, 2022 18:46

rdblue reviewed Apr 3, 2022

View reviewed changes

rdblue approved these changes Apr 3, 2022

View reviewed changes

rdblue merged commit d10383b into apache:master Apr 3, 2022

jzhuge mentioned this pull request Apr 28, 2022

[WIP] API/Core: View support #4657

Closed

jzhuge mentioned this pull request Jun 7, 2022

API: Add view interfaces #4925

Merged

ajantha-bhat mentioned this pull request Nov 18, 2022

Docs: Update the view spec #6214

Merged

amogh-jahagirdar mentioned this pull request Apr 9, 2023

Core: View metadata implementation #6559

Closed


		### Metadata Location Management

		The view is created in the default warehouse location with database name and view name in the suffix. An example path is:


		Imagine the following sequence of operations:

		* `create table base_tab(c1 int, c2 varchar);`


		### Storage Format

		The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

		\| Optional \| current-schema-id \| ID of the current schema of the view \|


		‘schemas’ is an array of structs with a field as shown below:


		## Background and Motivation

		Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.


		## Overview

		The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.


		The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

		Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view.


		The view version metadata file has the following fields:

		\| Required/Optional \| Field Name \| Description \|

Add common view format spec #3188

Add common view format spec #3188

Uh oh!

Conversation

anjalinorwood commented Sep 27, 2021

Uh oh!

rdblue commented Sep 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacques-n commented Oct 4, 2021

Uh oh!

jacques-n commented Oct 4, 2021

Uh oh!

jacques-n commented Oct 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anjalinorwood commented Oct 7, 2021

Uh oh!

rdblue commented Oct 7, 2021

Uh oh!

jacques-n commented Oct 20, 2021

Uh oh!

jzhuge commented Oct 27, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jacques-n commented Oct 4, 2021 •

edited

Loading

rymurr Dec 8, 2021 •

edited

Loading

jzhuge commented Jan 3, 2022 •

edited

Loading

sumedhsakdeo Jan 21, 2022 •

edited

Loading