Skip to content

Conversation

@anjalinorwood
Copy link
Contributor

No description provided.

@github-actions github-actions bot added the docs label Sep 27, 2021
@rdblue
Copy link
Contributor

rdblue commented Sep 28, 2021

FYI @jacques-n. If you want to help review this to coordinate or deconflict with the Substrait effort, that would be really helpful!


### Metadata Location Management

The view is created in the default warehouse location with database name and view name in the suffix. An example path is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to mention these defaults, because the default root location is different across different catalog implementations and should not be relied on. If it's managed exactly like table location, we should just say that instead.


Imagine the following sequence of operations:

* `create table base_tab(c1 int, c2 varchar);`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit capitalize SQL keywords

* Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata
* Support versioning of views to track how a view evolved over time

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to display this in the website? Should we just make it official documentation for the view spec under the Format section instead of saying it's a proposal? @rdblue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at this point I think we should edit this doc as though it is a pending spec, not a proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

| Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. |
| Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. |
| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
| Required | sql | A string representing SQL definition of the view as input |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the development of Substrait, I feel we can use a more generic word (like "content", "representation"?), and then in the future we can make "substrait" a dialect? Would that work? @jacques-n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n could you comment?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue and I both had comments below. In general, it seems like @rdblue is leaning towards a collection of representations with at least one defined. I interpreted that as something like this (pseudo-json this time):

representations: [
  {"type": "sql", "dialect": "trino", "version": "361", "text": "CREATE VIEW v1 as SELECT * from foo"}
  {"type": "sql", "dialect": "hive", "version": "3.1.2", "text": "CREATE VIEW v1 as SELECT * from foo"}
  {"type": "substrait", "version": "1", "text": "..."}
  {"type": "substrait", "version": "1", "binary": "..."}
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Representation field "type" can be used to interpret the rest of the fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of spec will only spell out "sql" type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving it some thought, I'd suggest adding representations later. Here are a few reasons:

  • For the majority of the use cases, we just need to store the sql dialect when the view is created.
  • Current spec does not spell how to convert and store other sql dialect. When that day comes, we might decide to convert on the fly without storing converted sql in json.
  • When a concrete representation is proposed, we can discuss how to extend the json. The field sql and dialect might be a fallback for the representations list.

In short, field sql and dialect seem to be enough for now. Let's keep it simple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n How about this trade-off?

  • Support a list of representations
  • Types should be well defined
  • Only allow one object per type in the list
  • Only define one type "OriginalSql" in this spec
  • Need to go through future discussions to add more types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.

I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?

Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document proposes standardizing the view metadata for ease of sharing the views across engines.

### Goals
* Propose a common metadata format for view metadata, similar to how Iceberg supports a common table format for tables. The common metadata format specification Includes storage format as well as APIs to write/read the metadata

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bullet point before Includes

@jacques-n
Copy link

@rdblue, just saw your message. Didn't get a notification for some reason. Will take a look.

@jacques-n
Copy link

A couple of thoughts:

  • Was there a discussion of resolved SQL? Most systems will want to store this one way or another. The schema field can potentially help with this but it feels somewhat backwards. It's actually strange that field-aliases is populated for a view statement that has that part of the statement but the schema is used to resolve field names in the case that a * or arbitrary expression (e.g. select a+b from foo without field as clause) is used. I'm inclined to think that the we should include both resolved and unresolved sql to be explicit about such things. Really, the only reason to have unresolved is so people can get back to what they typed. Engines need resolved to be reliably consistent.
  • In the context of Substrait: I think there are two distinct use cases. The first is SQL view with substrait plan for reliable consumption cross engine. The second is a substrait plan-based view that doesn't have associated SQL. The latter exists when the FE that produces the plan isn't sql-based (for example, Pandas could be the producer of the Substrait plan). It seems like there probably should be some kind of union such as the following (using pseudo protobuf since I'm in that world right now):
message ProcessingInstruction {

  oneof processing_type {
    Sql sql = 1;
    SubstraitPlan plan = 2; // to be added in future.
  }

   message Sql {
    string text = 1;
    string dialect = 2;
    SubstraitPlan plan = 3; // to be added in future.
  }
  
  // to be added in future.
  message SubstraitPlan {
    int32 substrait_version = 1;
    binary plan = 2;
  }
}

@jacques-n
Copy link

jacques-n commented Oct 4, 2021

To further the point on resolved sql: there is also something weird when working with table references and it not being clear if a reference is absolute or relative. For example, if I create a view when my current schema is <root>.s1 and the sql for the view is something along the lines of select * from t1. Imagine in this situation that initially there is a <root>.t1 but not a <root>.s1.t1. In this case, the view will initially bind to <root>.t1. Let's then imagine I create table (or view) called <root>.s1.t1. If you apply resolution rules at view consumption times, it will resolve to <root>.s1.t1 even though that was not the intention. If you had saved the resolved sql at view creation time, it would remain point to <root>.t1.

What is expected behavior in the case of a Iceberg view?

(note that I'm using <root> here to mean the top of the namespace, not an actual name. Just seemed easier to include it explicitly to clarify the scenario since typically resolution is first the current schema followed by the root schema)

@anjalinorwood
Copy link
Contributor Author

I take it that we are driving towards consensus. @jzhuge will take this forward with the PR for implementation (as well address any outstanding comments).

@rdblue
Copy link
Contributor

rdblue commented Oct 7, 2021

@jacques-n, thanks for the example. I was going to ask for one to clarify about SQL resolution but you added one first!

What is expected behavior in the case of a Iceberg view?

This is a case I hadn't considered much because I thought that table resolution would not change based on the existence of tables.

The resolution in Trino never has this case because it always needs 3 parts:

  • If there are 3 parts, they are (catalog, schema, table)
  • If there are 2 parts, they are (schema, table) and the current catalog is used
  • If there is 1 part, it is the table and the current catalog and schema are used

We also specifically avoided this ambiguity in Spark when building the multi-part identifier support. We created rules so that there is only ever one way to resolve a table reference:

  1. If the name is a single identifier, use it as the table name, use the current catalog, and current namespace from the session.
    (Otherwise, the name is multiple parts)
  2. If the first part is a catalog, use that catalog. Use the last identifier part as the table name and any remaining parts as the namespace.
  3. If the first part is not a catalog, use the current catalog from the session. Use the last identifier part as the table name and the remaining parts (including the first that was not a catalog) as the namespace.

For a fixed set of catalogs, these rules provide an unambiguous table reference. There is some ambiguity if the set of catalogs changes, but that's a rare event because it is something done primarily by admins rather than individual users (like creating tables). We decided against the idea of falling back to trying the other branch if using rule 2 or rule 3 failed to load a table.

For the view spec, we have tried to keep enough information that resolution is always done the same way. That's why the current catalog and namespace are stored as part of the view definition. We could also store the set of catalogs defined, but if there are some SQL systems that don't have an unambiguous resolution (meaning not based on the existence of tables) then it might not matter and we should decide how to handle it.

The argument against requiring a resolved plan is that it may not be possible to produce one. I think Spark in particular used to store resolved plans but gave up on being able to produce them at all. They're also much longer and tended to cause views to break by being truncated in a database column even faster. And if we have a substrait plan, we probably don't need the resolved SQL.

I think my preference is to allow storing multiple optional representations. We know that we will need the original SQL for SQL-configured views. And we know that storing a standard query plan (like Substrait) is preferred in the long term over even a resolved SQL string because of dialect problems. Since even the SQL may be optional for views created directly as portable plans, I think it makes sense to make all of these representations possible and optional.

This is looking more like @jackye1995's original suggestion to be able to store different SQL for a given dialect. Maybe we should reconsider that.

@jacques-n
Copy link

@rdblue , I think your comments are mostly focused on table identifiers. I'll respond to those below.

Field names are probably the more important of the two. We should specify clearly what it means for a view to be "invalid". For example, if the SQL parses but the column names or column types are different, I would expect this to fail with an invalid view definition error. Additionally, I think we should probably cleanup the weird dual column definition pattern (if a user expressed column name aliases in the create view versus the population of derived * column names in the schema). Also, if we're capturing what the user wrote, we should probably capture the CREATE VIEW ... part of the command (not just the inner sql). I think this is consistent with how other tools expose the data.

WRT to table identifiers, I struggle with being opinionated about resolution given that end users may work in a system that already has a particular resolution behavior.

@jzhuge
Copy link
Member

jzhuge commented Oct 27, 2021

Also, if we're capturing what the user wrote, we should probably capture the CREATE VIEW ... part of the command (not just the inner sql). I think this is consistent with how other tools expose the data.

@jacques-n Unfortunately it will make it harder for other engine to consume the sql text, e.g., Presto doesn't understand Spark's syntax with column alias. To make it work, we might have to divide CREATE VIEW text into 2 parts, 1 for sql text, 1 for the rest.

However, since CREATE VIEW syntax is fairly stable for years, field sql and optional field-* should be enough to re-create the CREATE VIEW statement in different engines. We can add more optional field to support new syntax down the road.


### Storage Format

The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than refer to a specific technology I suggest a generic storage name

Suggested change
The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on S3 for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.
The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. |
| Required | object-type | String identifying the type of object this metadata file is for. ‘View’ for all objects covered in this spec. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the table spec need to be updated to add this field also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, as it is necessary to differentiate view from table for catalogs without metastore which typically specifies the object type.

| Optional | current-schema-id | ID of the current schema of the view |


‘schemas’ is an array of structs with a field as shown below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question, why is this optional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some catalogs may decide to analyze the schema when using the view thus there is no need to store the schema when creating the view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be made non-optional later? I think optional is fine as long as views are dialect specific. Eg EngineX knows it computes schema at run time so it doesn't write it and view creation time. But we expect to add dialect free views in later versions, in which case I think schema would no longer be optional. That sounds like a breaking change in this spec, maybe easier to make it required now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me as I haven't seen a use case where schema is not available (too expensive to obtain?) at creation. Anyone object to making it required?

| Required | version-id | Monotonically increasing id indicating the version of the view. Starts with “1”. |
| Required | timestamp-ms | Timestamp expressed in ms since epoch at which the version of the view was created. |
| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
| Required | sql | A string representing SQL definition of the view as input |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @jzhuge that we shouldn't define a ton of representations up front. Though I am not sure what the downside of @jacques-n proposal is. Specifically: adding sql type and dialect. It is likely that all initial implementations would only contain 1 element in the list and only the dialect that the view was created by. But thats ok, we could add features to extend that list later: dialect translation or substrait etc.

I am not sure what the OriginalSql above would be? To me we have to specify a dialect so others know if they can read the sql or not right?

| Required | summary | A string to string map of view properties to track version metadata. This field can be used by engines to store any necessary properties. Two currently required properties are described below. |
| Required | sql | A string representing SQL definition of the view as input |
| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
| Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was brought up on the original proposal as well, these two fields seem a bit specific to certain types of engines. Maybe it would be helpful to me to see some examples from different engines and how views would be represented in them.

A few more questions:

  1. what if the catalog is set up differently or with a different name, can these be overriden?
  2. why are they required?
  3. how are these used at run time. If I have a query select * from baz and the default-catalog=foo and default-namespace=bar are we doing string replacement to then execute select * from foo.bar.baz or is it assumed that the user has to run a USE CATALOG foo;USE NAMESPACE bar statement first? String replacement sounds iffy as it can't really be done w/o understanding the sql (which we don't want to do yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Catalog setup has to be consistent (e.g., no rename) between view creation and usage.
  2. Not required. Fixed. Also change the field names to session-catalog and session-namespace, because "default" is a little confusing.
  3. Fully qualify identifiers in the parsed plan. Don't do string replacement which is iffy as you pointed out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re 3. how do we get "Fully qualify identifiers in the parsed plan. " It still isn't clear to me how these values should be used at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, view sql is select * from tbl1, session-catalog is cat1, and session-namespace is ns1. When the view is accessed, after CTE substitution, we will perform view substitution which looks for unresolved relations in the parsed plan for view sql. Here we find tbl which does not have namespace and catalog, we will convert it to cat1.ns1.tbl1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an engine has to use these fields in its planning stage? That makes sense, sorry for the misunderstanding. I suppose this is potentially iterative. Eg search for tbl1 and not found, search for ns1.tbl1 not found, search for cat1.ns1.tbl1 found?

Or is it assumed that only iceberg tables are in an iceberg view so the lookup can be done from inside iceberg code w/o engines having to change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No iterative search. If the identifier has 1 part, tbl -> cat1.ns1; if the identifier has 2 parts, ns.tbl -> cat1.ns.tbl. Ultimately the purpose of these 2 fields is to help query engines resolve identifiers without ambiguity. Of course, engines can choose to send us sql text with all identifiers qualified, then these 2 fields are optional.

| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
| Required | default-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables used in the view definition. |
| Required | default-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘default-catalog’ above. |
| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the purpose of this field. Could you please clarify? eg how would this alias be different from aalias in select a as aalias from foo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aliasing the columns in select star sql. This is supported in Spark, Hive, and a few other engines.

Copy link
Contributor

@rymurr rymurr Dec 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @jzhuge I am not sure I understand your response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the syntax:

CREATE VIEW v1  (alias1 COMMENT 'comment for col1', alias2) AS
SELECT *
FROM tbl

Table tbl has 2 columns: col1 and col2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not the first one who find this feature hard to grasp. Presto does not support it. I'd suggest removing this field for this PR and have a follow-up discussion about the view column alias feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I get it now. Interesting. Yeah, I would prefer to defer this as it sounds rather engine specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is actually quite common. It's reasonable to not include it right now to get this in, but I think we'll want to add it fairly soon so that we can store the whole set of CREATE VIEW metadata.

| Required/Optional | Key | Value |
|-------------------|-----|-------|
| Required | operation |A string value indicating the view operation that caused this metadata to be created. Allowed values are “CREATE” and “REPLACE” |
| Required | engine-version | A string value indicating the version of the engine that performed the operation (create / replace) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this field is redundant if we go with the suggestion above for representations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right

@jzhuge
Copy link
Member

jzhuge commented Jan 3, 2022

Hi @rymurr @jackye1995 @jacques-n @losipiuk @rdblue , the spec has been updated with all review comments addressed. The main addition is the "representations" list. There are also a few minor reorgs and cleanups. Please take another look.

@rymurr
Copy link
Contributor

rymurr commented Jan 17, 2022

this looks great @jzhuge! Thanks!

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really great

"type" : "string",
"doc" : ""
} ]
},
Copy link

@sumedhsakdeo sumedhsakdeo Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, is there a plan to parse the SQL and capture referenced tables/views by the view as part of the View Metadata?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not planned. What use case do you have in mind?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, use-cases are warehouse management related. For example, table-level audit logging access and ACLs enforcements for tables queried through views.

Copy link
Member

@jzhuge jzhuge Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds interesting. For this pull request, we are trying to keep things minimal, but we can definitely start a separate thread for discussion.

As a workaround, you can store the information as view properties.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let's discuss the expansion separately. Great PR. 👍

@jzhuge
Copy link
Member

jzhuge commented Jan 25, 2022

@rdblue @anjalinorwood @jacques-n @jackye1995 Are we ready to merge this PR?

@anjalinorwood
Copy link
Contributor Author

Hi John, LGTM! Thank you for picking up this PR to drive it to completion.

| _required_ | _required_ | `partition-spec` | JSON fields representation of the partition spec used to write the manifest |
| _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string |
| _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string |
| _optional_ | _optional_ | `object-type` | Type of object this metadata file is for: "table" or "view". Default is "table". |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would state that "if missing, assumed to be "table"` rather than "default is table". Using "default" doesn't quite seem correct to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not like object-type either. How about manifest-type over spec-type?

Type of this manifest file, if missing, assumed to be "table".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful about using the term "manifest". That refers to manifest files in the table spec and not these metadata files.

I'm okay with "spec-type" or "object-type". I don't think we should worry too much about this.

|Metadata field|JSON representation|Example|
|--- |--- |--- |
|**`format-version`**|`JSON int`|`1`|
|**`object-type`**|`JSON string`|`table`|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There should be double quotes around table because the example should be a JSON string, like the other strings below.


## Background and Motivation

Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to say "commonly known as views"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".


## Background and Motivation

Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence, "Views and tables occupy the same namespace" doesn't seem necessary. That is a catalog choice, right? It doesn't affect this spec if someone uses it for catalogs that do not mix views and tables so that they occupy separate namespaces.

## Background and Motivation

Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.
Each compute engine stores the metadata of the view in its proprietary format in the metastore of choice. Thus, views created from one engine can not be read or altered easily from another engine even when engines share the metastore as well as the storage system. This document standardizes the view metadata for ease of sharing the views across engines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be a separate paragraph? If so, I think you need a newline between this line and the previous one.


## Overview

The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.
Copy link
Contributor

@rdblue rdblue Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to call out that the metadata model matches tables, but this goes a bit too far and states requirements for the metastore, like tracking "database name, owner, create time, last access time, ..."

I think this should be more similar to the table spec and bring in a lot of similar wording:

View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks!


The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to use backticks and fixed-width font for CREATE OR REPLACE VIEW?

Is it necessary to state this in terms of SQL? Above, I suggested "all changes to view state create a new metadata file and completely replace the old metadata ..." That seems sufficient to me.

Copy link
Member

@jzhuge jzhuge Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Combined 2 paragraphs.

The view metadata storage and retrieval mirrors how Iceberg table metadata is stored and retrieved. The view metadata is stored in a JSON file on object storage for ease of tracking the evolution of the view. Metastore continues to hold the view object with some properties such as database name, owner, create time, last access time and an indication that the object is a view.

Each ‘CREATE OR REPLACE VIEW’ statement creates a new view version metadata file for that view.
Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good thing to call out: "Each metadata file is self-sufficient".

I'm not sure what you mean by "and can be used to roll back the view to a previous version". It sounds like this is a suggestion that rolling back the pointer to an old metadata file is a possibility, which is not a good idea. Metadata files should always roll forward. Otherwise, structures like a current version log are not maintained.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "roll back" here does not mean the metadata file, instead, it means rolling back the "version". I'd expect the metadata file to be rolled "forward" in this case.

| Required | versions | An array of structs describing the last few versions of the view. Controlled by the table property: “version.history.num_entries”. See more below. |
| Required | version-log | An array of structs describing the log of created versions. See more below. |
| Optional | schemas | A list of schemas, the same as the ‘schemas’ field from Iceberg table spec. |
| Optional | current-schema-id | ID of the current schema of the view |
Copy link
Contributor

@rdblue rdblue Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be stored with each version rather than here? I think this is required to match the current version's schema, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table's current-schema-id is stored in the similar place. I think it tracks the schema id of the current version, right?

Should we add schema-id to Version struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually added schema-id to SQL representation.

| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
| Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |

“version-log” is an array of structs describing the log of the versions created. The struct has the following fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes when each version was considered "current". Creation is different and is stored in each version's metadata. This allows you to reconstruct what someone would have seen at some point in time. If the view has been updated and rolled back, this will show it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | timestamp-ms | Timestamp expressed as ms since epoch for when the version of the view indicated in version-id field below was created |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. This timestamp should be when the referenced version was made the current version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"representations" : [ { => SQL metadata of the view
"type" : "sql",
"sql" : "SELECT *\nFROM\n base_tab\n", => original view SQL
"schema" : { => Schema of the view expressed in Iceberg types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in schemas and referenced by ID either here or in versions?

Copy link
Member

@jzhuge jzhuge Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Updated.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new docs website and repository, please move the spec changes to docs/common/format, thank you!


## Background and Motivation

Most compute engines (e.g. Trino and Apache Spark) support logical views, commonly known as ‘views’. A view is a logical table that can be referenced by future queries. Views do not contain any data. Instead, the query stored by the view is executed every time the view is referenced by another query. Views and tables occupy the same namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already say "A view is a logical table" later, so we can save the first sentence to only say "most compute engines (e.g. Trino and Apache Spark) support views".

| _required_ | _required_ | `partition-spec` | JSON fields representation of the partition spec used to write the manifest |
| _optional_ | _required_ | `partition-spec-id` | ID of the partition spec used to write the manifest as a string |
| _optional_ | _required_ | `format-version` | Table format version number of the manifest as a string |
| _optional_ | _optional_ | `object-type` | Type of object this metadata file is for: "table" or "view". Default is "table". |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to say spec-type? Not sure if it is only for me, but object-type sounds very Java


The view version metadata file has the following fields:

| Required/Optional | Field Name | Description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to have some consistency with the table spec tables: https://github.com/apache/iceberg/blob/master/docs/common/format/spec.md?plain=1#L508, can we make some changes to the tables in the view spec:

  1. use v1 instead of Required/Optional has header
  2. make column size consistent for everything before description
  3. lower case and bold _required_ or _optional_ for required or optional values

| Required | format-version | Json format version number for the view metadata spec. The view metadata spec and the corresponding format-version is independent of table spec. Starts with 1 and is incremented when there is a breaking change to view metadata. |
| Required | object-type | Type of object this metadata file is for: "table" or "view". It must be set to "view" for all objects covered in this spec. |
| Required | location | Location of the view metadata files |
| Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by an empty view? A view without a SQL definition?

| Required | sql | A string representing SQL definition of the view as input |
| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |
| Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An array of strings indicating namespace" made me thought it was a list of namespaces until I then realized namespace is a list of string. I think people who don't have context would get confused here. I think it's more clear to say "Namespace to use when table or view references do not contain an explicit namespace", and then say namespace is serialized as a list of string.

| Required | type | It must be set to "sql" |
| Required | sql | A string representing SQL definition of the view as input |
| Required | dialect | A string specifying the dialect of the ‘sql’ field above. Used by engines to perform necessary translations to the SQL dialect supported by the engine. |
| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session-catalog as the name does not sound very intuitive to me, as Ryan says the user might be using a different session when evaluating this view, and will still use the catalog defined here to get the referenced table/view information by name. What about default-catalog?


| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | type | It must be set to "sql" |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already say "The rest of the fields are interpreted by the type of representation." above, I think we don't need type again in this table?


| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | type | A string indicating the type of representation. The only valid choice is "sql". The rest of the fields are interpreted by the type of representation. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saying The only valid choice is "sql" might make people doubt the necessity of the field. What about saying The format used to store this representation of the view, currently the only supported type is "sql"

| Optional | session-catalog | A string that specifies the catalog of the user session when the view was created / replaced. Used to resolve the tables in the view definition. |
| Optional | session-namespace | An array of strings indicating namespace at the time view was created / replaced. Used similar to ‘session-catalog’ above. |
| Optional | field-aliases | A list of strings of field aliases E.g. a list of alias_name info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
| Optional | field-docs | A list of strings of field comments E.g. a list of ‘comment’ info specified in the following create view statement. `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT ...` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are field aliases not a part of schemas?

The path is intentionally similar to the path for iceberg tables and contains a ‘metadata’ directory. (`METASTORE_WAREHOUSE_DIR/<dbname>.db/<viewname>/metadata`)

The metadata directory contains View Version Metadata files. The text after '=>' symbols describes the fields.
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few places are not up-to-date, I suppose you will fix later?

@jzhuge
Copy link
Member

jzhuge commented Feb 15, 2022

Thanks @rdblue @jackye1995 for the comments! Will take a look shortly.

@jzhuge
Copy link
Member

jzhuge commented Mar 31, 2022

This commit rebased to the master and relocated the spec to the new path.

Incorporated all comments from @rdblue.
Working on @jackye1995 comments.

@jzhuge
Copy link
Member

jzhuge commented Mar 31, 2022

Removed object-type from the proposal because listing views only or tables only may be very inefficient as we need to read each json file and determine its object/manifest type.

Instead, we can use OOB flag to determine the spec type. For example, HMS catalogs can use table type while Hadoop catalogs can use file extension ".viewmetadata.json" vs ".metadata.json".


* A common metadata format for view metadata, similar to how Iceberg supports a common table format for tables.
* The view metadata format specification
* Includes storage format as well as APIs to write/read the metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually in this spec, and it probably shouldn't be. Should we remove this?

## Overview

View metadata storage mirrors how Iceberg table metadata is stored and retrieved. View metadata is maintained in metadata files. All changes to view state create a new view metadata file and completely replace the old metadata using an atomic swap. Like Iceberg tables, this atomic swap is delegated to the metastore that tracks tables and/or views by name. The view metadata file tracks the view schema, custom properties, current and past versions, as well as other metadata.
Each metadata file is self-sufficient. It contains the history of the last few operations performed on the view and can be used to roll back the view to a previous version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a little misleading to say the view metadata file "can be used to roll back the view to a previous version". Like tables, the view metadata files should only go forward. You can roll back by updating a pointer to an older version tracked in metadata but we do not want people to think that they can roll back by pointing to an older metadata file.

| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | format-version | An integer version number for the view format. Currently, this must be 1. Implementations must throw an exception if the view's version is higher than the supported version. |
| Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manifest files? Looks like this may be a copy-paste error.

| Required | location | The view's base location. This is used to determine where to store manifest files and view metadata files. |
| Required | current-version-id | Current version of the view. Set to ‘1’ when the view is first created. |
| Optional | properties | A string to string map of view properties. This is used for metadata such as "comment" and for settings that affect view maintenance. This is not intended to be used for arbitrary metadata. |
| Required | versions | An array of structs describing the last known versions of the view. Controlled by the table property: “version.history.num-entries”. See section [Versions](#versions). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need "last" in "last known". Just "known versions of the view" should be sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "Controlled by the table property" sentence could be more clear: "The number of versions to retain is controlled ...".


Each representation is stored as an object with only one common field "type".
The rest of the fields are interpreted based on the type.
There is only one type of representation defined in the spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this sentence is needed. Right now there's only one. But we want people to standardize these here.

| Required/Optional | Field Name | Description |
|-------------------|------------|-------------|
| Required | type | A string indicating the type of representation. It is set to "sql" for this type. |
| Required | sql | A string representing the original view definition in SQL |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we requiring that this is a SELECT statement? I don't think we want this set to CREATE VIEW ... AS SELECT (the full create/replace statement).

| Optional | field-docs | A list of strings of field comments optionally specified in the create view statement. The list should have the same length as the schema's top level fields. See the example below. |

For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`,
the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to have an example of SQL and the values of the representation fields.

For `CREATE VIEW v (alias_name COMMENT 'docs', alias_name2, ...) AS SELECT col1, col2, ...`,
the field aliases are 'alias_name', 'alias_name2', and etc., and the field docs are 'docs', null, and etc.

## Appendix A: An Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an example for interpretation, could we do the same thing as the table spec and have a JSON serialization section with good examples at a field level?

You could put a real SQL CREATE VIEW statement at the top and base the example on that. I think that would be more helpful because the description would be more specific than big pasted blobs of JSON with comments.

@@ -0,0 +1,20 @@
---
title: "View Spec"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samredai, is this extra file needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is required. It adds the menu item to the left-nav on the docs site to link to this View Spec page which is built with the landing-page site.

@rdblue
Copy link
Contributor

rdblue commented Apr 3, 2022

Removed object-type from the proposal because listing views only or tables only may be very inefficient as we need to read each json file and determine its object/manifest type.

This should be fine, although I think if you're listing in a file system to find views you're already in a bad place!

@rdblue
Copy link
Contributor

rdblue commented Apr 3, 2022

While there are some minor things to update to make this more clear, I think this spec is good and unambiguous. That means it's time to merge it. We can fix the remaining issues in follow-up PRs and start discussing adoption and implementation. Thanks @jzhuge and @anjalinorwood!

@rdblue rdblue merged commit d10383b into apache:master Apr 3, 2022
@jzhuge
Copy link
Member

jzhuge commented Apr 5, 2022

Thanks @rdblue !

@jzhuge
Copy link
Member

jzhuge commented Apr 5, 2022

Should I go ahead to create an PR for view implementation based on a subset of Netflix code

@rdblue
Copy link
Contributor

rdblue commented Apr 5, 2022

Yes, please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants