-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[#2039] Support default value semantics - API changes #2496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
linkedin/iceberg version of this patch can be found at: linkedin/iceberg#63. |
api/src/test/java/org/apache/iceberg/types/TestNestedFieldDefaultValues.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/types/TestNestedFieldDefaultValues.java
Outdated
Show resolved
Hide resolved
|
Are complex types allowed to have defaults? Just wondering how complicated this can get |
|
I can't remember the details of this, but I remember a few sync's back we were discussing this behavior for the general case and were worried about spark losing field properties... I think. Does anyone else remember the details? |
|
I agree w/ @RussellSpitzer on complex types. How would a deeply nested structure look w/ default types? I can definitely see the value in default values but I am having a hard time figuring out all the downstream effects: how does Spark handle these, how do they work in the Arrow based vectorised readers etc. Some more context on the effects on other systems would be useful for me to understand this change better. |
thanks for the question @rymurr . I see your concern. In this change, we are adding default values as optional, such that different readers don't have to worry handling it. I plan to handle default values in Spark's avro, orc and parquet readers. We can tackle them one by one. I don't foresee any potential complications except may be for the ser/deser of complex types. For spark avro reader, which I will start with, here is how to handle default values:
|
|
hi @RussellSpitzer , @rymurr @shardulm94, I worked on a prototype to verify the planned changes will fix the issue, and I created a draft-PR to share, for reference. This prototype gives an idea how things will look like. I also tested complex types in this prototype Here it is: linkedin/iceberg#70 I have just pushed a new commit to this PR, after addressing the comments. This also reflects the commit we merged to the linkedin branch. Please take a look. |
d7f7fa2 to
d61638d
Compare
|
hi @RussellSpitzer , @rymurr , LinkedIn/Iceberg PRs related to supporting non-null default values: |
|
I Think the first step to getting default value semantics in, would be changing the Spec so that fields are allowed to have "default values". We should also outline expected behaviors in case of changing defaults and and modifying of rows with default values. |
|
@RussellSpitzer are you referring to this or something else? |
d61638d to
b23a00a
Compare
I think the behaviors need to be elaborated a bit more than that and we also need information in JSON Serialzation (Appendix C) For example, does a writer upon rewriting a file with a default value set, materialize that value? If the default for a column is changed do we change the value? I think "manifested" needs explanation as well. If defaults are used on writers I think we probably need to explain that in the writer section. So to list all of the things. A field is required and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed in the schema but the column id remains the same. A field is optional and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed ... Can a default valued field be applied to files where the column ID of the field is not present? I believe this is part of the idea here but i'm not sure how it would be defined. Is the idea that if the column ID is not present in a given data file do we always just return the default of the current schema? Do defaults exist retroactively or are they always forward looking? An example being: Say I have file A' without column id 3 which has a default in my spec of "foo". If I read rows out of fileA' with this schema do I return foo? If I later read the table where the spec has a default for column3 of "bar" do I read bar? If I rewrite the datafiles inbetween the schema changes do I get foo or bar or null? |
|
Thanks @RussellSpitzer for the great questions. I took time to make sure I am thinking with general Iceberg mindset, below please find my responses. If they make sense, I will go ahead and update the specs accordingly.
The default value is part of the table’s schema/spec. Hence, it lives in the Iceberg Metadata file. Therefore, whenever the schema is updated, leading to a new Schema ID, successor reads will read the new value. If however reading an older snapshot, i.e., using the older Snapshot Schema, the older default value saved in that older Schema will be used.
No. Default values belong to the table’s spec/schema, and are used only when reading and the column is missing (not materialized)
Agreed, will use “materialized” instead. Also, defaults are not used in writes (of datafiles). However, it worth mentioning that we might consider adding a DDL to ALTER column adding/changing/dropping default values, if fits. Such DDLs would create a new schema in the metadata file, and a new Snapshot.
Default values are used only during reading data. When a data row is missing a column id that has a default value, the default value (from the schema) is read/used.
This is the default case, right? The opposite case is interesting: if column id is changed, I am not sure what does that even mean? Column is deleted and a new one with the same column name is introduced? In any case, the default value defined for column X in the schema will only be used with the X's column id.
Optional vs Required makes difference if no default value is defined, while reading data. If a default value is defined in and the column id is missing, the default value is used in both cases (i.e., optional and required fields cases). If no default value is defined, then an exception is thrown only if the field is required. For reference: https://github.com/linkedin/iceberg/pull/72/files#diff-40083c166e284232643fa343534c626bca09d488537c226bb324be6169cab571R109
if I read this correctly, the question is: if table had columns A and B, and we have datafile1 written. Then later we add column X with default value d_x. Can we read the default value while reading datafile1? The answer is yes. In fact, this support for non-null default values was motivated to address this scenario particularly.
Correct (where current = schema of the snapshot we are reading)
Defaults can exist retroactively, since we can read older data files using newer snapshots’ schemata.. |
I think this may be an issue since all current implementations of rewrite start by reading the current state of the data and then writing that output to new files. Consider I have two files both missing column A for which I have set a default value of 1. Say my optimize rewrite command touches one of these files and rewrites it. On read it will return rows with a=1. The replacement data file is now filled in with a=1, the column is no longer missing so no default will be applied on read. Now if I change the default to 2, a row in the unoptimized file will return 2(the new default for a data file missing a) while rows in the optimized file will return 1 (since that was the value read while rewriting). I think this would be a pretty strange behavior and we should probably figure out how to eliminate it |
| A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. | ||
|
|
||
| Iceberg supports default-value semantics for fields of nested types (i.e., struct, list and map). Specifically, a field | ||
| of a nested type field can have a default value that will be returned upon reading this field, if it is not manifested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "manifested" mean? I believe that we want default values to be filled in if the column does not exist in a data file. If the column does exist in a data file and is null, then the written value is null and Iceberg should return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is correct. Will use materialized instead of manifested.
|
|
||
| For the representations of these types in Avro, ORC, and Parquet file formats, see Appendix A. | ||
|
|
||
| Default values for fields are supported, see Neted Types below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "Neted" should be "Nested"
|
|
||
| A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types. | ||
|
|
||
| Iceberg supports default-value semantics for fields of nested types (i.e., struct, list and map). Specifically, a field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for a list element to have a default value? Similarly, what does it mean for a map value to have a default?
I don't think that list elements or map values are places where we should allow default values. I'm not aware of a case where there is a file that contains a map, but the value column is missing. And I think that's when we would fill in default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention here is to provide default values for map, array, and struct fields, not the building blocks of those types (e.g., map keys, map values, or array elements). Not sure if this addresses Ryan's concern, but I think it is reasonable (and in our internal case required) to provide default values for fields of such data types. Pre-set default values (e.g., empty list or map) may not suit all use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just want to make sure that's the goal. No default elements, key/value pairs, or custom lookup results (like getOrDefault). Just default values for whole lists, like an empty map or empty list. Maybe a specific map or list that is non-empty?
| of a nested type field can have a default value that will be returned upon reading this field, if it is not manifested. | ||
| The default value can be defined with both required and optional fields. Null default values are allowed with optional | ||
| fields only, and it's behavior is identical to optional fields with no default value, that is a Null is returned upon | ||
| reading this field when it is not manifested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the rules for setting default values? The behavior required by the SQL spec is that default values are handled as though they are written into data files. That is, if I add an int field, x, with a default value of 0, write a row, update the default value to 1, then the row that was written must have x=0. I think that this implies that default values can't be changed unless we know there are no data files without the default, but it would be good to get more clarity here and some quotes from the SQL spec to inform this discussion.
|
From the discussion with @RussellSpitzer, I think some of those statements are in conflict with the SQL spec. Here's Postgres behavior (verified by SQLFiddle): create table default_test (id int);
insert into default_test values (1), (2);
alter table default_test add column data int default 0;
alter table default_test alter column data set default 1000;
insert into default_test (id) values (3);
insert into default_test values (4, null);
alter table default_test alter column data set default 2000;select * from default_test; |
|
I think the default issue could be issued by scanning through the schemas created since the file was created and only using the first available default. This seems kind of expensive to me though. General algorithm is |
that would do. Or, if we can detect if we are coming from rewriting path, we skip reading the default value for the missing/unmaterialized columns. Looking into that option |
I think this would be tricky as every engine would have to customize their rewrite pathway to accommodate it. Additionally what would you do when rewriting a row with a required column that you default'd previously? You cannot write a new file where only some records are missing an entry for a column (null vs empty issue again) |
|
I doubt that it is worth the complexity of an algorithm like finding the default that should be applied for a file. We could do that by keeping track of when defaults are added with sequence numbers. But I think another reasonable way to fix that problem is just to state that once a default is set, it is an incompatible change to update it to another value. By making it an incompatible change, we could allow modifying the value but make it clear that it is the user's responsibility. We do this when adding a required field to a table. You can add a field by calling |
I don't think that maps or lists can have default map values or list elements. May entire map or list can be defaulted? It would be easier if you could assume an empty map or empty list default. We could allow empty defaults with a simple flag, or we could allow expressing defaults like We have a similar choice for structs. A struct could be defaulted using a flag and nested default values, so the struct is non-null and all its columns have default values. Or we could allow setting a specific default struct. I'm again not sure about the utility of the specific default struct. It is way more complicated for us to store the default values. The choice here may come down to the strategy used to store default values. If we use simple JSON structures, then we can definitely store the nested defaults. I'm interested to hear a proposal for this. |
This is interesting @rdblue! Hive on the other hand, does not seem to materialize the default value upon schema evolution. Oracle 10g behaves like PostgreSql, but a performance enhancement (referred to as fast column update) in Oracle 11g, the default value lives in the schema definition and not materialized. So, this seems to me an implementation details. I checked SQL 1992 (11.5 default clause > 2), and it states that default value is driven from the |
|
I meant - without any rewriting implementation to change or customize anything - to have the reader check current call stack and detect if the caller is a rewriter, then we can have proper reading for un-materialized defaulted columns |
|
I do not think it is ideal to base the decisions off the call stack, but probably that can be substituted by a parameter (if the proper solution is indeed to customize the rewrite behavior). |
|
I'm not sure I understood the part about the call stack, but it seems that SQL behavior is to act as though the values is written into data files. I'm okay bending the rules a bit by making it an incompatible change, but I think that rewrites should write the default value into data files. |
I'm not sure how this would work. Given my example again If I choose Row(1, 3) then File2 will get a different value if the default is changed. (Forbidding changing the default sounds like an ok solution for this but i'm not a huge fan of table changes that have this kind of non-controlled behavior. A user choosing to Alter the default even after warning would have no real way to tell what rows got the new default and which didn't) If I choose Row(1, null) then for Required B I don't have a problem since I can just read the default, for Optional I have a problem because now have this column as null and not missing. So I'd say i'm |
|
Regarding the default value semantics: I think this discussion can be simplified if we distinguish between two use cases of default values: (1) when they are used with an To start, we can discuss the schema evolution use case as the sole one to simplify the discussion (we will relax this assumption later). In fact, unlike traditional database engines, which allow
So putting the "
Now, to cover the |
|
@wmoustafa, I think I mostly agree with you. That's probably how it should work. The only problem I have is point 3. Iceberg should not allow changing default values because it breaks the expected SQL behavior. That said, I would allow changing default values only behind the |
|
I am okay with that @rdblue. I hope in the future, changing default values becomes a compatible change (for example if we trigger a rewrite once a new column with default value is added -- which will hardcode the default value), but it is okay to start more restrictive and relax the restrictions in the future. |
|
Closing this in favor of #4301. |
Summary
Iceberg schema currently does not support default values which imposes a challenge on reading Hive tables written in avro format and have default values (see issue 2039). Specifically, if a field has a non-null default value, it is mapped to a required field - with no default value - in iceberg. Thus, upon reading rows where this field is not manifested, an IllegalArgumentException (Missing required field) will be thrown. Further, default values of nullable fields are lost silently. That is because nullable fields with default values are mapped to optional fields with no default values, and thus null is returned when the field is absent, instead of the default value. This document describes how to support the default values semantics in Iceberg schema to resolve these issues.
Problem
Default values are lost
Default values are specified using the Avro schema keyword “default”. For example, the following is an example of an Avro string field with default value “unknown”:
{"name": "filedName", "type": "string", "default": “unknown”}Also, a nullable (optional) avro field can define default value as follows:
{"name": "fieldName", "type": ["null", "string"], "default": null}Please note that nullability is specified via UNION type (i.e., the [“null’, “string”]) and the default value’s type must match the first type in the union. In other words, the following are invalid types:
That is, if the default value is of type null, the first type of the field must be the “null”, o.w., if the default value is of type string, the first type of the field must be string.
When converting an avro schema to Iceberg Types, we have 2 cases. If the field is nullable, it maps to an optional NestedField Iceberg type. While if the field is non-nullable, it is mapped to a required NestedField Iceberg type, with no default value, since the NestedFiled does not support default values. In both cases, the default value is lost which leads to a wrong handling of the data when read. In the case of non-nullable fields, error is thrown if the field is not present, whereas in case of null, it goes by as an optional field.
Where in code this breaks
When reading avro records AvroSchemaUtils::buildAvroProjections() is invoked, which invokes BuildAvroProjection.record() to construct the Iceberge’s record. When reading rows with default values (i.e., the field is not present in the data file), the code path goes to check if this field is optional or not (here). If the field has a null-default value, it is nullable, and is therefore mapped to an optional field and the field is skipped, if otherwise it has a non-null default value, this check throws an exception.
Solution
Overview
The fix is simply to add the default value to the NestedField, and add relevant APIs to copy the default value over when converting from AVRO schema, to use it while reading. In case of non-nullable fields with default values, it is obvious that these need to be modeled as required fields with default values. The default value here can be used in schema evolution, for example if a required field is added. In this case, while reading older data/partitions, the default value will be returned. For nullable fields, with similar reasoning about schema evolution, we should model these fields as optional with default values and use the default value instead of just using null for optional fields. This includes the two cases: (a) if the default value is null (as in:
{"name": "fieldName", "type": ["null", "string"], "default": null}), and (b) non-null default value( as in:{"name": "fieldName", "type": ["string", "null"], "default": “defValue”}).ORC and Parquet
While Avro support defaults semantics, and Avro libraries can be used to read fields with defaults values, neither ORC nor Parquet formats support default values semantics. It is required, though, to provide consistent behavior and semantics across different file formats, therefore, once default semantics are enabled into Iceberg schema, the ORC and Parquet readers should be modified to handle this properly. Specifically, when reading a field that is not manifested but has a default value, the default value should be used.
Planned Code Changes