[#2039] Support default value semantics - API changes #2496

shenodaguirguis · 2021-04-20T01:35:15Z

Summary

Iceberg schema currently does not support default values which imposes a challenge on reading Hive tables written in avro format and have default values (see issue 2039). Specifically, if a field has a non-null default value, it is mapped to a required field - with no default value - in iceberg. Thus, upon reading rows where this field is not manifested, an IllegalArgumentException (Missing required field) will be thrown. Further, default values of nullable fields are lost silently. That is because nullable fields with default values are mapped to optional fields with no default values, and thus null is returned when the field is absent, instead of the default value. This document describes how to support the default values semantics in Iceberg schema to resolve these issues.

Problem

Default values are lost

Default values are specified using the Avro schema keyword “default”. For example, the following is an example of an Avro string field with default value “unknown”:

{"name": "filedName", "type": "string", "default": “unknown”}

Also, a nullable (optional) avro field can define default value as follows:

{"name": "fieldName", "type": ["null", "string"], "default": null}

Please note that nullability is specified via UNION type (i.e., the [“null’, “string”]) and the default value’s type must match the first type in the union. In other words, the following are invalid types:

{"name": "fieldName", "type": ["null", "string"], "default": “unnown”}
{"name": "fieldName", "type": ["string", “null”], "default": null}

That is, if the default value is of type null, the first type of the field must be the “null”, o.w., if the default value is of type string, the first type of the field must be string.

When converting an avro schema to Iceberg Types, we have 2 cases. If the field is nullable, it maps to an optional NestedField Iceberg type. While if the field is non-nullable, it is mapped to a required NestedField Iceberg type, with no default value, since the NestedFiled does not support default values. In both cases, the default value is lost which leads to a wrong handling of the data when read. In the case of non-nullable fields, error is thrown if the field is not present, whereas in case of null, it goes by as an optional field.

Where in code this breaks

When reading avro records AvroSchemaUtils::buildAvroProjections() is invoked, which invokes BuildAvroProjection.record() to construct the Iceberge’s record. When reading rows with default values (i.e., the field is not present in the data file), the code path goes to check if this field is optional or not (here). If the field has a null-default value, it is nullable, and is therefore mapped to an optional field and the field is skipped, if otherwise it has a non-null default value, this check throws an exception.

Solution

Overview

The fix is simply to add the default value to the NestedField, and add relevant APIs to copy the default value over when converting from AVRO schema, to use it while reading. In case of non-nullable fields with default values, it is obvious that these need to be modeled as required fields with default values. The default value here can be used in schema evolution, for example if a required field is added. In this case, while reading older data/partitions, the default value will be returned. For nullable fields, with similar reasoning about schema evolution, we should model these fields as optional with default values and use the default value instead of just using null for optional fields. This includes the two cases: (a) if the default value is null (as in: {"name": "fieldName", "type": ["null", "string"], "default": null}), and (b) non-null default value( as in: {"name": "fieldName", "type": ["string", "null"], "default": “defValue”} ).

ORC and Parquet

While Avro support defaults semantics, and Avro libraries can be used to read fields with defaults values, neither ORC nor Parquet formats support default values semantics. It is required, though, to provide consistent behavior and semantics across different file formats, therefore, once default semantics are enabled into Iceberg schema, the ORC and Parquet readers should be modified to handle this properly. Specifically, when reading a field that is not manifested but has a default value, the default value should be used.

Planned Code Changes

FIrst (this) PR: API Changes: An AVRO schema of type record is mapped into Iceberg type: StructType, which consists of an array of NestedField’s. Therefore, to support default values we need to make NestedField default-value aware. This involves adding APIs to create NetedField objects with default values, as well as getting the and checking for the default value.
Second PR: schema mapping changes: to copy over default values to Iceberg schema during schema mapping/conversion
Third PR: Avro Reader changes to use the default value when needed
Fourth PR: ORC Reader changes
Fifth PR: Parquet Reader changes

shenodaguirguis · 2021-04-20T01:38:44Z

@shardulm94 @wmoustafa

shardulm94 · 2021-04-20T01:50:42Z

cc: @rdblue @aokolnychyi @RussellSpitzer @omalley @rymurr

wmoustafa · 2021-04-20T02:18:41Z

linkedin/iceberg version of this patch can be found at: linkedin/iceberg#63.

site/docs/spec.md

api/src/test/java/org/apache/iceberg/types/TestNestedFieldDefaultValues.java

api/src/main/java/org/apache/iceberg/types/Types.java

RussellSpitzer · 2021-04-21T21:42:08Z

Are complex types allowed to have defaults? Just wondering how complicated this can get

api/src/main/java/org/apache/iceberg/types/Types.java

RussellSpitzer · 2021-04-21T21:56:56Z

I can't remember the details of this, but I remember a few sync's back we were discussing this behavior for the general case and were worried about spark losing field properties... I think. Does anyone else remember the details?

rymurr · 2021-04-23T10:48:50Z

I agree w/ @RussellSpitzer on complex types. How would a deeply nested structure look w/ default types?

I can definitely see the value in default values but I am having a hard time figuring out all the downstream effects: how does Spark handle these, how do they work in the Arrow based vectorised readers etc. Some more context on the effects on other systems would be useful for me to understand this change better.

shenodaguirguis · 2021-04-28T18:57:04Z

I agree w/ @RussellSpitzer on complex types. How would a deeply nested structure look w/ default types?
I can definitely see the value in default values but I am having a hard time figuring out all the downstream effects:
how does Spark handle these, how do they work in the Arrow based vectorised readers etc. Some more context
on the effects on other systems would be useful for me to understand this change better.

thanks for the question @rymurr . I see your concern. In this change, we are adding default values as optional, such that different readers don't have to worry handling it. I plan to handle default values in Spark's avro, orc and parquet readers. We can tackle them one by one. I don't foresee any potential complications except may be for the ser/deser of complex types.

For spark avro reader, which I will start with, here is how to handle default values:

Converting Avro Schema to Iceberg Types takes place in AvroSchemaUtil::convert(Schema) which takes in an avro.Schema and uses the SchemaToType avro schema visitor to perform the conversion. To copy over the default value, we need to modify SchemaToType::record() method to handle the case when the field has default value and use the new NestedFiled API for defaultValue.
BuildAvroProjection.record() changes: this is the mapping of the read schema code path. Here we need to add handling for cases when the field has a default value and construct the read avro schema correctly with default values.
Reader side changes: since we constructed the read avro schema correctly with default values in BuildAvroProjection.record() in the previous step, avro libraries will handle filling any field with default value, if the field is not manifested

shenodaguirguis · 2021-05-14T17:14:42Z

hi @RussellSpitzer , @rymurr @shardulm94, I worked on a prototype to verify the planned changes will fix the issue, and I created a draft-PR to share, for reference. This prototype gives an idea how things will look like. I also tested complex types in this prototype Here it is: linkedin/iceberg#70
(note: I skipped the ser/deser code changes in that prototype).

I have just pushed a new commit to this PR, after addressing the comments. This also reflects the commit we merged to the linkedin branch. Please take a look.
CC: @wmoustafa

shenodaguirguis · 2022-02-03T17:49:05Z

hi @RussellSpitzer , @rymurr ,
As @wmoustafa shared on this thread, we have deployed support for non-null default values for AVRO and ORC formats at LinkedIn in production for a while now, which enabled us to rollout Hive tables to Iceberg (as a step towards converting the tables transparently to Iceberg format). The relevant changes in LinkedIn's fork are listed below for reference. I have updated this PR to reflect the its final state. Please take a look at your earliest convenience.

LinkedIn/Iceberg PRs related to supporting non-null default values:

Support default value semantic for AVRO PR#75
which combines the following three PRs:
API changes PR#63
Schema Mapping changes PR#69
Use default values reading AVRO PR#72
Support default value read for ORC format in spark PR#76
(optional) Move to Avro 1.10.2 PR#82

RussellSpitzer · 2022-02-03T18:01:51Z

I Think the first step to getting default value semantics in, would be changing the Spec so that fields are allowed to have "default values". We should also outline expected behaviors in case of changing defaults and and modifying of rows with default values.

shenodaguirguis · 2022-02-03T19:14:56Z

@RussellSpitzer are you referring to this or something else?

RussellSpitzer · 2022-02-03T21:22:02Z

@RussellSpitzer are you referring to this or something else?

I think the behaviors need to be elaborated a bit more than that and we also need information in JSON Serialzation (Appendix C)

For example, does a writer upon rewriting a file with a default value set, materialize that value? If the default for a column is changed do we change the value? I think "manifested" needs explanation as well. If defaults are used on writers I think we probably need to explain that in the writer section.

So to list all of the things.

A field is required and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed in the schema but the column id remains the same.

A field is optional and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed ...

Can a default valued field be applied to files where the column ID of the field is not present? I believe this is part of the idea here but i'm not sure how it would be defined. Is the idea that if the column ID is not present in a given data file do we always just return the default of the current schema?

Do defaults exist retroactively or are they always forward looking?

An example being:

Say I have file A' without column id 3 which has a default in my spec of "foo". If I read rows out of fileA' with this schema do I return foo? If I later read the table where the spec has a default for column3 of "bar" do I read bar? If I rewrite the datafiles inbetween the schema changes do I get foo or bar or null?

shenodaguirguis · 2022-02-09T23:14:09Z

Thanks @RussellSpitzer for the great questions. I took time to make sure I am thinking with general Iceberg mindset, below please find my responses. If they make sense, I will go ahead and update the specs accordingly.
Let me start with the example you wrote:

An example being:

Say I have file A' without column id 3 which has a default in my spec of "foo". If I read rows out of fileA' with this schema do I return foo? If I later read the table where the spec has a default for column3 of "bar" do I read bar? If I rewrite the datafiles inbetween the schema changes do I get foo or bar or null?

The default value is part of the table’s schema/spec. Hence, it lives in the Iceberg Metadata file. Therefore, whenever the schema is updated, leading to a new Schema ID, successor reads will read the new value. If however reading an older snapshot, i.e., using the older Snapshot Schema, the older default value saved in that older Schema will be used.
So, in the example you listed, reading file-A first time (snapshot 0) will return value “foo”, while reading it after updating the spec/schema (new Snapshot 1) will read “bar” (but of course reading snapshot 0 even after updating the schema will still read “foo”)..
Rewriting datafiles produces a new snapshot, which derives from the latest snapshot, with its same schema, so it is sort of orthogonal in the sense that the reading behavior is the same.

For example, does a writer upon rewriting a file with a default value set, materialize that value?

No. Default values belong to the table’s spec/schema, and are used only when reading and the column is missing (not materialized)

I think "manifested" needs explanation as well. If defaults are used on writers I think we probably need to explain that in the writer section.

Agreed, will use “materialized” instead. Also, defaults are not used in writes (of datafiles). However, it worth mentioning that we might consider adding a DDL to ALTER column adding/changing/dropping default values, if fits. Such DDLs would create a new schema in the metadata file, and a new Snapshot.

A field is required and has a default value, What does this mean on writing/rewriting + reading.

Default values are used only during reading data. When a data row is missing a column id that has a default value, the default value (from the schema) is read/used.

What happens if this default is changed in the schema but the column id remains the same.

This is the default case, right? The opposite case is interesting: if column id is changed, I am not sure what does that even mean? Column is deleted and a new one with the same column name is introduced? In any case, the default value defined for column X in the schema will only be used with the X's column id.

A field is optional and has a default value, What does this mean on writing/rewriting + reading. What happens if this default is changed ...

Optional vs Required makes difference if no default value is defined, while reading data. If a default value is defined in and the column id is missing, the default value is used in both cases (i.e., optional and required fields cases). If no default value is defined, then an exception is thrown only if the field is required. For reference: https://github.com/linkedin/iceberg/pull/72/files#diff-40083c166e284232643fa343534c626bca09d488537c226bb324be6169cab571R109

Can a default valued field be applied to files where the column ID of the field is not present?

if I read this correctly, the question is: if table had columns A and B, and we have datafile1 written. Then later we add column X with default value d_x. Can we read the default value while reading datafile1? The answer is yes. In fact, this support for non-null default values was motivated to address this scenario particularly.

I believe this is part of the idea here but i'm not sure how it would be defined. Is the idea that if the column ID is not present in a given data file do we always just return the default of the current schema?

Correct (where current = schema of the snapshot we are reading)

Do defaults exist retroactively or are they always forward looking?

Defaults can exist retroactively, since we can read older data files using newer snapshots’ schemata..

RussellSpitzer · 2022-02-11T05:34:38Z

Rewriting datafiles produces a new snapshot, which derives from the latest snapshot, with its same schema, so it is sort of orthogonal in the sense that the reading behavior is the same.

I think this may be an issue since all current implementations of rewrite start by reading the current state of the data and then writing that output to new files. Consider I have two files both missing column A for which I have set a default value of 1. Say my optimize rewrite command touches one of these files and rewrites it. On read it will return rows with a=1. The replacement data file is now filled in with a=1, the column is no longer missing so no default will be applied on read. Now if I change the default to 2, a row in the unoptimized file will return 2(the new default for a data file missing a) while rows in the optimized file will return 1 (since that was the value read while rewriting). I think this would be a pretty strange behavior and we should probably figure out how to eliminate it

rdblue · 2022-02-11T18:47:54Z

site/docs/spec.md

 A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

+Iceberg supports default-value semantics for fields of nested types (i.e., struct, list and map). Specifically, a field 
+of a nested type field can have a default value that will be returned upon reading this field, if it is not manifested. 


What does "manifested" mean? I believe that we want default values to be filled in if the column does not exist in a data file. If the column does exist in a data file and is null, then the written value is null and Iceberg should return null.

that is correct. Will use materialized instead of manifested.

rdblue · 2022-02-11T18:48:09Z

site/docs/spec.md


 For the representations of these types in Avro, ORC, and Parquet file formats, see Appendix A.

+Default values for fields are supported, see Neted Types below.


Typo: "Neted" should be "Nested"

rdblue · 2022-02-11T18:49:38Z

site/docs/spec.md


 A **`map`** is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

+Iceberg supports default-value semantics for fields of nested types (i.e., struct, list and map). Specifically, a field 


What does it mean for a list element to have a default value? Similarly, what does it mean for a map value to have a default?

I don't think that list elements or map values are places where we should allow default values. I'm not aware of a case where there is a file that contains a map, but the value column is missing. And I think that's when we would fill in default values.

The intention here is to provide default values for map, array, and struct fields, not the building blocks of those types (e.g., map keys, map values, or array elements). Not sure if this addresses Ryan's concern, but I think it is reasonable (and in our internal case required) to provide default values for fields of such data types. Pre-set default values (e.g., empty list or map) may not suit all use cases.

Yes, I just want to make sure that's the goal. No default elements, key/value pairs, or custom lookup results (like getOrDefault). Just default values for whole lists, like an empty map or empty list. Maybe a specific map or list that is non-empty?

rdblue · 2022-02-11T18:53:39Z

site/docs/spec.md

+of a nested type field can have a default value that will be returned upon reading this field, if it is not manifested. 
+The default value can be defined with both required and optional fields. Null default values are allowed with optional
+fields only, and it's behavior is identical to optional fields with no default value, that is a Null is returned upon
+reading this field when it is not manifested.


What are the rules for setting default values? The behavior required by the SQL spec is that default values are handled as though they are written into data files. That is, if I add an int field, x, with a default value of 0, write a row, update the default value to 1, then the row that was written must have x=0. I think that this implies that default values can't be changed unless we know there are no data files without the default, but it would be good to get more clarity here and some quotes from the SQL spec to inform this discussion.

rdblue · 2022-02-11T19:03:51Z

From the discussion with @RussellSpitzer, I think some of those statements are in conflict with the SQL spec. Here's Postgres behavior (verified by SQLFiddle):

create table default_test (id int);
insert into default_test values (1), (2);
alter table default_test add column data int default 0;
alter table default_test alter column data set default 1000;
insert into default_test (id) values (3);
insert into default_test values (4, null);
alter table default_test alter column data set default 2000;

select * from default_test;

id | data
-- | --
 1 | 0
 2 | 0
 3 | 1000
 4 | (null)

RussellSpitzer · 2022-02-11T19:14:56Z

I think the default issue could be issued by scanning through the schemas created since the file was created and only using the first available default. This seems kind of expensive to me though. General algorithm is

For missing column A in file written at Snapshot X
Find the first Schema created after X which applies a default value to A
   If such a schema exists 
      return the schema's default
    else
      return null

shenodaguirguis · 2022-02-11T21:06:01Z

I think the default issue could be issued by scanning through the schemas created since the file was created and only using the first available default. This seems kind of expensive to me though. General algorithm is
For missing column A in file written at Snapshot X
Find the first Schema created after X which applies a default value to A
   If such a schema exists 
      return the schema's default
    else
      return null

that would do. Or, if we can detect if we are coming from rewriting path, we skip reading the default value for the missing/unmaterialized columns. Looking into that option

RussellSpitzer · 2022-02-11T21:11:03Z

I think the default issue could be issued by scanning through the schemas created since the file was created and only using the first available default. This seems kind of expensive to me though. General algorithm is
For missing column A in file written at Snapshot X
Find the first Schema created after X which applies a default value to A
   If such a schema exists 
      return the schema's default
    else
      return null
that would do. Or, if we can detect if we are coming from rewriting path, we skip reading the default value for the missing/unmaterialized columns. Looking into that option

I think this would be tricky as every engine would have to customize their rewrite pathway to accommodate it. Additionally what would you do when rewriting a row with a required column that you default'd previously? You cannot write a new file where only some records are missing an entry for a column (null vs empty issue again)

rdblue · 2022-02-11T22:42:44Z

I doubt that it is worth the complexity of an algorithm like finding the default that should be applied for a file. We could do that by keeping track of when defaults are added with sequence numbers. But I think another reasonable way to fix that problem is just to state that once a default is set, it is an incompatible change to update it to another value.

By making it an incompatible change, we could allow modifying the value but make it clear that it is the user's responsibility. We do this when adding a required field to a table. You can add a field by calling allowIncompatibleChanges() in the schema update API. But if any files don't have the field and reads fail, that's the user's fault.

rdblue · 2022-02-13T20:21:35Z

Are complex types allowed to have defaults? Just wondering how complicated this can get

I don't think that maps or lists can have default map values or list elements. May entire map or list can be defaulted? It would be easier if you could assume an empty map or empty list default. We could allow empty defaults with a simple flag, or we could allow expressing defaults like map("a" => 1) or list(1, 2, 3). I'm not sure if the custom defaults are worth the complexity.

We have a similar choice for structs. A struct could be defaulted using a flag and nested default values, so the struct is non-null and all its columns have default values. Or we could allow setting a specific default struct. I'm again not sure about the utility of the specific default struct. It is way more complicated for us to store the default values.

The choice here may come down to the strategy used to store default values. If we use simple JSON structures, then we can definitely store the nested defaults. I'm interested to hear a proposal for this.

shenodaguirguis · 2022-02-16T02:36:27Z

From the discussion with @RussellSpitzer, I think some of those statements are in conflict with the SQL spec. Here's Postgres behavior (verified by SQLFiddle):

create table default_test (id int);
insert into default_test values (1), (2);
alter table default_test add column data int default 0;
alter table default_test alter column data set default 1000;
insert into default_test (id) values (3);
insert into default_test values (4, null);
alter table default_test alter column data set default 2000;

select * from default_test;

id | data
-- | --
 1 | 0
 2 | 0
 3 | 1000
 4 | (null)

This is interesting @rdblue! Hive on the other hand, does not seem to materialize the default value upon schema evolution. Oracle 10g behaves like PostgreSql, but a performance enhancement (referred to as fast column update) in Oracle 11g, the default value lives in the schema definition and not materialized. So, this seems to me an implementation details. I checked SQL 1992 (11.5 default clause > 2), and it states that default value is driven from the <column descriptor> , if not, then from the range's default value, o.w., it is null

shenodaguirguis · 2022-02-16T02:40:05Z

I meant - without any rewriting implementation to change or customize anything - to have the reader check current call stack and detect if the caller is a rewriter, then we can have proper reading for un-materialized defaulted columns

wmoustafa · 2022-02-16T03:09:09Z

I do not think it is ideal to base the decisions off the call stack, but probably that can be substituted by a parameter (if the proper solution is indeed to customize the rewrite behavior).

rdblue · 2022-02-17T00:55:32Z

I'm not sure I understood the part about the call stack, but it seems that SQL behavior is to act as though the values is written into data files. I'm okay bending the rules a bit by making it an incompatible change, but I think that rewrites should write the default value into data files.

RussellSpitzer · 2022-02-18T20:30:04Z

I meant - without any rewriting implementation to change or customize anything - to have the reader check current call stack and detect if the caller is a rewriter, then we can have proper reading for un-materialized defaulted columns

I'm not sure how this would work. Given my example again

Given Row (A int)
File 1 - Row (1)
File 2 - Row (2)
AlterTable Add Column (B int default (3)
Rewrite File 1
File1' = Row(1, 3) or Row(1, null)?

If I choose Row(1, 3) then File2 will get a different value if the default is changed. (Forbidding changing the default sounds like an ok solution for this but i'm not a huge fan of table changes that have this kind of non-controlled behavior. A user choosing to Alter the default even after warning would have no real way to tell what rows got the new default and which didn't)

If I choose Row(1, null) then for Required B I don't have a problem since I can just read the default, for Optional I have a problem because now have this column as null and not missing.

So I'd say i'm
+1 on saying defaults are immutable with a strong wording that changing defaults is only allowed with a special flag that strongly encourages users to fully rewrite their datafiles if they only want the default applied to new files added to the table.

wmoustafa · 2022-02-23T00:50:03Z

Regarding the default value semantics:

I think this discussion can be simplified if we distinguish between two use cases of default values: (1) when they are used with an INSERT INTO that references a subset of columns, and (2) when they are used with schema evolution to compensate for missing columns in previous data. We observe that the two use cases operate on disjoint subsets of rows. INSERT INTO default values operate on rows where the column exists in the schema, and schema evolution default values operate on rows where the column does not exist in the schema. Therefore, we can discuss each separately.

To start, we can discuss the schema evolution use case as the sole one to simplify the discussion (we will relax this assumption later). In fact, unlike traditional database engines, which allow INSERT INTO that references a subset of columns, compute engines in the data lake do not commonly support inserting rows with missing values. In most cases, Hive, Trino, and Spark will throw an error if the INSERT INTO statement does not specify values for all columns. If we go by this assumption (for now), it means that: A column’s default value does not matter for subsequent inserts. It only matters for previous inserts when the column did not exist in the table in the first place. So we can have the following events throughout the lifecycle of a table:

Table creation: default value does not matter for all columns defined as part of the table creation.
Adding a column to an existing table and defining a default value for it at the same time: Default value affects the rows that are present at the time of this event (i.e., adding the column and introducing the default value). When reading those rows, the default value will be used. Let us call this set of rows R1.
Altering the default value (to d2) of an existing column that has a default value (d1): The same rows R1 that were assigned the default value d1, will now be assigned d2. Nothing else changes. Rows added after R1 are not affected by d1 or d2.
Rewrite: the current default value is used for rewriting R1, and the default value will now be materialized. Subsequent default value changes no longer affect R1 as well (or the subset of rewritten rows from R1). If all rows are rewritten, we are back to the “table creation” state.

So putting the "INSERT INTO that references a subset of columns” use case aside, semantics of default values can be clear:

It is safe to introduce them and change them. The current default will always be used for the rows missing the respective column, until the first rewrite.

Now, to cover the INSERT INTO use case, just in case it becomes a common one (e.g., Hive 3 is adding support as Shenoda mentions above), the behavior in this case can be materializing the default values for the rows that are part of this INSERT since the statement occurs on existing columns. This still does not affect the semantics of the R1 subset since this is the subset of rows where the column does not exist.

rdblue · 2022-02-23T03:22:18Z

@wmoustafa, I think I mostly agree with you. That's probably how it should work. The only problem I have is point 3. Iceberg should not allow changing default values because it breaks the expected SQL behavior.

That said, I would allow changing default values only behind the allowIncompatibleChanges flag on the schema update operation.

wmoustafa · 2022-03-01T02:06:38Z

I am okay with that @rdblue. I hope in the future, changing default values becomes a compatible change (for example if we trigger a rewrite once a new column with default value is added -- which will hardcode the default value), but it is okay to start more restrictive and relax the restrictions in the future.

rdblue · 2022-03-10T00:13:58Z

Closing this in favor of #4301.

github-actions bot added API docs labels Apr 20, 2021

shardulm94 reviewed Apr 20, 2021

View reviewed changes

shardulm94 mentioned this pull request Apr 20, 2021

[#2039] Support default value semantics - first PR: API changes linkedin/iceberg#63

Merged

shardulm94 reviewed Apr 20, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/types/Types.java Show resolved Hide resolved

api/src/main/java/org/apache/iceberg/types/Types.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 21, 2021

View reviewed changes

api/src/main/java/org/apache/iceberg/types/Types.java Outdated Show resolved Hide resolved

shenodaguirguis force-pushed the master branch from 3844e4d to d7f7fa2 Compare May 14, 2021 17:05

shenodaguirguis mentioned this pull request May 15, 2021

[#2039] Support Default Value Semantis: use default values reading AVRO files linkedin/iceberg#72

Merged

shenodaguirguis force-pushed the master branch from d7f7fa2 to d61638d Compare February 3, 2022 17:16

Shenoda Guirguis and others added 2 commits February 3, 2022 11:56

[apache#2039] Support default value semantics - API changes

103922d

fix checkStyle

b23a00a

shenodaguirguis force-pushed the master branch from d61638d to b23a00a Compare February 3, 2022 19:56

rdblue reviewed Feb 11, 2022

View reviewed changes

rzhang10 mentioned this pull request Mar 9, 2022

Docs: Default value support feature specification #4301

Merged

rdblue closed this Mar 10, 2022


		For the representations of these types in Avro, ORC, and Parquet file formats, see Appendix A.

		Default values for fields are supported, see Neted Types below.


		A `map` is a collection of key-value pairs with a key type and a value type. Both the key field and value field each have an integer id that is unique in the table schema. Map keys are required and map values can be either optional or required. Both map keys and map values may be any type, including nested types.

		Iceberg supports default-value semantics for fields of nested types (i.e., struct, list and map). Specifically, a field

[#2039] Support default value semantics - API changes #2496

[#2039] Support default value semantics - API changes #2496

Uh oh!

Conversation

shenodaguirguis commented Apr 20, 2021

Summary

Problem

Default values are lost

Where in code this breaks

Solution

Overview

ORC and Parquet

Planned Code Changes

Uh oh!

shenodaguirguis commented Apr 20, 2021

Uh oh!

shardulm94 commented Apr 20, 2021

Uh oh!

wmoustafa commented Apr 20, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RussellSpitzer commented Apr 21, 2021

Uh oh!

Uh oh!

RussellSpitzer commented Apr 21, 2021

Uh oh!

rymurr commented Apr 23, 2021

Uh oh!

shenodaguirguis commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shenodaguirguis commented May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shenodaguirguis commented Feb 3, 2022

Uh oh!

RussellSpitzer commented Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shenodaguirguis commented Feb 3, 2022

Uh oh!

RussellSpitzer commented Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shenodaguirguis commented Feb 9, 2022

Uh oh!

RussellSpitzer commented Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

shenodaguirguis Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

wmoustafa Feb 18, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue commented Feb 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

shenodaguirguis commented Apr 28, 2021 •

edited

Loading

shenodaguirguis commented May 14, 2021 •

edited

Loading

RussellSpitzer commented Feb 3, 2022 •

edited

Loading

RussellSpitzer commented Feb 3, 2022 •

edited

Loading

RussellSpitzer commented Feb 11, 2022 •

edited

Loading

rdblue commented Feb 11, 2022 •

edited

Loading

RussellSpitzer commented Feb 18, 2022 •

edited

Loading