Skip to content

Conversation

@rzhang10
Copy link
Member

@rzhang10 rzhang10 commented Nov 9, 2021

This rb changes the avro spark reader code path to read a union data type of [int, string] to a spark schema/data of [tag, field0, field1] instead of the previous [tag_0, tag_1].
it also fixed an issue that when the avro union is a null, the reader should return a plain null instead of [0, null, null] or [null, null] (as the previous schema did).

@rzhang10 rzhang10 force-pushed the refactor_union_schema_from_hive_to_trino branch from ce38cb9 to d078702 Compare November 9, 2021 19:01
@funcheetah
Copy link

Thanks @rzhang10 for the PR! Could you please integrate test to make sure the new schema works?

@@ -108,22 +92,4 @@ public void testOptionSchema() {

Assert.assertEquals(expectedIcebergSchema, icebergSchema.toString());
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this considered invalid at all? or what's the reason to get rid of this case ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this schema itself shouldn't appear in the first place, user shouldn't define this kind of schema.

}

static class UnionReader implements ValueReader<InternalRow> {
private static class UnionReader implements ValueReader<Object> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this modifier change necessary ? this might make it harder for us to contributing back in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also need to do the same for the vectorized orc reader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mimics the same access control modifier for the similar method in this class SparkValueReaders, the relevant change has been made in SparkOrcValueReaders in pr #85 .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the modifier change ? Is that really needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you look at other nested class in this file you will see:
private static class MapReader
private static class EnumReader
private static class ArrayReader

so I think it's best to make the modifiers consistent.

return options.get(0);
}
} else {
// Complex union

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an official definition of what is complex and what is simple ? If we target to contribute it back, maybe better to clarify that thru javadoc -- If this is merely internally usage I am fine with what we have now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple union is [sometype, null] while complex union is [sometype1, sometype2, ...], where there are at least 2 non-null types in the union, I think we can probably add a java doc in AvroSchemaUtil.isOptionSchema, which is an existing method in upstream iceberg.

} else if (index < nullIndex) {
struct.update(index + 1, value);
struct.setInt(0, index);
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow this, what does the relative position between value index and null index have anything to do with how we assign the value in the struct ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in avro the nullability of a union type is presented by the existence of a NULL type inside the union alternatives, and the NULL type can occur in any order inside the union, but in our converted struct, we don't have a field that corresponds to that NULL type, thus we have -1 field count.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I didn't know earlier was, if NULL type appears in the union it seems it has to be the first position in
the alternative list ? But the avro spec said "Thus, for unions containing "null", the "null" is usually listed first,"

Will this lead to a silent failure if you don't check this first?

Copy link
Member Author

@rzhang10 rzhang10 Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just a recommendation from the Avro spec, not mandated, in fact, in our ecosystem there are many avro schemas which failed to put null as the first type element in the union.

And that's why I'm specifically computing the null index here and branching on it.


@Override
public InternalRow read(Decoder decoder, Object reuse) throws IOException {
public Object read(Decoder decoder, Object reuse) throws IOException {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed if you are still returning struct which is a InternalRow ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See line 323, if it's null data, we will return null directly, so this method returns either null or an InternalRow, so the return type should only be Object. Please see the same method signature in ValueReaders.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not returning null directly at line 323 if nullIndex is hit ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers[index].read(decoder, reuse) will return null, I just feel writing this way will make the code more consistent with later part.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure that's worthwhile with the cost of changing the method signature ... But I don't have strong opinion on this either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using InternalRow is better, let me make the change accordingly. Thanks for spotting this!

@rzhang10 rzhang10 changed the title Avro: Refactor union schema from hive to trino Avro: Change union read schema from hive to trino Nov 15, 2021
@autumnust
Copy link

A couple of more comments ... @rzhang10

Copy link

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Most of conservations above are just questions. I still believe we shouldn't change the signature of union read method from InternalRow to Object since a return type as Object doesn't seem to right if we can make it more specific. But I am no strong opinion on that and don't take that as blocker.

@rzhang10 rzhang10 merged commit e43bc6e into linkedin:li-0.11.x Dec 7, 2021
rzhang10 added a commit to rzhang10/iceberg that referenced this pull request Oct 11, 2022
* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro
rzhang10 added a commit that referenced this pull request Oct 21, 2022
* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10)

Add custom hive catalog to not override existing Hive metadata

Fail early with a proper exception if the metadata file is not existing

Simplify CustomHiveCatalog (#22)

* Shading: Add a iceberg-runtime shaded module (#12)

* ORC: Add test for reading files without Iceberg IDs (#16)

* Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26)

- Support for non string partition columns (#24)
- Support for Hive tables without avro.schema.literal (#25)
- Hive Metadata Scan: Notify ScanEvent listeners on planning (#35)
- Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37)
- Hive Metadata Scan: Return empty statistics (#49)
- Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50)
- Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51)

Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>

* Row level filtering: Allow table scans to pass a row level filter for ORC files

- ORC: Support NameMapping with row-level filtering (#53)

* Hive: Made Predicate Pushdown dynamic based on the Hive Version

* Hive: Fix uppercase bug and determine catalog from table properties (#38)

* Hive: Return lowercase fieldname from IcebergRecordStructField
* Hive: Determine catalog from table property

* Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47)

* Hive: Use Hive table location in HiveIcebergSplit
* Hive: Fix schema not passed to Serde
* Hive: Refactor tests for tables with unqualified location URI

Co-authored-by: Shardul Mahadik <[email protected]>

* Hive Metadata Scan: Support case insensitive name mapping (#52)

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57)

Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58)

Hive Metadata Scan: Fix support for Hive timestamp type (#61)

Co-authored-by: Raymond Zhang <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>

Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67)

* Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time

* Trigger CI

(cherry picked from commit b90e838)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc

* Fix style check

* Do not delete metadata location when HMS has been successfully updated (#68)

(cherry picked from commit 766407e)

* Support reading Avro complex union types (#73)

Co-authored-by: Wenye Zhang <[email protected]>

* [#2039] Support default value semantic for AVRO (#75)

(cherry picked from commit c18f4c4)

* Support hive non string partition cols (#78)

* Support non-string hive type partition columns in LegacyHiveTableScan

* Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark (#76)

* Support default value read for ORC format in spark

* Refactor common code for ReadBuilder for both non-vectorized and vectorized read

* Fix code style issue

* Add special handling of ROW_POSITION metadata column

* Add corner case check for partition field

* Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct

* Support nested type default value for vectorized read

* Support deeply nested type default value for vectorized read

* Support reading ORC complex union types (#74)

* Support reading orc complex union types

* add more tests

* support union in VectorizedSparkOrcReaders and improve tests

* support union in VectorizedSparkOrcReaders and improve tests - continued

* fix checkstyle

Co-authored-by: Wenye Zhang <[email protected]>

* Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80)

* Fix ORC schema visitors to support reading ORC files with deeply nest… (#81)

* Fix ORC schema visitors to support reading ORC files with deeply nested union type schema

* Added test for vectorized read

* Disable avro validation for default values

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader reading union schema data (#83)

* Fix spark avro reader to read correctly structured nested data values

* Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* Avro: Change union read schema from hive to trino (#84)

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* ORC: Change union read schema from hive to trino (#85)

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC

* Change Hive type to Iceberg type conversion for union

* Recorder hive table properties to align the avro.schema.literal placement contract (#86)

* [#2039] Support default value semantic for AVRO

(cherry picked from commit c18f4c4)

* reverting commits 2c59857 and f362aed (#88)

Co-authored-by: Shenoda Guirguis <[email protected]>

* logically patching PR 2328 on HiveMetadataPreservingTableOperations

* Support timestamp as partition type (#91)

* Support timestamp in partition types

* Address comment

* Separate classes under hive legacy package to new hivelink module (#87)

* separate class under legacy to new hiveberg module

* fix build

* remove hiveberg dependency in iceberg-spark2 module

* Revert "remove hiveberg dependency in iceberg-spark2 module"

This reverts commit 2e8b743.

* rename hiveberg module to hivelink

Co-authored-by: Wenye Zhang <[email protected]>

* [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92)

* Align default value validation align with avro semantics in terms of nullable (nested) fields

* Allow setting null as default value for nested fields in record default

* [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94)

* [LI][Spark] read avro union using decoder instead of directly returning value

* Add a comment for the schema

* Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96)

* Move custom hive catalog to hivelink-core (#99)

* Handle non-nullable union of single type for Avro (#98)

* Handle non-nullable union of single type

Co-authored-by: Wenye Zhang <[email protected]>

* Handle null default in nested type default value situations (#100)

* Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102)

* Remove activeSparkSession (#103)

* Disable default value preserving (#106)

* Disable default value preserving

* [LI][Avro] Do not reorder elements inside a Avro union schema (#93)

* handle single type union properly in AvroSchemaVisitor for deep nested schema (#107)

* Handle non-nullable union of single type for ORC spark non-vectorized reader (#104)

* Handle single type union for non-vectorized reader

* [Avro] Retain the type of field while copying the default values. (#109)

* Retain the type of field while copying the default values.

* [Hivelink] Refactor support hive non string partition cols to rid of … (#110)

* [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes

* Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101)

* Add scm and developer info (#111)

* [Core] Fix and refactor schema parser (#112)

* [Core] Fix/Refactor SchemaParser to fix multiple bugs

* Enhance the UT for testing required fields with default values (#113)

* Enhance the UT for testing required fields with default values

* Addressed review comments

* Addressed review comment

* Support single type union for ORC-vectorization reader (#114)

* Support single type union for ORC-vectorization reader

* Support single type union for ORC-vectorization reader

Co-authored-by: Yiqiang Ding <[email protected]>

* Refactor HMS code upon cherry-pick

* Check for schema corruption and fix it on commit (#117)

* Check for schema corruption and fix it on commit

* ORC: Handle query where select and filter only uses default value col… (#118)

* ORC: Handle query where select and filter only use default value columns

* Set ORC columns and fix case-sensitivity issue with schema check (#119)

* Hive: Return null for currentSnapshot() (#121)

* Hive: Return null for currentSnapshot()

* Handle snapshots()

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120)

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes

* Add logic to derive partition column id from partition.column.ids pro… (#122)

* Add logic to derive partition column id from partition.column.ids property

* Do not push down filter to ORC for union type schema (#123)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable

* LinkedIn rebase draft

* Refactor hivelink 1

* Make hivelink module test all pass

* Make spark 2.4 module work

* Fix mr module

* Make spark 3.1 module work

* Fix TestSparkMetadataColumns

* Minor fix for spark 2.4

* Update default spark version to 3.1

* Update java ci to only run spark 2.4 and 3.1

* Minor fix HiveTableOperations

* Adapt github CI to 0.14.x branch

* Fix mr module checkstyle

* Fix checkstyle for orc module

* Fix spark2.4 checkstyle

* Refactor catalog loading logic using CatalogUtil

* Minor change to CI/release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
rzhang10 added a commit to rzhang10/iceberg that referenced this pull request Nov 4, 2022
* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (linkedin#10)

Add custom hive catalog to not override existing Hive metadata

Fail early with a proper exception if the metadata file is not existing

Simplify CustomHiveCatalog (linkedin#22)

* Shading: Add a iceberg-runtime shaded module (linkedin#12)

* ORC: Add test for reading files without Iceberg IDs (linkedin#16)

* Hive Metadata Scan: Support reading tables with only Hive metadata (linkedin#23, linkedin#24, linkedin#25, linkedin#26)

- Support for non string partition columns (linkedin#24)
- Support for Hive tables without avro.schema.literal (linkedin#25)
- Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35)
- Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37)
- Hive Metadata Scan: Return empty statistics (linkedin#49)
- Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50)
- Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51)

Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>

* Row level filtering: Allow table scans to pass a row level filter for ORC files

- ORC: Support NameMapping with row-level filtering (linkedin#53)

* Hive: Made Predicate Pushdown dynamic based on the Hive Version

* Hive: Fix uppercase bug and determine catalog from table properties (linkedin#38)

* Hive: Return lowercase fieldname from IcebergRecordStructField
* Hive: Determine catalog from table property

* Hive: Fix schema not forwarded to SerDe on MR jobs (linkedin#45) (linkedin#47)

* Hive: Use Hive table location in HiveIcebergSplit
* Hive: Fix schema not passed to Serde
* Hive: Refactor tests for tables with unqualified location URI

Co-authored-by: Shardul Mahadik <[email protected]>

* Hive Metadata Scan: Support case insensitive name mapping (linkedin#52)

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (linkedin#57)

Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58)

Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61)

Co-authored-by: Raymond Zhang <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>

Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67)

* Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time

* Trigger CI

(cherry picked from commit b90e838)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc (linkedin#64)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc

* Fix style check

* Do not delete metadata location when HMS has been successfully updated (linkedin#68)

(cherry picked from commit 766407e)

* Support reading Avro complex union types (linkedin#73)

Co-authored-by: Wenye Zhang <[email protected]>

* [#2039] Support default value semantic for AVRO (linkedin#75)

(cherry picked from commit c18f4c4)

* Support hive non string partition cols (linkedin#78)

* Support non-string hive type partition columns in LegacyHiveTableScan

* Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark (linkedin#76)

* Support default value read for ORC format in spark

* Refactor common code for ReadBuilder for both non-vectorized and vectorized read

* Fix code style issue

* Add special handling of ROW_POSITION metadata column

* Add corner case check for partition field

* Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct

* Support nested type default value for vectorized read

* Support deeply nested type default value for vectorized read

* Support reading ORC complex union types (linkedin#74)

* Support reading orc complex union types

* add more tests

* support union in VectorizedSparkOrcReaders and improve tests

* support union in VectorizedSparkOrcReaders and improve tests - continued

* fix checkstyle

Co-authored-by: Wenye Zhang <[email protected]>

* Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (linkedin#80)

* Fix ORC schema visitors to support reading ORC files with deeply nest… (linkedin#81)

* Fix ORC schema visitors to support reading ORC files with deeply nested union type schema

* Added test for vectorized read

* Disable avro validation for default values

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader reading union schema data (linkedin#83)

* Fix spark avro reader to read correctly structured nested data values

* Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* Avro: Change union read schema from hive to trino (linkedin#84)

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* ORC: Change union read schema from hive to trino (linkedin#85)

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC

* Change Hive type to Iceberg type conversion for union

* Recorder hive table properties to align the avro.schema.literal placement contract (linkedin#86)

* [#2039] Support default value semantic for AVRO

(cherry picked from commit c18f4c4)

* reverting commits 2c59857 and f362aed (linkedin#88)

Co-authored-by: Shenoda Guirguis <[email protected]>

* logically patching PR 2328 on HiveMetadataPreservingTableOperations

* Support timestamp as partition type (linkedin#91)

* Support timestamp in partition types

* Address comment

* Separate classes under hive legacy package to new hivelink module (linkedin#87)

* separate class under legacy to new hiveberg module

* fix build

* remove hiveberg dependency in iceberg-spark2 module

* Revert "remove hiveberg dependency in iceberg-spark2 module"

This reverts commit 2e8b743.

* rename hiveberg module to hivelink

Co-authored-by: Wenye Zhang <[email protected]>

* [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (linkedin#92)

* Align default value validation align with avro semantics in terms of nullable (nested) fields

* Allow setting null as default value for nested fields in record default

* [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (linkedin#94)

* [LI][Spark] read avro union using decoder instead of directly returning value

* Add a comment for the schema

* Improve the logging when the deserailzed index is invalid to read the symbol from enum (linkedin#96)

* Move custom hive catalog to hivelink-core (linkedin#99)

* Handle non-nullable union of single type for Avro (linkedin#98)

* Handle non-nullable union of single type

Co-authored-by: Wenye Zhang <[email protected]>

* Handle null default in nested type default value situations (linkedin#100)

* Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (linkedin#102)

* Remove activeSparkSession (linkedin#103)

* Disable default value preserving (linkedin#106)

* Disable default value preserving

* [LI][Avro] Do not reorder elements inside a Avro union schema (linkedin#93)

* handle single type union properly in AvroSchemaVisitor for deep nested schema (linkedin#107)

* Handle non-nullable union of single type for ORC spark non-vectorized reader (linkedin#104)

* Handle single type union for non-vectorized reader

* [Avro] Retain the type of field while copying the default values. (linkedin#109)

* Retain the type of field while copying the default values.

* [Hivelink] Refactor support hive non string partition cols to rid of … (linkedin#110)

* [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes

* Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (linkedin#101)

* Add scm and developer info (linkedin#111)

* [Core] Fix and refactor schema parser (linkedin#112)

* [Core] Fix/Refactor SchemaParser to fix multiple bugs

* Enhance the UT for testing required fields with default values (linkedin#113)

* Enhance the UT for testing required fields with default values

* Addressed review comments

* Addressed review comment

* Support single type union for ORC-vectorization reader (linkedin#114)

* Support single type union for ORC-vectorization reader

* Support single type union for ORC-vectorization reader

Co-authored-by: Yiqiang Ding <[email protected]>

* Refactor HMS code upon cherry-pick

* Check for schema corruption and fix it on commit (linkedin#117)

* Check for schema corruption and fix it on commit

* ORC: Handle query where select and filter only uses default value col… (linkedin#118)

* ORC: Handle query where select and filter only use default value columns

* Set ORC columns and fix case-sensitivity issue with schema check (linkedin#119)

* Hive: Return null for currentSnapshot() (linkedin#121)

* Hive: Return null for currentSnapshot()

* Handle snapshots()

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (linkedin#120)

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes

* Add logic to derive partition column id from partition.column.ids pro… (linkedin#122)

* Add logic to derive partition column id from partition.column.ids property

* Do not push down filter to ORC for union type schema (linkedin#123)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (linkedin#125)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable

* LinkedIn rebase draft

* Refactor hivelink 1

* Make hivelink module test all pass

* Make spark 2.4 module work

* Fix mr module

* Make spark 3.1 module work

* Fix TestSparkMetadataColumns

* Minor fix for spark 2.4

* Update default spark version to 3.1

* Update java ci to only run spark 2.4 and 3.1

* Minor fix HiveTableOperations

* Adapt github CI to 0.14.x branch

* Fix mr module checkstyle

* Fix checkstyle for orc module

* Fix spark2.4 checkstyle

* Refactor catalog loading logic using CatalogUtil

* Minor change to CI/release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
rzhang10 added a commit to rzhang10/iceberg that referenced this pull request Nov 4, 2022
* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (linkedin#10)

Add custom hive catalog to not override existing Hive metadata

Fail early with a proper exception if the metadata file is not existing

Simplify CustomHiveCatalog (linkedin#22)

* Shading: Add a iceberg-runtime shaded module (linkedin#12)

* ORC: Add test for reading files without Iceberg IDs (linkedin#16)

* Hive Metadata Scan: Support reading tables with only Hive metadata (linkedin#23, linkedin#24, linkedin#25, linkedin#26)

- Support for non string partition columns (linkedin#24)
- Support for Hive tables without avro.schema.literal (linkedin#25)
- Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35)
- Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37)
- Hive Metadata Scan: Return empty statistics (linkedin#49)
- Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50)
- Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51)

Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>

* Row level filtering: Allow table scans to pass a row level filter for ORC files

- ORC: Support NameMapping with row-level filtering (linkedin#53)

* Hive: Made Predicate Pushdown dynamic based on the Hive Version

* Hive: Fix uppercase bug and determine catalog from table properties (linkedin#38)

* Hive: Return lowercase fieldname from IcebergRecordStructField
* Hive: Determine catalog from table property

* Hive: Fix schema not forwarded to SerDe on MR jobs (linkedin#45) (linkedin#47)

* Hive: Use Hive table location in HiveIcebergSplit
* Hive: Fix schema not passed to Serde
* Hive: Refactor tests for tables with unqualified location URI

Co-authored-by: Shardul Mahadik <[email protected]>

* Hive Metadata Scan: Support case insensitive name mapping (linkedin#52)

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (linkedin#57)

Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58)

Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61)

Co-authored-by: Raymond Zhang <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>

Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67)

* Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time

* Trigger CI

(cherry picked from commit b90e838)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc (linkedin#64)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc

* Fix style check

* Do not delete metadata location when HMS has been successfully updated (linkedin#68)

(cherry picked from commit 766407e)

* Support reading Avro complex union types (linkedin#73)

Co-authored-by: Wenye Zhang <[email protected]>

* [#2039] Support default value semantic for AVRO (linkedin#75)

(cherry picked from commit c18f4c4)

* Support hive non string partition cols (linkedin#78)

* Support non-string hive type partition columns in LegacyHiveTableScan

* Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark (linkedin#76)

* Support default value read for ORC format in spark

* Refactor common code for ReadBuilder for both non-vectorized and vectorized read

* Fix code style issue

* Add special handling of ROW_POSITION metadata column

* Add corner case check for partition field

* Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct

* Support nested type default value for vectorized read

* Support deeply nested type default value for vectorized read

* Support reading ORC complex union types (linkedin#74)

* Support reading orc complex union types

* add more tests

* support union in VectorizedSparkOrcReaders and improve tests

* support union in VectorizedSparkOrcReaders and improve tests - continued

* fix checkstyle

Co-authored-by: Wenye Zhang <[email protected]>

* Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (linkedin#80)

* Fix ORC schema visitors to support reading ORC files with deeply nest… (linkedin#81)

* Fix ORC schema visitors to support reading ORC files with deeply nested union type schema

* Added test for vectorized read

* Disable avro validation for default values

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader reading union schema data (linkedin#83)

* Fix spark avro reader to read correctly structured nested data values

* Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* Avro: Change union read schema from hive to trino (linkedin#84)

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* ORC: Change union read schema from hive to trino (linkedin#85)

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC

* Change Hive type to Iceberg type conversion for union

* Recorder hive table properties to align the avro.schema.literal placement contract (linkedin#86)

* [#2039] Support default value semantic for AVRO

(cherry picked from commit c18f4c4)

* reverting commits 2c59857 and f362aed (linkedin#88)

Co-authored-by: Shenoda Guirguis <[email protected]>

* logically patching PR 2328 on HiveMetadataPreservingTableOperations

* Support timestamp as partition type (linkedin#91)

* Support timestamp in partition types

* Address comment

* Separate classes under hive legacy package to new hivelink module (linkedin#87)

* separate class under legacy to new hiveberg module

* fix build

* remove hiveberg dependency in iceberg-spark2 module

* Revert "remove hiveberg dependency in iceberg-spark2 module"

This reverts commit 2e8b743.

* rename hiveberg module to hivelink

Co-authored-by: Wenye Zhang <[email protected]>

* [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (linkedin#92)

* Align default value validation align with avro semantics in terms of nullable (nested) fields

* Allow setting null as default value for nested fields in record default

* [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (linkedin#94)

* [LI][Spark] read avro union using decoder instead of directly returning value

* Add a comment for the schema

* Improve the logging when the deserailzed index is invalid to read the symbol from enum (linkedin#96)

* Move custom hive catalog to hivelink-core (linkedin#99)

* Handle non-nullable union of single type for Avro (linkedin#98)

* Handle non-nullable union of single type

Co-authored-by: Wenye Zhang <[email protected]>

* Handle null default in nested type default value situations (linkedin#100)

* Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (linkedin#102)

* Remove activeSparkSession (linkedin#103)

* Disable default value preserving (linkedin#106)

* Disable default value preserving

* [LI][Avro] Do not reorder elements inside a Avro union schema (linkedin#93)

* handle single type union properly in AvroSchemaVisitor for deep nested schema (linkedin#107)

* Handle non-nullable union of single type for ORC spark non-vectorized reader (linkedin#104)

* Handle single type union for non-vectorized reader

* [Avro] Retain the type of field while copying the default values. (linkedin#109)

* Retain the type of field while copying the default values.

* [Hivelink] Refactor support hive non string partition cols to rid of … (linkedin#110)

* [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes

* Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (linkedin#101)

* Add scm and developer info (linkedin#111)

* [Core] Fix and refactor schema parser (linkedin#112)

* [Core] Fix/Refactor SchemaParser to fix multiple bugs

* Enhance the UT for testing required fields with default values (linkedin#113)

* Enhance the UT for testing required fields with default values

* Addressed review comments

* Addressed review comment

* Support single type union for ORC-vectorization reader (linkedin#114)

* Support single type union for ORC-vectorization reader

* Support single type union for ORC-vectorization reader

Co-authored-by: Yiqiang Ding <[email protected]>

* Refactor HMS code upon cherry-pick

* Check for schema corruption and fix it on commit (linkedin#117)

* Check for schema corruption and fix it on commit

* ORC: Handle query where select and filter only uses default value col… (linkedin#118)

* ORC: Handle query where select and filter only use default value columns

* Set ORC columns and fix case-sensitivity issue with schema check (linkedin#119)

* Hive: Return null for currentSnapshot() (linkedin#121)

* Hive: Return null for currentSnapshot()

* Handle snapshots()

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (linkedin#120)

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes

* Add logic to derive partition column id from partition.column.ids pro… (linkedin#122)

* Add logic to derive partition column id from partition.column.ids property

* Do not push down filter to ORC for union type schema (linkedin#123)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (linkedin#125)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable

* LinkedIn rebase draft

* Refactor hivelink 1

* Make hivelink module test all pass

* Make spark 2.4 module work

* Fix mr module

* Make spark 3.1 module work

* Fix TestSparkMetadataColumns

* Minor fix for spark 2.4

* Update default spark version to 3.1

* Update java ci to only run spark 2.4 and 3.1

* Minor fix HiveTableOperations

* Adapt github CI to 0.14.x branch

* Fix mr module checkstyle

* Fix checkstyle for orc module

* Fix spark2.4 checkstyle

* Refactor catalog loading logic using CatalogUtil

* Minor change to CI/release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
rzhang10 added a commit to rzhang10/iceberg that referenced this pull request Nov 4, 2022
* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (linkedin#10)

Add custom hive catalog to not override existing Hive metadata

Fail early with a proper exception if the metadata file is not existing

Simplify CustomHiveCatalog (linkedin#22)

* Shading: Add a iceberg-runtime shaded module (linkedin#12)

* ORC: Add test for reading files without Iceberg IDs (linkedin#16)

* Hive Metadata Scan: Support reading tables with only Hive metadata (linkedin#23, linkedin#24, linkedin#25, linkedin#26)

- Support for non string partition columns (linkedin#24)
- Support for Hive tables without avro.schema.literal (linkedin#25)
- Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35)
- Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37)
- Hive Metadata Scan: Return empty statistics (linkedin#49)
- Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50)
- Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51)

Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>

* Row level filtering: Allow table scans to pass a row level filter for ORC files

- ORC: Support NameMapping with row-level filtering (linkedin#53)

* Hive: Made Predicate Pushdown dynamic based on the Hive Version

* Hive: Fix uppercase bug and determine catalog from table properties (linkedin#38)

* Hive: Return lowercase fieldname from IcebergRecordStructField
* Hive: Determine catalog from table property

* Hive: Fix schema not forwarded to SerDe on MR jobs (linkedin#45) (linkedin#47)

* Hive: Use Hive table location in HiveIcebergSplit
* Hive: Fix schema not passed to Serde
* Hive: Refactor tests for tables with unqualified location URI

Co-authored-by: Shardul Mahadik <[email protected]>

* Hive Metadata Scan: Support case insensitive name mapping (linkedin#52)

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (linkedin#57)

Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58)

Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61)

Co-authored-by: Raymond Zhang <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>

Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67)

* Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time

* Trigger CI

(cherry picked from commit b90e838)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc (linkedin#64)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc

* Fix style check

* Do not delete metadata location when HMS has been successfully updated (linkedin#68)

(cherry picked from commit 766407e)

* Support reading Avro complex union types (linkedin#73)

Co-authored-by: Wenye Zhang <[email protected]>

* [#2039] Support default value semantic for AVRO (linkedin#75)

(cherry picked from commit c18f4c4)

* Support hive non string partition cols (linkedin#78)

* Support non-string hive type partition columns in LegacyHiveTableScan

* Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark (linkedin#76)

* Support default value read for ORC format in spark

* Refactor common code for ReadBuilder for both non-vectorized and vectorized read

* Fix code style issue

* Add special handling of ROW_POSITION metadata column

* Add corner case check for partition field

* Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct

* Support nested type default value for vectorized read

* Support deeply nested type default value for vectorized read

* Support reading ORC complex union types (linkedin#74)

* Support reading orc complex union types

* add more tests

* support union in VectorizedSparkOrcReaders and improve tests

* support union in VectorizedSparkOrcReaders and improve tests - continued

* fix checkstyle

Co-authored-by: Wenye Zhang <[email protected]>

* Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (linkedin#80)

* Fix ORC schema visitors to support reading ORC files with deeply nest… (linkedin#81)

* Fix ORC schema visitors to support reading ORC files with deeply nested union type schema

* Added test for vectorized read

* Disable avro validation for default values

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader reading union schema data (linkedin#83)

* Fix spark avro reader to read correctly structured nested data values

* Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* Avro: Change union read schema from hive to trino (linkedin#84)

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* ORC: Change union read schema from hive to trino (linkedin#85)

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC

* Change Hive type to Iceberg type conversion for union

* Recorder hive table properties to align the avro.schema.literal placement contract (linkedin#86)

* [#2039] Support default value semantic for AVRO

(cherry picked from commit c18f4c4)

* reverting commits 2c59857 and f362aed (linkedin#88)

Co-authored-by: Shenoda Guirguis <[email protected]>

* logically patching PR 2328 on HiveMetadataPreservingTableOperations

* Support timestamp as partition type (linkedin#91)

* Support timestamp in partition types

* Address comment

* Separate classes under hive legacy package to new hivelink module (linkedin#87)

* separate class under legacy to new hiveberg module

* fix build

* remove hiveberg dependency in iceberg-spark2 module

* Revert "remove hiveberg dependency in iceberg-spark2 module"

This reverts commit 2e8b743.

* rename hiveberg module to hivelink

Co-authored-by: Wenye Zhang <[email protected]>

* [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (linkedin#92)

* Align default value validation align with avro semantics in terms of nullable (nested) fields

* Allow setting null as default value for nested fields in record default

* [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (linkedin#94)

* [LI][Spark] read avro union using decoder instead of directly returning value

* Add a comment for the schema

* Improve the logging when the deserailzed index is invalid to read the symbol from enum (linkedin#96)

* Move custom hive catalog to hivelink-core (linkedin#99)

* Handle non-nullable union of single type for Avro (linkedin#98)

* Handle non-nullable union of single type

Co-authored-by: Wenye Zhang <[email protected]>

* Handle null default in nested type default value situations (linkedin#100)

* Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (linkedin#102)

* Remove activeSparkSession (linkedin#103)

* Disable default value preserving (linkedin#106)

* Disable default value preserving

* [LI][Avro] Do not reorder elements inside a Avro union schema (linkedin#93)

* handle single type union properly in AvroSchemaVisitor for deep nested schema (linkedin#107)

* Handle non-nullable union of single type for ORC spark non-vectorized reader (linkedin#104)

* Handle single type union for non-vectorized reader

* [Avro] Retain the type of field while copying the default values. (linkedin#109)

* Retain the type of field while copying the default values.

* [Hivelink] Refactor support hive non string partition cols to rid of … (linkedin#110)

* [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes

* Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (linkedin#101)

* Add scm and developer info (linkedin#111)

* [Core] Fix and refactor schema parser (linkedin#112)

* [Core] Fix/Refactor SchemaParser to fix multiple bugs

* Enhance the UT for testing required fields with default values (linkedin#113)

* Enhance the UT for testing required fields with default values

* Addressed review comments

* Addressed review comment

* Support single type union for ORC-vectorization reader (linkedin#114)

* Support single type union for ORC-vectorization reader

* Support single type union for ORC-vectorization reader

Co-authored-by: Yiqiang Ding <[email protected]>

* Refactor HMS code upon cherry-pick

* Check for schema corruption and fix it on commit (linkedin#117)

* Check for schema corruption and fix it on commit

* ORC: Handle query where select and filter only uses default value col… (linkedin#118)

* ORC: Handle query where select and filter only use default value columns

* Set ORC columns and fix case-sensitivity issue with schema check (linkedin#119)

* Hive: Return null for currentSnapshot() (linkedin#121)

* Hive: Return null for currentSnapshot()

* Handle snapshots()

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (linkedin#120)

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes

* Add logic to derive partition column id from partition.column.ids pro… (linkedin#122)

* Add logic to derive partition column id from partition.column.ids property

* Do not push down filter to ORC for union type schema (linkedin#123)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (linkedin#125)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable

* LinkedIn rebase draft

* Refactor hivelink 1

* Make hivelink module test all pass

* Make spark 2.4 module work

* Fix mr module

* Make spark 3.1 module work

* Fix TestSparkMetadataColumns

* Minor fix for spark 2.4

* Update default spark version to 3.1

* Update java ci to only run spark 2.4 and 3.1

* Minor fix HiveTableOperations

* Adapt github CI to 0.14.x branch

* Fix mr module checkstyle

* Fix checkstyle for orc module

* Fix spark2.4 checkstyle

* Refactor catalog loading logic using CatalogUtil

* Minor change to CI/release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
rzhang10 added a commit that referenced this pull request Dec 17, 2022
* Rebase LI-Iceberg changes on top of Apache Iceberg 1.0.0 release

* Hive Catalog: Add a hive catalog that does not override existing Hive metadata (#10)

Add custom hive catalog to not override existing Hive metadata

Fail early with a proper exception if the metadata file is not existing

Simplify CustomHiveCatalog (#22)

* Shading: Add a iceberg-runtime shaded module (#12)

* ORC: Add test for reading files without Iceberg IDs (#16)

* Hive Metadata Scan: Support reading tables with only Hive metadata (#23, #24, #25, #26)

- Support for non string partition columns (#24)
- Support for Hive tables without avro.schema.literal (#25)
- Hive Metadata Scan: Notify ScanEvent listeners on planning (#35)
- Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (#37)
- Hive Metadata Scan: Return empty statistics (#49)
- Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (#50)
- Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (#51)

Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>

* Row level filtering: Allow table scans to pass a row level filter for ORC files

- ORC: Support NameMapping with row-level filtering (#53)

* Hive: Made Predicate Pushdown dynamic based on the Hive Version

* Hive: Fix uppercase bug and determine catalog from table properties (#38)

* Hive: Return lowercase fieldname from IcebergRecordStructField
* Hive: Determine catalog from table property

* Hive: Fix schema not forwarded to SerDe on MR jobs (#45) (#47)

* Hive: Use Hive table location in HiveIcebergSplit
* Hive: Fix schema not passed to Serde
* Hive: Refactor tests for tables with unqualified location URI

Co-authored-by: Shardul Mahadik <[email protected]>

* Hive Metadata Scan: Support case insensitive name mapping (#52)

* Hive Metadata Scan: Merge Hive and Avro schemas to fix datatype inconsistencies (#57)

Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (#58)

Hive Metadata Scan: Fix support for Hive timestamp type (#61)

Co-authored-by: Raymond Zhang <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>

Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (#67)

* Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time

* Trigger CI

(cherry picked from commit b90e838)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc (#64)

* Stop using serdeToFileFormat to unblock formats other than Avro or Orc

* Fix style check

* Do not delete metadata location when HMS has been successfully updated (#68)

(cherry picked from commit 766407e)

* Support reading Avro complex union types (#73)

Co-authored-by: Wenye Zhang <[email protected]>

* [#2039] Support default value semantic for AVRO (#75)

(cherry picked from commit c18f4c4)

* Support hive non string partition cols (#78)

* Support non-string hive type partition columns in LegacyHiveTableScan

* Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark (#76)

* Support default value read for ORC format in spark

* Refactor common code for ReadBuilder for both non-vectorized and vectorized read

* Fix code style issue

* Add special handling of ROW_POSITION metadata column

* Add corner case check for partition field

* Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct

* Support nested type default value for vectorized read

* Support deeply nested type default value for vectorized read

* Support reading ORC complex union types (#74)

* Support reading orc complex union types

* add more tests

* support union in VectorizedSparkOrcReaders and improve tests

* support union in VectorizedSparkOrcReaders and improve tests - continued

* fix checkstyle

Co-authored-by: Wenye Zhang <[email protected]>

* Support avro.schema.literal/hive union types in Hive legacy table to Iceberg conversion (#80)

* Fix ORC schema visitors to support reading ORC files with deeply nest… (#81)

* Fix ORC schema visitors to support reading ORC files with deeply nested union type schema

* Added test for vectorized read

* Disable avro validation for default values

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader reading union schema data (#83)

* Fix spark avro reader to read correctly structured nested data values

* Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* Avro: Change union read schema from hive to trino (#84)

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* ORC: Change union read schema from hive to trino (#85)

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC

* Change Hive type to Iceberg type conversion for union

* Recorder hive table properties to align the avro.schema.literal placement contract (#86)

* [#2039] Support default value semantic for AVRO

(cherry picked from commit c18f4c4)

* reverting commits 2c59857 and f362aed (#88)

Co-authored-by: Shenoda Guirguis <[email protected]>

* logically patching PR 2328 on HiveMetadataPreservingTableOperations

* Support timestamp as partition type (#91)

* Support timestamp in partition types

* Address comment

* Separate classes under hive legacy package to new hivelink module (#87)

* separate class under legacy to new hiveberg module

* fix build

* remove hiveberg dependency in iceberg-spark2 module

* Revert "remove hiveberg dependency in iceberg-spark2 module"

This reverts commit 2e8b743.

* rename hiveberg module to hivelink

Co-authored-by: Wenye Zhang <[email protected]>

* [LI] Align default value validation align with avro semantics in terms of nullable (nested) fields (#92)

* Align default value validation align with avro semantics in terms of nullable (nested) fields

* Allow setting null as default value for nested fields in record default

* [LI][Spark][Avro] read avro union using decoder instead of directly returning v… (#94)

* [LI][Spark] read avro union using decoder instead of directly returning value

* Add a comment for the schema

* Improve the logging when the deserailzed index is invalid to read the symbol from enum (#96)

* Move custom hive catalog to hivelink-core (#99)

* Handle non-nullable union of single type for Avro (#98)

* Handle non-nullable union of single type

Co-authored-by: Wenye Zhang <[email protected]>

* Handle null default in nested type default value situations (#100)

* Move 'Hive Metadata Scan: Support case insensitive name mapping' (PR 52) to hivelink-core (#102)

* Remove activeSparkSession (#103)

* Disable default value preserving (#106)

* Disable default value preserving

* [LI][Avro] Do not reorder elements inside a Avro union schema (#93)

* handle single type union properly in AvroSchemaVisitor for deep nested schema (#107)

* Handle non-nullable union of single type for ORC spark non-vectorized reader (#104)

* Handle single type union for non-vectorized reader

* [Avro] Retain the type of field while copying the default values. (#109)

* Retain the type of field while copying the default values.

* [Hivelink] Refactor support hive non string partition cols to rid of … (#110)

* [Hivelink] Refactor support hive non string partition cols to rid of Iceberg-oss code changes

* Release automation overhaul: Sonatype Nexus, Shipkit and GH Actions (#101)

* Add scm and developer info (#111)

* [Core] Fix and refactor schema parser (#112)

* [Core] Fix/Refactor SchemaParser to fix multiple bugs

* Enhance the UT for testing required fields with default values (#113)

* Enhance the UT for testing required fields with default values

* Addressed review comments

* Addressed review comment

* Support single type union for ORC-vectorization reader (#114)

* Support single type union for ORC-vectorization reader

* Support single type union for ORC-vectorization reader

Co-authored-by: Yiqiang Ding <[email protected]>

* Refactor HMS code upon cherry-pick

* Check for schema corruption and fix it on commit (#117)

* Check for schema corruption and fix it on commit

* ORC: Handle query where select and filter only uses default value col… (#118)

* ORC: Handle query where select and filter only use default value columns

* Set ORC columns and fix case-sensitivity issue with schema check (#119)

* Hive: Return null for currentSnapshot() (#121)

* Hive: Return null for currentSnapshot()

* Handle snapshots()

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes (#120)

* Fix MergeHiveSchemaWithAvro to make it copy full Avro schema attributes

* Add logic to derive partition column id from partition.column.ids pro… (#122)

* Add logic to derive partition column id from partition.column.ids property

* Do not push down filter to ORC for union type schema (#123)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for li… (#125)

* Bug fix: MergeHiveSchemaWithAvro should retain avro properties for list and map when they are nullable

* LinkedIn rebase draft

* Refactor hivelink 1

* Make hivelink module test all pass

* Make spark 2.4 module work

* Fix mr module

* Make spark 3.1 module work

* Fix TestSparkMetadataColumns

* Minor fix for spark 2.4

* Update default spark version to 3.1

* Update java ci to only run spark 2.4 and 3.1

* Minor fix HiveTableOperations

* Adapt github CI to 0.14.x branch

* Fix mr module checkstyle

* Fix checkstyle for orc module

* Fix spark2.4 checkstyle

* Refactor catalog loading logic using CatalogUtil

* Minor change to CI/release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>

* Add flink 1.14 artifacts for release

Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Ratandeep Ratti <[email protected]>
Co-authored-by: Shardul Mahadik <[email protected]>
Co-authored-by: Kuai Yu <[email protected]>
Co-authored-by: Walaa Eldin Moustafa <[email protected]>
Co-authored-by: Sushant Raikar <[email protected]>
Co-authored-by: ZihanLi58 <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Wenye Zhang <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Shenoda Guirguis <[email protected]>
Co-authored-by: Lei Sun <[email protected]>
Co-authored-by: Jiefan <[email protected]>
Co-authored-by: yiqiangin <[email protected]>
Co-authored-by: Malini Mahalakshmi Venkatachari <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Yiqiang Ding <[email protected]>
Co-authored-by: Jack Moseley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants