Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release #126

rzhang10 · 2022-10-11T20:06:23Z

Rebase all Linkedin specific changes on top of apache iceberg release: https://github.com/apache/iceberg/releases/tag/apache-iceberg-0.14.1

Currently all the modules (that linkedin has changes) are working (builds and passing all unit tests):

core
api
hivelink-core
orc
mr
spark-2.4
spark-3.1

Also adapted github CI to the check/test the modules/builds we want for LI specific uses (e.g. spark-2.4, spark-3.1) for the li-0.14.x branch.

… metadata (linkedin#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (linkedin#22)

…inkedin#23, linkedin#24, linkedin#25, linkedin#26) - Support for non string partition columns (linkedin#24) - Support for Hive tables without avro.schema.literal (linkedin#25) - Hive Metadata Scan: Notify ScanEvent listeners on planning (linkedin#35) - Hive Metadata Scan: Do not use table snapshot summary for estimating statistics (linkedin#37) - Hive Metadata Scan: Return empty statistics (linkedin#49) - Hive Metadata Scan: Do not throw an exception on dangling partitions; log warning message (linkedin#50) - Hive Metadata Scan: Fix pushdown of non-partition predicates within NOT (linkedin#51) Co-authored-by: Ratandeep Ratti <[email protected]> Co-authored-by: Kuai Yu <[email protected]> Co-authored-by: Walaa Eldin Moustafa <[email protected]>

… ORC files - ORC: Support NameMapping with row-level filtering (linkedin#53)

…inkedin#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property

…kedin#47) * Hive: Use Hive table location in HiveIcebergSplit * Hive: Fix schema not passed to Serde * Hive: Refactor tests for tables with unqualified location URI Co-authored-by: Shardul Mahadik <[email protected]>

…sistencies (linkedin#57) Hive Metadata Scan: Fix Hive primitive to Avro logical type conversion (linkedin#58) Hive Metadata Scan: Fix support for Hive timestamp type (linkedin#61) Co-authored-by: Raymond Zhang <[email protected]> Co-authored-by: Shardul Mahadik <[email protected]> Fix HasDuplicateLowercaseColumnNames's visit method to use a new visi… (linkedin#67) * Fix HasDuplicateLowercaseColumnNames's visit method to use a new visitor instance every time * Trigger CI (cherry picked from commit b90e838)

linkedin#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check

linkedin#68) (cherry picked from commit 766407e)

Co-authored-by: Wenye Zhang <[email protected]>

(cherry picked from commit c18f4c4)

* Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns

* Support default value read for ORC format in spark * Refactor common code for ReadBuilder for both non-vectorized and vectorized read * Fix code style issue * Add special handling of ROW_POSITION metadata column * Add corner case check for partition field * Use BaseDataReader.convertConstant to convert constants, and expand its functionality to support nested-type contants such as array/map/struct * Support nested type default value for vectorized read * Support deeply nested type default value for vectorized read

* Support reading orc complex union types * add more tests * support union in VectorizedSparkOrcReaders and improve tests * support union in VectorizedSparkOrcReaders and improve tests - continued * fix checkstyle Co-authored-by: Wenye Zhang <[email protected]>

…Iceberg conversion (linkedin#80)

linkedin#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read

Co-authored-by: Shenoda Guirguis <[email protected]>

* Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union

…ment contract (linkedin#86)

(cherry picked from commit c18f4c4)

Co-authored-by: Shenoda Guirguis <[email protected]>

* Support timestamp in partition types * Address comment

…nkedin#87) * separate class under legacy to new hiveberg module * fix build * remove hiveberg dependency in iceberg-spark2 module * Revert "remove hiveberg dependency in iceberg-spark2 module" This reverts commit 2e8b743. * rename hiveberg module to hivelink Co-authored-by: Wenye Zhang <[email protected]>

…s of nullable (nested) fields (linkedin#92) * Align default value validation align with avro semantics in terms of nullable (nested) fields * Allow setting null as default value for nested fields in record default

yiqiangin

LGTM

shardulm94 and others added 30 commits September 28, 2022 08:42

Hive Catalog: Add a hive catalog that does not override existing Hive…

09cd709

… metadata (linkedin#10) Add custom hive catalog to not override existing Hive metadata Fail early with a proper exception if the metadata file is not existing Simplify CustomHiveCatalog (linkedin#22)

Shading: Add a iceberg-runtime shaded module (linkedin#12)

1dd5c14

ORC: Add test for reading files without Iceberg IDs (linkedin#16)

21c0276

Row level filtering: Allow table scans to pass a row level filter for…

80e6317

… ORC files - ORC: Support NameMapping with row-level filtering (linkedin#53)

Hive: Made Predicate Pushdown dynamic based on the Hive Version

d2e4f4b

Hive: Fix uppercase bug and determine catalog from table properties (l…

d959c40

…inkedin#38) * Hive: Return lowercase fieldname from IcebergRecordStructField * Hive: Determine catalog from table property

Hive Metadata Scan: Support case insensitive name mapping (linkedin#52)

5a5adcd

Stop using serdeToFileFormat to unblock formats other than Avro or Orc (

eb26358

linkedin#64) * Stop using serdeToFileFormat to unblock formats other than Avro or Orc * Fix style check

Do not delete metadata location when HMS has been successfully updated (

37c9ae2

linkedin#68) (cherry picked from commit 766407e)

Support reading Avro complex union types (linkedin#73)

4bdb0c5

Co-authored-by: Wenye Zhang <[email protected]>

[#2039] Support default value semantic for AVRO (linkedin#75)

15df78f

(cherry picked from commit c18f4c4)

Support hive non string partition cols (linkedin#78)

1e1b4b9

* Support non-string hive type partition columns in LegacyHiveTableScan * Leverage eval against partition filter expression to filter non-string columns

Support avro.schema.literal/hive union types in Hive legacy table to …

52ec9b9

…Iceberg conversion (linkedin#80)

Fix ORC schema visitors to support reading ORC files with deeply nest… (

fc51e32

linkedin#81) * Fix ORC schema visitors to support reading ORC files with deeply nested union type schema * Added test for vectorized read

Disable avro validation for default values

1757af7

Co-authored-by: Shenoda Guirguis <[email protected]>

Fix spark avro reader reading union schema data (linkedin#83)

1f63d32

* Fix spark avro reader to read correctly structured nested data values * Make sure field-id mapping is correctly maintained given arbitrary nested schema that contains union

Avro: Change union read schema from hive to trino (linkedin#84)

973e3dc

* [LI] Avro: Refactor union-to-struct schema - Part 1. changes to support reading Avro

ORC: Change union read schema from hive to trino (linkedin#85)

611b256

* [LI] ORC: Refactor union-to-struct schema - Part 2. changes to support reading ORC * Change Hive type to Iceberg type conversion for union

Recorder hive table properties to align the avro.schema.literal place…

987eb1d

…ment contract (linkedin#86)

[#2039] Support default value semantic for AVRO

8cc2711

(cherry picked from commit c18f4c4)

reverting commits 2c59857 and f362aed (linkedin#88)

64ed521

Co-authored-by: Shenoda Guirguis <[email protected]>

logically patching PR 2328 on HiveMetadataPreservingTableOperations

7d41f3e

Support timestamp as partition type (linkedin#91)

33bd0da

* Support timestamp in partition types * Address comment

rzhang10 force-pushed the li-0.14.x branch from fed806b to 71d918e Compare October 11, 2022 20:20

rzhang10 changed the title ~~Update LI-Iceberg with Apache Iceberg 0.14.1~~ Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 Oct 11, 2022

rzhang10 changed the title ~~Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1~~ Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release Oct 11, 2022

Fix mr module

47c32df

github-actions bot added SPARK and removed AWS DOCS PARQUET NESSIE PIG FLINK PYTHON COMMON ARROW labels Oct 11, 2022

rzhang10 added 10 commits October 11, 2022 17:40

Make spark 3.1 module work

3f35222

Fix TestSparkMetadataColumns

3d955d8

Minor fix for spark 2.4

656c692

Update default spark version to 3.1

343fd1c

Update java ci to only run spark 2.4 and 3.1

794a46b

Minor fix HiveTableOperations

d46aaae

Adapt github CI to 0.14.x branch

8c6c8e2

Fix mr module checkstyle

ca81b95

Fix checkstyle for orc module

d63a059

Fix spark2.4 checkstyle

555259a

yiqiangin approved these changes Oct 13, 2022

View reviewed changes

rzhang10 added 2 commits October 21, 2022 13:50

Refactor catalog loading logic using CatalogUtil

35c4650

Minor change to CI/release

f158233

yiqiangin approved these changes Oct 21, 2022

View reviewed changes

rzhang10 merged commit 211a8c9 into linkedin:li-0.14.x Oct 21, 2022

rzhang10 mentioned this pull request Nov 4, 2022

Rebase LI-Iceberg changes on top of Apache Iceberg 1.0.0 release #131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release #126

Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release #126

rzhang10 commented Oct 11, 2022 •

edited

Loading

Uh oh!

yiqiangin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release #126

Rebase LI-Iceberg changes on top of Apache Iceberg 0.14.1 release #126

Conversation

rzhang10 commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiqiangin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

rzhang10 commented Oct 11, 2022 •

edited

Loading