Core - Upgrade Parquet to 1.12.3 #4951

kbendick · 2022-06-03T01:20:13Z

This patch upgrades Parquet from 1.12.2 to 1.12.3.

The change-log between the two can be found here: apache/parquet-java@apache-parquet-1.12.2...apache-parquet-1.12.3

A few notes of particular interest:

Enabled zstd buffer pool by default
- Sets parquet.compression.codec.zstd.bufferPool.enabled to true by default
- Some users of current Iceberg have reported that this makes working with tables with very very high cardinality (e.g. hundreds or even thousands of columns) much more memory friendly
PARQUET-2030 - Expose page size writer configurations
Bump Avro from 1.10.1 to 1.10.2
- We might need to upgrade this as well as we currently use 1.10.1

rdblue · 2022-06-29T17:04:13Z

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

kbendick · 2022-06-29T17:49:20Z

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

Yes I believe we should set that configuration by default. Either by upgrading the patch version of parquet, or setting it ourselves.

If we choose to set it ourselves and not upgrade the parquet library, I'll do a quick pass over the PRs added between 1.12.2 and 1.12.3 to make sure it's not somehow unsafe to add.

I think upgrading the parquet patch version would likely be better in the longer term.

kbendick · 2022-06-29T17:58:47Z

For reference, 1.12.3 is the default parquet version in Spark master, although Spark 3.3.0 still uses Parquet 1.12.2.

.../v3.2/spark/src/jmh/java/org/apache/iceberg/spark/source/IcebergSourceFlatDataBenchmark.java

versions.props

kbendick · 2022-06-29T18:11:20Z

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

rdblue · 2022-06-29T18:38:03Z

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

Is that safe? What is the version change?

kbendick · 2022-06-29T19:58:33Z

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

Is that safe? What is the version change?

The version change is from 1.10.1 to 1.10.2. apache/parquet-java@d96b19b

We are on 1.10.1, so we would be following the same upgrade path. I'm not sure which other dependencies rely on avro (I imagine many of them do).

kbendick · 2022-06-29T20:00:46Z

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

I have spoken to some users with very high cardinality tables (potentially over 1000 columns) who have told me that enabling the buffer pool via this configuration has resolved OOMs for them, so I believe it will be beneficial to all users.

kbendick · 2022-06-29T20:23:21Z

The Hive tests are failing due to the lack of a method.

    java.lang.NoSuchMethodError: org.apache.parquet.format.Util.writePageHeader(Lorg/apache/parquet/format/PageHeader;Ljava/io/OutputStream;Lorg/apache/parquet/format/BlockCipher$Encryptor;[B)V
        at org.apache.parquet.format.converter.ParquetMetadataConverter.writeDataPageV1Header(ParquetMetadataConverter.java:1880)
        at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:186)
        at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:59)
        at org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387)
        at org.apache.parquet.column.impl.ColumnWriteStoreBase.flush(ColumnWriteStoreBase.java:186)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:29)
        at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:203)
        at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:236)
        at org.apache.iceberg.data.GenericAppenderHelper.appendToLocalFile(GenericAppenderHelper.java:102)
        at org.apache.iceberg.data.GenericAppenderHelper.writeFile(GenericAppenderHelper.java:85)
        at org.apache.iceberg.mr.TestHelper.writeFile(TestHelper.java:114)
        at org.apache.iceberg.mr.hive.TestTables.appendIcebergTable(TestTables.java:295)
        at org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithMultipleCatalogs.createAndAddRecords(TestHiveIcebergStorageHandlerWithMultipleCatalogs.java:140)
        at org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithMultipleCatalogs.testJoinTablesFromDifferentCatalogs(TestHiveIcebergStorageHandlerWithMultipleCatalogs.java:118)

rdblue · 2022-06-29T22:49:12Z

Looks like Hive might be bringing in a different copy of Parquet and that is conflicting in the test. We should be able to exclude Hive's Parquet version to work around this.

kbendick · 2022-06-30T16:09:27Z

Looks like Hive might be bringing in a different copy of Parquet and that is conflicting in the test. We should be able to exclude Hive's Parquet version to work around this.

I'll give that a try.

kbendick · 2022-07-06T03:42:42Z

Closed in favor of #5188, which gets this working.

github-actions bot added build spark labels Jun 3, 2022

kbendick changed the title ~~Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default~~ [TEST] Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default Jun 5, 2022

rdblue added this to the Iceberg 0.14.0 Release milestone Jun 29, 2022

rdblue reviewed Jun 29, 2022

View reviewed changes

.../v3.2/spark/src/jmh/java/org/apache/iceberg/spark/source/IcebergSourceFlatDataBenchmark.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 29, 2022

View reviewed changes

versions.props Show resolved Hide resolved

kbendick force-pushed the kb-bump-parquet-patch-version branch from 2ae2c7c to 6896510 Compare June 29, 2022 19:49

Core - Upgrade Parquet from 1.12.2 to 1.12.3

6896510

kbendick changed the title ~~[TEST] Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default~~ Core - Upgrade Parquet to 1.12.3 Jun 29, 2022

kbendick marked this pull request as ready for review June 29, 2022 19:59

Bump avro version as well for testing purposes

bff320b

rdblue mentioned this pull request Jul 3, 2022

Build: Update Parquet and Avro dependencies #5188

Merged

kbendick closed this Jul 6, 2022

Core - Upgrade Parquet to 1.12.3 #4951

Core - Upgrade Parquet to 1.12.3 #4951

Uh oh!

Conversation

kbendick commented Jun 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jun 29, 2022

Uh oh!

kbendick commented Jun 29, 2022

Uh oh!

kbendick commented Jun 29, 2022

Uh oh!

Uh oh!

Uh oh!

kbendick commented Jun 29, 2022

Uh oh!

rdblue commented Jun 29, 2022

Uh oh!

kbendick commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kbendick commented Jun 29, 2022

Uh oh!

kbendick commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jun 29, 2022

Uh oh!

kbendick commented Jun 30, 2022

Uh oh!

kbendick commented Jul 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kbendick commented Jun 3, 2022 •

edited

Loading

kbendick commented Jun 29, 2022 •

edited

Loading

kbendick commented Jun 29, 2022 •

edited

Loading