Skip to content

Conversation

@kbendick
Copy link
Contributor

@kbendick kbendick commented Jun 3, 2022

This patch upgrades Parquet from 1.12.2 to 1.12.3.

The change-log between the two can be found here: apache/parquet-java@apache-parquet-1.12.2...apache-parquet-1.12.3

A few notes of particular interest:

@kbendick kbendick changed the title Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default [TEST] Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default Jun 5, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

@rdblue rdblue added this to the Iceberg 0.14.0 Release milestone Jun 29, 2022
@kbendick
Copy link
Contributor Author

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

Yes I believe we should set that configuration by default. Either by upgrading the patch version of parquet, or setting it ourselves.

If we choose to set it ourselves and not upgrade the parquet library, I'll do a quick pass over the PRs added between 1.12.2 and 1.12.3 to make sure it's not somehow unsafe to add.

I think upgrading the parquet patch version would likely be better in the longer term.

@kbendick
Copy link
Contributor Author

@kbendick
Copy link
Contributor Author

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

Is that safe? What is the version change?

@kbendick kbendick force-pushed the kb-bump-parquet-patch-version branch from 2ae2c7c to 6896510 Compare June 29, 2022 19:49
@kbendick kbendick changed the title [TEST] Core - Upgrade Parquet to 1.12.3 to get Zstd Buffer Pool by default Core - Upgrade Parquet to 1.12.3 Jun 29, 2022
@kbendick
Copy link
Contributor Author

kbendick commented Jun 29, 2022

We might also want to bump the Avro version to match what’s used in parquet 1.12.3.

Is that safe? What is the version change?

The version change is from 1.10.1 to 1.10.2. apache/parquet-java@d96b19b

We are on 1.10.1, so we would be following the same upgrade path. I'm not sure which other dependencies rely on avro (I imagine many of them do).

@kbendick kbendick marked this pull request as ready for review June 29, 2022 19:59
@kbendick
Copy link
Contributor Author

@kbendick, what's the status of this? Should Iceberg set parquet.compression.codec.zstd.bufferPool.enabled=true for all Parquet files? That sounds like a good thing to me.

I have spoken to some users with very high cardinality tables (potentially over 1000 columns) who have told me that enabling the buffer pool via this configuration has resolved OOMs for them, so I believe it will be beneficial to all users.

@kbendick
Copy link
Contributor Author

kbendick commented Jun 29, 2022

The Hive tests are failing due to the lack of a method.

    java.lang.NoSuchMethodError: org.apache.parquet.format.Util.writePageHeader(Lorg/apache/parquet/format/PageHeader;Ljava/io/OutputStream;Lorg/apache/parquet/format/BlockCipher$Encryptor;[B)V
        at org.apache.parquet.format.converter.ParquetMetadataConverter.writeDataPageV1Header(ParquetMetadataConverter.java:1880)
        at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:186)
        at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:59)
        at org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:387)
        at org.apache.parquet.column.impl.ColumnWriteStoreBase.flush(ColumnWriteStoreBase.java:186)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:29)
        at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:203)
        at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:236)
        at org.apache.iceberg.data.GenericAppenderHelper.appendToLocalFile(GenericAppenderHelper.java:102)
        at org.apache.iceberg.data.GenericAppenderHelper.writeFile(GenericAppenderHelper.java:85)
        at org.apache.iceberg.mr.TestHelper.writeFile(TestHelper.java:114)
        at org.apache.iceberg.mr.hive.TestTables.appendIcebergTable(TestTables.java:295)
        at org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithMultipleCatalogs.createAndAddRecords(TestHiveIcebergStorageHandlerWithMultipleCatalogs.java:140)
        at org.apache.iceberg.mr.hive.TestHiveIcebergStorageHandlerWithMultipleCatalogs.testJoinTablesFromDifferentCatalogs(TestHiveIcebergStorageHandlerWithMultipleCatalogs.java:118)

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

Looks like Hive might be bringing in a different copy of Parquet and that is conflicting in the test. We should be able to exclude Hive's Parquet version to work around this.

@kbendick
Copy link
Contributor Author

Looks like Hive might be bringing in a different copy of Parquet and that is conflicting in the test. We should be able to exclude Hive's Parquet version to work around this.

I'll give that a try.

@kbendick
Copy link
Contributor Author

kbendick commented Jul 6, 2022

Closed in favor of #5188, which gets this working.

@kbendick kbendick closed this Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants