Implement verification for optimized parquet writer by raunaqmorarka · Pull Request #13246 · trinodb/trino

raunaqmorarka · 2022-07-20T08:09:33Z

Description

Implements verification of file footer, row count and
checksum of columns.
Added a config parquet.optimized-writer.validation-percentage and
session property in hive connector to control the percentage of
written files that will be verified.
By default, we will verify 5% of written files.

Is this change a fix, improvement, new feature, refactoring, or other?

new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

optimized parquet writer

How would you describe this change to a non-technical end user or system administrator?

Implements verification of files written by optimized parquet writer in hive connector.

Fixes #5356

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Hive
* Implement verification of files written by optimized parquet writer. ({issue}`13246`)

docs/src/main/sphinx/connector/hive.rst

lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java

docs/src/main/sphinx/connector/hive.rst

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestParquetWriterConfig.java

sopel39

cc @lukasz-stec

docs/src/main/sphinx/connector/hive.rst

lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/ParquetTester.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java

skrzypo987

lgtm % I'm not an expert in ORC format

sopel39 · 2022-08-03T15:21:18Z

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java

consider adding test cases for mismatch and equal cases for this and other methods

lukasz-stec

I added some comments.
I also have two high-level objections.

there seems to ba a lot of code copied between orc, rcfile and parquet write validation. It would be a lot cleaner to have it extracted to common place.
This changes modifies ParquetReader and related classes significantly, which in theory could be avoided (may be hard in practice though).

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java

lukasz-stec · 2022-08-08T13:22:01Z

lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java

why the implementation is different than io.trino.orc.ValidationHash#mapSkipNullKeysHash?

I'm not sure why there is even a isNull check in mapSkipNullKeysHash given that the null case is already handled by hash method.
Maybe @dain knows the reason behind that ?
The xor was avoided to avoid having a function which gives the same result if the key and value are flipped.

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetBlockFactory.java

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriter.java

sopel39

lgtm % comments

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReaderColumn.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetWriterConfig.java

sopel39 · 2022-08-19T13:52:46Z

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java

Added ParquetBlockFactory along similar lines as OrcBlockFactory to handle creation of lazy blocks and addition of connector specific error codes to exceptions. This change makes it possible for the writer to perform validation without having to rely on ConnectorPageSource.

The logic here does not have dependency on hive and it is needed here to allow ParquetWriter to create ParquetReader for the verification of the written file.

raunaqmorarka · 2022-09-12T16:18:17Z

Optimized parquet writer verification inserts benchmark.pdf
Perf impact with 5% verification (current default) is around 2-3%
Perf impact with 100% verification would be around 45%.

Implements verification of file footer, row count, nulls count and checksum of columns. Added a config parquet.optimized-writer.validation-percentage and session property in hive connector to control the percentage of written files that will be verified.

findepi · 2022-09-13T10:38:49Z

Per #14047 (comment)
is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

raunaqmorarka · 2022-09-13T10:50:03Z

Per #14047 (comment)
is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

Right, this PR implements parquet writer verification only for hive connector. It should be straightforward to add it to delta lake and iceberg as well, but I don't know if we actually want to run it there given that we've already been using the new writer without verification in those connectors.

findepi · 2023-06-22T22:21:51Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

+    {
+        int batchSize = nextBatch();
+        if (batchSize <= 0) {
+            return null;


under what circumstances is this possible?

btw maybe we should throw here?
returning null will cause ParquetPageSource to close and not read anything anymore:

trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java

Lines 103 to 105 in f8e774a

if (closed || page == null) {

close();

return null;

cc @alexjo2144 @homar

This is just preserving the old code's behaviour

int batchSize = parquetReader.nextBatch(); if (closed || batchSize <= 0) { close(); return null; }

batchSize can be -1 on reaching end of list of row groups to read, which is okay.
batchSize can be 0 on encountering empty row groups. This should be encountered only for an empty file. It's possible that some buggy writer writes empty row groups in the middle of a file with data, in that case we will stop early.

cla-bot bot added the cla-signed label Jul 20, 2022

github-actions bot added the tests:hive label Jul 20, 2022

raunaqmorarka force-pushed the pq-verify branch from 84344ac to e6d8e11 Compare July 20, 2022 10:16

findepi requested a review from electrum July 20, 2022 15:34

raunaqmorarka force-pushed the pq-verify branch 3 times, most recently from e95c15c to a8c155c Compare July 22, 2022 07:09

raunaqmorarka marked this pull request as ready for review July 22, 2022 07:10

raunaqmorarka requested review from findepi, skrzypo987 and sopel39 July 22, 2022 07:12

raunaqmorarka force-pushed the pq-verify branch from a8c155c to 096bb7a Compare July 22, 2022 07:27

github-actions bot added the docs label Jul 22, 2022

raunaqmorarka mentioned this pull request Jul 23, 2022

Avoid writing empty row group in optimized parquet writer #13248

Merged

raunaqmorarka force-pushed the pq-verify branch from 096bb7a to bf4d88b Compare July 23, 2022 07:47

skrzypo987 reviewed Jul 27, 2022

View reviewed changes

raunaqmorarka force-pushed the pq-verify branch 2 times, most recently from e191cb8 to edc7a5a Compare July 27, 2022 10:27

raunaqmorarka requested a review from skrzypo987 July 27, 2022 10:29

raunaqmorarka force-pushed the pq-verify branch from edc7a5a to 8a0364b Compare August 2, 2022 05:07

sopel39 added the performance label Aug 2, 2022

sopel39 requested a review from gaurav8297 August 2, 2022 10:29

sopel39 reviewed Aug 2, 2022

View reviewed changes

sopel39 requested a review from lukasz-stec August 2, 2022 15:29

skrzypo987 approved these changes Aug 3, 2022

View reviewed changes

raunaqmorarka force-pushed the pq-verify branch from 8a0364b to 15204e7 Compare August 3, 2022 09:55

sopel39 reviewed Aug 3, 2022

View reviewed changes

raunaqmorarka force-pushed the pq-verify branch from 15204e7 to 44bd97c Compare August 5, 2022 10:51

lukasz-stec reviewed Aug 8, 2022

View reviewed changes

raunaqmorarka force-pushed the pq-verify branch from 44bd97c to 4e07f95 Compare August 12, 2022 11:49

sopel39 approved these changes Aug 19, 2022

View reviewed changes

sopel39 mentioned this pull request Aug 30, 2022

Add predicate to ParquetReader constructor in Iceberg #13804

Merged

raunaqmorarka force-pushed the pq-verify branch 6 times, most recently from 4fec1bd to 6a60b32 Compare August 31, 2022 11:48

raunaqmorarka force-pushed the pq-verify branch 7 times, most recently from e2e909f to 8224703 Compare September 12, 2022 06:38

raunaqmorarka added 3 commits September 12, 2022 15:12

Remove unused ParquetPageSource constructor

a6922ac

Move HiveParquetColumnIOConverter#constructField to trino-parquet

7bb5cba

The logic here does not have dependency on hive and it is needed here to allow ParquetWriter to create ParquetReader for the verification of the written file.

raunaqmorarka force-pushed the pq-verify branch from 8224703 to 3cbd473 Compare September 12, 2022 09:43

raunaqmorarka force-pushed the pq-verify branch from 3cbd473 to a41d95f Compare September 12, 2022 17:32

raunaqmorarka merged commit 489bc1e into trinodb:master Sep 13, 2022

raunaqmorarka deleted the pq-verify branch September 13, 2022 02:50

github-actions bot added this to the 396 milestone Sep 13, 2022

raunaqmorarka mentioned this pull request Sep 13, 2022

Release notes for 396 #14047

Closed

colebow mentioned this pull request Sep 13, 2022

Add Trino 396 release notes #14109

Merged

findepi reviewed Jun 22, 2023

View reviewed changes

Conversation

raunaqmorarka commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Documentation

Release notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skrzypo987 left a comment

Choose a reason for hiding this comment

Uh oh!

sopel39 Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

sopel39 Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lukasz-stec Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sopel39 Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raunaqmorarka commented Sep 12, 2022

Uh oh!

findepi commented Sep 13, 2022

Uh oh!

raunaqmorarka commented Sep 13, 2022

Uh oh!

findepi Jun 22, 2023

Choose a reason for hiding this comment

Uh oh!

raunaqmorarka Jun 23, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

raunaqmorarka commented Jul 20, 2022 •

edited

Loading