Skip to content

Implement verification for optimized parquet writer#13246

Merged
raunaqmorarka merged 4 commits intotrinodb:masterfrom
raunaqmorarka:pq-verify
Sep 13, 2022
Merged

Implement verification for optimized parquet writer#13246
raunaqmorarka merged 4 commits intotrinodb:masterfrom
raunaqmorarka:pq-verify

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Jul 20, 2022

Description

Implements verification of file footer, row count and
checksum of columns.
Added a config parquet.optimized-writer.validation-percentage and
session property in hive connector to control the percentage of
written files that will be verified.
By default, we will verify 5% of written files.

Is this change a fix, improvement, new feature, refactoring, or other?

new feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

optimized parquet writer

How would you describe this change to a non-technical end user or system administrator?

Implements verification of files written by optimized parquet writer in hive connector.

Fixes #5356

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Hive
* Implement verification of files written by optimized parquet writer. ({issue}`13246`)

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 sopel39 requested a review from lukasz-stec August 2, 2022 15:29
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % I'm not an expert in ORC format

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding test cases for mismatch and equal cases for this and other methods

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Copy Markdown
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments.
I also have two high-level objections.

  1. there seems to ba a lot of code copied between orc, rcfile and parquet write validation. It would be a lot cleaner to have it extracted to common place.
  2. This changes modifies ParquetReader and related classes significantly, which in theory could be avoided (may be hard in practice though).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the implementation is different than io.trino.orc.ValidationHash#mapSkipNullKeysHash?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why there is even a isNull check in mapSkipNullKeysHash given that the null case is already handled by hash method.
Maybe @dain knows the reason behind that ?
The xor was avoided to avoid having a function which gives the same result if the key and value are flipped.

Copy link
Copy Markdown
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Added ParquetBlockFactory along similar lines as OrcBlockFactory to
handle creation of lazy blocks and addition of connector specific
error codes to exceptions.
This change makes it possible for the writer to perform validation
without having to rely on ConnectorPageSource.
The logic here does not have dependency on hive and it is needed
here to allow ParquetWriter to create ParquetReader for the
verification of the written file.
@raunaqmorarka
Copy link
Copy Markdown
Member Author

Optimized parquet writer verification inserts benchmark.pdf
Perf impact with 5% verification (current default) is around 2-3%
Perf impact with 100% verification would be around 45%.

Implements verification of file footer, row count, nulls count and
checksum of columns.
Added a config parquet.optimized-writer.validation-percentage and
session property in hive connector to control the percentage of
written files that will be verified.
@raunaqmorarka raunaqmorarka merged commit 489bc1e into trinodb:master Sep 13, 2022
@raunaqmorarka raunaqmorarka deleted the pq-verify branch September 13, 2022 02:50
@github-actions github-actions bot added this to the 396 milestone Sep 13, 2022
@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 13, 2022

Per #14047 (comment)
is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

@raunaqmorarka
Copy link
Copy Markdown
Member Author

Per #14047 (comment)
is this enabled in Hive connector only, and Iceberg/Delta (which also use the optimizer writer), do not run the verification?

Right, this PR implements parquet writer verification only for hive connector. It should be straightforward to add it to delta lake and iceberg as well, but I don't know if we actually want to run it there given that we've already been using the new writer without verification in those connectors.

{
int batchSize = nextBatch();
if (batchSize <= 0) {
return null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under what circumstances is this possible?

btw maybe we should throw here?
returning null will cause ParquetPageSource to close and not read anything anymore:

if (closed || page == null) {
close();
return null;

cc @alexjo2144 @homar

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just preserving the old code's behaviour

int batchSize = parquetReader.nextBatch();

if (closed || batchSize <= 0) {
      close();
      return null;
}

batchSize can be -1 on reaching end of list of row groups to read, which is okay.
batchSize can be 0 on encountering empty row groups. This should be encountered only for an empty file. It's possible that some buggy writer writes empty row groups in the middle of a file with data, in that case we will stop early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add verification to new Parquet writer

5 participants