Implement verification for optimized parquet writer#13246
Implement verification for optimized parquet writer#13246raunaqmorarka merged 4 commits intotrinodb:masterfrom
Conversation
e95c15c to
a8c155c
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/TestParquetWriterConfig.java
Outdated
Show resolved
Hide resolved
e191cb8 to
edc7a5a
Compare
edc7a5a to
8a0364b
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/ValidationHash.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveSessionProperties.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriterFactory.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/ParquetTester.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java
Outdated
Show resolved
Hide resolved
skrzypo987
left a comment
There was a problem hiding this comment.
lgtm % I'm not an expert in ORC format
8a0364b to
15204e7
Compare
There was a problem hiding this comment.
consider adding test cases for mismatch and equal cases for this and other methods
15204e7 to
44bd97c
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
I added some comments.
I also have two high-level objections.
- there seems to ba a lot of code copied between orc, rcfile and parquet write validation. It would be a lot cleaner to have it extracted to common place.
- This changes modifies ParquetReader and related classes significantly, which in theory could be avoided (may be hard in practice though).
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
why the implementation is different than io.trino.orc.ValidationHash#mapSkipNullKeysHash?
There was a problem hiding this comment.
I'm not sure why there is even a isNull check in mapSkipNullKeysHash given that the null case is already handled by hash method.
Maybe @dain knows the reason behind that ?
The xor was avoided to avoid having a function which gives the same result if the key and value are flipped.
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetBlockFactory.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetFileWriter.java
Outdated
Show resolved
Hide resolved
44bd97c to
4e07f95
Compare
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReaderColumn.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetPageSource.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/parquet/ParquetWriterConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetWriteValidation.java
Outdated
Show resolved
Hide resolved
4fec1bd to
6a60b32
Compare
e2e909f to
8224703
Compare
Added ParquetBlockFactory along similar lines as OrcBlockFactory to handle creation of lazy blocks and addition of connector specific error codes to exceptions. This change makes it possible for the writer to perform validation without having to rely on ConnectorPageSource.
The logic here does not have dependency on hive and it is needed here to allow ParquetWriter to create ParquetReader for the verification of the written file.
8224703 to
3cbd473
Compare
|
Optimized parquet writer verification inserts benchmark.pdf |
Implements verification of file footer, row count, nulls count and checksum of columns. Added a config parquet.optimized-writer.validation-percentage and session property in hive connector to control the percentage of written files that will be verified.
3cbd473 to
a41d95f
Compare
|
Per #14047 (comment) |
Right, this PR implements parquet writer verification only for hive connector. It should be straightforward to add it to delta lake and iceberg as well, but I don't know if we actually want to run it there given that we've already been using the new writer without verification in those connectors. |
| { | ||
| int batchSize = nextBatch(); | ||
| if (batchSize <= 0) { | ||
| return null; |
There was a problem hiding this comment.
under what circumstances is this possible?
btw maybe we should throw here?
returning null will cause ParquetPageSource to close and not read anything anymore:
There was a problem hiding this comment.
This is just preserving the old code's behaviour
int batchSize = parquetReader.nextBatch();
if (closed || batchSize <= 0) {
close();
return null;
}
batchSize can be -1 on reaching end of list of row groups to read, which is okay.
batchSize can be 0 on encountering empty row groups. This should be encountered only for an empty file. It's possible that some buggy writer writes empty row groups in the middle of a file with data, in that case we will stop early.
Description
Implements verification of file footer, row count and
checksum of columns.
Added a config
parquet.optimized-writer.validation-percentageandsession property in hive connector to control the percentage of
written files that will be verified.
By default, we will verify 5% of written files.
new feature
optimized parquet writer
Implements verification of files written by optimized parquet writer in hive connector.
Fixes #5356
Documentation
( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
(x) Release notes entries required with the following suggested text: