Reinitialize the column writers when flushing a row group#10418
Reinitialize the column writers when flushing a row group#10418findepi merged 2 commits intotrinodb:masterfrom
Conversation
3d8d389 to
89667af
Compare
89667af to
7cadf0f
Compare
There was a problem hiding this comment.
This bugfix follows the same strategy as in parquet-mr project:
After the row group has been flushed, the column writers get reinitialized from scratch.
The FallbackValuesWriter doesn't reset both of its writers within the #reset() method and this behavior leads to the unwanted behavior documented in #5518 (comment)
One concern which I have with this approach is that the reinitialisation of the writers may affect the performance. I'd need in this regard feedback.
There was a problem hiding this comment.
Comparing https://github.com/trinodb/trino/runs/4650149952?check_suite_focus=true (a test run of this branch) and
https://github.com/trinodb/trino/runs/4649720577?check_suite_focus=true (a build on master) on TestFullParquetReader i see differences of about 20% in favor of the master implementation compared to the current bugfix implementation (for thetrino-hive, trino-parquet tests).
I ran again the tests https://github.com/trinodb/trino/runs/4652707273?check_suite_focus=true and found out no major difference between the runs. Actually in most of the cases, the tests were running faster on the branch bugfix/parquet-9749 than on master
The test TestFullParquetReader.testNestedStructs took 1m / 2m on the bugfix branch and 8 seconds on master. I'm not sure whether this is relevant or is it linked to the github build runner.
There was a problem hiding this comment.
The FallbackValuesWriter doesn't reset both of its writers within the #reset() method and this behavior leads to the unwanted behavior documented in #5518 (comment)
When does the FallbackValuesWriter get used?
One concern which I have with this approach is that the reinitialisation of the writers may affect the performance. I'd need in this regard feedback.
That should not be much of a problem. It happens once per row group and it doesn't seem to be a very expensive operation.
There was a problem hiding this comment.
When does the FallbackValuesWriter get used?
It gets used as a fallback for the situation where the dictionary encoding grows too big.
7cadf0f to
7b4e5e0
Compare
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergParquetConnectorTest.java
Outdated
Show resolved
Hide resolved
23bf3a8 to
e1b6c09
Compare
|
What about changing the top three The rest of the fields should reset/clear fine, it's just the ValueWriters that seem to have a problem. If that's not worth it, I think the |
@alexjo2144 I'm not sure that I follow what you mean. |
|
Sorry, my thought was can we either:
or if not
|
e1b6c09 to
682eb69
Compare
@alexjo2144 Indeed, this method is now no longer needed. I'm still investigating also the refactoring possibility:
Taken from the Parquet documentation https://parquet.apache.org/documentation/latest/
I'm thinking that in such cases there is no really a big overhead of creating again from scratch the writers in case of dealing with a new row group. |
skrzypo987
left a comment
There was a problem hiding this comment.
This looks ok.
However, my knowledge about writing Parquet files is non-existent so I can't be the one to approve anything here.
|
Thanks @skrzypo987 for taking a look |
Reinitialize the column writers when flushing a row group
The previous strategy involved in doing the bugfix was:
However, this strategy collides with #5518 (comment) reason why now the column writers get reinitialized from scratch.
Fixes #9749