Fix writing bloom filter for string columns in orc#11982
Conversation
There was a problem hiding this comment.
Is there a way to test it?
How it was broken? (improve commie message)
There was a problem hiding this comment.
trino/lib/trino-orc/src/main/java/io/trino/orc/OrcWriter.java
Lines 386 to 406 in 2df56a0
SliceDictionaryColumnWriter#tryConvertToDirect will clear rowGroups finally in dictionary compression optimization, index streams use the rowGroups after optimization that result in losing bloom filter streams
|
@dain could you review this change. I don't have enough context here |
Praveen2112
left a comment
There was a problem hiding this comment.
Can we add test to ensure that bloom filter is written for supported datatypes ?
|
@ans76 Can you please apply the comments. |
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
lib/trino-orc/src/test/java/io/trino/orc/TestWriteBloomFilter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
DataProvider can be below the test method
There was a problem hiding this comment.
Can we inline this method ?
There was a problem hiding this comment.
static import for OrcWriteValidationMode
There was a problem hiding this comment.
Do we need to check it for all validation modes ?
There was a problem hiding this comment.
Can we specific name for column instead of test
|
@alexjo2144 and @raunaqmorarka Can you PTAL ? |
raunaqmorarka
left a comment
There was a problem hiding this comment.
Please update commit title to
Fix writing bloom filter for string columns in orc
Commit message
Ensures that bloom filters are written when the writer falls back to
direct encoding due to dictionary becoming too large.
Also, please squash your commits.
There was a problem hiding this comment.
The triggering conditions for the bug seems to be dictionary becoming too large and the writer falling back to non-dictionary encoding. But I don't see any tweaks to orc writer config (like reducing hive.orc.writer.dictionary-max-memory) which would guarantee that the conditions of the bug are reproduced.
It seems to me that it would be simpler to just test this in TestSliceDictionaryColumnWriter for CHAR, VARCHAR and STRING types and just call tryConvertToDirect and getBloomFilters directly in the unit test.
3840d20 to
495b045
Compare
3747c21 to
338d973
Compare
raunaqmorarka
left a comment
There was a problem hiding this comment.
Please squash your commits, lgtm otherwise
Ensures that bloom filters are written when the writer falls back to direct encoding due to dictionary becoming too large Co-Authored-By: Raunaq Morarka <raunaqmorarka@users.noreply.github.com>
338d973 to
aa9b8dd
Compare
|
Merged !! Thanks for fixing this |
Description
Documentation
( ) No documentation is needed.
Release notes