Skip to content

Fix writing bloom filter for string columns in orc#11982

Merged
Praveen2112 merged 1 commit intotrinodb:masterfrom
ans76:orc-slice-direct-use-dict-row-group-bloom-filter
Jun 22, 2022
Merged

Fix writing bloom filter for string columns in orc#11982
Praveen2112 merged 1 commit intotrinodb:masterfrom
ans76:orc-slice-direct-use-dict-row-group-bloom-filter

Conversation

@ans76
Copy link
Copy Markdown
Contributor

@ans76 ans76 commented Apr 15, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?
fix #11757

Documentation

( ) No documentation is needed.

Release notes

  • Fix writing bloom filter for String columns in ORC

@cla-bot cla-bot bot added the cla-signed label Apr 15, 2022
@findepi findepi requested review from Praveen2112, dain and sopel39 April 15, 2022 12:53
@sopel39 sopel39 requested a review from skrzypo987 April 15, 2022 13:44
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to test it?

How it was broken? (improve commie message)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// convert any dictionary encoded column with a low compression ratio to direct
dictionaryCompressionOptimizer.finalOptimize(bufferedBytes);
columnWriters.forEach(ColumnWriter::close);
List<OrcDataOutput> outputData = new ArrayList<>();
List<Stream> allStreams = new ArrayList<>(columnWriters.size() * 3);
// get index streams
long indexLength = 0;
for (ColumnWriter columnWriter : columnWriters) {
for (StreamDataOutput indexStream : columnWriter.getIndexStreams(metadataWriter)) {
// The ordering is critical because the stream only contain a length with no offset.
outputData.add(indexStream);
allStreams.add(indexStream.getStream());
indexLength += indexStream.size();
}
for (StreamDataOutput bloomFilter : columnWriter.getBloomFilters(metadataWriter)) {
outputData.add(bloomFilter);
allStreams.add(bloomFilter.getStream());
indexLength += bloomFilter.size();

SliceDictionaryColumnWriter#tryConvertToDirect will clear rowGroups finally in dictionary compression optimization, index streams use the rowGroups after optimization that result in losing bloom filter streams

@sopel39 sopel39 requested review from sopel39 and removed request for sopel39 April 20, 2022 08:50
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Apr 20, 2022

@dain could you review this change. I don't have enough context here

Copy link
Copy Markdown
Member

@Praveen2112 Praveen2112 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test to ensure that bloom filter is written for supported datatypes ?

@Praveen2112
Copy link
Copy Markdown
Member

@ans76 Can you please apply the comments.

@ans76 ans76 requested a review from Praveen2112 May 25, 2022 02:45
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataProvider can be below the test method

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we inline this method ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import for OrcWriteValidationMode

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check it for all validation modes ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specific name for column instead of test

@Praveen2112
Copy link
Copy Markdown
Member

@alexjo2144 and @raunaqmorarka Can you PTAL ?

Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update commit title to
Fix writing bloom filter for string columns in orc
Commit message

Ensures that bloom filters are written when the writer falls back to 
direct encoding due to dictionary becoming too large.

Also, please squash your commits.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The triggering conditions for the bug seems to be dictionary becoming too large and the writer falling back to non-dictionary encoding. But I don't see any tweaks to orc writer config (like reducing hive.orc.writer.dictionary-max-memory) which would guarantee that the conditions of the bug are reproduced.
It seems to me that it would be simpler to just test this in TestSliceDictionaryColumnWriter for CHAR, VARCHAR and STRING types and just call tryConvertToDirect and getBloomFilters directly in the unit test.

@ans76 ans76 changed the title Fix use DictionaryRowGroup statistics if convert to direct writer Fix writing bloom filter for string columns in orc Jun 7, 2022
@ans76 ans76 force-pushed the orc-slice-direct-use-dict-row-group-bloom-filter branch 2 times, most recently from 3840d20 to 495b045 Compare June 7, 2022 10:12
@ans76 ans76 force-pushed the orc-slice-direct-use-dict-row-group-bloom-filter branch 4 times, most recently from 3747c21 to 338d973 Compare June 21, 2022 14:43
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash your commits, lgtm otherwise

Ensures that bloom filters are written when the writer falls
back to direct encoding due to dictionary becoming too large

Co-Authored-By: Raunaq Morarka <raunaqmorarka@users.noreply.github.com>
@ans76 ans76 force-pushed the orc-slice-direct-use-dict-row-group-bloom-filter branch from 338d973 to aa9b8dd Compare June 22, 2022 07:14
@Praveen2112 Praveen2112 merged commit 2e050f2 into trinodb:master Jun 22, 2022
@Praveen2112
Copy link
Copy Markdown
Member

Merged !! Thanks for fixing this

@github-actions github-actions bot added this to the 387 milestone Jun 22, 2022
@ans76 ans76 deleted the orc-slice-direct-use-dict-row-group-bloom-filter branch June 22, 2022 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Does trino support bloom filter for varchar columns for ORC

4 participants