Skip to content

Improve performance of getRegionLogicalSizeInBytes#15843

Merged
highker merged 1 commit intoprestodb:masterfrom
arunthirupathi:convert_string_dictionary_slowness
Mar 19, 2021
Merged

Improve performance of getRegionLogicalSizeInBytes#15843
highker merged 1 commit intoprestodb:masterfrom
arunthirupathi:convert_string_dictionary_slowness

Conversation

@arunthirupathi
Copy link
Copy Markdown

@arunthirupathi arunthirupathi commented Mar 17, 2021

DictionaryBlock getRegionLogicalSizeInBytes is slow when abandoning
StringDictionaryEncoding. When abandoning string dictionary encoding,
dictionary has large number of keys (millions) and it is converted
in segements of maximum 1024. The getRegionLogicalSizeInBytes allocates
size array of number of keys as a cache.

With this change code is modified to use the cache code path only
when the number of keys is less than the length. The conversion
ratio can be tuned if required.

Before this change, abandoning a dictionary with 10 million keys
used to 150 times slower than the direct encoding. Now it is only
50 times worse (similar to dictionary encoding).

fixes #15506

Test plan -
Added unit test for the non cached code path.

== RELEASE NOTES ==

General Changes
* Improve Dictionary Block getRegionLogicalSizeInBytes performance

Copy link
Copy Markdown
Collaborator

@sdruzkin sdruzkin Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: so the idea is that it doesn't make sense to create a seenSizes cache if the cached seenSize value is not going to be reused, right?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is another way of looking at it, and that comment probably makes this more simpler.

I was looking at it from the opinion, that there is a cost to call getRegionLogicalSizeInBytes on a block. There is a cost to do the cache maintenance.

@arunthirupathi arunthirupathi force-pushed the convert_string_dictionary_slowness branch 3 times, most recently from 6d86ae3 to dcf3302 Compare March 18, 2021 01:36
Copy link
Copy Markdown

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block is too critical for Presto. Maybe worth some profiling given it's on the critical path.

Comment on lines 268 to 269
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe have a generic comment. Presto main engine doesn't have the concept of files.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the reference to file

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRegionLogicalSizeInBytes is a very expensive call. How likely/useful it would be to have a light-weighted hashmap:

int[] seenIds = new int[length];
long[] seenSize = new long[length];
Arrays.fill(seenIds, -1);
for (int i = positionOffset; i < positionOffset + length; i++) {
    int id = getId(i);
    int index = id % length;
    if (seenIds[index] == id) {
        sizeInBytes += seenSize[index];
    }
    else if (seenIds[index] == -1) {
        seenIds[index] = id;
        seenSize[index] = dictionary.getRegionLogicalSizeInBytes(id, 1);
        sizeInBytes += seenSize[index];
    }
    else {
        sizeInBytes += dictionary.getRegionLogicalSizeInBytes(getId(i), 1);
    }
}

Frankly, I don't know if this approach is going to be better or worth. Maybe worth profiling?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seenIds, need to be a regular HashMap. The dictionary index could be 1,4,7 with length 3, in which case all of them will collide and produce wrong results. Adding the actual hash map will make any implementation much worser.

I looked at few different implementations of the getRegionLogicalSizeInBytes (VariableWidthBlock is cheap, but MapBlock is costly). So it is very hard to come up with cost model and even profiling.

What I have in mind is, instead of doing the in-place code path when length <= dictionarySize, we can do the in place code path, when length * 2 <= dictionarySize, this assumes that a hit in the cache is twice cheaper than calling the actual method. I am ok changing the condition to length * 5 <= dictionarySize, so the in place code path is triggered when the cache is not going to yield any benefits.

Or other approach we can do is, using a regular hashMap in the inplace code path and let it grow dynamically.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other approach I coded and discarded is, I can make changes to the dictionary writer to not call the getRegionLogicalSizeInBytes and can do inplace evaluation as the use case is well understood. But I believe the fix here is better as it will improve the performance of Presto. I coded up the dictionary Writer fix and discarded it. I will push that change as well, just in case.

@arunthirupathi arunthirupathi force-pushed the convert_string_dictionary_slowness branch 2 times, most recently from 594af9d to 61dd293 Compare March 18, 2021 06:19
@arunthirupathi arunthirupathi requested a review from highker March 18, 2021 06:20
@arunthirupathi arunthirupathi force-pushed the convert_string_dictionary_slowness branch from 61dd293 to 0fc9cce Compare March 18, 2021 14:21
DictionaryBlock getRegionLogicalSizeInBytes is slow when abandoning
StringDictionaryEncoding. When abandoning string dictionary encoding,
dictionary has large number of keys (millions) and it is converted
in segements of maximum 1024. The getRegionLogicalSizeInBytes allocates
size array of number of keys as a cache.

With this change code is modified to use the cache code path only
when the number of keys is less than the length. The conversion
ratio can be tuned if required.

Before this change, abandoning a dictionary with 10 million keys
used to 150 times slower than the direct encoding. Now it is only
50 times worse (similar to dictionary encoding).
@arunthirupathi arunthirupathi force-pushed the convert_string_dictionary_slowness branch from 0fc9cce to e38594f Compare March 18, 2021 14:28
if (seenSizes[position] < 0) {
seenSizes[position] = dictionary.getRegionLogicalSizeInBytes(position, 1);
}
sizeInBytes += seenSizes[position];
Copy link
Copy Markdown
Collaborator

@sdruzkin sdruzkin Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q regarding the original implementation. This is a dictionary, do we really need to multiply the entry size by the number of times this entry appears?

If the answer is no, we can try to collect used dictionary positions into a IntOpenHashSet from fastutil.

If yes, instead of a regular hash map you can use Int2IntOpenHashMap. It would really depends on the relation between the dict size and the length, and still likely be more expensive.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical size, wants to get the underlying size of the data. That will involve multiplying the entry size, by number of times it appears. The logical size is used in few places, where the data is converted from dictionary encoding to direct encoding. One such place is abandoning dictionary encoding and switching to direct encoding.

I agree that fastutil's collections are better than java.utils in terms of CPU/memory. But my opinion, is the code is simple when the likelihood of hit is very low, we can have a non cached path and avoid the cache completely.

@highker highker merged commit 7c7930e into prestodb:master Mar 19, 2021
@varungajjala varungajjala mentioned this pull request Mar 23, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Abandoning String Dictionary is very slow

3 participants