Improve performance of getRegionLogicalSizeInBytes#15843
Improve performance of getRegionLogicalSizeInBytes#15843highker merged 1 commit intoprestodb:masterfrom
Conversation
There was a problem hiding this comment.
tl;dr: so the idea is that it doesn't make sense to create a seenSizes cache if the cached seenSize value is not going to be reused, right?
There was a problem hiding this comment.
That is another way of looking at it, and that comment probably makes this more simpler.
I was looking at it from the opinion, that there is a cost to call getRegionLogicalSizeInBytes on a block. There is a cost to do the cache maintenance.
presto-main/src/test/java/com/facebook/presto/block/TestDictionaryBlock.java
Outdated
Show resolved
Hide resolved
6d86ae3 to
dcf3302
Compare
highker
left a comment
There was a problem hiding this comment.
Block is too critical for Presto. Maybe worth some profiling given it's on the critical path.
There was a problem hiding this comment.
Let's maybe have a generic comment. Presto main engine doesn't have the concept of files.
There was a problem hiding this comment.
Remove the reference to file
There was a problem hiding this comment.
getRegionLogicalSizeInBytes is a very expensive call. How likely/useful it would be to have a light-weighted hashmap:
int[] seenIds = new int[length];
long[] seenSize = new long[length];
Arrays.fill(seenIds, -1);
for (int i = positionOffset; i < positionOffset + length; i++) {
int id = getId(i);
int index = id % length;
if (seenIds[index] == id) {
sizeInBytes += seenSize[index];
}
else if (seenIds[index] == -1) {
seenIds[index] = id;
seenSize[index] = dictionary.getRegionLogicalSizeInBytes(id, 1);
sizeInBytes += seenSize[index];
}
else {
sizeInBytes += dictionary.getRegionLogicalSizeInBytes(getId(i), 1);
}
}Frankly, I don't know if this approach is going to be better or worth. Maybe worth profiling?
There was a problem hiding this comment.
The seenIds, need to be a regular HashMap. The dictionary index could be 1,4,7 with length 3, in which case all of them will collide and produce wrong results. Adding the actual hash map will make any implementation much worser.
I looked at few different implementations of the getRegionLogicalSizeInBytes (VariableWidthBlock is cheap, but MapBlock is costly). So it is very hard to come up with cost model and even profiling.
What I have in mind is, instead of doing the in-place code path when length <= dictionarySize, we can do the in place code path, when length * 2 <= dictionarySize, this assumes that a hit in the cache is twice cheaper than calling the actual method. I am ok changing the condition to length * 5 <= dictionarySize, so the in place code path is triggered when the cache is not going to yield any benefits.
Or other approach we can do is, using a regular hashMap in the inplace code path and let it grow dynamically.
There was a problem hiding this comment.
Other approach I coded and discarded is, I can make changes to the dictionary writer to not call the getRegionLogicalSizeInBytes and can do inplace evaluation as the use case is well understood. But I believe the fix here is better as it will improve the performance of Presto. I coded up the dictionary Writer fix and discarded it. I will push that change as well, just in case.
594af9d to
61dd293
Compare
61dd293 to
0fc9cce
Compare
DictionaryBlock getRegionLogicalSizeInBytes is slow when abandoning StringDictionaryEncoding. When abandoning string dictionary encoding, dictionary has large number of keys (millions) and it is converted in segements of maximum 1024. The getRegionLogicalSizeInBytes allocates size array of number of keys as a cache. With this change code is modified to use the cache code path only when the number of keys is less than the length. The conversion ratio can be tuned if required. Before this change, abandoning a dictionary with 10 million keys used to 150 times slower than the direct encoding. Now it is only 50 times worse (similar to dictionary encoding).
0fc9cce to
e38594f
Compare
| if (seenSizes[position] < 0) { | ||
| seenSizes[position] = dictionary.getRegionLogicalSizeInBytes(position, 1); | ||
| } | ||
| sizeInBytes += seenSizes[position]; |
There was a problem hiding this comment.
Q regarding the original implementation. This is a dictionary, do we really need to multiply the entry size by the number of times this entry appears?
If the answer is no, we can try to collect used dictionary positions into a IntOpenHashSet from fastutil.
If yes, instead of a regular hash map you can use Int2IntOpenHashMap. It would really depends on the relation between the dict size and the length, and still likely be more expensive.
There was a problem hiding this comment.
The logical size, wants to get the underlying size of the data. That will involve multiplying the entry size, by number of times it appears. The logical size is used in few places, where the data is converted from dictionary encoding to direct encoding. One such place is abandoning dictionary encoding and switching to direct encoding.
I agree that fastutil's collections are better than java.utils in terms of CPU/memory. But my opinion, is the code is simple when the likelihood of hit is very low, we can have a non cached path and avoid the cache completely.
DictionaryBlock getRegionLogicalSizeInBytes is slow when abandoning
StringDictionaryEncoding. When abandoning string dictionary encoding,
dictionary has large number of keys (millions) and it is converted
in segements of maximum 1024. The getRegionLogicalSizeInBytes allocates
size array of number of keys as a cache.
With this change code is modified to use the cache code path only
when the number of keys is less than the length. The conversion
ratio can be tuned if required.
Before this change, abandoning a dictionary with 10 million keys
used to 150 times slower than the direct encoding. Now it is only
50 times worse (similar to dictionary encoding).
fixes #15506
Test plan -
Added unit test for the non cached code path.