Optimize statistics aggregation for wide tables#11558
Optimize statistics aggregation for wide tables#11558arhimondr wants to merge 1 commit intoprestodb:masterfrom
Conversation
is it safe? |
ee50c38 to
acc3074
Compare
|
Yes. Partial aggregation memory limit (if set) is accounted as system memory usage. As it is a fixed chunk of memory dedicated for pre-aggregations. After this patch, if the pre-aggregation memory limit is not set, the code will allocate the memory in the user memory pool. As if it was a final aggregation. |
mbasmanova
left a comment
There was a problem hiding this comment.
@arhimondr Andrii, there is not enough explanation for me to understand this change. Could you describe the problem this change is trying to solve and explain a bit more about why is it a good idea to remove a memory limit?
CC: @nezihyigitbasi
in case of stats on write, is the final aggregation single node? i.e. if the final aggregation had enough memory to complete, will partial aggregation have enough memory? |
|
@arhimondr Andrii, how did you test this change? Would you add a benchmark that shows how much of the improvement there is? |
Partial aggregation is not the intermediate aggregation. Partial aggregation is an optimization. The trick is that partial aggregation has some fixed amount of memory (16MB by default). Whenever it reaches the limit it yields the current state to the FINAL or to the INTERMIDIATE aggregation. The problem is that when you have a lot of columns you are doing aggregations over, the 16MB limit is being filled out ever few rows. It was never a problem before. As no one ever done 4000 aggregation functions in a single statement. However the statistics calculation code may easily create that many aggregation functions. |
@mbasmanova It was tested by taking perf reports on a real production cluster. |
i know that it works so. It can be a problem for narrower aggregations as well, when having multiple distinct grouping keys (here you have O(number columns), i.e. "fixed"). |
There was a problem hiding this comment.
@arhimondr Assuming that this number ties to max number of output partitions, let's make the connection explicit. E.g. add a comment or, better yet, use the same configuration setting to compute this.
There was a problem hiding this comment.
I don't think it is worth configuration. It is not a hard limit, but the initial size for the grouping hashmap. It will grow larger if needed.
Could you share these? |
dain
left a comment
There was a problem hiding this comment.
The change looks good to me, with the caveat that my expectation is the statistics memory usage with be c * numberOfColumns * numberOfTableSegmentsWritten. The c can be different for column types, but should not be effected by number of rows (it could grow to a reasonable limit). This way the user has influence over this memory and can be user memory.
Also, please, make sure that any concerns from others are resolved before merging.
There was a problem hiding this comment.
Can you add a comment stating that when there is not partial memory limit set, any memory usage is considered user memory?
It is also affected by the number of partitions. 1 partition - 1 grouping key. |
I expected that (see |
Unfortunately those snapshots are server wide, and may contain some other traces from some proprietary processes. |
acc3074 to
dc32e6c
Compare
This patch optimizes statistics aggregation for extra wide tables (1000+ columns). For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as it creates ~4 aggregators (one for every statistic collected) for every column. After each partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more CPU time that the aggregations itself. As an optimization this patch: - Removes partial aggregation memory limit to avoid frequent flushes - Sets expected entries size to 200 instead of 10_000
dc32e6c to
02d75d7
Compare
In theory it could. But in practice, currently we group statistics per partitions. And now it is not allowed to insert more than 100 partitions in a single query by default. And anyhow, there is absolutely no point of flushing the results pre-maturely. It will just result in more singlethreaded work at final aggregation |
mbasmanova
left a comment
There was a problem hiding this comment.
@arhimondr Looks good to me. Please, explain the choice of 200 in the commit message before merging.
|
Merged |
This patch optimizes statistics aggregation for extra wide tables (1000+ columns).
For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as
it creates ~4 aggregators (one for every statistic collected) for every column. After each
partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more
CPU time that the aggregations itself.
As an optimization this patch: