Optimize statistics aggregation for wide tables by arhimondr · Pull Request #11558 · prestodb/presto

arhimondr · 2018-09-24T11:48:14Z

This patch optimizes statistics aggregation for extra wide tables (1000+ columns).

For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as
it creates ~4 aggregators (one for every statistic collected) for every column. After each
partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more
CPU time that the aggregations itself.

As an optimization this patch:

Removes partial aggregation memory limit to avoid frequent flushes
Sets expected entries size to 200 instead of 10_000

findepi · 2018-09-24T13:31:11Z

Removes partial aggregation memory limit to avoid frequent flushes

is it safe?

arhimondr · 2018-09-24T14:15:30Z

@findepi

Yes. Partial aggregation memory limit (if set) is accounted as system memory usage. As it is a fixed chunk of memory dedicated for pre-aggregations.

After this patch, if the pre-aggregation memory limit is not set, the code will allocate the memory in the user memory pool. As if it was a final aggregation.

https://github.com/prestodb/presto/pull/11558/files#diff-5645751c8a59753e806941d4649be086R330

mbasmanova

@arhimondr Andrii, there is not enough explanation for me to understand this change. Could you describe the problem this change is trying to solve and explain a bit more about why is it a good idea to remove a memory limit?

CC: @nezihyigitbasi

mbasmanova · 2018-09-24T14:25:34Z

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java

This needs a comment

mbasmanova · 2018-09-24T14:25:42Z

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java

This needs a comment

findepi · 2018-09-24T14:28:24Z

After this patch, if the pre-aggregation memory limit is not set, the code will allocate the memory in the user memory pool. As if it was a final aggregation.

in case of stats on write, is the final aggregation single node? i.e. if the final aggregation had enough memory to complete, will partial aggregation have enough memory?

mbasmanova · 2018-09-24T14:40:06Z

@arhimondr Andrii, how did you test this change? Would you add a benchmark that shows how much of the improvement there is?

arhimondr · 2018-09-24T14:42:35Z

in case of stats on write, is the final aggregation single node? i.e. if the final aggregation had enough memory to complete, will partial aggregation have enough memory?

Partial aggregation is not the intermediate aggregation. Partial aggregation is an optimization. The trick is that partial aggregation has some fixed amount of memory (16MB by default). Whenever it reaches the limit it yields the current state to the FINAL or to the INTERMIDIATE aggregation.

The problem is that when you have a lot of columns you are doing aggregations over, the 16MB limit is being filled out ever few rows. It was never a problem before. As no one ever done 4000 aggregation functions in a single statement.

However the statistics calculation code may easily create that many aggregation functions.

arhimondr · 2018-09-24T14:49:50Z

Andrii, how did you test this change? Would you add a benchmark that shows how much of the improvement there is?

@mbasmanova It was tested by taking perf reports on a real production cluster.

findepi · 2018-09-24T15:12:49Z

The problem is that when you have a lot of columns you are doing aggregations over, the 16MB limit is being filled out ever few rows. It was never a problem before. As no one ever done 4000 aggregation functions in a single statement.

i know that it works so. It can be a problem for narrower aggregations as well, when having multiple distinct grouping keys (here you have O(number columns), i.e. "fixed").
BTW in #11267 there was a rule treating this for certain ROLLUP queries.

mbasmanova · 2018-09-24T15:31:20Z

presto-main/src/main/java/com/facebook/presto/sql/planner/LocalExecutionPlanner.java

@arhimondr Assuming that this number ties to max number of output partitions, let's make the connection explicit. E.g. add a comment or, better yet, use the same configuration setting to compute this.

I don't think it is worth configuration. It is not a hard limit, but the initial size for the grouping hashmap. It will grow larger if needed.

mbasmanova · 2018-09-24T15:31:58Z

@arhimondr

It was tested by taking perf reports on a real production cluster.

Could you share these?

dain

The change looks good to me, with the caveat that my expectation is the statistics memory usage with be c * numberOfColumns * numberOfTableSegmentsWritten. The c can be different for column types, but should not be effected by number of rows (it could grow to a reasonable limit). This way the user has influence over this memory and can be user memory.

Also, please, make sure that any concerns from others are resolved before merging.

dain · 2018-09-24T15:43:36Z

...in/java/com/facebook/presto/operator/aggregation/builder/InMemoryHashAggregationBuilder.java

Can you add a comment stating that when there is not partial memory limit set, any memory usage is considered user memory?

arhimondr · 2018-09-24T15:51:55Z

but should not be effected by number of rows (it could grow to a reasonable limit).

It is also affected by the number of partitions. 1 partition - 1 grouping key.

dain · 2018-09-24T15:58:40Z

but should not be effected by number of rows (it could grow to a reasonable limit).

It is also affected by the number of partitions. 1 partition - 1 grouping key.

I expected that (see numberOfTableSegmentsWritten in my formula), and I think that is acceptable also.

arhimondr · 2018-09-25T02:43:45Z

@mbasmanova

Could you share these?

Unfortunately those snapshots are server wide, and may contain some other traces from some proprietary processes.

This patch optimizes statistics aggregation for extra wide tables (1000+ columns). For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as it creates ~4 aggregators (one for every statistic collected) for every column. After each partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more CPU time that the aggregations itself. As an optimization this patch: - Removes partial aggregation memory limit to avoid frequent flushes - Sets expected entries size to 200 instead of 10_000

arhimondr · 2018-09-25T15:19:28Z

@findepi

i know that it works so. It can be a problem for narrower aggregations as well, when having multiple distinct grouping keys (here you have O(number columns), i.e. "fixed").

In theory it could. But in practice, currently we group statistics per partitions. And now it is not allowed to insert more than 100 partitions in a single query by default.

And anyhow, there is absolutely no point of flushing the results pre-maturely. It will just result in more singlethreaded work at final aggregation

mbasmanova

@arhimondr Looks good to me. Please, explain the choice of 200 in the commit message before merging.

arhimondr · 2018-09-25T16:17:59Z

Merged

arhimondr requested review from dain and mbasmanova September 24, 2018 11:48

facebook-github-bot added the CLA Signed label Sep 24, 2018

arhimondr force-pushed the optimize-statistics-aggregation branch from ee50c38 to acc3074 Compare September 24, 2018 14:13

mbasmanova reviewed Sep 24, 2018

View reviewed changes

dain approved these changes Sep 24, 2018

View reviewed changes

arhimondr force-pushed the optimize-statistics-aggregation branch from acc3074 to dc32e6c Compare September 25, 2018 04:35

arhimondr force-pushed the optimize-statistics-aggregation branch from dc32e6c to 02d75d7 Compare September 25, 2018 15:08

mbasmanova approved these changes Sep 25, 2018

View reviewed changes

arhimondr closed this Sep 25, 2018

arhimondr deleted the optimize-statistics-aggregation branch September 25, 2018 16:18

Conversation

arhimondr commented Sep 24, 2018

Uh oh!

findepi commented Sep 24, 2018

Uh oh!

arhimondr commented Sep 24, 2018

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova Sep 24, 2018

Choose a reason for hiding this comment

Uh oh!

mbasmanova Sep 24, 2018

Choose a reason for hiding this comment

Uh oh!

findepi commented Sep 24, 2018

Uh oh!

mbasmanova commented Sep 24, 2018

Uh oh!

arhimondr commented Sep 24, 2018

Uh oh!

arhimondr commented Sep 24, 2018

Uh oh!

findepi commented Sep 24, 2018

Uh oh!

mbasmanova Sep 24, 2018

Choose a reason for hiding this comment

Uh oh!

arhimondr Sep 25, 2018

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Sep 24, 2018

Uh oh!

dain left a comment

Choose a reason for hiding this comment

Uh oh!

dain Sep 24, 2018

Choose a reason for hiding this comment

Uh oh!

arhimondr commented Sep 24, 2018

Uh oh!

dain commented Sep 24, 2018

Uh oh!

arhimondr commented Sep 25, 2018

Uh oh!

arhimondr commented Sep 25, 2018

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

arhimondr commented Sep 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants