Skip to content

Optimize statistics aggregation for wide tables#11558

Closed
arhimondr wants to merge 1 commit intoprestodb:masterfrom
arhimondr:optimize-statistics-aggregation
Closed

Optimize statistics aggregation for wide tables#11558
arhimondr wants to merge 1 commit intoprestodb:masterfrom
arhimondr:optimize-statistics-aggregation

Conversation

@arhimondr
Copy link
Member

This patch optimizes statistics aggregation for extra wide tables (1000+ columns).

For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as
it creates ~4 aggregators (one for every statistic collected) for every column. After each
partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more
CPU time that the aggregations itself.

As an optimization this patch:

  • Removes partial aggregation memory limit to avoid frequent flushes
  • Sets expected entries size to 200 instead of 10_000

@findepi
Copy link
Contributor

findepi commented Sep 24, 2018

Removes partial aggregation memory limit to avoid frequent flushes

is it safe?

@arhimondr arhimondr force-pushed the optimize-statistics-aggregation branch from ee50c38 to acc3074 Compare September 24, 2018 14:13
@arhimondr
Copy link
Member Author

@findepi

Yes. Partial aggregation memory limit (if set) is accounted as system memory usage. As it is a fixed chunk of memory dedicated for pre-aggregations.

After this patch, if the pre-aggregation memory limit is not set, the code will allocate the memory in the user memory pool. As if it was a final aggregation.

https://github.com/prestodb/presto/pull/11558/files#diff-5645751c8a59753e806941d4649be086R330

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Andrii, there is not enough explanation for me to understand this change. Could you describe the problem this change is trying to solve and explain a bit more about why is it a good idea to remove a memory limit?

CC: @nezihyigitbasi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment

@findepi
Copy link
Contributor

findepi commented Sep 24, 2018

After this patch, if the pre-aggregation memory limit is not set, the code will allocate the memory in the user memory pool. As if it was a final aggregation.

in case of stats on write, is the final aggregation single node? i.e. if the final aggregation had enough memory to complete, will partial aggregation have enough memory?

@mbasmanova
Copy link
Contributor

@arhimondr Andrii, how did you test this change? Would you add a benchmark that shows how much of the improvement there is?

@arhimondr
Copy link
Member Author

in case of stats on write, is the final aggregation single node? i.e. if the final aggregation had enough memory to complete, will partial aggregation have enough memory?

Partial aggregation is not the intermediate aggregation. Partial aggregation is an optimization. The trick is that partial aggregation has some fixed amount of memory (16MB by default). Whenever it reaches the limit it yields the current state to the FINAL or to the INTERMIDIATE aggregation.

The problem is that when you have a lot of columns you are doing aggregations over, the 16MB limit is being filled out ever few rows. It was never a problem before. As no one ever done 4000 aggregation functions in a single statement.

However the statistics calculation code may easily create that many aggregation functions.

@arhimondr
Copy link
Member Author

Andrii, how did you test this change? Would you add a benchmark that shows how much of the improvement there is?

@mbasmanova It was tested by taking perf reports on a real production cluster.

@findepi
Copy link
Contributor

findepi commented Sep 24, 2018

The problem is that when you have a lot of columns you are doing aggregations over, the 16MB limit is being filled out ever few rows. It was never a problem before. As no one ever done 4000 aggregation functions in a single statement.

i know that it works so. It can be a problem for narrower aggregations as well, when having multiple distinct grouping keys (here you have O(number columns), i.e. "fixed").
BTW in #11267 there was a rule treating this for certain ROLLUP queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Assuming that this number ties to max number of output partitions, let's make the connection explicit. E.g. add a comment or, better yet, use the same configuration setting to compute this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worth configuration. It is not a hard limit, but the initial size for the grouping hashmap. It will grow larger if needed.

@mbasmanova
Copy link
Contributor

@arhimondr

It was tested by taking perf reports on a real production cluster.

Could you share these?

Copy link
Contributor

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, with the caveat that my expectation is the statistics memory usage with be c * numberOfColumns * numberOfTableSegmentsWritten. The c can be different for column types, but should not be effected by number of rows (it could grow to a reasonable limit). This way the user has influence over this memory and can be user memory.

Also, please, make sure that any concerns from others are resolved before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment stating that when there is not partial memory limit set, any memory usage is considered user memory?

@arhimondr
Copy link
Member Author

but should not be effected by number of rows (it could grow to a reasonable limit).

It is also affected by the number of partitions. 1 partition - 1 grouping key.

@dain
Copy link
Contributor

dain commented Sep 24, 2018

but should not be effected by number of rows (it could grow to a reasonable limit).

It is also affected by the number of partitions. 1 partition - 1 grouping key.

I expected that (see numberOfTableSegmentsWritten in my formula), and I think that is acceptable also.

@arhimondr
Copy link
Member Author

@mbasmanova

Could you share these?

Unfortunately those snapshots are server wide, and may contain some other traces from some proprietary processes.

@arhimondr arhimondr force-pushed the optimize-statistics-aggregation branch from acc3074 to dc32e6c Compare September 25, 2018 04:35
This patch optimizes statistics aggregation for extra wide tables (1000+ columns).

For extra wide tables creation of InMemoryHashAggregationBuilder could be expensive, as
it creates ~4 aggregators (one for every statistic collected) for every column. After each
partial results flush the InMemoryHashAggregationBuilder has to be recreated, what takes way more
CPU time that the aggregations itself.

As an optimization this patch:

- Removes partial aggregation memory limit to avoid frequent flushes
- Sets expected entries size to 200 instead of 10_000
@arhimondr arhimondr force-pushed the optimize-statistics-aggregation branch from dc32e6c to 02d75d7 Compare September 25, 2018 15:08
@arhimondr
Copy link
Member Author

@findepi

i know that it works so. It can be a problem for narrower aggregations as well, when having multiple distinct grouping keys (here you have O(number columns), i.e. "fixed").

In theory it could. But in practice, currently we group statistics per partitions. And now it is not allowed to insert more than 100 partitions in a single query by default.

And anyhow, there is absolutely no point of flushing the results pre-maturely. It will just result in more singlethreaded work at final aggregation

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Looks good to me. Please, explain the choice of 200 in the commit message before merging.

@arhimondr
Copy link
Member Author

Merged

@arhimondr arhimondr closed this Sep 25, 2018
@arhimondr arhimondr deleted the optimize-statistics-aggregation branch September 25, 2018 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants