Add a quicker, hash-based version of distinct limit by kaikalur · Pull Request #17199 · prestodb/presto

kaikalur · 2022-01-18T16:17:59Z

Added a quicker version of partial distinct limit based on hashes for reducing latency. This eliminates the DistinctLimitPartial node type and instead uses a filter using a stateful filter function that just keeps the N distinct hashes and filters out everything after it reaches the limit or if a hash is already seen.

It maybe slightly inaccurate when there are hash collisions but for adhoc quick data exploration this will be good as the user doesn't have to wait hours to see the values.

Fixes #17196

Test plan - Tests already exist

== RELEASE NOTES ==

General Changes
* We now have a faster version of distinct limit N queries for N <= 10000 that uses distinct hashes for quicker results to handle common situations where there are a few dozen distinct values in large datasets. This feature can be enabled by session param: hash_based_distinct_limit_enabled and the limit (default 1000) can be adjusted using: hash_based_distinct_limit_threashold.

mbasmanova · 2022-01-18T17:15:02Z

I'm not sure I understand how this code is removing "the largest one". I think this code either adds "hash" (if not exists already) or not. It returns true if added and false if the value already existed.

Sorry bad comment. Removed it.

mbasmanova · 2022-01-18T17:17:32Z

@kaikalur My understanding is that the speedup comes from the fact that this implementation is approximate, e.g. returns a set of values with distinct hashes and may miss some of the values if they happen to have the same hash. It would be good to clarify that in the release notes and, perhaps, use session / configuration key with the name that clearly signals that the results may not be 100% accurate.

kaikalur · 2022-01-18T18:24:47Z

@kaikalur My understanding is that the speedup comes from the fact that this implementation is approximate, e.g. returns a set of values with distinct hashes and may miss some of the values if they happen to have the same hash. It would be good to clarify that in the release notes and, perhaps, use session / configuration key with the name that clearly signals that the results may not be 100% accurate.

Yes that's correct but two points -

With a 64bit hashing algo for 10k distinct values (the threshold), the chance of collision is almost 0 according to the internet :)

https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions

It's not accurate in the sense that we may miss some values but all values returned are distinct

rongrong · 2022-01-18T21:37:12Z

Please keep the class name and function name consistent, so maybe change this to TopkDistinct instead. Or TopKDistinct and top_k_distinct, or name the function k_distinct.

+1

Using the term topk in the name might be confusing as top-k usually refers to k first elements in a sorted list, but here the list is not sorted. I'd also consider adding approx_ prefix to the name to make it crystal clear that results may not be 100% accurate.

Also, I wonder if this logic could be folded into DistinctLimitOperator instead of a function.

DistinctLimit operator is quite heavy and it has partitioning semantics etc. So I didn't want to touch it because it is needed for the other case when it is turned off.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate. We have a featuresconfig so for deployments that are for adhoc/analytical queries this is fine. It is definitely not approximate.

rongrong · 2022-01-18T21:37:53Z

nits: Static import
Is this function useful as a public function? I'd say mark it as HIDDEN

Yeah I'm wondering if this could be a useful function for users. We will see.

Kept it hidden for now

rongrong · 2022-01-18T21:59:04Z

Are you planning to use hash_based_distinct_limit_threshold = 0 to disable this feature? Should we have a separate variable to enable this?

I think it might be cleaner to introduce a top-level property that enables approximate query results and require that it is enabled for any of the approximate optimizations including approximate results where slow workers are ignored, results produced using metadata only, this optimization, etc.

Yeah - looking into that as well maybe when we have some more we can do that.

rongrong · 2022-01-18T22:02:28Z

Please also add tests and benchmark, thanks!

rongrong · 2022-01-18T22:30:02Z

Just to be clear, the state in scalar function is shared only within the same driver (same page processor), so using it for partial limit is safe. Generally speaking, using state in scalar is discouraged.

kaikalur · 2022-01-18T23:03:04Z

Just to be clear, the state in scalar function is shared only within the same driver (same page processor), so using it for partial limit is safe. Generally speaking, using state in scalar is discouraged.

Yes - in fact I'm trying to see if we can add superfilters (like this one) which basically terminate the scan using three-valued logic - filter in, filter out, stop scan

kaikalur · 2022-01-19T18:56:45Z

As for benchmark, we don't have a way to benchmark distributed queries! Still I added the benchmark if/when we add the capability we can run this.

kaikalur · 2022-01-19T19:20:49Z

Updated the commit message with some "benchmark" numbers

mbasmanova · 2022-01-19T19:27:38Z

@kaikalur Sreeni, what are the wins you are observing with e2e queries?

mbasmanova · 2022-01-19T19:31:04Z

@kaikalur Sreeni, would you update Release Notes to document configuration and session properties that can be used to enable / disable this optimization.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate.

It would be nice to add a comment explaining this to the code as this is not obvious.

kaikalur · 2022-01-19T19:40:28Z

@kaikalur Sreeni, what are the wins you are observing with e2e queries?

I updated the commit message - I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

kaikalur · 2022-01-19T19:50:29Z

@kaikalur Sreeni, would you update Release Notes to document configuration and session properties that can be used to enable / disable this optimization.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate.

It would be nice to add a comment explaining this to the code as this is not obvious.

Done

mbasmanova · 2022-01-19T20:01:51Z

I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

That's impressive. Do you know why would the original version use so much memory? It sounds like a bug worth fixing.

kaikalur · 2022-01-19T20:07:27Z

I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

That's impressive. Do you know why would the original version use so much memory? It sounds like a bug worth fixing.

So my theory :) Like I mentioned in the issue I referenced the DistinctLimit operator seems to be implemented assuming it will hit the limit fast. So it actually uses group by mechanisms to mark positions to copy out in the page etc. So my theory is those pages were being held. On other hand, this filter function has a fixed LongArraySet of max 10k entries so 80K total per driver.

mbasmanova · 2022-01-19T20:17:32Z

It would be nice to dig a bit deeper to get a better understanding of why current implementation is using so much memory.

mbasmanova

@kaikalur Overall looks good. A few comments / questions below.

mbasmanova · 2022-01-20T16:00:06Z

nit: Capitalize 'threshold'

mbasmanova · 2022-01-20T16:01:52Z

Should we add a check that the threshold is < 10K? I'm concerned about users inadvertently configuring the system in a way that may produce approximate results. Investigate such issues will be extremely hard. I think having a top-level config to opt-in into approximate results would be helpful. It may be helpful to also issue a warning whenever a query is using a feature / optimization that may produce approximate result.

This is not a config but rather query level session param. So as long as it doesn't OOM, I see no issue (again it says the chance of collision starts at 2bn - 4bn distinct values.

I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?

I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?

Yes that's the idea. 10k is what we are starting with but if other deployments want to change it, they can do it.

mbasmanova · 2022-01-20T16:04:00Z

Would users understand what "Hash based distinct limit" means and what is "threshold" for that? Should we add some documentation somewhere in https://prestodb.io/docs/current/optimizer.html and/or https://prestodb.io/docs/current/admin/properties.html#optimizer-properties ?

It's intentional. We are not expecting users to mess with this. I'm thinking this will be mostly enable/disable by admins for whole clusters - enable it for adhoc/analytical clusters only.

Is there reasons why we don't want to enable this for batch workloads? My thinking is that we are adding the boolean config to control roll out, but once this is tested, we can actually remove it, and always use the threshold (assuming that under certain numbers, it would always be beneficial, say 100)

Sure we can. It's upto whoever wants to use it :) Also generally we don't see this pattern in batch workloads.

mbasmanova · 2022-01-20T16:05:44Z

This is a function documentation, right? Since the function returns a boolean, perhaps, explain what conditions make it return true and false? Perhaps, "Returns true for the first K distinct hash values".

Yeah - I also want to capture the fact that it maintains state. Let me see if I can rephrase it.

mbasmanova · 2022-01-20T16:06:12Z

nit: Why "key"? Perhaps, addHash is clearer.

nit: size is K, right? would it make sense to name it that or use the term limit for clarity? It would be nice to add a comment to this method to explain what it does.

yeah I was thinking that. Made it k

mbasmanova · 2022-01-20T16:08:26Z

nit: functionManager -> functionAndTypeManager is null

These checknulls look redundant to me. I'm going to get rid of it.

mbasmanova · 2022-01-20T16:08:48Z

nit: static import

mbasmanova · 2022-01-20T16:10:40Z

Any particular reason not to use IterativeOptimizer here?

This actually happens after we do all the optimizations so the patterns can be really weird. So I prefer this visit based optimizer also in general I like that better than iterative optimizer.

mbasmanova · 2022-01-20T16:11:45Z

Using com.facebook.presto.sql.planner.iterative.Rule would make it a lot simpler and nicer.

Harder to understand. I like visitor pattern better than regex-like rules.

nits: static import

mbasmanova · 2022-01-20T16:13:19Z

Is this the case? I believe we have a special optimization for hash-based aggregation on a single BIGINT key.

Yeah that's why I use that single bigint as the variable.

Just curious, what's the optimization? It doesn't apply to other integer types?

hash is a long so looks like optimize for case when group by key is a single BIGINT type (which is what we use for all integer type I thought)

we use java long for all integer SQL types. So if this is specific to the java type, it should map to TINYINT,... BIGINT.

mbasmanova · 2022-01-20T16:14:56Z

One more question. I'm not sure how this benchmark works, but would you share the results?

This doesn't work for the actual case here because we need partial/final but the LocalQueryRunner doesn't do that yet. #17210

sql_distinct_limit: default :: 136.001 cpu ms :: 0B peak memory :: in 15K, 0B, 110K/s, 0B/s :: out 1, 5B, 7/s, 36B/s sql_distinct_limit: hash based :: 109.253 cpu ms :: 0B peak memory :: in 15K, 0B, 137K/s, 0B/s :: out 1, 5B, 9/s, 45B/s sql_distinct_limit: default :: 84.350 cpu ms :: 0B peak memory :: in 15K, 0B, 178K/s, 0B/s :: out 1, 5B, 11/s, 59B/s sql_distinct_limit: hash based :: 71.158 cpu ms :: 0B peak memory :: in 15K, 0B, 211K/s, 0B/s :: out 1, 5B, 14/s, 70B/s

So why there's perf difference in this benchmark? Why do we include this benchmark here if it doesn't test the feature introduced?

I don't understand that either :) maybe it's not warmed up enough? Changing it to 20 warmup and 20 test iterations shows better:

sql_distinct_limit: default :: 77.940 cpu ms :: 0B peak memory :: in 15K, 0B, 192K/s, 0B/s :: out 1, 5B, 12/s, 64B/s sql_distinct_limit: hash based :: 70.397 cpu ms :: 0B peak memory :: in 15K, 0B, 213K/s, 0B/s :: out 1, 5B, 14/s, 71B/s sql_distinct_limit: default :: 70.115 cpu ms :: 0B peak memory :: in 15K, 0B, 214K/s, 0B/s :: out 1, 5B, 14/s, 71B/s sql_distinct_limit: hash based :: 71.953 cpu ms :: 0B peak memory :: in 15K, 0B, 208K/s, 0B/s :: out 1, 5B, 13/s, 69B/s

Removed the test for now

mbasmanova

Thank you, Sreeni.

Some perf numbers: Original query - 81 distinct rows out of 702B rows (over 10+ iterations): ------------------------------------------------------------------------ (81 rows) Query 20220119_013303_37211_k4nb5, FINISHED, 599 nodes Splits: 262,004 total, 262,004 done (100.00%) 1:38 [702B rows, 484GB] [7.17B rows/s, 4.95GB/s] With the optimization: ---------------------- (81 rows) Query 20220118_203652_00269_kynz9, FINISHED, 599 nodes Splits: 262,019 total, 262,019 done (100.00%) 0:39 [702B rows, 484GB] [18.1B rows/s, 12.5GB/s]

kaikalur requested review from rongrong and removed request for rongrong January 18, 2022 16:25

mbasmanova reviewed Jan 18, 2022

View reviewed changes

kaikalur force-pushed the optimize_distinct_limit branch from ccb241c to 572f5dd Compare January 18, 2022 18:23

kaikalur changed the title ~~Add a quicker, hash-based version of distinct limit~~ [WIP] Add a quicker, hash-based version of distinct limit Jan 18, 2022

rongrong reviewed Jan 18, 2022

View reviewed changes

kaikalur force-pushed the optimize_distinct_limit branch 5 times, most recently from 9cd233c to f8509d2 Compare January 19, 2022 18:47

kaikalur force-pushed the optimize_distinct_limit branch 3 times, most recently from 6a9527b to 9811f96 Compare January 19, 2022 19:11

kaikalur force-pushed the optimize_distinct_limit branch from 9811f96 to 5387c81 Compare January 19, 2022 19:47

kaikalur changed the title ~~[WIP] Add a quicker, hash-based version of distinct limit~~ Add a quicker, hash-based version of distinct limit Jan 19, 2022

kaikalur force-pushed the optimize_distinct_limit branch from 5387c81 to be47862 Compare January 19, 2022 23:33

mbasmanova reviewed Jan 20, 2022

View reviewed changes

kaikalur force-pushed the optimize_distinct_limit branch from be47862 to 9afeb69 Compare January 20, 2022 16:44

mbasmanova approved these changes Jan 20, 2022

View reviewed changes

kaikalur force-pushed the optimize_distinct_limit branch from 9afeb69 to 2f2a6b6 Compare January 21, 2022 03:58

rongrong approved these changes Jan 21, 2022

View reviewed changes

kaikalur force-pushed the optimize_distinct_limit branch from 2f2a6b6 to 6457ec3 Compare January 21, 2022 21:00

rongrong merged commit e666700 into prestodb:master Jan 24, 2022

neeradsomanchi mentioned this pull request Feb 8, 2022

Add release notes for 0.270 #17271

Merged

4 tasks

kaikalur mentioned this pull request Dec 1, 2022

Remove unused code #18749

Merged

Conversation

kaikalur commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova commented Jan 18, 2022

Uh oh!

kaikalur commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikalur Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rongrong commented Jan 18, 2022

Uh oh!

rongrong commented Jan 18, 2022

Uh oh!

kaikalur commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaikalur commented Jan 19, 2022

Uh oh!

kaikalur commented Jan 19, 2022

Uh oh!

mbasmanova commented Jan 19, 2022

Uh oh!

mbasmanova commented Jan 19, 2022

Uh oh!

kaikalur commented Jan 19, 2022

Uh oh!

kaikalur commented Jan 19, 2022

Uh oh!

mbasmanova commented Jan 19, 2022

Uh oh!

kaikalur commented Jan 19, 2022

Uh oh!

mbasmanova commented Jan 19, 2022

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaikalur commented Jan 18, 2022 •

edited

Loading

kaikalur commented Jan 18, 2022 •

edited

Loading

kaikalur Jan 19, 2022 •

edited

Loading

kaikalur commented Jan 18, 2022 •

edited

Loading