Add a quicker, hash-based version of distinct limit#17199
Add a quicker, hash-based version of distinct limit#17199rongrong merged 1 commit intoprestodb:masterfrom
Conversation
There was a problem hiding this comment.
I'm not sure I understand how this code is removing "the largest one". I think this code either adds "hash" (if not exists already) or not. It returns true if added and false if the value already existed.
There was a problem hiding this comment.
Sorry bad comment. Removed it.
|
@kaikalur My understanding is that the speedup comes from the fact that this implementation is approximate, e.g. returns a set of values with distinct hashes and may miss some of the values if they happen to have the same hash. It would be good to clarify that in the release notes and, perhaps, use session / configuration key with the name that clearly signals that the results may not be 100% accurate. |
ccb241c to
572f5dd
Compare
Yes that's correct but two points -
https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions
|
There was a problem hiding this comment.
Please keep the class name and function name consistent, so maybe change this to TopkDistinct instead. Or TopKDistinct and top_k_distinct, or name the function k_distinct.
There was a problem hiding this comment.
+1
Using the term topk in the name might be confusing as top-k usually refers to k first elements in a sorted list, but here the list is not sorted. I'd also consider adding approx_ prefix to the name to make it crystal clear that results may not be 100% accurate.
Also, I wonder if this logic could be folded into DistinctLimitOperator instead of a function.
There was a problem hiding this comment.
DistinctLimit operator is quite heavy and it has partitioning semantics etc. So I didn't want to touch it because it is needed for the other case when it is turned off.
And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate. We have a featuresconfig so for deployments that are for adhoc/analytical queries this is fine. It is definitely not approximate.
There was a problem hiding this comment.
nits: Static import
Is this function useful as a public function? I'd say mark it as HIDDEN
There was a problem hiding this comment.
Yeah I'm wondering if this could be a useful function for users. We will see.
There was a problem hiding this comment.
Kept it hidden for now
There was a problem hiding this comment.
Are you planning to use hash_based_distinct_limit_threshold = 0 to disable this feature? Should we have a separate variable to enable this?
There was a problem hiding this comment.
I think it might be cleaner to introduce a top-level property that enables approximate query results and require that it is enabled for any of the approximate optimizations including approximate results where slow workers are ignored, results produced using metadata only, this optimization, etc.
There was a problem hiding this comment.
Yeah - looking into that as well maybe when we have some more we can do that.
|
Please also add tests and benchmark, thanks! |
|
Just to be clear, the state in scalar function is shared only within the same driver (same page processor), so using it for partial limit is safe. Generally speaking, using state in scalar is discouraged. |
Yes - in fact I'm trying to see if we can add superfilters (like this one) which basically terminate the scan using three-valued logic - filter in, filter out, stop scan |
9cd233c to
f8509d2
Compare
|
As for benchmark, we don't have a way to benchmark distributed queries! Still I added the benchmark if/when we add the capability we can run this. |
6a9527b to
9811f96
Compare
|
Updated the commit message with some "benchmark" numbers |
|
@kaikalur Sreeni, what are the wins you are observing with e2e queries? |
|
@kaikalur Sreeni, would you update Release Notes to document configuration and session properties that can be used to enable / disable this optimization.
It would be nice to add a comment explaining this to the code as this is not obvious. |
I updated the commit message - I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load. |
9811f96 to
5387c81
Compare
Done |
That's impressive. Do you know why would the original version use so much memory? It sounds like a bug worth fixing. |
So my theory :) Like I mentioned in the issue I referenced the DistinctLimit operator seems to be implemented assuming it will hit the limit fast. So it actually uses group by mechanisms to mark positions to copy out in the page etc. So my theory is those pages were being held. On other hand, this filter function has a fixed LongArraySet of max 10k entries so 80K total per driver. |
|
It would be nice to dig a bit deeper to get a better understanding of why current implementation is using so much memory. |
5387c81 to
be47862
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@kaikalur Overall looks good. A few comments / questions below.
There was a problem hiding this comment.
nit: Capitalize 'threshold'
There was a problem hiding this comment.
Should we add a check that the threshold is < 10K? I'm concerned about users inadvertently configuring the system in a way that may produce approximate results. Investigate such issues will be extremely hard. I think having a top-level config to opt-in into approximate results would be helpful. It may be helpful to also issue a warning whenever a query is using a feature / optimization that may produce approximate result.
There was a problem hiding this comment.
This is not a config but rather query level session param. So as long as it doesn't OOM, I see no issue (again it says the chance of collision starts at 2bn - 4bn distinct values.
There was a problem hiding this comment.
I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?
There was a problem hiding this comment.
I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?
Yes that's the idea. 10k is what we are starting with but if other deployments want to change it, they can do it.
There was a problem hiding this comment.
Would users understand what "Hash based distinct limit" means and what is "threshold" for that? Should we add some documentation somewhere in https://prestodb.io/docs/current/optimizer.html and/or https://prestodb.io/docs/current/admin/properties.html#optimizer-properties ?
There was a problem hiding this comment.
It's intentional. We are not expecting users to mess with this. I'm thinking this will be mostly enable/disable by admins for whole clusters - enable it for adhoc/analytical clusters only.
There was a problem hiding this comment.
Is there reasons why we don't want to enable this for batch workloads? My thinking is that we are adding the boolean config to control roll out, but once this is tested, we can actually remove it, and always use the threshold (assuming that under certain numbers, it would always be beneficial, say 100)
There was a problem hiding this comment.
Sure we can. It's upto whoever wants to use it :) Also generally we don't see this pattern in batch workloads.
There was a problem hiding this comment.
This is a function documentation, right? Since the function returns a boolean, perhaps, explain what conditions make it return true and false? Perhaps, "Returns true for the first K distinct hash values".
There was a problem hiding this comment.
Yeah - I also want to capture the fact that it maintains state. Let me see if I can rephrase it.
There was a problem hiding this comment.
nit: Why "key"? Perhaps, addHash is clearer.
nit: size is K, right? would it make sense to name it that or use the term limit for clarity? It would be nice to add a comment to this method to explain what it does.
There was a problem hiding this comment.
yeah I was thinking that. Made it k
There was a problem hiding this comment.
nit: functionManager -> functionAndTypeManager is null
There was a problem hiding this comment.
These checknulls look redundant to me. I'm going to get rid of it.
There was a problem hiding this comment.
Any particular reason not to use IterativeOptimizer here?
There was a problem hiding this comment.
This actually happens after we do all the optimizations so the patterns can be really weird. So I prefer this visit based optimizer also in general I like that better than iterative optimizer.
There was a problem hiding this comment.
Using com.facebook.presto.sql.planner.iterative.Rule would make it a lot simpler and nicer.
There was a problem hiding this comment.
Harder to understand. I like visitor pattern better than regex-like rules.
There was a problem hiding this comment.
Is this the case? I believe we have a special optimization for hash-based aggregation on a single BIGINT key.
There was a problem hiding this comment.
Yeah that's why I use that single bigint as the variable.
There was a problem hiding this comment.
Just curious, what's the optimization? It doesn't apply to other integer types?
There was a problem hiding this comment.
hash is a long so looks like optimize for case when group by key is a single BIGINT type (which is what we use for all integer type I thought)
There was a problem hiding this comment.
we use java long for all integer SQL types. So if this is specific to the java type, it should map to TINYINT,... BIGINT.
There was a problem hiding this comment.
One more question. I'm not sure how this benchmark works, but would you share the results?
There was a problem hiding this comment.
This doesn't work for the actual case here because we need partial/final but the LocalQueryRunner doesn't do that yet. #17210
There was a problem hiding this comment.
sql_distinct_limit: default :: 136.001 cpu ms :: 0B peak memory :: in 15K, 0B, 110K/s, 0B/s :: out 1, 5B, 7/s, 36B/s
sql_distinct_limit: hash based :: 109.253 cpu ms :: 0B peak memory :: in 15K, 0B, 137K/s, 0B/s :: out 1, 5B, 9/s, 45B/s
sql_distinct_limit: default :: 84.350 cpu ms :: 0B peak memory :: in 15K, 0B, 178K/s, 0B/s :: out 1, 5B, 11/s, 59B/s
sql_distinct_limit: hash based :: 71.158 cpu ms :: 0B peak memory :: in 15K, 0B, 211K/s, 0B/s :: out 1, 5B, 14/s, 70B/s
There was a problem hiding this comment.
So why there's perf difference in this benchmark? Why do we include this benchmark here if it doesn't test the feature introduced?
There was a problem hiding this comment.
I don't understand that either :) maybe it's not warmed up enough? Changing it to 20 warmup and 20 test iterations shows better:
sql_distinct_limit: default :: 77.940 cpu ms :: 0B peak memory :: in 15K, 0B, 192K/s, 0B/s :: out 1, 5B, 12/s, 64B/s
sql_distinct_limit: hash based :: 70.397 cpu ms :: 0B peak memory :: in 15K, 0B, 213K/s, 0B/s :: out 1, 5B, 14/s, 71B/s
sql_distinct_limit: default :: 70.115 cpu ms :: 0B peak memory :: in 15K, 0B, 214K/s, 0B/s :: out 1, 5B, 14/s, 71B/s
sql_distinct_limit: hash based :: 71.953 cpu ms :: 0B peak memory :: in 15K, 0B, 208K/s, 0B/s :: out 1, 5B, 13/s, 69B/s
There was a problem hiding this comment.
Removed the test for now
be47862 to
9afeb69
Compare
9afeb69 to
2f2a6b6
Compare
Some perf numbers: Original query - 81 distinct rows out of 702B rows (over 10+ iterations): ------------------------------------------------------------------------ (81 rows) Query 20220119_013303_37211_k4nb5, FINISHED, 599 nodes Splits: 262,004 total, 262,004 done (100.00%) 1:38 [702B rows, 484GB] [7.17B rows/s, 4.95GB/s] With the optimization: ---------------------- (81 rows) Query 20220118_203652_00269_kynz9, FINISHED, 599 nodes Splits: 262,019 total, 262,019 done (100.00%) 0:39 [702B rows, 484GB] [18.1B rows/s, 12.5GB/s]
2f2a6b6 to
6457ec3
Compare
Added a quicker version of partial distinct limit based on hashes for reducing latency. This eliminates the DistinctLimitPartial node type and instead uses a filter using a stateful filter function that just keeps the N distinct hashes and filters out everything after it reaches the limit or if a hash is already seen.
It maybe slightly inaccurate when there are hash collisions but for adhoc quick data exploration this will be good as the user doesn't have to wait hours to see the values.
Fixes #17196
Test plan - Tests already exist