Skip to content

Add a quicker, hash-based version of distinct limit#17199

Merged
rongrong merged 1 commit intoprestodb:masterfrom
kaikalur:optimize_distinct_limit
Jan 24, 2022
Merged

Add a quicker, hash-based version of distinct limit#17199
rongrong merged 1 commit intoprestodb:masterfrom
kaikalur:optimize_distinct_limit

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented Jan 18, 2022

Added a quicker version of partial distinct limit based on hashes for reducing latency. This eliminates the DistinctLimitPartial node type and instead uses a filter using a stateful filter function that just keeps the N distinct hashes and filters out everything after it reaches the limit or if a hash is already seen.

It maybe slightly inaccurate when there are hash collisions but for adhoc quick data exploration this will be good as the user doesn't have to wait hours to see the values.

Fixes #17196

Test plan - Tests already exist

== RELEASE NOTES ==

General Changes
* We now have a faster version of distinct limit N queries for N <= 10000 that uses distinct hashes for quicker results to handle common situations where there are a few dozen distinct values in large datasets. This feature can be enabled by session param: hash_based_distinct_limit_enabled and the limit (default 1000) can be adjusted using: hash_based_distinct_limit_threashold.

@kaikalur kaikalur requested review from rongrong and removed request for rongrong January 18, 2022 16:25
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this code is removing "the largest one". I think this code either adds "hash" (if not exists already) or not. It returns true if added and false if the value already existed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry bad comment. Removed it.

@mbasmanova
Copy link
Copy Markdown
Contributor

@kaikalur My understanding is that the speedup comes from the fact that this implementation is approximate, e.g. returns a set of values with distinct hashes and may miss some of the values if they happen to have the same hash. It would be good to clarify that in the release notes and, perhaps, use session / configuration key with the name that clearly signals that the results may not be 100% accurate.

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from ccb241c to 572f5dd Compare January 18, 2022 18:23
@kaikalur
Copy link
Copy Markdown
Contributor Author

kaikalur commented Jan 18, 2022

@kaikalur My understanding is that the speedup comes from the fact that this implementation is approximate, e.g. returns a set of values with distinct hashes and may miss some of the values if they happen to have the same hash. It would be good to clarify that in the release notes and, perhaps, use session / configuration key with the name that clearly signals that the results may not be 100% accurate.

Yes that's correct but two points -

  1. With a 64bit hashing algo for 10k distinct values (the threshold), the chance of collision is almost 0 according to the internet :)

https://stackoverflow.com/questions/22029012/probability-of-64bit-hash-code-collisions

  1. It's not accurate in the sense that we may miss some values but all values returned are distinct

@kaikalur kaikalur changed the title Add a quicker, hash-based version of distinct limit [WIP] Add a quicker, hash-based version of distinct limit Jan 18, 2022
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the class name and function name consistent, so maybe change this to TopkDistinct instead. Or TopKDistinct and top_k_distinct, or name the function k_distinct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Using the term topk in the name might be confusing as top-k usually refers to k first elements in a sorted list, but here the list is not sorted. I'd also consider adding approx_ prefix to the name to make it crystal clear that results may not be 100% accurate.

Also, I wonder if this logic could be folded into DistinctLimitOperator instead of a function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistinctLimit operator is quite heavy and it has partitioning semantics etc. So I didn't want to touch it because it is needed for the other case when it is turned off.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate. We have a featuresconfig so for deployments that are for adhoc/analytical queries this is fine. It is definitely not approximate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: Static import
Is this function useful as a public function? I'd say mark it as HIDDEN

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm wondering if this could be a useful function for users. We will see.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it hidden for now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to use hash_based_distinct_limit_threshold = 0 to disable this feature? Should we have a separate variable to enable this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be cleaner to introduce a top-level property that enables approximate query results and require that it is enabled for any of the approximate optimizations including approximate results where slow workers are ignored, results produced using metadata only, this optimization, etc.

Copy link
Copy Markdown
Contributor Author

@kaikalur kaikalur Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - looking into that as well maybe when we have some more we can do that.

@rongrong
Copy link
Copy Markdown
Contributor

Please also add tests and benchmark, thanks!

@rongrong
Copy link
Copy Markdown
Contributor

Just to be clear, the state in scalar function is shared only within the same driver (same page processor), so using it for partial limit is safe. Generally speaking, using state in scalar is discouraged.

@kaikalur
Copy link
Copy Markdown
Contributor Author

kaikalur commented Jan 18, 2022

Just to be clear, the state in scalar function is shared only within the same driver (same page processor), so using it for partial limit is safe. Generally speaking, using state in scalar is discouraged.

Yes - in fact I'm trying to see if we can add superfilters (like this one) which basically terminate the scan using three-valued logic - filter in, filter out, stop scan

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch 5 times, most recently from 9cd233c to f8509d2 Compare January 19, 2022 18:47
@kaikalur
Copy link
Copy Markdown
Contributor Author

As for benchmark, we don't have a way to benchmark distributed queries! Still I added the benchmark if/when we add the capability we can run this.

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch 3 times, most recently from 6a9527b to 9811f96 Compare January 19, 2022 19:11
@kaikalur
Copy link
Copy Markdown
Contributor Author

Updated the commit message with some "benchmark" numbers

@mbasmanova
Copy link
Copy Markdown
Contributor

@kaikalur Sreeni, what are the wins you are observing with e2e queries?

@mbasmanova
Copy link
Copy Markdown
Contributor

@kaikalur Sreeni, would you update Release Notes to document configuration and session properties that can be used to enable / disable this optimization.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate.

It would be nice to add a comment explaining this to the code as this is not obvious.

@kaikalur
Copy link
Copy Markdown
Contributor Author

@kaikalur Sreeni, what are the wins you are observing with e2e queries?

I updated the commit message - I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from 9811f96 to 5387c81 Compare January 19, 2022 19:47
@kaikalur
Copy link
Copy Markdown
Contributor Author

@kaikalur Sreeni, would you update Release Notes to document configuration and session properties that can be used to enable / disable this optimization.

And I'm convinced here due to my research :) that for the threshold I have (10K) it's actually accurate.

It would be nice to add a comment explaining this to the code as this is not obvious.

Done

@mbasmanova
Copy link
Copy Markdown
Contributor

I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

That's impressive. Do you know why would the original version use so much memory? It sounds like a bug worth fixing.

@kaikalur
Copy link
Copy Markdown
Contributor Author

I'm seeing firstly global memory go from 20G to < 1MB for the limit query we were playing with and the latency under load was consistently 39s for this version vs 1m40s for the other one under simulated load.

That's impressive. Do you know why would the original version use so much memory? It sounds like a bug worth fixing.

So my theory :) Like I mentioned in the issue I referenced the DistinctLimit operator seems to be implemented assuming it will hit the limit fast. So it actually uses group by mechanisms to mark positions to copy out in the page etc. So my theory is those pages were being held. On other hand, this filter function has a fixed LongArraySet of max 10k entries so 80K total per driver.

@mbasmanova
Copy link
Copy Markdown
Contributor

It would be nice to dig a bit deeper to get a better understanding of why current implementation is using so much memory.

@kaikalur kaikalur changed the title [WIP] Add a quicker, hash-based version of distinct limit Add a quicker, hash-based version of distinct limit Jan 19, 2022
@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from 5387c81 to be47862 Compare January 19, 2022 23:33
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaikalur Overall looks good. A few comments / questions below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Capitalize 'threshold'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check that the threshold is < 10K? I'm concerned about users inadvertently configuring the system in a way that may produce approximate results. Investigate such issues will be extremely hard. I think having a top-level config to opt-in into approximate results would be helpful. It may be helpful to also issue a warning whenever a query is using a feature / optimization that may produce approximate result.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a config but rather query level session param. So as long as it doesn't OOM, I see no issue (again it says the chance of collision starts at 2bn - 4bn distinct values.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here by the statement around threshold for collision - should we assume this is == 0 or ~= 0 at sufficiently low cardinalities (like the 10K proposed above)?

Yes that's the idea. 10k is what we are starting with but if other deployments want to change it, they can do it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would users understand what "Hash based distinct limit" means and what is "threshold" for that? Should we add some documentation somewhere in https://prestodb.io/docs/current/optimizer.html and/or https://prestodb.io/docs/current/admin/properties.html#optimizer-properties ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional. We are not expecting users to mess with this. I'm thinking this will be mostly enable/disable by admins for whole clusters - enable it for adhoc/analytical clusters only.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there reasons why we don't want to enable this for batch workloads? My thinking is that we are adding the boolean config to control roll out, but once this is tested, we can actually remove it, and always use the threshold (assuming that under certain numbers, it would always be beneficial, say 100)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can. It's upto whoever wants to use it :) Also generally we don't see this pattern in batch workloads.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function documentation, right? Since the function returns a boolean, perhaps, explain what conditions make it return true and false? Perhaps, "Returns true for the first K distinct hash values".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I also want to capture the fact that it maintains state. Let me see if I can rephrase it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why "key"? Perhaps, addHash is clearer.

nit: size is K, right? would it make sense to name it that or use the term limit for clarity? It would be nice to add a comment to this method to explain what it does.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was thinking that. Made it k

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: functionManager -> functionAndTypeManager is null

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checknulls look redundant to me. I'm going to get rid of it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason not to use IterativeOptimizer here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually happens after we do all the optimizations so the patterns can be really weird. So I prefer this visit based optimizer also in general I like that better than iterative optimizer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using com.facebook.presto.sql.planner.iterative.Rule would make it a lot simpler and nicer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harder to understand. I like visitor pattern better than regex-like rules.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: static import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case? I believe we have a special optimization for hash-based aggregation on a single BIGINT key.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's why I use that single bigint as the variable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what's the optimization? It doesn't apply to other integer types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash is a long so looks like optimize for case when group by key is a single BIGINT type (which is what we use for all integer type I thought)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use java long for all integer SQL types. So if this is specific to the java type, it should map to TINYINT,... BIGINT.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question. I'm not sure how this benchmark works, but would you share the results?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for the actual case here because we need partial/final but the LocalQueryRunner doesn't do that yet. #17210

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        sql_distinct_limit: default ::  136.001 cpu ms ::    0B peak memory :: in   15K,      0B,    110K/s,      0B/s :: out     1,      5B,       7/s,     36B/s
     sql_distinct_limit: hash based ::  109.253 cpu ms ::    0B peak memory :: in   15K,      0B,    137K/s,      0B/s :: out     1,      5B,       9/s,     45B/s

        sql_distinct_limit: default ::   84.350 cpu ms ::    0B peak memory :: in   15K,      0B,    178K/s,      0B/s :: out     1,      5B,      11/s,     59B/s
     sql_distinct_limit: hash based ::   71.158 cpu ms ::    0B peak memory :: in   15K,      0B,    211K/s,      0B/s :: out     1,      5B,      14/s,     70B/s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why there's perf difference in this benchmark? Why do we include this benchmark here if it doesn't test the feature introduced?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that either :) maybe it's not warmed up enough? Changing it to 20 warmup and 20 test iterations shows better:

        sql_distinct_limit: default ::   77.940 cpu ms ::    0B peak memory :: in   15K,      0B,    192K/s,	  0B/s :: out     1,	  5B,	   12/s,     64B/s
     sql_distinct_limit: hash based ::   70.397 cpu ms ::    0B peak memory :: in   15K,      0B,    213K/s,	  0B/s :: out     1,	  5B,	   14/s,     71B/s
        sql_distinct_limit: default ::   70.115 cpu ms ::    0B peak memory :: in   15K,      0B,    214K/s,	  0B/s :: out     1,	  5B,	   14/s,     71B/s
     sql_distinct_limit: hash based ::   71.953 cpu ms ::    0B peak memory :: in   15K,      0B,    208K/s,      0B/s :: out     1,      5B,      13/s,     69B/s

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the test for now

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from be47862 to 9afeb69 Compare January 20, 2022 16:44
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Sreeni.

@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from 9afeb69 to 2f2a6b6 Compare January 21, 2022 03:58
Some perf numbers:

Original query - 81 distinct rows out of 702B rows (over 10+ iterations):
------------------------------------------------------------------------
(81 rows)

Query 20220119_013303_37211_k4nb5, FINISHED, 599 nodes
Splits: 262,004 total, 262,004 done (100.00%)
1:38 [702B rows, 484GB] [7.17B rows/s, 4.95GB/s]

With the optimization:
----------------------

(81 rows)

Query 20220118_203652_00269_kynz9, FINISHED, 599 nodes
Splits: 262,019 total, 262,019 done (100.00%)
0:39 [702B rows, 484GB] [18.1B rows/s, 12.5GB/s]
@kaikalur kaikalur force-pushed the optimize_distinct_limit branch from 2f2a6b6 to 6457ec3 Compare January 21, 2022 21:00
@rongrong rongrong merged commit e666700 into prestodb:master Jan 24, 2022
@neeradsomanchi neeradsomanchi mentioned this pull request Feb 8, 2022
4 tasks
@kaikalur kaikalur mentioned this pull request Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DistinctLimitPartial doesn't work well for low number of distinct values

4 participants