Skip to content

Add KLL Sketch Functions#21568

Merged
tdcmeehan merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-kll-sketch
Mar 4, 2024
Merged

Add KLL Sketch Functions#21568
tdcmeehan merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-kll-sketch

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Dec 18, 2023

Description

This change adds KLL Sketch support from the Apache DataSketches library. One of the benefits of KLL sketches is that the implementation supports more datatypes than the existing quantile digest and T-Digest implementations. It also has provably-bounded error compared to the T-Digest

The following datatype is added

  • kllsketch(T): Basically a wrapper around varbinary, similar to tdigest and qdigest types

The following functions are added

  • sketch_kll: computes a KLL sketch
  • sketch_kll_with_k: computes a kll sketch with the given value for k
  • sketch_kll_quantile: queries the sketch for a particular quantile
  • sketch_kll_rank: queries the sketch for a particular rank (inverse of the quantile function)

Motivation and Context

Eventual use by the optimizer after #21236. But also good to have since HMS is releasing support for them soon.

Impact

Two new aggregation functions, two new scalar functions

Test Plan

Standard aggregation unit tests and scalar function unit tests

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* New support for Apache DataSketches KLL sketch with the sketch_kll and related family of functions

@ZacBlanco ZacBlanco requested a review from a team as a code owner December 18, 2023 23:17
@github-actions
Copy link

github-actions bot commented Dec 18, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 0907cd8...4df29ad.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions/sketch.rst
presto-docs/src/main/sphinx/language/types.rst

@ZacBlanco ZacBlanco changed the title Add KLL Sketch Support Add KLL Sketch Functions Dec 18, 2023
@ZacBlanco ZacBlanco force-pushed the upstream-kll-sketch branch 2 times, most recently from 543459e to 74c92d1 Compare December 19, 2023 00:34
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a very tiny nit, otherwise LGTM!

Copy link
Contributor

@Yuhta Yuhta Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some pitfalls (mainly memory-wise) we use KLL in presto_cpp, not sure if they are applicable to this Java implementation or not:

  1. The sketch is initialized with a fairly large pre-allocated array depending on the error bound. This would cause large memory usage when the number of groups are high and each group only accumulates very few items.
  2. The serialized form is not very compact and can occupy more memory than it needs. This is ok for the partial aggregation side but would cause memory issue on the final aggregation side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing up these points! Was there anything you did to mitigate them?

re: 1. I haven't checked the code, but didn't consider the performance with a high number of groups since the main target use case for introducing the function is to provide table and partition-level statistics for the optimizer. If you have any thoughts on optimization I would happily discuss.

re: 2: I haven't measured the serialized size but according to the charts on the Apache Datasketches website, a "standard" k value of 200 produces sketches in the range of 2-4k bytes serialized with floats after consuming 1M+ items. Let's generously assume that with doubles, we would consume double the space, so about 8k.

kll size.

Do you know which parameters effect the number of partial aggregations sent to the final agg? I assumed it's the size of the cluster but, I could be wrong. Even if coalescing 1k KLL sketches in one node this should only consume 8k * 1000 = ~8MB of memory. I feel that shouldn't pose any issue. But it really depends on how many sketches are sent to the final agg step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta what sort of memory issues were observed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1 it's easy to fix in the sketch itself, just need to make the sketch grows the bottom layer when the number of elements it keeps is small. You may need to submit PR to datasketches project to fix it though, or copy the code over.

For 2 the size of serialization also depends on how the input data looks like. The worst case comparing to Q/T-digest comes from a lot of identical inputs. In this case Q-digest does not increase the memory usage but KLL does, because it keeps samples of inputs instead of distribution. This multiplied by the number of groups and the fan-in factor of final aggregation sometimes blows it up. One mitigation we did is to compact the sketch before the serialization, this does not resolve the problem completely but we have been good since then on production data.

Both issues can be fixed later though, I just mention it here so we are not surprised when these issues showing up.

@ZacBlanco ZacBlanco force-pushed the upstream-kll-sketch branch 3 times, most recently from 756aeb7 to 6a40813 Compare January 10, 2024 17:43
@tdcmeehan tdcmeehan self-assigned this Jan 10, 2024
steveburnett
steveburnett previously approved these changes Jan 10, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

New pull of branch and new local build of docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to add documentation for sketch_kll_with_k ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed new content had been added addressing this, so I pulled the branch and made a local build of the docs just now. Here's a screenshot of the newly-added sketch_kll_with_k entry:
Screenshot 2024-01-24 at 9 36 22 AM

@ZacBlanco ZacBlanco force-pushed the upstream-kll-sketch branch from 6a40813 to 905e17b Compare January 23, 2024 19:08
Copy link
Contributor

@vivek-bharathan vivek-bharathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Just nits from my side
Good work on the memory accounting!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real risk? Should this be user-configurable in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a case where it might be desirable to have it be configurable, but that's why we have the sketch_kll_with_k function so users can pass any value they want. I put this here to (1) extract constant value from all of the call sites and (2) provide a stable default in the case the library changes the default after an upgrade. A different value will affect the storage size of the sketch, so I wanted something that would be stable within the Presto code base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two addMemoryUsage(s) should cancel each other right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They might not - since state is mutable, Line 67 creates a new sketch which may have a different size than the result of state.getSketch() at that time. So, we need to subtract the current state (Line 68), set the new state (line 69) and re-calculate the size (Line 70)

aaneja
aaneja previously approved these changes Jan 24, 2024
@ZacBlanco ZacBlanco force-pushed the upstream-kll-sketch branch 2 times, most recently from c026de4 to d817bec Compare January 24, 2024 17:27
This change adds KLL Sketch support from the Apache DataSketches library.
One of the benefits of KLL sketches is that the implementation supports
more datatypes than the existing quantile digest and T-Digest implementations.

The following functions are added

- `sketch_kll`: computes a KLL sketch
- `sketch_kll_with_k`: computes a kll sketch with the given value for k
- `sketch_kll_quantile`: queries the sketch for a particular quantile
- `sketch_kll_rank`: queries the sketch for a particualr rank (inverse of the quantile function)
@ZacBlanco ZacBlanco force-pushed the upstream-kll-sketch branch from d817bec to 4df29ad Compare January 24, 2024 18:47
@ZacBlanco ZacBlanco requested a review from tdcmeehan February 1, 2024 18:58
@tdcmeehan
Copy link
Contributor

I think we need to document the serialized binary format, similar to T-Digest and QDigest. Otherwise, they're completely unreadable by any other system.

@ZacBlanco
Copy link
Contributor Author

I think we need to document the serialized binary format, similar to T-Digest and QDigest.

The serialized binary format is specified by the Apache Datasketches library. This is already noted in the docs changes.

https://github.com/prestodb/presto/pull/21568/files#diff-33bc3be2c487c146b213ef7b64ddc7c91956aa6e8b5d6207b217061312286b37R47-R48

If this is confusing I can try to update the phrasing

@tdcmeehan
Copy link
Contributor

I can't find where they explicitly document the binary format--is there a link for that?

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Feb 13, 2024

The binary format is not documented unfortunately. However, all versions (cpp, java, go, etc) of the datasketches libraries can serialize and deserialize the sketches. e.g. java can deserialize cpp serialized sketches, and vice versa. AFAIK this is the case for all sketches coming from the datasketches library.

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last thing

@tdcmeehan tdcmeehan merged commit e0e9a59 into prestodb:master Mar 4, 2024
@ZacBlanco
Copy link
Contributor Author

Just for some continuity on the discussion of sketch serialized format compatibility, we brought this to their community in hopes to clarify the guarantees that the maintainers provide. This resulted in some changes to the datasketches documentation in apache/datasketches-website#162 and subsequently apache/datasketches-website#163

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Mar 4, 2024

Possible follow up: if KLL has a provable error bound, could we use it to replace Q-Digest inside of approx_distinct? The only reason approx_distinct uses Q-Digest under the hood, in spite of its poor performance compared to T-Digest, is it has a provable error bound.

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Mar 4, 2024

I'm not familiar with the code for approx_distinct, but if the KLL sketch answers the same question as the Q-digest (e.g. what percentile does value X fall? or the inverse), then we should be able to replace it. Another option is you could replace the approx_distinct implementation entirely with a Theta sketch. I can't speak how the accuracy of a Theta sketch compares to the current approx_distinct implementation though.

This could be a good project/experiment for a new Presto contributor

@jmalkin
Copy link

jmalkin commented Mar 4, 2024

The binary format is defined here: https://github.com/apache/datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllPreambleUtil.java

Obviously the format for items cannot be defined for arbitrary type T, and the exact size of the levels array is not fixed. But the serialized header is defined, as is the order of the non-fixed-size/format data.

@leerho
Copy link

leerho commented Mar 4, 2024

@Yuhta,
I don't understand your comment above (Jan 4th) "The serialized form is not very compact and can occupy more memory than it needs."

The KLL sketch always serializes (toByteArray()) into a compact form, which has no wasted space, and grows very gradually based on the number of items fed to and retained by the sketch. You can see the effect of this in the plot copied from our website above. The blue curve starts very small (8 bytes) and grows gradually until it reaches ~2.5K bytes at about 1M items.

It is true that if the sketch image were compressed it would become even smaller, but that can be costly in terms of time. That is not an option we currently offer as part of the sketch, but as long as one is using lossless compression algorithms, it is something you could do yourself.

If you could elaborate you concern a bit more it would help us understand the issue.

Thanks!

@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants