Skip to content

Add noisy_approx_distinct_sfm and related functions#21290

Merged
arhimondr merged 1 commit intoprestodb:masterfrom
jonhehir:noisy-approx-distinct
Nov 29, 2023
Merged

Add noisy_approx_distinct_sfm and related functions#21290
arhimondr merged 1 commit intoprestodb:masterfrom
jonhehir:noisy-approx-distinct

Conversation

@jonhehir
Copy link
Contributor

@jonhehir jonhehir commented Oct 31, 2023

Description

The new SfmSketch type corresponds to the "Sketch-Flip-Merge" data structure of http://proceedings.mlr.press/v202/hehir23a/hehir23a.pdf. In addition to the type, the following functions have been added to support noisy cardinality estimation:
cardinality(SfmSketch), merge(SfmSketch), merge_sfm(Array<SfmSketch>), noisy_approx_distinct_sfm, noisy_approx_set_sfm, noisy_empty_approx_set_sfm

These functions are essentially equivalent to the HyperLogLog functions such as approx_distinct, approx_set, etc., but using the SFM sketch in lieu of HLL for extra noise.

Documentation and release notes will be forthcoming in a separate PR.

Motivation and Context

These functions are added as part of the noisyaggregation suite. A notable difference between the theoretical SFM sketch and this implementation is that the aggregators (e.g., noisy_approx_set_sfm) return NULL when aggregating empty sets. Strictly speaking, this is a violation of the differential privacy guarantee of the theoretical SFM sketch, though this may be patched with use of the noisy_empty_approx_set_sfm function.

Impact

This does not affect any existing functionality.

Test Plan

Unit tests have been added for the SfmSketch structure itself and the corresponding Presto type and functions.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@jonhehir
Copy link
Contributor Author

cc @duykienvp

@jonhehir jonhehir requested a review from mlyublena October 31, 2023 23:00
jonhehir added a commit to jonhehir/airlift that referenced this pull request Nov 10, 2023
Reverting previous commit. The corresponding code will land in
prestodb/presto instead as part of
prestodb/presto#21290.

This reverts commit 277184d.
@jonhehir jonhehir force-pushed the noisy-approx-distinct branch 2 times, most recently from 19276ce to 053ddc0 Compare November 10, 2023 21:18
@jonhehir jonhehir marked this pull request as ready for review November 10, 2023 21:19
@jonhehir jonhehir requested a review from a team as a code owner November 10, 2023 21:19
@jonhehir jonhehir requested a review from presto-oss November 10, 2023 21:19
@jonhehir
Copy link
Contributor Author

Updated the PR to include the changes from prestodb/airlift#65 (which are being reverted in prestodb/airlift#66) and to rebase onto a fresh copy of master. This is ready to review now.

cc @gcormode

Copy link

@gcormode gcormode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me in this update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if Bitmaps are useful in other parts of the code (and in that case perhaps the class belongs in a different location)
but I guess for now we can leave them here

Copy link
Contributor

@mlyublena mlyublena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from Presto's POV

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember it tended to be slow (since the entropy is obtained from the OS). Have you considered Random with a secure seed? Does it have to provide cryptographic security?

Copy link
Contributor Author

@jonhehir jonhehir Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://arxiv.org/pdf/2002.04049.pdf, we should use a cryptographically secure PRNG. I was under the impression that SecureRandom would essentially use a secure seed (potentially blocking to set the seed), then deterministically generate from a PRNG, though on closer inspection, that appears to depend on the exact implementation being used. Since I'm not an expert here, I'm open to suggestions on how best to proceed here to balance security and performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativePRNGNonBlocking is used for secure_random UDF in Presto: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/MathFunctions.java#L102

Here's a description on how it operates: https://www.synopsys.com/blogs/software-security/proper-use-of-javas-securerandom.html

In short it initializes the seed from /dev/urandom what doesn't block if the OS experiencing lack of entropy. The algorithm is based on SHA1PRNG which is considered cryptographically secure.

From what i understand the paper states that:

“Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG).

It looks like pseudo-random should be fine, as long as it is cryptographically secure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding serialize(DynamicSliceOutput) for efficiency (to avoid extra copies)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: castToVarbinary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to account for an instance of Random? (may not be necessary if ThreadLocalRandom is used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be accounted for if it's static? I believe in a previous code review, the suggestion was to make it static to avoid having to figure out the memory usage of a Random object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I'm not sure if it's safe to have it static. It is not thread safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this method has to be exposed? Or should updating memory accounting be responsibility of the setSketch method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! This was done to mimic the design in HyperLogLog-related classes. I like the idea of deferring the responsibility to setSketch(), though I'd need to also account for memory when merging a sketch in state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input.readBytes can be more efficient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't! Thanks for the link. In this case, I'm not sure a compressed bitmap will offer much value here, since in the end we scatter random bits throughout the bitmap. It'd be worth confirming in the future, but for now, I think we can just keep the uncompressed bitmap.

That said, the Bitmap class should probably extend Java's BitSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the Bitmap class now wraps BitSet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: requireNonNull

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: requireNonNull

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficuilt would it be to refactor the tests and make these two fields final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making randomizedResponseProbability final would require returning copies rather than mutating in enablePrivacy and mergeWith. mergeWith in particular was designed to mutate to match the functionality in HyperLogLog, QuantileDigest, etc.

bitmap seems like it should be final, but I'll take a closer look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitmap is now final.

@jonhehir jonhehir force-pushed the noisy-approx-distinct branch from 053ddc0 to e013275 Compare November 17, 2023 23:43
@jonhehir
Copy link
Contributor Author

@arhimondr, I pushed changes to address most of these concerns. I believe the outstanding questions surround the use (and memory accounting) of SecureRandom. (See responses above.) Let me know how you'd like to proceed, thanks!

@jonhehir jonhehir requested a review from arhimondr November 17, 2023 23:57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sizeOfLongArray(bitSet.size() / Long.SIZE) to account for the long[] object headers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I'm not sure if it's safe to have it static. It is not thread safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativePRNGNonBlocking is used for secure_random UDF in Presto: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/MathFunctions.java#L102

Here's a description on how it operates: https://www.synopsys.com/blogs/software-security/proper-use-of-javas-securerandom.html

In short it initializes the seed from /dev/urandom what doesn't block if the OS experiencing lack of entropy. The algorithm is based on SHA1PRNG which is considered cryptographically secure.

From what i understand the paper states that:

“Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG).

It looks like pseudo-random should be fine, as long as it is cryptographically secure.

@jonhehir jonhehir force-pushed the noisy-approx-distinct branch from e013275 to 49a4544 Compare November 28, 2023 20:17
The new SfmSketch type corresponds to the Sketch-Flip-Merge summary
(http://proceedings.mlr.press/v202/hehir23a/hehir23a.pdf).
In addition to the type, the following functions have been added
to support noisy cardinality estimation:
cardinality(SfmSketch), merge(SfmSketch),
merge_sfm(Array<SfmSketch>), noisy_approx_distinct_sfm,
noisy_approx_set_sfm, noisy_empty_approx_set_sfm

These functions are essentially equivalent to the HyperLogLog
functions such as approx_distinct, approx_set, etc., but using
the SFM sketch in lieu of HLL for extra noise.

These functions are added as part of the noisyaggregation suite.
A notable difference between the theoretical SFM sketch and this
implementation is that the aggregators (e.g., noisy_approx_set_sfm)
return NULL when aggregating empty sets. Strictly speaking, this
is a violation of the differential privacy guarantee of the
theoretical SFM sketch, though this may be patched with use of the
noisy_empty_approx_set_sfm function.
@jonhehir jonhehir force-pushed the noisy-approx-distinct branch from 49a4544 to 5892e64 Compare November 29, 2023 01:00
@jonhehir jonhehir requested a review from arhimondr November 29, 2023 01:00
@jonhehir
Copy link
Contributor Author

jonhehir commented Nov 29, 2023

Thanks once again for your thoughtful comments, Andrii! In the latest commit, we're using the same SecureRandom providers as in MathFunctions (centralized this logic into com.facebook.presto.util), and the static property in SecureRandomizationStrategy is gone. To avoid the need for memory estimation of the SecureRandom instance, we're no longer storing the randomizationStrategy property in SfmSketch itself, but passing it as a method to the two methods that depend on a source of randomness. Let me know if you have any further questions/concerns!

@arhimondr
Copy link
Member

That sounds good to me, approved. Thank you!

@arhimondr arhimondr merged commit 91e00f5 into prestodb:master Nov 29, 2023
jonhehir added a commit to jonhehir/presto that referenced this pull request Nov 30, 2023
As a follow-up to prestodb#21290,
this one-line change ensures that the remaining noisy
aggregators (noisy_count_gaussian, noisy_count_if_gaussian,
noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom
provider that will not block and cause performance degradation.
arhimondr pushed a commit that referenced this pull request Dec 1, 2023
As a follow-up to #21290,
this one-line change ensures that the remaining noisy
aggregators (noisy_count_gaussian, noisy_count_if_gaussian,
noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom
provider that will not block and cause performance degradation.
jonhehir added a commit to jonhehir/airlift that referenced this pull request Mar 14, 2024
Reverting previous commit. The corresponding code will land in
prestodb/presto instead as part of
prestodb/presto#21290.

This reverts commit 277184d.
NikhilCollooru pushed a commit to prestodb/airlift that referenced this pull request Mar 14, 2024
Reverting previous commit. The corresponding code will land in
prestodb/presto instead as part of
prestodb/presto#21290.

This reverts commit 277184d.
kaikalur pushed a commit to kaikalur/presto that referenced this pull request Mar 14, 2024
As a follow-up to prestodb#21290,
this one-line change ensures that the remaining noisy
aggregators (noisy_count_gaussian, noisy_count_if_gaussian,
noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom
provider that will not block and cause performance degradation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants