Add noisy_approx_distinct_sfm and related functions#21290
Add noisy_approx_distinct_sfm and related functions#21290arhimondr merged 1 commit intoprestodb:masterfrom
Conversation
|
cc @duykienvp |
Reverting previous commit. The corresponding code will land in prestodb/presto instead as part of prestodb/presto#21290. This reverts commit 277184d.
19276ce to
053ddc0
Compare
|
Updated the PR to include the changes from prestodb/airlift#65 (which are being reverted in prestodb/airlift#66) and to rebase onto a fresh copy of cc @gcormode |
gcormode
left a comment
There was a problem hiding this comment.
Everything looks good to me in this update.
There was a problem hiding this comment.
I'm wondering if Bitmaps are useful in other parts of the code (and in that case perhaps the class belongs in a different location)
but I guess for now we can leave them here
mlyublena
left a comment
There was a problem hiding this comment.
Looks good from Presto's POV
There was a problem hiding this comment.
From what I remember it tended to be slow (since the entropy is obtained from the OS). Have you considered Random with a secure seed? Does it have to provide cryptographic security?
There was a problem hiding this comment.
Per https://arxiv.org/pdf/2002.04049.pdf, we should use a cryptographically secure PRNG. I was under the impression that SecureRandom would essentially use a secure seed (potentially blocking to set the seed), then deterministically generate from a PRNG, though on closer inspection, that appears to depend on the exact implementation being used. Since I'm not an expert here, I'm open to suggestions on how best to proceed here to balance security and performance.
There was a problem hiding this comment.
NativePRNGNonBlocking is used for secure_random UDF in Presto: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/MathFunctions.java#L102
Here's a description on how it operates: https://www.synopsys.com/blogs/software-security/proper-use-of-javas-securerandom.html
In short it initializes the seed from /dev/urandom what doesn't block if the OS experiencing lack of entropy. The algorithm is based on SHA1PRNG which is considered cryptographically secure.
From what i understand the paper states that:
“Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG).
It looks like pseudo-random should be fine, as long as it is cryptographically secure.
There was a problem hiding this comment.
Consider adding serialize(DynamicSliceOutput) for efficiency (to avoid extra copies)
There was a problem hiding this comment.
Do we need to account for an instance of Random? (may not be necessary if ThreadLocalRandom is used)
There was a problem hiding this comment.
Does it need to be accounted for if it's static? I believe in a previous code review, the suggestion was to make it static to avoid having to figure out the memory usage of a Random object.
There was a problem hiding this comment.
Hmm, actually I'm not sure if it's safe to have it static. It is not thread safe.
There was a problem hiding this comment.
I wonder if this method has to be exposed? Or should updating memory accounting be responsibility of the setSketch method?
There was a problem hiding this comment.
Good question! This was done to mimic the design in HyperLogLog-related classes. I like the idea of deferring the responsibility to setSketch(), though I'd need to also account for memory when merging a sketch in state.
There was a problem hiding this comment.
input.readBytes can be more efficient
There was a problem hiding this comment.
Have you considered https://github.com/RoaringBitmap/RoaringBitmap?
There was a problem hiding this comment.
I hadn't! Thanks for the link. In this case, I'm not sure a compressed bitmap will offer much value here, since in the end we scatter random bits throughout the bitmap. It'd be worth confirming in the future, but for now, I think we can just keep the uncompressed bitmap.
That said, the Bitmap class should probably extend Java's BitSet.
There was a problem hiding this comment.
As mentioned above, the Bitmap class now wraps BitSet.
There was a problem hiding this comment.
How difficuilt would it be to refactor the tests and make these two fields final?
There was a problem hiding this comment.
Making randomizedResponseProbability final would require returning copies rather than mutating in enablePrivacy and mergeWith. mergeWith in particular was designed to mutate to match the functionality in HyperLogLog, QuantileDigest, etc.
bitmap seems like it should be final, but I'll take a closer look.
There was a problem hiding this comment.
bitmap is now final.
053ddc0 to
e013275
Compare
|
@arhimondr, I pushed changes to address most of these concerns. I believe the outstanding questions surround the use (and memory accounting) of |
There was a problem hiding this comment.
nit: sizeOfLongArray(bitSet.size() / Long.SIZE) to account for the long[] object headers.
There was a problem hiding this comment.
Hmm, actually I'm not sure if it's safe to have it static. It is not thread safe.
There was a problem hiding this comment.
NativePRNGNonBlocking is used for secure_random UDF in Presto: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/MathFunctions.java#L102
Here's a description on how it operates: https://www.synopsys.com/blogs/software-security/proper-use-of-javas-securerandom.html
In short it initializes the seed from /dev/urandom what doesn't block if the OS experiencing lack of entropy. The algorithm is based on SHA1PRNG which is considered cryptographically secure.
From what i understand the paper states that:
“Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG).
It looks like pseudo-random should be fine, as long as it is cryptographically secure.
e013275 to
49a4544
Compare
The new SfmSketch type corresponds to the Sketch-Flip-Merge summary (http://proceedings.mlr.press/v202/hehir23a/hehir23a.pdf). In addition to the type, the following functions have been added to support noisy cardinality estimation: cardinality(SfmSketch), merge(SfmSketch), merge_sfm(Array<SfmSketch>), noisy_approx_distinct_sfm, noisy_approx_set_sfm, noisy_empty_approx_set_sfm These functions are essentially equivalent to the HyperLogLog functions such as approx_distinct, approx_set, etc., but using the SFM sketch in lieu of HLL for extra noise. These functions are added as part of the noisyaggregation suite. A notable difference between the theoretical SFM sketch and this implementation is that the aggregators (e.g., noisy_approx_set_sfm) return NULL when aggregating empty sets. Strictly speaking, this is a violation of the differential privacy guarantee of the theoretical SFM sketch, though this may be patched with use of the noisy_empty_approx_set_sfm function.
49a4544 to
5892e64
Compare
|
Thanks once again for your thoughtful comments, Andrii! In the latest commit, we're using the same |
|
That sounds good to me, approved. Thank you! |
As a follow-up to prestodb#21290, this one-line change ensures that the remaining noisy aggregators (noisy_count_gaussian, noisy_count_if_gaussian, noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom provider that will not block and cause performance degradation.
As a follow-up to #21290, this one-line change ensures that the remaining noisy aggregators (noisy_count_gaussian, noisy_count_if_gaussian, noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom provider that will not block and cause performance degradation.
Reverting previous commit. The corresponding code will land in prestodb/presto instead as part of prestodb/presto#21290. This reverts commit 277184d.
Reverting previous commit. The corresponding code will land in prestodb/presto instead as part of prestodb/presto#21290. This reverts commit 277184d.
As a follow-up to prestodb#21290, this one-line change ensures that the remaining noisy aggregators (noisy_count_gaussian, noisy_count_if_gaussian, noisy_avg_gaussian, noisy_sum_gaussian) all use a SecureRandom provider that will not block and cause performance degradation.
Description
The new
SfmSketchtype corresponds to the "Sketch-Flip-Merge" data structure of http://proceedings.mlr.press/v202/hehir23a/hehir23a.pdf. In addition to the type, the following functions have been added to support noisy cardinality estimation:cardinality(SfmSketch),merge(SfmSketch),merge_sfm(Array<SfmSketch>),noisy_approx_distinct_sfm,noisy_approx_set_sfm,noisy_empty_approx_set_sfmThese functions are essentially equivalent to the HyperLogLog functions such as
approx_distinct,approx_set, etc., but using the SFM sketch in lieu of HLL for extra noise.Documentation and release notes will be forthcoming in a separate PR.
Motivation and Context
These functions are added as part of the
noisyaggregationsuite. A notable difference between the theoretical SFM sketch and this implementation is that the aggregators (e.g.,noisy_approx_set_sfm) returnNULLwhen aggregating empty sets. Strictly speaking, this is a violation of the differential privacy guarantee of the theoretical SFM sketch, though this may be patched with use of thenoisy_empty_approx_set_sfmfunction.Impact
This does not affect any existing functionality.
Test Plan
Unit tests have been added for the
SfmSketchstructure itself and the corresponding Presto type and functions.Contributor checklist
Release Notes