Add bloom_filter_agg Spark aggregate function#4028
Add bloom_filter_agg Spark aggregate function#4028jinchengchenghh wants to merge 11 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
From my understanding:
- The expected result is at least 36 bytes, not just "\u0004".
- You can use
return StringView(pointer_to_data, data_size);instead ofreturn "\u0004";to avoid the risk of truncation.Enhance BloomFilter to serialize and memory track #3861 (comment) - Please reconfirm the correctness of the expected result.
There was a problem hiding this comment.
Thanks, I try to fix as you say, but failed by #4028 (comment).
Can you give me more suggestions?
e8aae21 to
d1a0001
Compare
This comment was marked as outdated.
This comment was marked as outdated.
714ded6 to
82a36cf
Compare
This comment was marked as outdated.
This comment was marked as outdated.
82a36cf to
c844aa9
Compare
c844aa9 to
bc9d750
Compare
|
Spark fuzzer test will raise the exception, can you help to fix this? @duanmeng |
|
Can you help review this one? @mbasmanova Thanks! |
mbasmanova
left a comment
There was a problem hiding this comment.
#4633 is fixed. Would you re-enable fuzzer test?
Curious how is this function used in Spark to reduce the amount of shuffle data. Is there something I can read about this?
There was a problem hiding this comment.
Why is this change? Let's remove. Feel free to open a separate PR with this change along if necessary.
mbasmanova
left a comment
There was a problem hiding this comment.
@jinchengchenghh Some comments.
There was a problem hiding this comment.
naming: struct members do not have trailing underscore
There was a problem hiding this comment.
Why is this check? Would you add a test case where some of the input data is null?
There was a problem hiding this comment.
Spark bloomfilter aggregate test only tests the empty input. https://github.com/apache/spark/blob/branch-3.3/sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala#L196
I add the test:
test("Test that bloom_filter_agg errors null") {
spark.sql(
"""
|SELECT bloom_filter_agg(null)"""
.stripMargin)
}
It will throw exception:
[DATATYPE_MISMATCH.BLOOM_FILTER_WRONG_TYPE] Cannot resolve "bloom_filter_agg(NULL, 1000000, 8388608)" due to data type mismatch: Input to function `bloom_filter_agg` should have been "BINARY" followed by value with "BIGINT", but it's ["VOID", "BIGINT", "BIGINT"].; line 2 pos 7;
In spark, its first argument is xxhash(table_col), so it won't be null.
Velox BloomFilter accepts uint64_t while xxhash() returns int64_t, so we need to use folly to hash twice.
There was a problem hiding this comment.
In spark, its first argument is xxhash(table_col), so it won't be null.
I think it is totally possible that table_col is null for some or all rows.
SELECT bloom_filter_agg(null)
Try changing this to something like this:
SELECT bloom_filter_agg(cast(null as varbinary))
There was a problem hiding this comment.
I change the test to
test("Test that bloom_filter_agg errors null") {
spark.sql(
"""
|SELECT bloom_filter_agg(cast(null as binary))"""
.stripMargin)
}
Different exception:
[DATATYPE_MISMATCH.BLOOM_FILTER_WRONG_TYPE] Cannot resolve "bloom_filter_agg(CAST(NULL AS BINARY), 1000000, 8388608)" due to data type mismatch: Input to function `bloom_filter_agg` should have been "BINARY" followed by value with "BIGINT", but it's ["BINARY", "BIGINT", "BIGINT"].; line 2 pos 7;
'Aggregate [unresolvedalias(bloom_filter_agg(cast(null as binary), 1000000, 8388608, 0, 0), None)]
+- OneRowRelation
This is spark internal function, it is revoked by the planner, for the case table_col is null, it will be
bloom_filter_agg(xxhash64(null)), and xxhash64(null) is 42
There was a problem hiding this comment.
This is spark internal function, it is revoked by the planner, for the case table_col is null, it will be
bloom_filter_agg(xxhash64(null)), and xxhash64(null) is 42
Interesting. So the input to bloom_filter_agg is not a value, but a hash of the value. Let's clarify this in the documentation. What's the type if input? Is it VARBINARY or BIGINT?
There was a problem hiding this comment.
val rowCount = filterCreationSidePlan.stats.rowCount
val bloomFilterAgg = new BloomFilterAggregate(new XxHash64(Seq(filterCreationSideExp)), rowCount.get.longValue)
It is BIGINT
velox/functions/sparksql/aggregates/BloomFilterAggAggregate.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
It would be more efficient to compute total bytes needed for the whole result and call getBufferWithSpace once.
There was a problem hiding this comment.
naming: kDefaultExpe...
Use 1'000'000 for readability
There was a problem hiding this comment.
Should the documentation match?
@mbasmanova Spark Runtime Filters, apache/spark#35789, https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit#heading=h.4v65wq7vzy4q |
|
@mbasmanova In short, a big table A join with small table B with filter, it will generate the bloomfilter for table B after filter, then broadcast this bloomfilter to A side, and use |
17d662f to
36f399d
Compare
|
Spark aggregate fuzzer tests passed. |
|
I receive a core dump, but I don't think it is caused by my PR |
Looks like this is tracked in #4652 |
mbasmanova
left a comment
There was a problem hiding this comment.
@jinchengchenghh Some follow up comments.
There was a problem hiding this comment.
nit: this can be done once before the loop
There was a problem hiding this comment.
This variable is not used. Let's remove.
There was a problem hiding this comment.
It is used in following flatResult->setNoCopy(i, serialized);
There was a problem hiding this comment.
Any particular reason this cannot be part of accumulator->serializedSize()? It could return zero in this case.
There was a problem hiding this comment.
accumulator->serializedSize() never return 0, it is
uint32_t serializedSize() const {
return 1 /* version */
+ 4 /* number of bits */
+ bits_.size() * 8;
}
There was a problem hiding this comment.
nit: perhaps, getPreAllocatedBufferSize -> getTotalSize
There was a problem hiding this comment.
nit: should we add a method for this for consistency?
accumulator->initialized()
There was a problem hiding this comment.
do not abbreviate: currentValue and newValue
There was a problem hiding this comment.
This function is small and used only only. Consider folding this logic into the caller for readability.
There was a problem hiding this comment.
Do we need both originalEstimatedNumItems_ and estimatedNumItems_ member variables? Looks like just one would be sufficient.
There was a problem hiding this comment.
Yes, because constant originalEstimatedNumItems_ is the value in Vector, and it will compare with max value to get estimatedNumItems_ which is used in function
There was a problem hiding this comment.
Are you saying that estimatedNumItems_ can be lower than originalEstimatedNumItems_ if input value is too large?
There was a problem hiding this comment.
nit: perhaps,
Creates bloom filter from values of 'x' and returns it serialized into VARBINARY.
``estimatedNumItems`` provides an estimate of the number of unique values of ``x``. Value is capped at 716,800.
``numBits`` specifies max capacity of the bloom filter, which allows to trade accuracy for memory.
There was a problem hiding this comment.
Change the number of unique values to the number of values, this is spark intend meaning.
There was a problem hiding this comment.
@jinchengchenghh I'm not sure I understand why would we want to specify the estimate of the total number of input values. Would you clarify?
There was a problem hiding this comment.
new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, numBits), numBits);
/**
* Computes the optimal k (number of hashes per item inserted in Bloom filter), given the
* expected insertions and total number of bits in the Bloom filter.
*
* See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
*
* @param n expected insertions (must be positive)
* @param m total number of bits in Bloom filter (must be positive)
*/
private static int optimalNumOfHashFunctions(long n, long m) {
// (m / n) * log(2), but avoid truncation due to division!
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
The spark BloomFilter implementation use this value to compute number of hash functions, it has a optimal value according to the theory. Velox implementation does not use variable number of hash function, it uses constant 4
There was a problem hiding this comment.
@jinchengchenghh Are you saying Velox's implementation doesn't use estimatedNumItems arguments? If so, should we remove it? Otherwise, should we document that this argument is not used and remove logic for capping its value and initializing estimatedNumItems_ member variable.
There was a problem hiding this comment.
For the final solution, I may implement native Spark BloomFilter in Velox, it will have better performance, we can switch to it then.
And if we not specify numBits, we will use estimatedNumItems to estimate numBits, BloomFilter implemenation doesn't use estimatedNumItems argument, but used in bloom_filter_agg
There was a problem hiding this comment.
Got it. Let's clarify all this in the documentation. It is not obvious.
There was a problem hiding this comment.
Do we need to also document the difference between spark, or just clarify the usage in Velox?
There was a problem hiding this comment.
Let's document the difference with Spark as well. By default, the assumption is that Velox functions match semantics of the original engine.
There was a problem hiding this comment.
Ok, updated
There was a problem hiding this comment.
from values of hashed value 'x'
It is a bit cryptic. Perhaps, rename x to hash and say something like
.. spark:function:: bloom_filter_agg(hash, estimatedNumItems, numBits) -> varbinary
Creates bloom filter from input hashes and returns it serialized into VARBINARY. The caller is expected to apply xxhash64 function to input data before calling bloom_filter_agg.
For example,
bloom_filter_agg(xxhash64(x), 1000000, 1024)
There was a problem hiding this comment.
typos:
In Spark implementation, ``estimatedNumItems`` and ``numBits`` are used to decide the number of hash functions and bloom filter capacity.
In Velox implementation, ``estimatedNumItems`` is not used.
There was a problem hiding this comment.
Let's remove since it is not used.
There was a problem hiding this comment.
typos:
In Spark, the value of``numBits`` is automatically capped at 67,108,864.
In Velxo, the value of``numBits`` is automatically capped at 716,800.
There was a problem hiding this comment.
Let's also mention that x / hash cannot be null.
There was a problem hiding this comment.
First argument value
Users may not understand what this refers to. How about,
First argument of bloom_filter_agg cannot be null
However, !decodedRaw_.mayHaveNulls() is a very strong check. It may return false even if there are no nulls.
There was a problem hiding this comment.
I'm a bit confused about It may return false even if there are no nulls.
In my mind, I understand it is supposed to be return false if there are no nulls, return false or true when there are nulls or not, so it is determined there is no nulls.
Other codes obey this rule.
https://github.com/facebookincubator/velox/blob/main/velox/connectors/hive/HiveConnector.cpp#L633
Scalar function document:
https://facebookincubator.github.io/velox/develop/scalar-functions.html
bool mayHaveNulls() : constant time check on the underlying vector nullity. When it returns false, there are definitely no nulls, a true does not guarantee null existence.
There was a problem hiding this comment.
Can you check this comment?
53c6763 to
f9041c8
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@jinchengchenghh Thank you for iterating on this PR. There are a lot of tricky details to get right. Some follow-up comments.
There was a problem hiding this comment.
Would you generate the docs locally and verify they get formatted nicely. It seems to me that we need some new lines or something around the example.
There was a problem hiding this comment.
Sorry for the document, I don't know how to confirm the document before.
Now I know how to convert it to html and use Vscode to preview, I will check the html format. Thanks for your kindly review.
There was a problem hiding this comment.
typo: Velxo -> Velox
is automatically capped at 716,800.
Let's update PR description to explain where this limitation comes from. CC: @xiaoxmeng
There was a problem hiding this comment.
This description needs to be revised.
A version of ``bloom_filter_agg`` that uses ``numBits`` computed as ``estimatedNumItems`` * 8.
``estimatedNumItems`` provides an estimate of the number of values of <TBD: fill in; y seems wrong>.
Value of ``estimatedNumItems`` is capped at 4,000,000.
Does 4M cap come from Spark? If so, let's clarify
Value of ``estimatedNumItems`` is capped at 4,000,000 like to match Spark's implementation.
There was a problem hiding this comment.
This is Spark implementation https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L58
It matches Spark.
There was a problem hiding this comment.
It looks like defaults are configurable in Spark. Should these be configurable in Velox as well?
val RUNTIME_BLOOM_FILTER_EXPECTED_NUM_ITEMS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.expectedNumItems")
.doc("The default number of expected items for the runtime bloomfilter")
.version("3.3.0")
.longConf
.createWithDefault(1000000L)
val RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumItems")
.doc("The max allowed number of expected items for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(4000000L)
val RUNTIME_BLOOM_FILTER_NUM_BITS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.numBits")
.doc("The default number of bits to use for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(8388608L)
val RUNTIME_BLOOM_FILTER_MAX_NUM_BITS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumBits")
.doc("The max number of bits to use for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(67108864L)
There was a problem hiding this comment.
Now we cannot get configuration when we implement aggregate functions, because we cannot get queryContext config here.
If we need this feature, we should reserve config in GroupingSet when we initialize it in HashAggregation.cpp.
Now just Spiller::Config exists in GroupingSet.
And we need to add a new argument config or context to functions such as addRawInput, it will change all the aggregate function input arguments.
Or we can reserve the config in Aggregate, but I don't suggest this way, it change the code less but Aggregate is created from FunctionRegistry factory, we cannot use constructor to create it with config.
If you think it is needed, I can help to implement it in another PR.
There was a problem hiding this comment.
Thank you for pointing this out. Let's create a GitHub issue to explain this use case and discuss how best to implement it. For this PR, let's just mention in the documentation that Spark allows for changing the defaults, but Velox does not.
There was a problem hiding this comment.
Same.
A version of ``bloom_filter_agg`` that uses 8,000,000 as ``numBits``.
Would you confirm that this matches Spark's implementation?
There was a problem hiding this comment.
This works, but is quite a bit of code and easy to get wrong. For example, by forgetting to call buffer->setSize or forgetting to add buffer_.size() when initializing bufferPtr.
Consider introducing new method:
char* rawBuffer = flatResult->getRawStringBufferWithSpace(totalSize);
This method would return the pointer to the first "writable" byte and update the size of the 'buffer' to include totalSize.
There was a problem hiding this comment.
Please, check this comment.
There was a problem hiding this comment.
Would this happen if we run masked aggregation and all rows for a given groups are masked out? Would you add a test case to verify this code path?
There was a problem hiding this comment.
Can you explain a bit more about BloomFilterAggAggregateTest.emptyInput.
Current test can cover this path, And gluten unit test can run into this path too.
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from BloomFilterAggAggregateTest
[ RUN ] BloomFilterAggAggregateTest.basic
not init
[ OK ] BloomFilterAggAggregateTest.basic (40 ms)
[ RUN ] BloomFilterAggAggregateTest.bloomFilterAggArgument
not init
not init
[ OK ] BloomFilterAggAggregateTest.bloomFilterAggArgument (160 ms)
[ RUN ] BloomFilterAggAggregateTest.emptyInput
not init
not init
not init
not init
not init
not init
not init
not init
not init
not init
[ OK ] BloomFilterAggAggregateTest.emptyInput (30 ms)
There was a problem hiding this comment.
do not abbreviate: vec -> vector or decoded
efc1cd1 to
f69a977
Compare
|
This failure happens again, can you help check it? @mbasmanova |
|
Do you have further comments? @mbasmanova |
|
@jinchengchenghh The CI is red. Would you rebase the PR and make sure CI is green? |
38e947d to
2e642ce
Compare
|
The CI passed @mbasmanova |
|
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
velox/vector/FlatVector.h
Outdated
There was a problem hiding this comment.
linter pointed out that this semi-colon is not needed; let's remove
I'm also seeing that this method is lacking documentation and tests. Would you submit a separate PR to introduce this method, document it clearly and add a test?
11cfb31 to
36b31e5
Compare
|
Fixed the linter warning, can it be imported? @mbasmanova |
|
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@mbasmanova merged this pull request in 86137eb. |
|
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |



This function is used in Spark Runtime Filters: apache/spark#35789
https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit#heading=h.4v65wq7vzy4q
BloomFilter implementation in Velox is different from Spark, hence, serialized BloomFilter is different.
Velox has memory limit for contiguous memory buffer, hence BloomFilter capacity is less than in Spark when numBits is large. See #4713 (comment)
Spark allows for changing the defaults while Velox does not.
See also #3342
Fixes #3694