fix(docs): Khyperloglog uniqueness_distribution default behavior#26957
fix(docs): Khyperloglog uniqueness_distribution default behavior#26957HeidiHan0000 merged 2 commits intoprestodb:masterfrom
Conversation
Summary:
The docs mention that a default value of 256 will be used as the default value for histogramSize if omitted. The KHyperLogLog.java class correctly reflects this behavior. However, in KHyperLogLogFunctions.java, in the one parameter version of the function uniqueness_distribution, we see that the size fo the minhash is passed in, so the default value in the class itself would never actually get used.
For instance, a query like
```
WITH test_data AS (
SELECT
KHYPERLOGLOG_AGG(join_key, uii) AS khll
FROM (
VALUES
(5, 100),
(1, 200)
) AS t(join_key, uii)
)
SELECT
UNIQUENESS_DISTRIBUTION(khll) AS dist_map
FROM test_data
```
returns
```
{
"1": 1,
"2": 0
}
```
and not
```
{
"1": 1,
"2": 0,
"3": 0,
"4": 0,
...}
```
Updating the doc to represent the actual behavior.
Alternatively, we can change the function implementation to use the default value.
Differential Revision: D90611985
Reviewer's guide (collapsed on small PRs)Reviewer's GuideUpdates KHyperLogLog documentation to align the described default behavior of UNIQUENESS_DISTRIBUTION with its actual implementation, which does not apply the 256 default histogram size when called with a single argument. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `presto-docs/src/main/sphinx/functions/khyperloglog.rst:59` </location>
<code_context>
- Returns the uniqueness histogram with the given amount of buckets. If omitted,
- the value defaults to 256. All ``uniqueness`` values greater than ``histogramSize`` are
- accumulated in the last bucket.
+ Returns the uniqueness histogram with the given amount of buckets, ``histogramSize``.
+ All ``uniqueness`` values greater than ``histogramSize`` are accumulated
+ in the last bucket.
</code_context>
<issue_to_address>
**nitpick (typo):** Consider changing "amount of buckets" to "number of buckets" for grammatical correctness.
Since "buckets" are countable, "number of buckets" is the grammatically correct phrasing here.
```suggestion
Returns the uniqueness histogram with the given number of buckets, ``histogramSize``.
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
steveburnett
left a comment
There was a problem hiding this comment.
Suggest accepting the sourcery-ai suggestion to change "the given amount of buckets" to "the given number of buckets". Nothing else, looks good in a local doc build. Thank you!
|
If you have time, consider adding with a row of triple `s above and below, to the PR description to pass the (not-required) failing CI check. |
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, looks good. Thanks!
Summary:
The docs mention that a default value of 256 will be used as the default value for histogramSize if omitted. The KHyperLogLog.java class correctly reflects this behavior. However, in KHyperLogLogFunctions.java, in the one parameter version of the function uniqueness_distribution, we see that the size fo the minhash is passed in, so the default value in the class itself would never actually get used.
For instance, a query like
returns
and not
Updating the doc to represent the actual behavior.
Alternatively, we can change the function implementation to use the default value.
Differential Revision: D90611985