fix: Fix approx_histogram_k aggregations #411

nikhil-zlai · 2025-02-21T00:45:37Z

Summary

The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Introduced new utility functions to streamline expression composition and cleanup.
- Enhanced aggregation descriptions for clearer operation choices.
- Added new aggregation types for improved data analysis.
Refactor
- Revamped frequency analysis logic with improved error handling and optimized sizing.
- Replaced legacy histogram approaches with a more robust frequent item detection mechanism.
Tests
- Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests.
- Updated existing tests to reflect changes in aggregation parameters.
Chores
- Removed deprecated interactive modules for a leaner deployment.
Configuration
- Adjusted default aggregation parameters for more consistent processing, including changes to the k value in multiple configurations.

coderabbitai · 2025-02-21T00:45:45Z

Walkthrough

The changes update the core aggregation logic. In Scala, the FrequentItems class now defaults to ErrorType.NO_FALSE_NEGATIVES and recalculates its sketch size using an effective map size. The deprecated ApproxHistogram class and its tests have been removed. In addition, helper methods for type conversion were added in ColumnAggregator. The Python API now includes expanded Operation attributes with revised defaults and removes obsolete interactive files. Configuration files and Spark tests have been updated to use a lower default aggregation parameter (k from 128 to 20).

Changes

File(s)	Change Summary
`aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala`	Updated `FrequentItems`: default error type changed, sketch size calculation modified, and removed `ApproxHistogram` functionality.
`aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala`	Added `toJLong` and `toJDouble` methods; updated operation mapping from `APPROX_HISTOGRAM_K` to `APPROX_FREQUENT_K`.
`aggregator/src/test/scala/ai/chronon/aggregator/test/(ApproxHistogramTest.scala, FrequentItemsTest.scala)`	Removed `ApproxHistogramTest`; added tests for skewed data and heavy hitters in `FrequentItemsTest`.
`api/py/ai/chronon/group_by.py`, `api/thrift/api.thrift`	Added new operation attributes; changed `APPROX_PERCENTILE` default from 128 to 20; replaced `APPROX_HISTOGRAM_K` with `APPROX_FREQUENT_K`/`APPROX_HEAVY_HITTERS_K`.
`api/py/ai/chronon/repo/(interactive.md, interactive.py)`	Deleted interactive runner documentation and utility file.
`api/py/ai/chronon/utils.py`	Added new `compose` and `clean_expression` functions with detailed documentation.
`api/py/test/sample/production/(group_bys/.v1, joins/, test_group_by.py)`	Changed aggregation parameter `k` value from "128" to "20" and updated groupBy configurations and dependencies.
`api/py/test/test_utils.py`	Added tests for the new `compose` and `clean_expression` functions.
`spark/src/test/scala/ai/chronon/spark/test/(fetcher/FetcherTest.scala, groupby/GroupByTest.scala)`	Updated aggregation operations from `APPROX_HISTOGRAM_K` to `APPROX_FREQUENT_K`.

Sequence Diagram(s)

sequenceDiagram
    participant CA as ColumnAggregator
    participant FI as FrequentItems
    participant DS as DataStructure
    CA->>FI: Instantiate with mapSize & errorType
    FI->>DS: Compute effectiveMapSize & sketchSize
    DS-->>FI: Return sketch dimensions
    FI->>CA: Return aggregation result

Possibly related PRs

eval query support and a few fixes #258: Changes in FrequentItems class and error handling relate to updates replacing APPROX_HISTOGRAM_K with APPROX_FREQUENT_K.
chore: split out expensive spark tests to parallelize #382: Modifications to FrequentItems class and aggregation logic are connected to the transition from APPROX_HISTOGRAM_K to APPROX_FREQUENT_K.
Rework BigTableKV Store & GCP Api #135: Changes in FrequentItems class and handling of frequent items are directly related to the updates involving APPROX_FREQUENT_K and APPROX_HEAVY_HITTERS_K.

Suggested reviewers

piyush-zlai
varant-zlai
ken-zlai

Poem

In the code, new logic starts to bloom,
Frequent items now find their room.
Debug lines sleep, insights remain,
Aggregations realigned again.
From histograms past, a brighter path is drawn! 🚀

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)

441-445: Remove commented debug code. Helps keep the file clean.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 428b77c and fca60e5.

📒 Files selected for processing (18)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2 hunks)
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala (0 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (2 hunks)
api/py/ai/chronon/group_by.py (1 hunks)
api/py/ai/chronon/repo/interactive.md (0 hunks)
api/py/ai/chronon/repo/interactive.py (0 hunks)
api/py/ai/chronon/utils.py (1 hunks)
api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_online_join.v1 (1 hunks)
api/py/test/test_group_by.py (1 hunks)
api/py/test/test_utils.py (1 hunks)
api/thrift/api.thrift (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (4 hunks)

💤 Files with no reviewable changes (3)

api/py/ai/chronon/repo/interactive.py
api/py/ai/chronon/repo/interactive.md
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: spark_tests
GitHub Check: analyzer_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: fetcher_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: non_spark_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (26)

api/py/test/sample/production/joins/sample_team/sample_label_join.v1 (1)

163-163: Decreased aggregator capacity. Verify smaller k won't degrade coverage too much.

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)

394-394: Default errorType changed. Confirm that NO_FALSE_NEGATIVES is correct for your use case.

398-406: Sketch size logic. The power-of-two approach looks correct.

api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1 (1)

61-61: Lowered k value. Ensure it still meets precision needs.

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)

166-182: LGTM! Well-structured test data generation.

The createSkewedData method effectively simulates real-world data distribution with both long tail and heavy hitters.

184-195: LGTM! Good test coverage for MostFrequentK.

Test validates the exact k elements requirement and heavy hitter presence.

197-210: LGTM! Comprehensive test for HeavyHittersK.

Test ensures only heavy hitters are returned using NO_FALSE_POSITIVES error type.

api/py/test/sample/production/joins/sample_team/sample_online_join.v1 (1)

227-228: Verify impact of reduced k parameter.

Reducing k from 128 to 20 will decrease memory usage but may affect accuracy.

Please confirm this change aligns with your accuracy requirements.

api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1 (1)

163-165: LGTM! Consistent parameter update.

The k parameter change matches other configurations.

api/py/test/test_group_by.py (1)

179-179: LGTM! Test updated for new default.

Test assertion correctly reflects the new k parameter value.
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (1)

228-229:
❓ Verification inconclusive

Verify the impact of reduced k value on percentile accuracy.

Reducing k from 128 to 20 will decrease memory usage but may affect percentile accuracy.

Run this script to analyze the accuracy impact:

🏁 Script executed:
#!/bin/bash
# Find all uses of operation 12 with k=128 to assess migration impact
rg -A 5 '"operation": 12.*"k": "128"'
Length of output: 39
Attention: Verify the Impact of Reduced k

While the initial script found no instances of operation 12 using k="128", this absence suggests the change is likely complete in the intended file. However, the impact on percentile accuracy remains a key concern. Please manually verify that all aggregation operations have correctly migrated to k="20" and that the reduced k value does not adversely affect accuracy.

Confirm that no remnants of k="128" exist in any operation 12 usage.

Manually test the approximate percentile calculations to ensure their accuracy meets requirements.
api/py/test/test_utils.py (2)

327-346: LGTM! Comprehensive test for compose function.

Test verifies complex nested function composition with proper formatting.

349-356: LGTM! Good test coverage for expression cleaning.

Test verifies whitespace handling in expressions.

api/thrift/api.thrift (1)

173-174: LGTM! Clear distinction between frequency and heavy hitter operations.

The split provides better semantic clarity between frequency counting and skew detection.

aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3)

137-138: LGTM! Clean helper methods for type conversion.

Encapsulates Java type conversion logic.

267-277: LGTM! Well-structured implementation of APPROX_FREQUENT_K.

Comprehensive type handling with appropriate conversions for each numeric type.

281-289: LGTM! Proper implementation of APPROX_HEAVY_HITTERS_K.

Uses NO_FALSE_POSITIVES error type for accurate skew detection.

api/py/ai/chronon/utils.py (2)

579-627: LGTM! Well-documented utility function.

The compose function is well-implemented with clear examples and proper indentation handling.

629-633: LGTM! Clean utility function.

The clean_expression function is concise and handles whitespace cleanup effectively.
api/py/ai/chronon/group_by.py (4)

111-117: LGTM! Clear warning about performance impact.

The docstring properly warns about high cardinality impact.

118-123: LGTM! Clear documentation.

The docstring clearly explains the approximate nature of the operation.

124-131: LGTM! Well-documented distinction.

The docstring clearly explains how this differs from APPROX_FREQUENT_K.

144-148:
❓ Verification inconclusive

Verify the impact of reducing k from 128 to 20.

The default accuracy parameter k has been reduced significantly.

🌐 Web query:
What is the recommended value for k in approximate percentile calculations?
💡 Result:

The recommended value for error tolerance (k) in approximate percentile calculations depends on balancing speed and accuracy requirements:

Default Recommendation:

The standard error tolerance is 0.01 (1%) [4][7]. This provides a near-exact estimate (±1% deviation from true percentiles) while maintaining computational efficiency [2][4].

Large Datasets:

For datasets exceeding 100k rows, APPROX_PERCENTILE achieves high accuracy (±0.01 range) even with default settings [2].

Example: A true median of 100 would yield an approximate value between 99–101 [2].

Trading Accuracy for Speed:

Increase tolerance to 0.05 (5%) for faster results at reduced precision [4].

Higher tolerances reduce computational overhead but widen error margins (e.g., ±5% deviation).

Extreme Percentiles:

Accuracy improves naturally near distribution edges (e.g., 1st or 99th percentile) due to data sparsity [2][9].

Key Formula:

For uniform distributions spanning [0, N], an approximate percentile at p will lie within:
[ \text{Approx}_p \in \left[ p - k \cdot N,\ p + k \cdot N \right] ]
where ( k ) = error tolerance [2][4].

Summary:

Use k = 0.01 by default for most cases [4].

Adjust to k = 0.05 only if computational speed is prioritized over precision [2][4].

Citations:

1: https://www.indeed.com/career-advice/career-development/how-to-calculate-percentile

2: https://www.singlestore.com/blog/fast-and-accurate-percentiles-with-approx-percentile-in-singlestore/

3: https://statisticsbyjim.com/basics/percentiles/

4: https://docs.singlestore.com/cloud/reference/sql-reference/aggregate-functions/approx-percentile/

5: https://www.dummies.com/article/academics-the-arts/math/statistics/how-to-calculate-percentiles-in-statistics-169783/

6: https://www.youtube.com/watch?v=VcPWzNXs41Y

7: https://courses.lumenlearning.com/introstatscorequisite/chapter/measures-of-the-location-of-the-data-2/

8: https://opentextbc.ca/businesstechnicalmath/chapter/8-1-percentiles-and-quartiles/

9: https://en.wikipedia.org/wiki/Percentile

10: https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/percentiles-cumulative-relative-frequency/v/calculating-percentile
Action: Confirm the impact of reducing k on percentile accuracy.

In api/py/ai/chronon/group_by.py (Lines 144-148), the k value has been reduced from 128 to 20.

Because k directly affects the accuracy performance of the approximate percentile calculation, please ensure that this lower k maintains acceptable error tolerance in your tests. Note that while external guidelines suggest error tolerance values of around 0.01 (or 1%) for similar algorithms, our implementation uses an integer parameter—validate that k=20 provides the expected trade-off between accuracy and speed.
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (1)

491-517: LGTM! Consistent operation changes.

The test has been updated to use APPROX_FREQUENT_K consistently for all three input columns.

spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2)

322-324: LGTM! Consistent operation change.

The test has been updated to use APPROX_FREQUENT_K for txn_types.

547-550: LGTM! Consistent operation change.

The test has been updated to use APPROX_FREQUENT_K for rating.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (4)

184-195: Add assertions for frequency values.

Consider verifying that the frequencies of heavy hitters are higher than the frequencies of other items.
 topHistogram.size() shouldBe k
 heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString))
+// Verify frequencies
+val heavyHitterFreqs = heavyHitterElems.map(elem => topHistogram.get(elem.toString))
+val otherFreqs = topHistogram.values.asScala.filterNot(heavyHitterFreqs.contains)
+heavyHitterFreqs.foreach(freq => otherFreqs.foreach(otherFreq => freq should be > otherFreq))
197-210: Add assertions for non-heavy hitters.

Verify that non-heavy hitter elements are not included in the result.
 heavyHitterResult.size() shouldBe heavyHitterElems.size
 heavyHitterElems.foreach(elem => heavyHitterResult.containsKey(elem.toString))
+// Verify non-heavy hitters are excluded
+val nonHeavyHitters = (1 to 100).map(_.toString)
+nonHeavyHitters.foreach(elem => heavyHitterResult.containsKey(elem) shouldBe false)
184-195: Add more assertions to strengthen the test.

While the test verifies the size and presence of heavy hitters, it could be enhanced by:

Verifying the order of elements (most frequent first)

Checking the actual frequency counts
 topHistogram.size() shouldBe k
 heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString))
+// Verify frequencies are in descending order
+val frequencies = topHistogram.asScala.values.toSeq
+frequencies.zip(frequencies.tail).foreach { case (prev, next) =>
+  prev should be >= next
+}
197-210: Add frequency threshold verification.

The test should verify that returned items exceed a minimum frequency threshold.
 heavyHitterResult.size() shouldBe heavyHitterElems.size
 heavyHitterElems.foreach(elem => heavyHitterResult.containsKey(elem.toString))
+// Verify all items exceed minimum frequency
+val minFrequency = 1000 // Each heavy hitter appears 1000 times
+heavyHitterResult.values().forEach(freq => freq should be >= minFrequency)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)

441-445: Consider using structured logging instead of comments.

Replace debug comments with proper logging statements.

-    // useful with debugger on - keep around
-    // val outputSketchSize = ir.sketch.getNumActiveItems
-    // val serializer = implicitly[FrequentItemsFriendly[T]].serializer
-    // val outputSketchBytes = ir.sketch.toByteArray(serializer).length
+    if (logger.isDebugEnabled) {
+      val outputSketchSize = ir.sketch.getNumActiveItems
+      val serializer = implicitly[FrequentItemsFriendly[T]].serializer
+      val outputSketchBytes = ir.sketch.toByteArray(serializer).length
+      logger.debug(s"Sketch stats: size=$outputSketchSize, bytes=$outputSketchBytes")
+    }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between c3e6172 and 0ee71a4.

📒 Files selected for processing (20)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2 hunks)
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala (0 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (2 hunks)
api/py/ai/chronon/group_by.py (1 hunks)
api/py/ai/chronon/repo/interactive.md (0 hunks)
api/py/ai/chronon/repo/interactive.py (0 hunks)
api/py/ai/chronon/utils.py (1 hunks)
api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_online_join.v1 (1 hunks)
api/py/test/test_group_by.py (1 hunks)
api/py/test/test_utils.py (1 hunks)
api/thrift/api.thrift (1 hunks)
docs/source/authoring_features/GroupBy.md (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (5 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1 hunks)

💤 Files with no reviewable changes (3)

api/py/ai/chronon/repo/interactive.md
api/py/ai/chronon/repo/interactive.py
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala

🚧 Files skipped from review as they are similar to previous changes (10)

api/py/test/test_group_by.py
api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1
api/py/ai/chronon/utils.py
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala
api/py/test/test_utils.py
api/py/test/sample/production/joins/sample_team/sample_label_join.v1
api/py/test/sample/production/joins/sample_team/sample_online_join.v1
api/thrift/api.thrift
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: streaming_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: analyzer_tests
GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix

🔇 Additional comments (19)

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (4)

12-14: LGTM: Required imports added for new test cases.

166-182: LGTM: Well-structured test data generation.

The createSkewedData helper creates a good mix of frequent (4 elements × 1000) and infrequent (100 elements × 100) items.

12-14: LGTM! Required imports added.

The imports are necessary for the new test cases.

166-182: Well-structured test data generator.

The createSkewedData method effectively creates a dataset with both long tail and heavy hitter elements, perfect for testing frequency-based aggregations.
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (2)

228-229: LGTM: Reduced k parameter for better efficiency.

The reduction from 128 to 20 aligns with the PR's goal to fix aggregations.

228-229:
❓ Verification inconclusive

Verify impact of reduced k value.

Reducing k from 128 to 20 significantly lowers the number of items tracked. Ensure this doesn't affect accuracy requirements.

🏁 Script executed:
#!/bin/bash
# Search for other k values in configuration files
rg -g '*.v[0-9]' '"k":\s*"(128|20)"'
Length of output: 523
Reduced k value impact

All affected config files now use "k": "20" (e.g., sample_join_bootstrap.v2, group_by_with_kwargs.v1, sample_label_join.v1, sample_online_join.v1, sample_label_join_with_agg.v1). This significant drop from 128 may reduce tracking precision. Please confirm that accuracy requirements remain met.
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)

147-148:
❓ Verification inconclusive

Investigate root cause before relaxing threshold.

Increasing the null ratio threshold from 0.2 to 0.3 appears to be a workaround. The TODO comment suggests this needs investigation.

🏁 Script executed:
#!/bin/bash
# Search for related test failures or issues
rg -A 5 "summary.*null.*ratio|null.*ratio.*threshold" 
Length of output: 55
Investigate underlying issue before settling for a relaxed threshold.

The threshold change from 0.2 to 0.3 appears to be a temporary workaround.

Please manually verify the impact on test behavior and address the root cause rather than patching.

147-148: Document rationale for threshold change.

The null ratio threshold has been relaxed from 0.2 to 0.3. The TODO comment suggests unexpected behavior.

Please clarify:

Why is the test "acting up"?

Is this a temporary workaround?

What's the plan to investigate and fix the root cause?
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (6)

137-138: LGTM: Improved type conversion helpers.

Extracted common conversion logic into helper methods.

141-159: LGTM: Consistent type casting.

Using the new helper methods makes the code more uniform.

267-289: LGTM: Improved frequency estimation.

Split into APPROX_FREQUENT_K and APPROX_HEAVY_HITTERS_K for better accuracy

Added ErrorType.NO_FALSE_POSITIVES for heavy hitters

137-138: LGTM! Type conversion helpers added.

The new helper methods improve type safety and readability.

267-278: LGTM! New aggregation operation added.

The APPROX_FREQUENT_K operation is well-implemented with appropriate type handling.

279-290: LGTM! Heavy hitters implementation updated.

The implementation now uses FrequentItems with NO_FALSE_POSITIVES error type.

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)

394-394: LGTM: Improved accuracy by preventing false negatives.

The change to NO_FALSE_NEGATIVES ensures we don't miss any frequent items.

398-406: LGTM: Better sketch size calculation.

The new calculation using effectiveMapSize properly accounts for internal map purge behavior.
api/py/ai/chronon/group_by.py (2)

111-131: LGTM: Added new aggregation operations with clear documentation.

The new operations FREQUENT_K, APPROX_FREQUENT_K, and APPROX_HEAVY_HITTERS_K are well documented with their differences clearly explained.

144-146:
❓ Verification inconclusive

Verify the impact of reducing default k value.

The change from k=128 to k=20 for APPROX_PERCENTILE might affect accuracy.

🌐 Web query:
What is the recommended k value for KLL sketch in Apache DataSketches for percentile approximation?
💡 Result:

The recommended k value for KLL sketches in Apache DataSketches depends on balancing accuracy requirements with storage constraints:

Default Recommendation

Apache DataSketches: Defaults to k=200 [1][5][9], providing an epsilon (error bound) of ~1.33% for quantile estimates and 1.65% for PMF/CDF calculations[1].

Key Considerations

Factor Impact

Accuracy Higher k reduces error bounds (e.g., k=128 gives ~1.725% rank error[4], k=200 ~1.33%[1])

Storage Sketch size grows linearly with k (e.g., k=200 uses ~432 bytes vs k=128 ~276 bytes)[1][4]

Implementation-Specific Notes

Cloudera Impala/Druid: Supports k=8 to k=65535, defaulting to 200[1][9]

Imply Polaris: Uses power-of-2 k values (2–32768), defaulting to 128[4]

Microsoft SQL Server: Uses KLL internally but doesn’t expose k tuning[10]

For most use cases, k=200 strikes a practical balance between accuracy and resource usage[1][5]. Increase k when:

You require sub-1% error bounds

Your data distribution has heavy tails/extreme outliers

Storage costs are not a primary concern

Decrease k when prioritizing sketch compactness over precision[4][9].

Citations:

1: https://docs.cloudera.com/cdw-runtime/cloud/manage-impala/topics/impala-datasketches-kll.html

2: https://datasketches.apache.org/docs/KLL/KLLSketch.html

3: https://apache.github.io/datasketches-java/6.1.0/org/apache/datasketches/kll/KllSketch.html

4: https://docs.imply.io/polaris/sketches-quantiles/

5: https://github.com/apache/datasketches-cpp/blob/master/kll/include/kll_sketch.hpp

6: https://www.youtube.com/watch?v=-lXDBtpg_aQ

7: https://github.com/apache/datasketches-python

8: https://apache.github.io/datasketches-python/5.1.0/quantiles/quantiles_depr.html

9: https://druid.apache.org/docs/latest/development/extensions-core/datasketches-kll/

10: https://learn.microsoft.com/en-us/sql/t-sql/functions/approx-percentile-disc-transact-sql?view=sql-server-ver16
Action: Confirm k Value Impact on Accuracy

The current k value of 20 is significantly lower than both our previous default (128) and the Apache DataSketches-recommended value (≈200). This reduction could notably degrade the accuracy of percentile approximations. Please verify that the lower precision is acceptable for our use case and add documentation or tests to justify the trade-off.

File: api/py/ai/chronon/group_by.py, Lines 144–146
docs/source/authoring_features/GroupBy.md (1)

136-151: LGTM: Documentation accurately reflects code changes.

The table of properties correctly documents the new aggregation types and their characteristics.

…ke derivations readable

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)
441-445: Consider removing commented debug code.

These debugging statements should be removed or moved to a logging framework.
-    // useful with debugger on - keep around
-    // val outputSketchSize = ir.sketch.getNumActiveItems
-    // val serializer = implicitly[FrequentItemsFriendly[T]].serializer
-    // val outputSketchBytes = ir.sketch.toByteArray(serializer).length

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 0ee71a4 and 1558ca8.

📒 Files selected for processing (20)

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2 hunks)
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala (0 hunks)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (2 hunks)
api/py/ai/chronon/group_by.py (1 hunks)
api/py/ai/chronon/repo/interactive.md (0 hunks)
api/py/ai/chronon/repo/interactive.py (0 hunks)
api/py/ai/chronon/utils.py (1 hunks)
api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1 (1 hunks)
api/py/test/sample/production/joins/sample_team/sample_online_join.v1 (1 hunks)
api/py/test/test_group_by.py (1 hunks)
api/py/test/test_utils.py (1 hunks)
api/thrift/api.thrift (1 hunks)
docs/source/authoring_features/GroupBy.md (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (5 hunks)
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (1 hunks)

💤 Files with no reviewable changes (3)

api/py/ai/chronon/repo/interactive.py
api/py/ai/chronon/repo/interactive.md
aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala

🚧 Files skipped from review as they are similar to previous changes (11)

spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1
api/py/test/sample/production/joins/sample_team/sample_online_join.v1
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2
api/py/test/sample/production/joins/sample_team/sample_label_join.v1
api/py/test/test_group_by.py
api/py/test/test_utils.py
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1
api/thrift/api.thrift
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: analyzer_tests
GitHub Check: analyzer_tests
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: spark_tests
GitHub Check: spark_tests
GitHub Check: non_spark_tests
GitHub Check: join_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (15)

api/py/ai/chronon/utils.py (2)

579-626: Well-implemented utility function for method composition.

The compose function elegantly handles nested method calls with proper indentation and argument handling.

629-633: Clean utility for expression formatting.

Simple and effective implementation for removing excess whitespace.

docs/source/authoring_features/GroupBy.md (1)

136-150: Documentation properly updated to reflect code changes.

Table has been reformatted for better clarity and now accurately describes the new aggregation types (approx_frequent_k and approx_heavy_hitter_k) that replace the deprecated approx_histogram_k.

aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (4)

137-139: Good helper methods for type conversion.

Encapsulating the conversion logic improves readability and maintainability.

140-160: Refactored type casting using helper methods.

Now using the helper methods consistently throughout type casting operations.

267-277: Replaced APPROX_HISTOGRAM_K with APPROX_FREQUENT_K implementation.

Now uses FrequentItems with default error type and handles all data types properly.

278-290: Updated APPROX_HEAVY_HITTERS_K implementation.

Now uses FrequentItems with ErrorType.NO_FALSE_POSITIVES for optimized heavy hitter detection.

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)

166-182: Good test data generation utility.

Creates a realistic skewed distribution with both long tail and heavy hitter elements.

184-195: Effective test for MostFrequentK behavior.

Ensures the implementation returns exactly k items with heavy hitters included.

197-210: Thorough test for HeavyHittersK behavior.

Verifies the implementation correctly identifies only true heavy hitters.

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)

394-394: Changed default error type to improve frequency analysis.

Default error type now favors including all frequent items (with potential false positives) rather than missing any.

398-406: Improved sketch size calculation to maintain effective capacity.

The new calculation properly accounts for internal map purge behavior, ensuring mapSize is properly maintained.
api/py/ai/chronon/group_by.py (3)

62-131: Improved documentation and semantics of operations.

Added clear docstrings and replaced vague histogram operations with more specific frequent item operations that better distinguish between exact calculation, approximation, and heavy hitter detection.

147-147: Improved docstring for APPROX_PERCENTILE.

Added documentation that explains the configurable accuracy parameter.

145-145:
✅ Verification successful

Reduced default aggregation parameter from 128 to 20.

The k parameter for APPROX_PERCENTILE now defaults to 20 instead of 128, which reduces memory usage but may impact accuracy.

🏁 Script executed:
#!/bin/bash
# Check if this parameter change is reflected in tests
rg -A 3 "APPROX_PERCENTILE" --glob "api/py/test/*.py"
Length of output: 804
Aggregation default k parameter change verified

Tests in api/py/test/test_group_by.py confirm that k=20 is correctly set for APPROX_PERCENTILE, matching the intended change. No further action is required.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (2)

166-182: Consider documenting distribution characteristics.

The skewed data method is well-implemented but dropping 1000 random elements could affect test reliability.

  private def createSkewedData(): Array[Long] = {
    // 10k elements - each repeating 100 times
    val longTail = (1 to 100).flatMap(_ => 1 to 100)

    // 4 elements - each repeating 1000 times
    val heavyHitters = (1 to 1000).flatMap(_ => heavyHitterElems)

-    // all of them together and shuffled
+    // Combine and shuffle all data, then drop 1000 random items
+    // This creates a distribution with ~10% heavy hitters (4000 out of ~13000 elements)
    Random
      .shuffle(longTail ++ heavyHitters)
      .iterator
      .map(_.toLong)
      .drop(1000) // delete a few random items to produce noise
      .toArray
  }

184-195: Test verifies size but not frequencies.

Test checks for expected elements but could verify frequency values too.

  "MostFrequentK" should "always produce nearly k elements when cardinality is > k" in {
    val k = 10
    val topFrequentItems = new FrequentItems[java.lang.Long](k)
    val frequentItemsIr = topFrequentItems.prepare(0)

    createSkewedData().foreach(i => topFrequentItems.update(frequentItemsIr, i))

    val topHistogram = topFrequentItems.finalize(frequentItemsIr)

    math.abs(topHistogram.size() - k) <= 2 shouldBe true
    heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString))
+   // Verify frequencies of heavy hitters are high
+   heavyHitterElems.foreach(elem => 
+     topHistogram.get(elem.toString) should be >= 800L)
  }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 1558ca8 and 9d6dc98.

📒 Files selected for processing (1)

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (17)

GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: fetcher_tests
GitHub Check: spark_tests
GitHub Check: python_tests
GitHub Check: analyzer_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests
GitHub Check: non_spark_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (3)

aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)

12-15: Good imports.

The added imports support new test functionality.

83-87: Updated sketch sizes match implementation changes.

Sketch size expectations correctly reflect new internal sizing logic.

197-210: Good error type specification.

Test correctly uses NO_FALSE_POSITIVES to test heavy hitters functionality.

## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations.

## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Cheour clientslist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations.

coderabbitai bot reviewed Feb 21, 2025

View reviewed changes

nikhil-zlai force-pushed the eval_limit branch from 7bcf9ca to 0ee71a4 Compare February 21, 2025 02:06

coderabbitai bot reviewed Feb 21, 2025

View reviewed changes

nikhil-zlai added 8 commits February 24, 2025 20:22

fix: approx_histogram_k should produce atleast k + syntax sugar to ma…

f7e4cf6

…ke derivations readable

adjusted map size

b917227

clean out approx_histogram_k

2801b15

fix: histogram_k usage

e114f83

remove spurious docs

27c21a8

doc fixes

e34cab0

test fixes

4205fa6

doc fix

1558ca8

nikhil-zlai force-pushed the eval_limit branch from 0ee71a4 to 1558ca8 Compare February 25, 2025 04:23

coderabbitai bot reviewed Feb 25, 2025

View reviewed changes

test fix

9d6dc98

coderabbitai bot reviewed Feb 25, 2025

View reviewed changes

Merge branch 'main' into eval_limit

038b932

tchow-zlai approved these changes Feb 25, 2025

View reviewed changes

nikhil-zlai merged commit 748efe0 into main Feb 25, 2025
20 checks passed

nikhil-zlai deleted the eval_limit branch February 25, 2025 06:56

coderabbitai bot mentioned this pull request May 15, 2025

Clean history #779

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix approx_histogram_k aggregations #411

fix: Fix approx_histogram_k aggregations #411

Uh oh!

nikhil-zlai commented Feb 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 21, 2025 •

edited

Loading

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Key Formula:

Summary:

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Factor	Impact
Accuracy	Higher `k` reduces error bounds (e.g., `k=128` gives ~1.725% rank error[4], `k=200` ~1.33%[1])
Storage	Sketch size grows linearly with `k` (e.g., `k=200` uses ~432 bytes vs `k=128` ~276 bytes)[1][4]

fix: Fix approx_histogram_k aggregations #411

fix: Fix approx_histogram_k aggregations #411

Uh oh!

Conversation

nikhil-zlai commented Feb 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Key Formula:

Summary:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikhil-zlai commented Feb 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 21, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)