-
Notifications
You must be signed in to change notification settings - Fork 8
fix: Fix approx_histogram_k aggregations #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update the core aggregation logic. In Scala, the Changes
Sequence Diagram(s)sequenceDiagram
participant CA as ColumnAggregator
participant FI as FrequentItems
participant DS as DataStructure
CA->>FI: Instantiate with mapSize & errorType
FI->>DS: Compute effectiveMapSize & sketchSize
DS-->>FI: Return sketch dimensions
FI->>CA: Return aggregation result
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)
441-445: Remove commented debug code. Helps keep the file clean.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (18)
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala(2 hunks)aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala(3 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala(0 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala(2 hunks)api/py/ai/chronon/group_by.py(1 hunks)api/py/ai/chronon/repo/interactive.md(0 hunks)api/py/ai/chronon/repo/interactive.py(0 hunks)api/py/ai/chronon/utils.py(1 hunks)api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_online_join.v1(1 hunks)api/py/test/test_group_by.py(1 hunks)api/py/test/test_utils.py(1 hunks)api/thrift/api.thrift(1 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala(4 hunks)
💤 Files with no reviewable changes (3)
- api/py/ai/chronon/repo/interactive.py
- api/py/ai/chronon/repo/interactive.md
- aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: streaming_tests
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: spark_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: fetcher_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (26)
api/py/test/sample/production/joins/sample_team/sample_label_join.v1 (1)
163-163: Decreased aggregator capacity. Verify smaller k won't degrade coverage too much.aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)
394-394: Default errorType changed. Confirm thatNO_FALSE_NEGATIVESis correct for your use case.
398-406: Sketch size logic. The power-of-two approach looks correct.api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1 (1)
61-61: Lowered k value. Ensure it still meets precision needs.aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)
166-182: LGTM! Well-structured test data generation.The createSkewedData method effectively simulates real-world data distribution with both long tail and heavy hitters.
184-195: LGTM! Good test coverage for MostFrequentK.Test validates the exact k elements requirement and heavy hitter presence.
197-210: LGTM! Comprehensive test for HeavyHittersK.Test ensures only heavy hitters are returned using NO_FALSE_POSITIVES error type.
api/py/test/sample/production/joins/sample_team/sample_online_join.v1 (1)
227-228: Verify impact of reduced k parameter.Reducing k from 128 to 20 will decrease memory usage but may affect accuracy.
Please confirm this change aligns with your accuracy requirements.
api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1 (1)
163-165: LGTM! Consistent parameter update.The k parameter change matches other configurations.
api/py/test/test_group_by.py (1)
179-179: LGTM! Test updated for new default.Test assertion correctly reflects the new k parameter value.
api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (1)
228-229:❓ Verification inconclusive
Verify the impact of reduced k value on percentile accuracy.
Reducing k from 128 to 20 will decrease memory usage but may affect percentile accuracy.
Run this script to analyze the accuracy impact:
🏁 Script executed:
#!/bin/bash # Find all uses of operation 12 with k=128 to assess migration impact rg -A 5 '"operation": 12.*"k": "128"'Length of output: 39
Attention: Verify the Impact of Reduced k
While the initial script found no instances of operation 12 using k="128", this absence suggests the change is likely complete in the intended file. However, the impact on percentile accuracy remains a key concern. Please manually verify that all aggregation operations have correctly migrated to k="20" and that the reduced k value does not adversely affect accuracy.
- Confirm that no remnants of k="128" exist in any operation 12 usage.
- Manually test the approximate percentile calculations to ensure their accuracy meets requirements.
api/py/test/test_utils.py (2)
327-346: LGTM! Comprehensive test for compose function.Test verifies complex nested function composition with proper formatting.
349-356: LGTM! Good test coverage for expression cleaning.Test verifies whitespace handling in expressions.
api/thrift/api.thrift (1)
173-174: LGTM! Clear distinction between frequency and heavy hitter operations.The split provides better semantic clarity between frequency counting and skew detection.
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (3)
137-138: LGTM! Clean helper methods for type conversion.Encapsulates Java type conversion logic.
267-277: LGTM! Well-structured implementation of APPROX_FREQUENT_K.Comprehensive type handling with appropriate conversions for each numeric type.
281-289: LGTM! Proper implementation of APPROX_HEAVY_HITTERS_K.Uses NO_FALSE_POSITIVES error type for accurate skew detection.
api/py/ai/chronon/utils.py (2)
579-627: LGTM! Well-documented utility function.The
composefunction is well-implemented with clear examples and proper indentation handling.
629-633: LGTM! Clean utility function.The
clean_expressionfunction is concise and handles whitespace cleanup effectively.api/py/ai/chronon/group_by.py (4)
111-117: LGTM! Clear warning about performance impact.The docstring properly warns about high cardinality impact.
118-123: LGTM! Clear documentation.The docstring clearly explains the approximate nature of the operation.
124-131: LGTM! Well-documented distinction.The docstring clearly explains how this differs from APPROX_FREQUENT_K.
144-148:❓ Verification inconclusive
Verify the impact of reducing k from 128 to 20.
The default accuracy parameter k has been reduced significantly.
🌐 Web query:
What is the recommended value for k in approximate percentile calculations?💡 Result:
The recommended value for error tolerance (
k) in approximate percentile calculations depends on balancing speed and accuracy requirements:
Default Recommendation:
- The standard error tolerance is
0.01(1%) [4][7]. This provides a near-exact estimate (±1% deviation from true percentiles) while maintaining computational efficiency [2][4].Large Datasets:
- For datasets exceeding 100k rows,
APPROX_PERCENTILEachieves high accuracy (±0.01 range) even with default settings [2].- Example: A true median of 100 would yield an approximate value between 99–101 [2].
Trading Accuracy for Speed:
- Increase tolerance to
0.05(5%) for faster results at reduced precision [4].- Higher tolerances reduce computational overhead but widen error margins (e.g., ±5% deviation).
Extreme Percentiles:
- Accuracy improves naturally near distribution edges (e.g., 1st or 99th percentile) due to data sparsity [2][9].
Key Formula:
For uniform distributions spanning
[0, N], an approximate percentile atpwill lie within:
[ \text{Approx}_p \in \left[ p - k \cdot N,\ p + k \cdot N \right] ]
where ( k ) = error tolerance [2][4].Summary:
- Use
k = 0.01by default for most cases [4].- Adjust to
k = 0.05only if computational speed is prioritized over precision [2][4].Citations:
- 1: https://www.indeed.com/career-advice/career-development/how-to-calculate-percentile
- 2: https://www.singlestore.com/blog/fast-and-accurate-percentiles-with-approx-percentile-in-singlestore/
- 3: https://statisticsbyjim.com/basics/percentiles/
- 4: https://docs.singlestore.com/cloud/reference/sql-reference/aggregate-functions/approx-percentile/
- 5: https://www.dummies.com/article/academics-the-arts/math/statistics/how-to-calculate-percentiles-in-statistics-169783/
- 6: https://www.youtube.com/watch?v=VcPWzNXs41Y
- 7: https://courses.lumenlearning.com/introstatscorequisite/chapter/measures-of-the-location-of-the-data-2/
- 8: https://opentextbc.ca/businesstechnicalmath/chapter/8-1-percentiles-and-quartiles/
- 9: https://en.wikipedia.org/wiki/Percentile
- 10: https://www.khanacademy.org/math/ap-statistics/density-curves-normal-distribution-ap/percentiles-cumulative-relative-frequency/v/calculating-percentile
Action: Confirm the impact of reducing k on percentile accuracy.
- In api/py/ai/chronon/group_by.py (Lines 144-148), the k value has been reduced from 128 to 20.
- Because k directly affects the accuracy performance of the approximate percentile calculation, please ensure that this lower k maintains acceptable error tolerance in your tests. Note that while external guidelines suggest error tolerance values of around 0.01 (or 1%) for similar algorithms, our implementation uses an integer parameter—validate that k=20 provides the expected trade-off between accuracy and speed.
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (1)
491-517: LGTM! Consistent operation changes.The test has been updated to use APPROX_FREQUENT_K consistently for all three input columns.
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2)
322-324: LGTM! Consistent operation change.The test has been updated to use APPROX_FREQUENT_K for txn_types.
547-550: LGTM! Consistent operation change.The test has been updated to use APPROX_FREQUENT_K for rating.
7bcf9ca to
0ee71a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (4)
184-195: Add assertions for frequency values.Consider verifying that the frequencies of heavy hitters are higher than the frequencies of other items.
topHistogram.size() shouldBe k heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString)) +// Verify frequencies +val heavyHitterFreqs = heavyHitterElems.map(elem => topHistogram.get(elem.toString)) +val otherFreqs = topHistogram.values.asScala.filterNot(heavyHitterFreqs.contains) +heavyHitterFreqs.foreach(freq => otherFreqs.foreach(otherFreq => freq should be > otherFreq))
197-210: Add assertions for non-heavy hitters.Verify that non-heavy hitter elements are not included in the result.
heavyHitterResult.size() shouldBe heavyHitterElems.size heavyHitterElems.foreach(elem => heavyHitterResult.containsKey(elem.toString)) +// Verify non-heavy hitters are excluded +val nonHeavyHitters = (1 to 100).map(_.toString) +nonHeavyHitters.foreach(elem => heavyHitterResult.containsKey(elem) shouldBe false)
184-195: Add more assertions to strengthen the test.While the test verifies the size and presence of heavy hitters, it could be enhanced by:
- Verifying the order of elements (most frequent first)
- Checking the actual frequency counts
topHistogram.size() shouldBe k heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString)) +// Verify frequencies are in descending order +val frequencies = topHistogram.asScala.values.toSeq +frequencies.zip(frequencies.tail).foreach { case (prev, next) => + prev should be >= next +}
197-210: Add frequency threshold verification.The test should verify that returned items exceed a minimum frequency threshold.
heavyHitterResult.size() shouldBe heavyHitterElems.size heavyHitterElems.foreach(elem => heavyHitterResult.containsKey(elem.toString)) +// Verify all items exceed minimum frequency +val minFrequency = 1000 // Each heavy hitter appears 1000 times +heavyHitterResult.values().forEach(freq => freq should be >= minFrequency)aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)
441-445: Consider using structured logging instead of comments.Replace debug comments with proper logging statements.
- // useful with debugger on - keep around - // val outputSketchSize = ir.sketch.getNumActiveItems - // val serializer = implicitly[FrequentItemsFriendly[T]].serializer - // val outputSketchBytes = ir.sketch.toByteArray(serializer).length + if (logger.isDebugEnabled) { + val outputSketchSize = ir.sketch.getNumActiveItems + val serializer = implicitly[FrequentItemsFriendly[T]].serializer + val outputSketchBytes = ir.sketch.toByteArray(serializer).length + logger.debug(s"Sketch stats: size=$outputSketchSize, bytes=$outputSketchBytes") + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (20)
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala(2 hunks)aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala(3 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala(0 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala(2 hunks)api/py/ai/chronon/group_by.py(1 hunks)api/py/ai/chronon/repo/interactive.md(0 hunks)api/py/ai/chronon/repo/interactive.py(0 hunks)api/py/ai/chronon/utils.py(1 hunks)api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_online_join.v1(1 hunks)api/py/test/test_group_by.py(1 hunks)api/py/test/test_utils.py(1 hunks)api/thrift/api.thrift(1 hunks)docs/source/authoring_features/GroupBy.md(2 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala(5 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala(1 hunks)
💤 Files with no reviewable changes (3)
- api/py/ai/chronon/repo/interactive.md
- api/py/ai/chronon/repo/interactive.py
- aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala
🚧 Files skipped from review as they are similar to previous changes (10)
- api/py/test/test_group_by.py
- api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1
- api/py/ai/chronon/utils.py
- spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala
- api/py/test/test_utils.py
- api/py/test/sample/production/joins/sample_team/sample_label_join.v1
- api/py/test/sample/production/joins/sample_team/sample_online_join.v1
- api/thrift/api.thrift
- spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala
- api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: analyzer_tests
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (19)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (4)
12-14: LGTM: Required imports added for new test cases.
166-182: LGTM: Well-structured test data generation.The
createSkewedDatahelper creates a good mix of frequent (4 elements × 1000) and infrequent (100 elements × 100) items.
12-14: LGTM! Required imports added.The imports are necessary for the new test cases.
166-182: Well-structured test data generator.The
createSkewedDatamethod effectively creates a dataset with both long tail and heavy hitter elements, perfect for testing frequency-based aggregations.api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2 (2)
228-229: LGTM: Reduced k parameter for better efficiency.The reduction from 128 to 20 aligns with the PR's goal to fix aggregations.
228-229:❓ Verification inconclusive
Verify impact of reduced k value.
Reducing k from 128 to 20 significantly lowers the number of items tracked. Ensure this doesn't affect accuracy requirements.
🏁 Script executed:
#!/bin/bash # Search for other k values in configuration files rg -g '*.v[0-9]' '"k":\s*"(128|20)"'Length of output: 523
Reduced k value impact
All affected config files now use "k": "20" (e.g., sample_join_bootstrap.v2, group_by_with_kwargs.v1, sample_label_join.v1, sample_online_join.v1, sample_label_join_with_agg.v1). This significant drop from 128 may reduce tracking precision. Please confirm that accuracy requirements remain met.
spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala (2)
147-148:❓ Verification inconclusive
Investigate root cause before relaxing threshold.
Increasing the null ratio threshold from 0.2 to 0.3 appears to be a workaround. The TODO comment suggests this needs investigation.
🏁 Script executed:
#!/bin/bash # Search for related test failures or issues rg -A 5 "summary.*null.*ratio|null.*ratio.*threshold"Length of output: 55
Investigate underlying issue before settling for a relaxed threshold.
- The threshold change from 0.2 to 0.3 appears to be a temporary workaround.
- Please manually verify the impact on test behavior and address the root cause rather than patching.
147-148: Document rationale for threshold change.The null ratio threshold has been relaxed from 0.2 to 0.3. The TODO comment suggests unexpected behavior.
Please clarify:
- Why is the test "acting up"?
- Is this a temporary workaround?
- What's the plan to investigate and fix the root cause?
aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (6)
137-138: LGTM: Improved type conversion helpers.Extracted common conversion logic into helper methods.
141-159: LGTM: Consistent type casting.Using the new helper methods makes the code more uniform.
267-289: LGTM: Improved frequency estimation.
- Split into APPROX_FREQUENT_K and APPROX_HEAVY_HITTERS_K for better accuracy
- Added ErrorType.NO_FALSE_POSITIVES for heavy hitters
137-138: LGTM! Type conversion helpers added.The new helper methods improve type safety and readability.
267-278: LGTM! New aggregation operation added.The APPROX_FREQUENT_K operation is well-implemented with appropriate type handling.
279-290: LGTM! Heavy hitters implementation updated.The implementation now uses FrequentItems with NO_FALSE_POSITIVES error type.
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)
394-394: LGTM: Improved accuracy by preventing false negatives.The change to
NO_FALSE_NEGATIVESensures we don't miss any frequent items.
398-406: LGTM: Better sketch size calculation.The new calculation using
effectiveMapSizeproperly accounts for internal map purge behavior.api/py/ai/chronon/group_by.py (2)
111-131: LGTM: Added new aggregation operations with clear documentation.The new operations
FREQUENT_K,APPROX_FREQUENT_K, andAPPROX_HEAVY_HITTERS_Kare well documented with their differences clearly explained.
144-146:❓ Verification inconclusive
Verify the impact of reducing default k value.
The change from k=128 to k=20 for APPROX_PERCENTILE might affect accuracy.
🌐 Web query:
What is the recommended k value for KLL sketch in Apache DataSketches for percentile approximation?💡 Result:
The recommended
kvalue for KLL sketches in Apache DataSketches depends on balancing accuracy requirements with storage constraints:Default Recommendation
- Apache DataSketches: Defaults to
k=200[1][5][9], providing an epsilon (error bound) of ~1.33% for quantile estimates and 1.65% for PMF/CDF calculations[1].Key Considerations
Factor Impact Accuracy Higher kreduces error bounds (e.g.,k=128gives ~1.725% rank error[4],k=200~1.33%[1])Storage Sketch size grows linearly with k(e.g.,k=200uses ~432 bytes vsk=128~276 bytes)[1][4]Implementation-Specific Notes
- Cloudera Impala/Druid: Supports
k=8tok=65535, defaulting to 200[1][9]- Imply Polaris: Uses power-of-2
kvalues (2–32768), defaulting to 128[4]- Microsoft SQL Server: Uses KLL internally but doesn’t expose
ktuning[10]For most use cases,
k=200strikes a practical balance between accuracy and resource usage[1][5]. Increasekwhen:
- You require sub-1% error bounds
- Your data distribution has heavy tails/extreme outliers
- Storage costs are not a primary concern
Decrease
kwhen prioritizing sketch compactness over precision[4][9].Citations:
- 1: https://docs.cloudera.com/cdw-runtime/cloud/manage-impala/topics/impala-datasketches-kll.html
- 2: https://datasketches.apache.org/docs/KLL/KLLSketch.html
- 3: https://apache.github.io/datasketches-java/6.1.0/org/apache/datasketches/kll/KllSketch.html
- 4: https://docs.imply.io/polaris/sketches-quantiles/
- 5: https://github.com/apache/datasketches-cpp/blob/master/kll/include/kll_sketch.hpp
- 6: https://www.youtube.com/watch?v=-lXDBtpg_aQ
- 7: https://github.com/apache/datasketches-python
- 8: https://apache.github.io/datasketches-python/5.1.0/quantiles/quantiles_depr.html
- 9: https://druid.apache.org/docs/latest/development/extensions-core/datasketches-kll/
- 10: https://learn.microsoft.com/en-us/sql/t-sql/functions/approx-percentile-disc-transact-sql?view=sql-server-ver16
Action: Confirm k Value Impact on Accuracy
The current k value of 20 is significantly lower than both our previous default (128) and the Apache DataSketches-recommended value (≈200). This reduction could notably degrade the accuracy of percentile approximations. Please verify that the lower precision is acceptable for our use case and add documentation or tests to justify the trade-off.
- File:
api/py/ai/chronon/group_by.py, Lines 144–146docs/source/authoring_features/GroupBy.md (1)
136-151: LGTM: Documentation accurately reflects code changes.The table of properties correctly documents the new aggregation types and their characteristics.
0ee71a4 to
1558ca8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (1)
441-445: Consider removing commented debug code.These debugging statements should be removed or moved to a logging framework.
- // useful with debugger on - keep around - // val outputSketchSize = ir.sketch.getNumActiveItems - // val serializer = implicitly[FrequentItemsFriendly[T]].serializer - // val outputSketchBytes = ir.sketch.toByteArray(serializer).length
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (20)
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala(2 hunks)aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala(3 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala(0 hunks)aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala(2 hunks)api/py/ai/chronon/group_by.py(1 hunks)api/py/ai/chronon/repo/interactive.md(0 hunks)api/py/ai/chronon/repo/interactive.py(0 hunks)api/py/ai/chronon/utils.py(1 hunks)api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1(1 hunks)api/py/test/sample/production/joins/sample_team/sample_online_join.v1(1 hunks)api/py/test/test_group_by.py(1 hunks)api/py/test/test_utils.py(1 hunks)api/thrift/api.thrift(1 hunks)docs/source/authoring_features/GroupBy.md(2 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala(5 hunks)spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala(1 hunks)
💤 Files with no reviewable changes (3)
- api/py/ai/chronon/repo/interactive.py
- api/py/ai/chronon/repo/interactive.md
- aggregator/src/test/scala/ai/chronon/aggregator/test/ApproxHistogramTest.scala
🚧 Files skipped from review as they are similar to previous changes (11)
- spark/src/test/scala/ai/chronon/spark/test/stats/drift/DriftTest.scala
- api/py/test/sample/production/group_bys/sample_team/group_by_with_kwargs.v1
- api/py/test/sample/production/joins/sample_team/sample_online_join.v1
- api/py/test/sample/production/joins/sample_team/sample_join_bootstrap.v2
- api/py/test/sample/production/joins/sample_team/sample_label_join.v1
- api/py/test/test_group_by.py
- api/py/test/test_utils.py
- api/py/test/sample/production/joins/sample_team/sample_label_join_with_agg.v1
- api/thrift/api.thrift
- spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala
- spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: analyzer_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: spark_tests
- GitHub Check: spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: join_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (15)
api/py/ai/chronon/utils.py (2)
579-626: Well-implemented utility function for method composition.The
composefunction elegantly handles nested method calls with proper indentation and argument handling.
629-633: Clean utility for expression formatting.Simple and effective implementation for removing excess whitespace.
docs/source/authoring_features/GroupBy.md (1)
136-150: Documentation properly updated to reflect code changes.Table has been reformatted for better clarity and now accurately describes the new aggregation types (
approx_frequent_kandapprox_heavy_hitter_k) that replace the deprecatedapprox_histogram_k.aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala (4)
137-139: Good helper methods for type conversion.Encapsulating the conversion logic improves readability and maintainability.
140-160: Refactored type casting using helper methods.Now using the helper methods consistently throughout type casting operations.
267-277: Replaced APPROX_HISTOGRAM_K with APPROX_FREQUENT_K implementation.Now uses FrequentItems with default error type and handles all data types properly.
278-290: Updated APPROX_HEAVY_HITTERS_K implementation.Now uses FrequentItems with ErrorType.NO_FALSE_POSITIVES for optimized heavy hitter detection.
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)
166-182: Good test data generation utility.Creates a realistic skewed distribution with both long tail and heavy hitter elements.
184-195: Effective test for MostFrequentK behavior.Ensures the implementation returns exactly k items with heavy hitters included.
197-210: Thorough test for HeavyHittersK behavior.Verifies the implementation correctly identifies only true heavy hitters.
aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala (2)
394-394: Changed default error type to improve frequency analysis.Default error type now favors including all frequent items (with potential false positives) rather than missing any.
398-406: Improved sketch size calculation to maintain effective capacity.The new calculation properly accounts for internal map purge behavior, ensuring mapSize is properly maintained.
api/py/ai/chronon/group_by.py (3)
62-131: Improved documentation and semantics of operations.Added clear docstrings and replaced vague histogram operations with more specific frequent item operations that better distinguish between exact calculation, approximation, and heavy hitter detection.
147-147: Improved docstring for APPROX_PERCENTILE.Added documentation that explains the configurable accuracy parameter.
145-145:✅ Verification successful
Reduced default aggregation parameter from 128 to 20.
The k parameter for APPROX_PERCENTILE now defaults to 20 instead of 128, which reduces memory usage but may impact accuracy.
🏁 Script executed:
#!/bin/bash # Check if this parameter change is reflected in tests rg -A 3 "APPROX_PERCENTILE" --glob "api/py/test/*.py"Length of output: 804
Aggregation default k parameter change verified
Tests in
api/py/test/test_group_by.pyconfirm thatk=20is correctly set forAPPROX_PERCENTILE, matching the intended change. No further action is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (2)
166-182: Consider documenting distribution characteristics.The skewed data method is well-implemented but dropping 1000 random elements could affect test reliability.
private def createSkewedData(): Array[Long] = { // 10k elements - each repeating 100 times val longTail = (1 to 100).flatMap(_ => 1 to 100) // 4 elements - each repeating 1000 times val heavyHitters = (1 to 1000).flatMap(_ => heavyHitterElems) - // all of them together and shuffled + // Combine and shuffle all data, then drop 1000 random items + // This creates a distribution with ~10% heavy hitters (4000 out of ~13000 elements) Random .shuffle(longTail ++ heavyHitters) .iterator .map(_.toLong) .drop(1000) // delete a few random items to produce noise .toArray }
184-195: Test verifies size but not frequencies.Test checks for expected elements but could verify frequency values too.
"MostFrequentK" should "always produce nearly k elements when cardinality is > k" in { val k = 10 val topFrequentItems = new FrequentItems[java.lang.Long](k) val frequentItemsIr = topFrequentItems.prepare(0) createSkewedData().foreach(i => topFrequentItems.update(frequentItemsIr, i)) val topHistogram = topFrequentItems.finalize(frequentItemsIr) math.abs(topHistogram.size() - k) <= 2 shouldBe true heavyHitterElems.foreach(elem => topHistogram.containsKey(elem.toString)) + // Verify frequencies of heavy hitters are high + heavyHitterElems.foreach(elem => + topHistogram.get(elem.toString) should be >= 800L) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: python_tests
- GitHub Check: analyzer_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (3)
aggregator/src/test/scala/ai/chronon/aggregator/test/FrequentItemsTest.scala (3)
12-15: Good imports.The added imports support new test functionality.
83-87: Updated sketch sizes match implementation changes.Sketch size expectations correctly reflect new internal sizing logic.
197-210: Good error type specification.Test correctly uses NO_FALSE_POSITIVES to test heavy hitters functionality.
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Checklist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items. ## Cheour clientslist - [x] Added Unit Tests - [x] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced new utility functions to streamline expression composition and cleanup. - Enhanced aggregation descriptions for clearer operation choices. - Added new aggregation types for improved data analysis. - **Refactor** - Revamped frequency analysis logic with improved error handling and optimized sizing. - Replaced legacy histogram approaches with a more robust frequent item detection mechanism. - **Tests** - Added tests to validate heavy hitter detection and skewed data scenarios, while removing obsolete histogram tests. - Updated existing tests to reflect changes in aggregation parameters. - **Chores** - Removed deprecated interactive modules for a leaner deployment. - **Configuration** - Adjusted default aggregation parameters for more consistent processing, including changes to the `k` value in multiple configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
The existing aggregations configure the items sketch incorrectly. Split it into two one that works purely with skewed data, and one that tries to best-effort collect most frequent items.
Checklist
Summary by CodeRabbit
New Features
Refactor
Tests
Chores
Configuration
kvalue in multiple configurations.