Update Catalyst util to bump up whole stage codegen limit for Beacon top #380

piyush-zlai · 2025-02-14T03:28:11Z

Summary

While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Enhanced data processing robustness with improved handling and early error detection for large schemas.
- Refined SQL query formatting for clearer logical conditions.
Tests
- Added a new validation for large schema deserialization.
- Updated test definitions to improve structure and readability.

coderabbitai · 2025-02-14T03:28:19Z

Walkthrough

This pull request enhances Avro deserialization testing and CatalystUtil functionality. A new test validates the handling of large Avro schemas using added Spark SQL and Avro-related imports. In CatalystUtil, a constant (MaxFields = 1000) is introduced with an early field count check and updated SQL clause formatting. Additionally, CatalystUtil test cases are refactored to utilize the ScalaTest DSL style.

Changes

File(s)	Change Summary
`flink/.../AvroDeSerializationSupportSpec.scala`	Added new imports (`AvroConversions`, `CatalystUtil`, `Row`, `ExpressionEncoder`) and a test case `"successfully deser large avro schemas"` validating large Avro schema deserialization and SQL filtering.
`online/.../CatalystUtil.scala`, `online/.../CatalystUtilTest.scala`	Introduced `MaxFields = 1000` to enforce field limits with early checks; modified SQL clause formatting in CatalystUtil; refactored test cases in CatalystUtilTest to ScalaTest DSL style.

Sequence Diagram(s)

sequenceDiagram
    participant T as Test Framework
    participant ADS as AvroDeserializationSupport
    participant CU as CatalystUtil
    T->>ADS: Invoke build(schema, payload)
    ADS->>CU: Forward Spark row for processing
    CU-->>T: Return processed/filtered row result

Possibly related PRs

[Untested] Vz cherry pick oss avro schema #170: Enhancements to Avro schema handling via AvroConversions modifications.
feat: Support Avro date conversions #275: Adjustments in Avro date conversions and integration with Spark SQL.
explode support in catalyst util test #280: Changes to CatalystUtil methods and return types related to the new test case.

Suggested reviewers

nikhil-zlai
varant-zlai

Poem

In our code the tests now gleam,
Avro and Catalyst join the team.
Limits check fields with utmost care,
SQL clauses wrapped, precise and fair.
Code leaps ahead with spark and theme 🚀.

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)

74-103: Test could be more comprehensive.

Consider:

Adding positive test case where event matches the where clause.

Moving schema and payload to test resources for better maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a493d35 and 4fb28d3.

📒 Files selected for processing (3)

flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)

✅ Files skipped from review due to trivial changes (1)

online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (3)

online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

58-59: LGTM! Good practice defining MaxFields as a constant.

The value of 1000 provides sufficient headroom above the current 374 fields in beacon top.

Also applies to: 75-75

206-211: LGTM! Early validation prevents runtime issues.

Good error message that includes the schema details for debugging.

143-144: LGTM! Wrapping clauses in parentheses prevents operator precedence issues.

tchow-zlai

awesome!

coderabbitai

Actionable comments posted: 4

🔭 Outside diff range comments (1)

spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)
399-401: 🛠️ Refactor suggestion

Fix potential flaky test by using a fixed timestamp.

Using System.currentTimeMillis() can make tests non-deterministic. Consider using a fixed timestamp for reproducible tests.
-    val now = System.currentTimeMillis()
-    val today = tableUtils.partitionSpec.at(now)
+    val fixedTimestamp = 1677628800000L  // e.g., 2023-03-01 00:00:00 UTC
+    val today = tableUtils.partitionSpec.at(fixedTimestamp)

🧹 Nitpick comments (15)

spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1)
65-71: Consider removing redundant partition column usage.

The partition column is used twice: in column creation and in gen call. This might lead to confusion.
-    gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count, partitionColumn)
+    gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (2)
55-86: Add more descriptive assertion messages.

The assertion messages in these test cases could be more descriptive to help debug failures.
-    assertEquals(0, diff.count())
+    assertEquals(0, diff.count(), s"Found ${diff.count()} mismatched rows between actual and expected results")
Also applies to: 88-137, 139-206, 207-267

397-417: Consider adding randomization to test data.

The test data generation in createTestSource could benefit from more randomization to catch edge cases.
-    val sourceSchema = List(
-      Column("user", StringType, 10000),
-      Column("item", StringType, 100),
-      Column("time_spent_ms", LongType, 5000),
-      Column("price", DoubleType, 100)
-    )
+    val sourceSchema = List(
+      Column("user", StringType, scala.util.Random.nextInt(20000) + 1000),
+      Column("item", StringType, scala.util.Random.nextInt(200) + 50),
+      Column("time_spent_ms", LongType, scala.util.Random.nextInt(10000) + 1000),
+      Column("price", DoubleType, scala.util.Random.nextInt(200) + 50)
+    )
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)
206-211: Consider enhancing the error message.

The validation logic is correct, but the error message could be more actionable.
-          s"Too many fields in input schema. We support a max of: ${CatalystUtil.MaxFields}. Schema: ${inputSparkSchema.simpleString}"
+          s"Input schema exceeds maximum field limit of ${CatalystUtil.MaxFields}. Consider splitting the schema. Current schema: ${inputSparkSchema.simpleString}"
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)
77-103: Consider adding more assertions.

The test validates the core functionality but could be more thorough.

Add assertions to verify:

Schema field count

Successful deserialization of specific fields

CatalystUtil's handling of the large schema
 assert(result.isEmpty) // no rows should be returned as the event is not in the where clause
+assert(chrononSchema.fields.length > 100, "Should be a large schema")
+assert(sparkRow != null, "Should deserialize successfully")
+assert(catalystInternalRow != null, "Should process through CatalystUtil")
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (1)

215-294: Add unit tests for SQL helper methods.

Complex SQL queries in computeSimpleAverageThroughSql and computeLastThroughSql should be unit tested to ensure correctness.

Would you like me to help create unit tests for these SQL helper methods?

Also applies to: 300-409
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
98-111: Prefix logic
Use stripSuffix to simplify trailing slash handling.
-    } else if (barePrefix.endsWith("/")) {
-      Some(barePrefix)
-    } else {
-      Some(barePrefix + "/")
-    }
+    Some(barePrefix.stripSuffix("/") + "/")
817-818: Temporary fix
Offer final approach later.
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (2)

27-28: Consider using a more efficient fetcher initialization.

Lazy transient fields could be replaced with a factory method for better testability.

38-42: Deliberate key mistyping needs documentation.

Add a comment explaining why keys are deliberately mistyped and the expected behavior.

spark/BUILD.bazel (1)

110-118: Consider adding size tags to test suites.

Like the "large" tag in the main test suite, other suites might benefit from size tags.
spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)
67-69: Consider extracting partition column renaming logic.

Repeated renaming operations could be moved to a utility method.
+  private def withDefaultPartition(df: DataFrame, partitionCol: String): DataFrame = {
+    if (partitionCol == "ds") df else df.withColumnRenamed(partitionCol, "ds")
+  }
spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)
330-331: Consider handling null partition columns more explicitly.

The partitionBy(null: _*) usage could be made more explicit about its intention.

Consider adding a comment explaining why null partitioning is used, or create a helper method with a descriptive name:
+  private def withoutPartitioning[T](writer: DataFrameWriter[T]): DataFrameWriter[T] = {
+    writer.partitionBy(null: _*)
+  }
Also applies to: 340-341
.github/workflows/test_scala_spark.yaml (2)

176-176: Add newline at end of file.

Add a newline character at the end of the file to comply with POSIX standards.

🧰 Tools

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

48-176: Consider using job templates to reduce duplication.

The new test jobs share identical configuration. Consider using job templates or composite actions to reduce duplication.

🧰 Tools

🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb28d3 and 2dedcc2.

📒 Files selected for processing (36)

.bazelrc (1 hunks)
.github/workflows/test_scala_spark.yaml (1 hunks)
api/py/test/sample/staging_queries/quickstart/purchases.py (1 hunks)
api/src/main/scala/ai/chronon/api/Builders.scala (2 hunks)
api/thrift/api.thrift (2 hunks)
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)
spark/BUILD.bazel (1 hunks)
spark/src/main/scala/ai/chronon/spark/Analyzer.scala (4 hunks)
spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/Extensions.scala (4 hunks)
spark/src/main/scala/ai/chronon/spark/GroupBy.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/Join.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/JoinBase.scala (5 hunks)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (14 hunks)
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala (0 hunks)
spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (3 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala (2 hunks)
spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (30 hunks)
spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala (3 hunks)

💤 Files with no reviewable changes (1)

spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala

✅ Files skipped from review due to trivial changes (9)

spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java
spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala
spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala
spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala
spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala
spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala
spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala
spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala

🚧 Files skipped from review as they are similar to previous changes (1)

online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala

🧰 Additional context used

🪛 actionlint (1.7.4)

.github/workflows/test_scala_spark.yaml

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

.github/workflows/test_scala_spark.yaml

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (10)

GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: bazel_config_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (76)

spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (3)

19-20: LGTM!

Import consolidation improves code organization.

195-201: LGTM!

The partition column drop prevents duplicated column errors during joins.

268-269: LGTM!

Variable renaming enhances code clarity.

spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala (3)

17-17: LGTM!

Package move to streaming-specific location improves organization.

22-22: LGTM!

Import changes maintain clean dependencies while preserving access to required functionality.

Also applies to: 25-25

54-54: LGTM!

Consistent use of StreamingTest.buildInMemoryKvStore() improves clarity.

Also applies to: 113-113

spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)

17-22: LGTM! Package organization and import consolidation.

The package move and import consolidation improve code organization.

spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (2)

41-45: LGTM! Clean addition of partition column parameter.

Maintains backward compatibility while adding flexibility.

52-61: LGTM! Consistent partition column handling.

Aligns with gen method changes.

spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (1)

17-43: LGTM! Well-organized imports.

Package and import statements are well-structured and properly consolidated.

online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

36-36: LGTM!

Import required for accessing Spark SQL configuration settings.

58-59: LGTM!

Constant value (1000) provides sufficient headroom above Beacon top's 374 fields.

75-75: LGTM!

Properly configures Spark's codegen field limit.

flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2)

3-5: LGTM!

All imports are required for the new test functionality.

Also applies to: 13-14

74-76: LGTM!

Clear explanation of test's purpose and scope.
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (3)

17-17: LGTM! Package organization and import consolidation.

Package moved to streaming subpackage and imports consolidated for better organization.

Also applies to: 22-27

432-859: LGTM! Comprehensive test coverage.

Test cases thoroughly cover various scenarios with clear documentation and assertions.

36-36:
❓ Verification inconclusive

Verify impact of removing TaggedFilterSuite.

Removal of TaggedFilterSuite mixin might affect test filtering capabilities.

🏁 Script executed:
#!/bin/bash
# Check if TaggedFilterSuite is used elsewhere and if there's a replacement
rg -A 5 "TaggedFilterSuite"
Length of output: 1451
Impact of TaggedFilterSuite Removal

MutationsTest now extends only AnyFlatSpec while other tests (e.g., in CatalystUtilHiveUDFTest.scala) still use TaggedFilterSuite.

Confirm that excluding TaggedFilterSuite in the streaming tests is intentional and that any required test filtering is handled appropriately.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (24)

29-34: Imports for creation statuses
Looks good.

36-36: Additional imports
Usage seems correct.

82-82: Small mode cutoff change
No issues.

202-204: Partition column parameter
Flexibility is improved.

208-208: Dynamic partition columns
Implementation looks good.

302-303: autoExpand parameter
Returning TableCreationStatus is clear.

306-306: Creation status assignment
Logic is consistent.

310-313: Generate table builder
Valid approach.

321-321: TableAlreadyExists
Gracefully handled.

327-328: Check existing table
Redundant but harmless.

331-342: Handling table props & autoExpand
Pattern matching is neat.

356-356: Blank line
No concerns.

359-360: Column rearrangement
Keeps partition columns last.

364-364: createTable invocation
Fine.

383-386: No operation for WithInitialData
Confirm if omission is intended.

434-434: Whitespace
No issues.

440-440: Another createTable call
Seems consistent.

442-445: No action on WithInitialData
Check if skipping is correct.

486-486: Partition columns param
Helpful extension.

835-835: New parameter in scanDf
Straightforward.

837-838: Effective partition column
Works neatly.

844-844: scanDfBase usage
Looks correct.

846-850: Conditional rename
Smart fallback.

863-866: New table creation statuses
Clear usage.

.bazelrc (1)

7-7: Test timeout raised
Validate necessity.

api/py/test/sample/staging_queries/quickstart/purchases.py (4)

1-14: License header
Approved.

15-15: Imports
Looks fine.

19-29: Query definition
Placeholders look correct.

31-38: StagingQuery
Neatly structured.

spark/BUILD.bazel (1)

98-107: LGTM! Well-organized test suite structure.

Clear separation of test categories improves maintainability.

spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)

47-48: LGTM! Good default value for partition column.

Maintains backward compatibility while adding flexibility.

api/src/main/scala/ai/chronon/api/Builders.scala (1)

49-63: LGTM! Clean addition of partition column support.

The implementation maintains backward compatibility and follows the builder pattern consistently.

spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (1)

187-191: LGTM! Improved partition column handling.

The change to use effectivePartitionColumn aligns with the flexible partition column support.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (2)

387-392: LGTM! Added SourceSparkOps for partition column handling.

Clean implementation of partition column retrieval with fallback to tableUtils.

394-397: LGTM! Added QuerySparkOps for effective partition column.

Well-structured implementation with proper null handling.

api/thrift/api.thrift (2)

19-20: LGTM! Added partitionColumn to Query struct.

Clean addition of optional field for partition column support.

53-57: LGTM! Added partitionColumn to StagingQuery struct.

Well-documented field with clear purpose.

spark/src/main/scala/ai/chronon/spark/JoinBase.scala (3)

54-54: LGTM! Added implicit TableUtils.

Clean addition of implicit value for better code organization.

173-175: LGTM! Updated unfilledRanges with partition column.

Consistent use of effectivePartitionColumn across join operations.

Also applies to: 488-490

287-288: LGTM! Improved variable naming.

Renamed variable for better clarity without changing functionality.

spark/src/main/scala/ai/chronon/spark/Join.scala (1)

518-518: LGTM! Using query-specific partition column.

The change correctly uses the effective partition column from the query instead of the default partition column.

spark/src/main/scala/ai/chronon/spark/Analyzer.scala (1)

316-319: LGTM! Using query-specific partition column.

The change correctly uses the effective partition column from the query for unfilled ranges.

spark/src/main/scala/ai/chronon/spark/GroupBy.scala (1)

629-629: LGTM! Using query-specific partition column.

The change correctly uses the partition column from the query for meta columns.

spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (1)

62-62: Verify why this test is being ignored.

The test for metadata store is being disabled. Please ensure there's a tracking issue to re-enable it.

spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (12)

125-126: LGTM! Reduced test data size.

Reducing entity count from 1000 to 100 improves test execution time while maintaining test coverage.

Also applies to: 173-174

438-440: LGTM! Added partition column support.

The addition of partition column support enhances test coverage for different partitioning scenarios.

429-551: LGTM! Added test for different partition columns.

New test case validates join behavior with different partition columns.

125-126: LGTM! Reduced test data size and added partition column support.

The changes reduce test data size from 1000 to 100 records and add partition column support, which aligns with addressing the Avro field limit issues.

Also applies to: 173-174, 318-319, 340-341, 356-357, 439-441, 561-562, 578-579, 614-615, 684-686, 708-709, 1034-1035, 1070-1071, 1136-1137, 1156-1157, 1171-1172, 1322-1323, 1335-1336

17-17: LGTM! Package organization improvement.

Moving join tests to a dedicated package improves code organization.

Also applies to: 33-33, 34-37

125-126: LGTM! Test data size optimization.

Reducing test data size improves test execution time while maintaining test coverage.

173-174: LGTM! Improved partition column handling.

Making partition column configurable and consistent improves flexibility.

Also applies to: 225-232

429-551: LGTM! Comprehensive partition column test coverage.

New test case thoroughly validates different partition column scenarios.

318-318: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied across all test cases.

Also applies to: 340-340, 463-463

859-859: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to name and user test cases.

Also applies to: 875-875, 883-883

614-614: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to view events test cases.

Also applies to: 635-635, 685-685, 708-708, 1034-1034, 1070-1070, 1136-1136, 1156-1156, 1171-1171

1322-1322: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to selected join parts test.

Also applies to: 1335-1335

.github/workflows/test_scala_spark.yaml (2)

71-72: LGTM! Increased memory allocation.

Increased JVM memory settings (-Xmx16G -Xms8G) for component-specific test jobs will help prevent out-of-memory errors.

Also applies to: 97-98, 123-124, 149-150, 175-176

71-71: LGTM! Increased memory limits for test jobs.

The increased memory limits (-Xmx16G -Xms8G) will help handle large Avro schemas.

Also applies to: 97-97, 123-123, 149-149, 175-175

coderabbitai · 2025-02-14T16:40:35Z

spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala

+          val result = Await.result(future, Duration(10000, SECONDS)) // todo: change back to millis
+          val latency = System.currentTimeMillis() - start
+          latencySum += latency
+          latencyCount += 1
+          result
+        }


💡 Verification agent

🧩 Analysis chain

Verify timeout duration.

The TODO comment suggests changing from seconds to milliseconds.

🏁 Script executed:

#!/bin/bash # Search for similar timeout patterns in tests rg -A 2 'Duration$[0-9]+.*SECONDS$'

Length of output: 1767

Update Timeout Duration Unit:
In spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (lines 65–70), the timeout is specified as Duration(10000, SECONDS) with a TODO to revert to milliseconds. Since other tests (e.g., in ExternalSourcesTest.scala) use a 10‑second timeout, please verify if the intended behavior is 10 seconds (i.e. 10000 ms) and update accordingly (e.g. change to Duration(10000, MILLISECONDS)).

coderabbitai · 2025-02-14T16:40:35Z

spark/src/main/scala/ai/chronon/spark/Analyzer.scala

  def analyzeJoin(joinConf: api.Join,
                  skewDetection: Boolean = false,
-                  validateTablePermission: Boolean = true,
+                  validateTablePermission: Boolean = false,


💡 Verification agent

🧩 Analysis chain

Verify if table permission validation can be disabled.

Changing the default value of validateTablePermission from true to false might affect security if table permission validation is needed.

🌐 Web query:

What are the security implications of disabling table permission validation in Apache Spark?

💡 Result:

Disabling table permission validation in Apache Spark introduces significant security risks by bypassing critical access control mechanisms. Here's a breakdown of the implications based on industry practices and documented vulnerabilities:

1. Unauthorized Data Access

Full Exposure of Sensitive Data: Without permission checks, any user could query/modify tables containing PII, financial records, or proprietary data[2][6][7].

Bypass of Row/Column-Level Security: Tools like Apache Ranger and Spark-Authorizer enforce granular controls (e.g., masking phone numbers), which become ineffective[2][7].

2. Privilege Escalation Risks

Exploitable Code Execution: Attackers could inject malicious UDFs or modify query plans to execute arbitrary commands[6][10].

Credential Theft: Executors with unrestricted access might leak credentials stored in memory or configurations[6].

3. Compliance Violations

GDPR/HIPAA Non-Compliance: Loss of audit trails for data access makes regulatory compliance impossible[7][10].

Insecure Defaults: Folders/files created without inherited permissions (e.g., 777 on ADLS Gen1) expose data broadly[5].

4. System Integrity Compromises

SQL Injection Vulnerabilities: Malicious actors could alter query logic via unvalidated inputs[8].

UI Exploits: Disabled ACLs enable unauthorized users to view/kill applications via Spark UI[4][10].

Key Mitigations from Search Results

Security Measure Purpose Source

Enable spark.acls.enable Enforce UI/API access controls [4][10]

Use Apache Ranger Integration Centralize row/column policies [2][7]

Apply GRANT SELECT/MODIFY Restrict file/system operations [3][5]

Validate UDFs/Code Execution Prevent privilege escalation [6][10]

Disabling permission validation fundamentally undermines Spark's security architecture, making systems vulnerable to both internal misuse and external attacks[6][10]. Organizations should instead strengthen validation through granular ACLs and integration with enterprise security frameworks like Ranger[2][7].

Citations:

1: https://docs.datastax.com/en/dse/6.9/securing/spark-authentication.html

2: https://yaooqinn.github.io/spark-authorizer/docs/spark_sql_authorization.html

3: https://kb.databricks.com/security/table-create-security-exception

4: https://spark.apache.org/security.html

5: https://kb.databricks.com/data/spark-default-perms-adls-gen1

6: https://assets.amazon.science/91/50/72bd4f594d85bdfe330ee059bd52/membrane-safe-and-performant-data-access-controls-in-apache-spark-in-the-presence-of-imperative-code.pdf

7: https://community.cloudera.com/t5/Community-Articles/Row-Column-level-Security-in-SQL-for-Apache-Spark-2-1-1/ta-p/245738

8: https://dzone.com/articles/many-ways-of-validating-data-in-a-spark-dataframe

9: https://community.databricks.com/t5/forums/filteredbylabelpage/board-id/data-engineering/label-name/table%20access%20control

10: https://granulate.io/blog/spark-security-top-vulnerabilities-6-ways-to-secure-your-spark/

Security Issue: Table Permission Disabled

Disabling table permission validation can expose sensitive data, allow privilege escalation, and violate compliance requirements.

Ensure that this configuration is intentionally chosen and that adequate compensatory controls are in place.

If table access control is required, reset the default to true.

File: spark/src/main/scala/ai/chronon/spark/Analyzer.scala (Line 278)

validateTablePermission: Boolean = false,

coderabbitai · 2025-02-14T16:40:35Z

.github/workflows/test_scala_spark.yaml

+  fetcher_tests:
+    runs-on: ubuntu-8_cores-32_gb
+    container:
+      image: ghcr.io/${{ github.repository }}-ci:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.GITHUB_TOKEN }}
+    defaults:
+      run:
+        working-directory: ${{ github.workspace }}
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Bazel cache credentials
+        run: |
+          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+
+      - name: Run Fetcher tests
+        run: |
+          bazel test \
+            --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
+            --google_credentials=bazel-cache-key.json \
+            --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
+            //spark:fetcher_test
+
+  join_tests:
+    runs-on: ubuntu-8_cores-32_gb
+    container:
+      image: ghcr.io/${{ github.repository }}-ci:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.GITHUB_TOKEN }}
+    defaults:
+      run:
+        working-directory: ${{ github.workspace }}
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Bazel cache credentials
+        run: |
+          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+
+      - name: Run Join tests
+        run: |
+          bazel test \
+            --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
+            --google_credentials=bazel-cache-key.json \
+            --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
+            //spark:join_test
+
+  groupby_tests:
+    runs-on: ubuntu-8_cores-32_gb
+    container:
+      image: ghcr.io/${{ github.repository }}-ci:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.GITHUB_TOKEN }}
+    defaults:
+      run:
+        working-directory: ${{ github.workspace }}
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Bazel cache credentials
+        run: |
+          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+
+      - name: Run GroupBy tests
+        run: |
+          bazel test \
+            --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
+            --google_credentials=bazel-cache-key.json \
+            --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
+            //spark:groupby_test
+
+  analyzer_tests:
+    runs-on: ubuntu-8_cores-32_gb
+    container:
+      image: ghcr.io/${{ github.repository }}-ci:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.GITHUB_TOKEN }}
+    defaults:
+      run:
+        working-directory: ${{ github.workspace }}
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Bazel cache credentials
+        run: |
+          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+
+      - name: Run Analyzer tests
+        run: |
+          bazel test \
+            --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
+            --google_credentials=bazel-cache-key.json \
+            --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
+            //spark:analyzer_test
+
+  streaming_tests:
+    runs-on: ubuntu-8_cores-32_gb
+    container:
+      image: ghcr.io/${{ github.repository }}-ci:latest
+      credentials:
+        username: ${{ github.actor }}
+        password: ${{ secrets.GITHUB_TOKEN }}
+    defaults:
+      run:
+        working-directory: ${{ github.workspace }}
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Setup Bazel cache credentials
+        run: |
+          echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
+
+      - name: Run Streaming tests
+        run: |
+          bazel test \
+            --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
+            --google_credentials=bazel-cache-key.json \
+            --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
+            //spark:streaming_test


⚠️ Potential issue

Use standard GitHub Actions runner labels.

The runner label ubuntu-8_cores-32_gb is not a standard GitHub-hosted runner label. Consider using standard labels like ubuntu-latest-8-cores.

Apply this diff to fix the runner labels:

- runs-on: ubuntu-8_cores-32_gb + runs-on: ubuntu-latest-8-cores

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fetcher_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Fetcher tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:fetcher_test

join_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Join tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:join_test

groupby_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run GroupBy tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:groupby_test

analyzer_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Analyzer tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:analyzer_test

streaming_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Streaming tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:streaming_test

fetcher_tests:

- runs-on: ubuntu-8_cores-32_gb

+ runs-on: ubuntu-latest-8-cores

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Fetcher tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:fetcher_test

join_tests:

- runs-on: ubuntu-8_cores-32_gb

+ runs-on: ubuntu-latest-8-cores

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Join tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:join_test

groupby_tests:

- runs-on: ubuntu-8_cores-32_gb

+ runs-on: ubuntu-latest-8-cores

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run GroupBy tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:groupby_test

analyzer_tests:

- runs-on: ubuntu-8_cores-32_gb

+ runs-on: ubuntu-latest-8-cores

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Analyzer tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:analyzer_test

streaming_tests:

- runs-on: ubuntu-8_cores-32_gb

+ runs-on: ubuntu-latest-8-cores

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Streaming tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:streaming_test

🧰 Tools

🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

🛠️ Refactor suggestion

Add newline at end of file and document runner labels.

The workflow changes look good but need some improvements:

Add a newline at the end of the file

Document the custom runner label ubuntu-8_cores-32_gb in actionlint.yaml

Apply this diff to add the newline:

//spark:streaming_test +

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fetcher_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Fetcher tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:fetcher_test

join_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Join tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:join_test

groupby_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run GroupBy tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:groupby_test

analyzer_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Analyzer tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:analyzer_test

streaming_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Streaming tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:streaming_test

fetcher_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Fetcher tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:fetcher_test

join_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Join tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:join_test

groupby_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run GroupBy tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:groupby_test

analyzer_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Analyzer tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:analyzer_test

streaming_tests:

runs-on: ubuntu-8_cores-32_gb

container:

image: ghcr.io/${{ github.repository }}-ci:latest

credentials:

username: ${{ github.actor }}

password: ${{ secrets.GITHUB_TOKEN }}

defaults:

run:

working-directory: ${{ github.workspace }}

steps:

- uses: actions/checkout@v4

- name: Setup Bazel cache credentials

run: |

echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json

- name: Run Streaming tests

run: |

bazel test \

--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \

--google_credentials=bazel-cache-key.json \

--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \

//spark:streaming_test

🧰 Tools

🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)

58-60: Verify MaxFields value aligns with Beacon top requirements.

The PR objective mentions 374 fields in Beacon top. While 1000 is sufficient, consider setting it closer to the actual requirement.

Also applies to: 75-75

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 2dedcc2 and deb87ba.

📒 Files selected for processing (3)

flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (3)

online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

36-36: LGTM: SQLConf import added.

Required for accessing SQL configuration settings.

143-144: LGTM: Improved SQL clause readability.

Wrapping clauses in parentheses enhances SQL readability.

206-211: LGTM: Early validation prevents codegen issues.

Fail-fast approach prevents incorrect code generation for large schemas.

Run this to verify the validation:
#!/bin/bash
# Search for other schema size validations
rg -A 2 "Too many fields"

… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability.

… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Cheour clientslist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability.

coderabbitai bot reviewed Feb 14, 2025

View reviewed changes

tchow-zlai approved these changes Feb 14, 2025

View reviewed changes

piyush-zlai force-pushed the main branch from 277f040 to 9802baf Compare February 14, 2025 16:32

piyush-zlai force-pushed the piyush/support_beacon_top_deser branch from 4fb28d3 to 2dedcc2 Compare February 14, 2025 16:33

coderabbitai bot reviewed Feb 14, 2025

View reviewed changes

Update Catalyst util to bump up whole stage codegen limit for beacontop

deb87ba

piyush-zlai force-pushed the piyush/support_beacon_top_deser branch from 2dedcc2 to deb87ba Compare February 14, 2025 20:07

coderabbitai bot reviewed Feb 14, 2025

View reviewed changes

piyush-zlai merged commit c2bfcdc into main Feb 14, 2025
6 checks passed

piyush-zlai deleted the piyush/support_beacon_top_deser branch February 14, 2025 20:25

coderabbitai bot mentioned this pull request Mar 14, 2025

Catalyst Util updates to handle Split Whole Stage code gen + generate exec nodes #515

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Catalyst util to bump up whole stage codegen limit for Beacon top #380

Update Catalyst util to bump up whole stage codegen limit for Beacon top #380

Uh oh!

piyush-zlai commented Feb 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 14, 2025 •

edited

Loading

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

tchow-zlai left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 14, 2025

Uh oh!

coderabbitai bot Feb 14, 2025

Uh oh!

coderabbitai bot Feb 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Security Measure	Purpose	Source
Enable `spark.acls.enable`	Enforce UI/API access controls	[4][10]
Use Apache Ranger Integration	Centralize row/column policies	[2][7]
Apply `GRANT SELECT/MODIFY`	Restrict file/system operations	[3][5]
Validate UDFs/Code Execution	Prevent privilege escalation	[6][10]

Update Catalyst util to bump up whole stage codegen limit for Beacon top #380

Update Catalyst util to bump up whole stage codegen limit for Beacon top #380

Uh oh!

Conversation

piyush-zlai commented Feb 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tchow-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2025

Choose a reason for hiding this comment

1. Unauthorized Data Access

2. Privilege Escalation Risks

3. Compliance Violations

4. System Integrity Compromises

Key Mitigations from Search Results

Uh oh!

coderabbitai bot Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

piyush-zlai commented Feb 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)