Skip to content

Conversation

@piyush-zlai
Copy link
Contributor

@piyush-zlai piyush-zlai commented Feb 14, 2025

Summary

While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Enhanced data processing robustness with improved handling and early error detection for large schemas.
    • Refined SQL query formatting for clearer logical conditions.
  • Tests

    • Added a new validation for large schema deserialization.
    • Updated test definitions to improve structure and readability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2025

Walkthrough

This pull request enhances Avro deserialization testing and CatalystUtil functionality. A new test validates the handling of large Avro schemas using added Spark SQL and Avro-related imports. In CatalystUtil, a constant (MaxFields = 1000) is introduced with an early field count check and updated SQL clause formatting. Additionally, CatalystUtil test cases are refactored to utilize the ScalaTest DSL style.

Changes

File(s) Change Summary
flink/.../AvroDeSerializationSupportSpec.scala Added new imports (AvroConversions, CatalystUtil, Row, ExpressionEncoder) and a test case "successfully deser large avro schemas" validating large Avro schema deserialization and SQL filtering.
online/.../CatalystUtil.scala, online/.../CatalystUtilTest.scala Introduced MaxFields = 1000 to enforce field limits with early checks; modified SQL clause formatting in CatalystUtil; refactored test cases in CatalystUtilTest to ScalaTest DSL style.

Sequence Diagram(s)

sequenceDiagram
    participant T as Test Framework
    participant ADS as AvroDeserializationSupport
    participant CU as CatalystUtil
    T->>ADS: Invoke build(schema, payload)
    ADS->>CU: Forward Spark row for processing
    CU-->>T: Return processed/filtered row result
Loading

Possibly related PRs

Suggested reviewers

  • nikhil-zlai
  • varant-zlai

Poem

In our code the tests now gleam,
Avro and Catalyst join the team.
Limits check fields with utmost care,
SQL clauses wrapped, precise and fair.
Code leaps ahead with spark and theme 🚀.

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)

74-103: Test could be more comprehensive.

Consider:

  1. Adding positive test case where event matches the where clause.
  2. Moving schema and payload to test resources for better maintainability.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a493d35 and 4fb28d3.

📒 Files selected for processing (3)
  • flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: non_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (3)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

58-59: LGTM! Good practice defining MaxFields as a constant.

The value of 1000 provides sufficient headroom above the current 374 fields in beacon top.

Also applies to: 75-75


206-211: LGTM! Early validation prevents runtime issues.

Good error message that includes the schema details for debugging.


143-144: LGTM! Wrapping clauses in parentheses prevents operator precedence issues.

Copy link
Collaborator

@tchow-zlai tchow-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)

399-401: 🛠️ Refactor suggestion

Fix potential flaky test by using a fixed timestamp.

Using System.currentTimeMillis() can make tests non-deterministic. Consider using a fixed timestamp for reproducible tests.

-    val now = System.currentTimeMillis()
-    val today = tableUtils.partitionSpec.at(now)
+    val fixedTimestamp = 1677628800000L  // e.g., 2023-03-01 00:00:00 UTC
+    val today = tableUtils.partitionSpec.at(fixedTimestamp)
🧹 Nitpick comments (15)
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1)

65-71: Consider removing redundant partition column usage.

The partition column is used twice: in column creation and in gen call. This might lead to confusion.

-    gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count, partitionColumn)
+    gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count)
spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (2)

55-86: Add more descriptive assertion messages.

The assertion messages in these test cases could be more descriptive to help debug failures.

-    assertEquals(0, diff.count())
+    assertEquals(0, diff.count(), s"Found ${diff.count()} mismatched rows between actual and expected results")

Also applies to: 88-137, 139-206, 207-267


397-417: Consider adding randomization to test data.

The test data generation in createTestSource could benefit from more randomization to catch edge cases.

-    val sourceSchema = List(
-      Column("user", StringType, 10000),
-      Column("item", StringType, 100),
-      Column("time_spent_ms", LongType, 5000),
-      Column("price", DoubleType, 100)
-    )
+    val sourceSchema = List(
+      Column("user", StringType, scala.util.Random.nextInt(20000) + 1000),
+      Column("item", StringType, scala.util.Random.nextInt(200) + 50),
+      Column("time_spent_ms", LongType, scala.util.Random.nextInt(10000) + 1000),
+      Column("price", DoubleType, scala.util.Random.nextInt(200) + 50)
+    )
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)

206-211: Consider enhancing the error message.

The validation logic is correct, but the error message could be more actionable.

-          s"Too many fields in input schema. We support a max of: ${CatalystUtil.MaxFields}. Schema: ${inputSparkSchema.simpleString}"
+          s"Input schema exceeds maximum field limit of ${CatalystUtil.MaxFields}. Consider splitting the schema. Current schema: ${inputSparkSchema.simpleString}"
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)

77-103: Consider adding more assertions.

The test validates the core functionality but could be more thorough.

Add assertions to verify:

  1. Schema field count
  2. Successful deserialization of specific fields
  3. CatalystUtil's handling of the large schema
 assert(result.isEmpty) // no rows should be returned as the event is not in the where clause
+assert(chrononSchema.fields.length > 100, "Should be a large schema")
+assert(sparkRow != null, "Should deserialize successfully")
+assert(catalystInternalRow != null, "Should process through CatalystUtil")
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (1)

215-294: Add unit tests for SQL helper methods.

Complex SQL queries in computeSimpleAverageThroughSql and computeLastThroughSql should be unit tested to ensure correctness.

Would you like me to help create unit tests for these SQL helper methods?

Also applies to: 300-409

spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)

98-111: Prefix logic
Use stripSuffix to simplify trailing slash handling.

-    } else if (barePrefix.endsWith("/")) {
-      Some(barePrefix)
-    } else {
-      Some(barePrefix + "/")
-    }
+    Some(barePrefix.stripSuffix("/") + "/")

817-818: Temporary fix
Offer final approach later.

spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (2)

27-28: Consider using a more efficient fetcher initialization.

Lazy transient fields could be replaced with a factory method for better testability.


38-42: Deliberate key mistyping needs documentation.

Add a comment explaining why keys are deliberately mistyped and the expected behavior.

spark/BUILD.bazel (1)

110-118: Consider adding size tags to test suites.

Like the "large" tag in the main test suite, other suites might benefit from size tags.

spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)

67-69: Consider extracting partition column renaming logic.

Repeated renaming operations could be moved to a utility method.

+  private def withDefaultPartition(df: DataFrame, partitionCol: String): DataFrame = {
+    if (partitionCol == "ds") df else df.withColumnRenamed(partitionCol, "ds")
+  }
spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)

330-331: Consider handling null partition columns more explicitly.

The partitionBy(null: _*) usage could be made more explicit about its intention.

Consider adding a comment explaining why null partitioning is used, or create a helper method with a descriptive name:

+  private def withoutPartitioning[T](writer: DataFrameWriter[T]): DataFrameWriter[T] = {
+    writer.partitionBy(null: _*)
+  }

Also applies to: 340-341

.github/workflows/test_scala_spark.yaml (2)

176-176: Add newline at end of file.

Add a newline character at the end of the file to comply with POSIX standards.

🧰 Tools
🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)


48-176: Consider using job templates to reduce duplication.

The new test jobs share identical configuration. Consider using job templates or composite actions to reduce duplication.

🧰 Tools
🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb28d3 and 2dedcc2.

📒 Files selected for processing (36)
  • .bazelrc (1 hunks)
  • .github/workflows/test_scala_spark.yaml (1 hunks)
  • api/py/test/sample/staging_queries/quickstart/purchases.py (1 hunks)
  • api/src/main/scala/ai/chronon/api/Builders.scala (2 hunks)
  • api/thrift/api.thrift (2 hunks)
  • flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)
  • spark/BUILD.bazel (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/Analyzer.scala (4 hunks)
  • spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (3 hunks)
  • spark/src/main/scala/ai/chronon/spark/Extensions.scala (4 hunks)
  • spark/src/main/scala/ai/chronon/spark/GroupBy.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/Join.scala (1 hunks)
  • spark/src/main/scala/ai/chronon/spark/JoinBase.scala (5 hunks)
  • spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (3 hunks)
  • spark/src/main/scala/ai/chronon/spark/TableUtils.scala (14 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala (0 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (3 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala (2 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (30 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (1 hunks)
  • spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala (3 hunks)
💤 Files with no reviewable changes (1)
  • spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala
✅ Files skipped from review due to trivial changes (9)
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java
  • spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala
  • spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
🧰 Additional context used
🪛 actionlint (1.7.4)
.github/workflows/test_scala_spark.yaml

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)
.github/workflows/test_scala_spark.yaml

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: streaming_tests
  • GitHub Check: groupby_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: spark_tests
  • GitHub Check: join_tests
  • GitHub Check: bazel_config_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: fetcher_tests
  • GitHub Check: non_spark_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (76)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (3)

19-20: LGTM!

Import consolidation improves code organization.


195-201: LGTM!

The partition column drop prevents duplicated column errors during joins.


268-269: LGTM!

Variable renaming enhances code clarity.

spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala (3)

17-17: LGTM!

Package move to streaming-specific location improves organization.


22-22: LGTM!

Import changes maintain clean dependencies while preserving access to required functionality.

Also applies to: 25-25


54-54: LGTM!

Consistent use of StreamingTest.buildInMemoryKvStore() improves clarity.

Also applies to: 113-113

spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)

17-22: LGTM! Package organization and import consolidation.

The package move and import consolidation improve code organization.

spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (2)

41-45: LGTM! Clean addition of partition column parameter.

Maintains backward compatibility while adding flexibility.


52-61: LGTM! Consistent partition column handling.

Aligns with gen method changes.

spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (1)

17-43: LGTM! Well-organized imports.

Package and import statements are well-structured and properly consolidated.

online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

36-36: LGTM!

Import required for accessing Spark SQL configuration settings.


58-59: LGTM!

Constant value (1000) provides sufficient headroom above Beacon top's 374 fields.


75-75: LGTM!

Properly configures Spark's codegen field limit.

flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2)

3-5: LGTM!

All imports are required for the new test functionality.

Also applies to: 13-14


74-76: LGTM!

Clear explanation of test's purpose and scope.

spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (3)

17-17: LGTM! Package organization and import consolidation.

Package moved to streaming subpackage and imports consolidated for better organization.

Also applies to: 22-27


432-859: LGTM! Comprehensive test coverage.

Test cases thoroughly cover various scenarios with clear documentation and assertions.


36-36:

❓ Verification inconclusive

Verify impact of removing TaggedFilterSuite.

Removal of TaggedFilterSuite mixin might affect test filtering capabilities.


🏁 Script executed:

#!/bin/bash
# Check if TaggedFilterSuite is used elsewhere and if there's a replacement
rg -A 5 "TaggedFilterSuite"

Length of output: 1451


Impact of TaggedFilterSuite Removal

  • MutationsTest now extends only AnyFlatSpec while other tests (e.g., in CatalystUtilHiveUDFTest.scala) still use TaggedFilterSuite.
  • Confirm that excluding TaggedFilterSuite in the streaming tests is intentional and that any required test filtering is handled appropriately.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (24)

29-34: Imports for creation statuses
Looks good.


36-36: Additional imports
Usage seems correct.


82-82: Small mode cutoff change
No issues.


202-204: Partition column parameter
Flexibility is improved.


208-208: Dynamic partition columns
Implementation looks good.


302-303: autoExpand parameter
Returning TableCreationStatus is clear.


306-306: Creation status assignment
Logic is consistent.


310-313: Generate table builder
Valid approach.


321-321: TableAlreadyExists
Gracefully handled.


327-328: Check existing table
Redundant but harmless.


331-342: Handling table props & autoExpand
Pattern matching is neat.


356-356: Blank line
No concerns.


359-360: Column rearrangement
Keeps partition columns last.


364-364: createTable invocation
Fine.


383-386: No operation for WithInitialData
Confirm if omission is intended.


434-434: Whitespace
No issues.


440-440: Another createTable call
Seems consistent.


442-445: No action on WithInitialData
Check if skipping is correct.


486-486: Partition columns param
Helpful extension.


835-835: New parameter in scanDf
Straightforward.


837-838: Effective partition column
Works neatly.


844-844: scanDfBase usage
Looks correct.


846-850: Conditional rename
Smart fallback.


863-866: New table creation statuses
Clear usage.

.bazelrc (1)

7-7: Test timeout raised
Validate necessity.

api/py/test/sample/staging_queries/quickstart/purchases.py (4)

1-14: License header
Approved.


15-15: Imports
Looks fine.


19-29: Query definition
Placeholders look correct.


31-38: StagingQuery
Neatly structured.

spark/BUILD.bazel (1)

98-107: LGTM! Well-organized test suite structure.

Clear separation of test categories improves maintainability.

spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)

47-48: LGTM! Good default value for partition column.

Maintains backward compatibility while adding flexibility.

api/src/main/scala/ai/chronon/api/Builders.scala (1)

49-63: LGTM! Clean addition of partition column support.

The implementation maintains backward compatibility and follows the builder pattern consistently.

spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (1)

187-191: LGTM! Improved partition column handling.

The change to use effectivePartitionColumn aligns with the flexible partition column support.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (2)

387-392: LGTM! Added SourceSparkOps for partition column handling.

Clean implementation of partition column retrieval with fallback to tableUtils.


394-397: LGTM! Added QuerySparkOps for effective partition column.

Well-structured implementation with proper null handling.

api/thrift/api.thrift (2)

19-20: LGTM! Added partitionColumn to Query struct.

Clean addition of optional field for partition column support.


53-57: LGTM! Added partitionColumn to StagingQuery struct.

Well-documented field with clear purpose.

spark/src/main/scala/ai/chronon/spark/JoinBase.scala (3)

54-54: LGTM! Added implicit TableUtils.

Clean addition of implicit value for better code organization.


173-175: LGTM! Updated unfilledRanges with partition column.

Consistent use of effectivePartitionColumn across join operations.

Also applies to: 488-490


287-288: LGTM! Improved variable naming.

Renamed variable for better clarity without changing functionality.

spark/src/main/scala/ai/chronon/spark/Join.scala (1)

518-518: LGTM! Using query-specific partition column.

The change correctly uses the effective partition column from the query instead of the default partition column.

spark/src/main/scala/ai/chronon/spark/Analyzer.scala (1)

316-319: LGTM! Using query-specific partition column.

The change correctly uses the effective partition column from the query for unfilled ranges.

spark/src/main/scala/ai/chronon/spark/GroupBy.scala (1)

629-629: LGTM! Using query-specific partition column.

The change correctly uses the partition column from the query for meta columns.

spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (1)

62-62: Verify why this test is being ignored.

The test for metadata store is being disabled. Please ensure there's a tracking issue to re-enable it.

spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (12)

125-126: LGTM! Reduced test data size.

Reducing entity count from 1000 to 100 improves test execution time while maintaining test coverage.

Also applies to: 173-174


438-440: LGTM! Added partition column support.

The addition of partition column support enhances test coverage for different partitioning scenarios.


429-551: LGTM! Added test for different partition columns.

New test case validates join behavior with different partition columns.


125-126: LGTM! Reduced test data size and added partition column support.

The changes reduce test data size from 1000 to 100 records and add partition column support, which aligns with addressing the Avro field limit issues.

Also applies to: 173-174, 318-319, 340-341, 356-357, 439-441, 561-562, 578-579, 614-615, 684-686, 708-709, 1034-1035, 1070-1071, 1136-1137, 1156-1157, 1171-1172, 1322-1323, 1335-1336


17-17: LGTM! Package organization improvement.

Moving join tests to a dedicated package improves code organization.

Also applies to: 33-33, 34-37


125-126: LGTM! Test data size optimization.

Reducing test data size improves test execution time while maintaining test coverage.


173-174: LGTM! Improved partition column handling.

Making partition column configurable and consistent improves flexibility.

Also applies to: 225-232


429-551: LGTM! Comprehensive partition column test coverage.

New test case thoroughly validates different partition column scenarios.


318-318: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied across all test cases.

Also applies to: 340-340, 463-463


859-859: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to name and user test cases.

Also applies to: 875-875, 883-883


614-614: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to view events test cases.

Also applies to: 635-635, 685-685, 708-708, 1034-1034, 1070-1070, 1136-1136, 1156-1156, 1171-1171


1322-1322: LGTM! Consistent test data optimization.

Test data size reduction is consistently applied to selected join parts test.

Also applies to: 1335-1335

.github/workflows/test_scala_spark.yaml (2)

71-72: LGTM! Increased memory allocation.

Increased JVM memory settings (-Xmx16G -Xms8G) for component-specific test jobs will help prevent out-of-memory errors.

Also applies to: 97-98, 123-124, 149-150, 175-176


71-71: LGTM! Increased memory limits for test jobs.

The increased memory limits (-Xmx16G -Xms8G) will help handle large Avro schemas.

Also applies to: 97-97, 123-123, 149-149, 175-175

Comment on lines 65 to 70
val result = Await.result(future, Duration(10000, SECONDS)) // todo: change back to millis
val latency = System.currentTimeMillis() - start
latencySum += latency
latencyCount += 1
result
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify timeout duration.

The TODO comment suggests changing from seconds to milliseconds.


🏁 Script executed:

#!/bin/bash
# Search for similar timeout patterns in tests
rg -A 2 'Duration\([0-9]+.*SECONDS\)' 

Length of output: 1767


Update Timeout Duration Unit:
In spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (lines 65–70), the timeout is specified as Duration(10000, SECONDS) with a TODO to revert to milliseconds. Since other tests (e.g., in ExternalSourcesTest.scala) use a 10‑second timeout, please verify if the intended behavior is 10 seconds (i.e. 10000 ms) and update accordingly (e.g. change to Duration(10000, MILLISECONDS)).

def analyzeJoin(joinConf: api.Join,
skewDetection: Boolean = false,
validateTablePermission: Boolean = true,
validateTablePermission: Boolean = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify if table permission validation can be disabled.

Changing the default value of validateTablePermission from true to false might affect security if table permission validation is needed.


🌐 Web query:

What are the security implications of disabling table permission validation in Apache Spark?

💡 Result:

Disabling table permission validation in Apache Spark introduces significant security risks by bypassing critical access control mechanisms. Here's a breakdown of the implications based on industry practices and documented vulnerabilities:

1. Unauthorized Data Access

  • Full Exposure of Sensitive Data: Without permission checks, any user could query/modify tables containing PII, financial records, or proprietary data[2][6][7].
  • Bypass of Row/Column-Level Security: Tools like Apache Ranger and Spark-Authorizer enforce granular controls (e.g., masking phone numbers), which become ineffective[2][7].

2. Privilege Escalation Risks

  • Exploitable Code Execution: Attackers could inject malicious UDFs or modify query plans to execute arbitrary commands[6][10].
  • Credential Theft: Executors with unrestricted access might leak credentials stored in memory or configurations[6].

3. Compliance Violations

  • GDPR/HIPAA Non-Compliance: Loss of audit trails for data access makes regulatory compliance impossible[7][10].
  • Insecure Defaults: Folders/files created without inherited permissions (e.g., 777 on ADLS Gen1) expose data broadly[5].

4. System Integrity Compromises

  • SQL Injection Vulnerabilities: Malicious actors could alter query logic via unvalidated inputs[8].
  • UI Exploits: Disabled ACLs enable unauthorized users to view/kill applications via Spark UI[4][10].

Key Mitigations from Search Results

Security Measure Purpose Source
Enable spark.acls.enable Enforce UI/API access controls [4][10]
Use Apache Ranger Integration Centralize row/column policies [2][7]
Apply GRANT SELECT/MODIFY Restrict file/system operations [3][5]
Validate UDFs/Code Execution Prevent privilege escalation [6][10]

Disabling permission validation fundamentally undermines Spark's security architecture, making systems vulnerable to both internal misuse and external attacks[6][10]. Organizations should instead strengthen validation through granular ACLs and integration with enterprise security frameworks like Ranger[2][7].

Citations:


Security Issue: Table Permission Disabled

  • Disabling table permission validation can expose sensitive data, allow privilege escalation, and violate compliance requirements.
  • Ensure that this configuration is intentionally chosen and that adequate compensatory controls are in place.
  • If table access control is required, reset the default to true.

File: spark/src/main/scala/ai/chronon/spark/Analyzer.scala (Line 278)

                  validateTablePermission: Boolean = false,

Comment on lines 48 to 176
fetcher_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}

steps:
- uses: actions/checkout@v4

- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Fetcher tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:fetcher_test
join_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}

steps:
- uses: actions/checkout@v4

- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Join tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:join_test
groupby_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}

steps:
- uses: actions/checkout@v4

- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run GroupBy tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:groupby_test
analyzer_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}

steps:
- uses: actions/checkout@v4

- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Analyzer tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:analyzer_test
streaming_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}

steps:
- uses: actions/checkout@v4

- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Streaming tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:streaming_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use standard GitHub Actions runner labels.

The runner label ubuntu-8_cores-32_gb is not a standard GitHub-hosted runner label. Consider using standard labels like ubuntu-latest-8-cores.

Apply this diff to fix the runner labels:

-    runs-on: ubuntu-8_cores-32_gb
+    runs-on: ubuntu-latest-8-cores
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fetcher_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Fetcher tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:fetcher_test
join_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Join tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:join_test
groupby_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run GroupBy tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:groupby_test
analyzer_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Analyzer tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:analyzer_test
streaming_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Streaming tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:streaming_test
fetcher_tests:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Fetcher tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:fetcher_test
join_tests:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Join tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:join_test
groupby_tests:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run GroupBy tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:groupby_test
analyzer_tests:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Analyzer tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:analyzer_test
streaming_tests:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Streaming tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:streaming_test
🧰 Tools
🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)


🛠️ Refactor suggestion

Add newline at end of file and document runner labels.

The workflow changes look good but need some improvements:

  1. Add a newline at the end of the file
  2. Document the custom runner label ubuntu-8_cores-32_gb in actionlint.yaml

Apply this diff to add the newline:

            //spark:streaming_test
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fetcher_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Fetcher tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:fetcher_test
join_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Join tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:join_test
groupby_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run GroupBy tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:groupby_test
analyzer_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Analyzer tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:analyzer_test
streaming_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Streaming tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:streaming_test
fetcher_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Fetcher tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:fetcher_test
join_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Join tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:join_test
groupby_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run GroupBy tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:groupby_test
analyzer_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Analyzer tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:analyzer_test
streaming_tests:
runs-on: ubuntu-8_cores-32_gb
container:
image: ghcr.io/${{ github.repository }}-ci:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
defaults:
run:
working-directory: ${{ github.workspace }}
steps:
- uses: actions/checkout@v4
- name: Setup Bazel cache credentials
run: |
echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json
- name: Run Streaming tests
run: |
bazel test \
--remote_cache=https://storage.googleapis.com/zipline-bazel-cache \
--google_credentials=bazel-cache-key.json \
--test_env=JAVA_OPTS="-Xmx16G -Xms8G" \
//spark:streaming_test
🧰 Tools
🪛 actionlint (1.7.4)

49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 YAMLlint (1.35.1)

[error] 176-176: no new line character at the end of file

(new-line-at-end-of-file)

@piyush-zlai piyush-zlai force-pushed the piyush/support_beacon_top_deser branch from 2dedcc2 to deb87ba Compare February 14, 2025 20:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)

58-60: Verify MaxFields value aligns with Beacon top requirements.

The PR objective mentions 374 fields in Beacon top. While 1000 is sufficient, consider setting it closer to the actual requirement.

Also applies to: 75-75

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 2dedcc2 and deb87ba.

📒 Files selected for processing (3)
  • flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2 hunks)
  • online/src/main/scala/ai/chronon/online/CatalystUtil.scala (5 hunks)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
  • flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: non_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (3)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)

36-36: LGTM: SQLConf import added.

Required for accessing SQL configuration settings.


143-144: LGTM: Improved SQL clause readability.

Wrapping clauses in parentheses enhances SQL readability.


206-211: LGTM: Early validation prevents codegen issues.

Fail-fast approach prevents incorrect code generation for large schemas.

Run this to verify the validation:

#!/bin/bash
# Search for other schema size validations
rg -A 2 "Too many fields"

@piyush-zlai piyush-zlai merged commit c2bfcdc into main Feb 14, 2025
6 checks passed
@piyush-zlai piyush-zlai deleted the piyush/support_beacon_top_deser branch February 14, 2025 20:25
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Checklist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
… top (#380)

## Summary
While trying to read the updated beacon top topic we hit issues as the
number of avro fields is greater than the Spark codegen limit default of
100. Thanks to this the wholestage codegen code is incorrect and we
either end up with segfaults (unit tests) or garbled events (prod flink
jobs). This PR bumps the limit to allow us to read beacon top (374
fields) as well as adds an assert in Catalyst util's whole stage code
gen code to fail if we encounter this again in the future for a higher
number of fields than our current bumped limit.

## Cheour clientslist
- [X] Added Unit Tests
- [ ] Covered by existing CI
- [ ] Integration tested
- [ ] Documentation update



<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Enhanced data processing robustness with improved handling and early
error detection for large schemas.
  - Refined SQL query formatting for clearer logical conditions.

- **Tests**
  - Added a new validation for large schema deserialization.
  - Updated test definitions to improve structure and readability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants