-
Notifications
You must be signed in to change notification settings - Fork 8
Update Catalyst util to bump up whole stage codegen limit for Beacon top #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis pull request enhances Avro deserialization testing and CatalystUtil functionality. A new test validates the handling of large Avro schemas using added Spark SQL and Avro-related imports. In CatalystUtil, a constant ( Changes
Sequence Diagram(s)sequenceDiagram
participant T as Test Framework
participant ADS as AvroDeserializationSupport
participant CU as CatalystUtil
T->>ADS: Invoke build(schema, payload)
ADS->>CU: Forward Spark row for processing
CU-->>T: Return processed/filtered row result
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)
74-103: Test could be more comprehensive.Consider:
- Adding positive test case where event matches the where clause.
- Moving schema and payload to test resources for better maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala(2 hunks)online/src/main/scala/ai/chronon/online/CatalystUtil.scala(5 hunks)online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (3)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)
58-59: LGTM! Good practice defining MaxFields as a constant.The value of 1000 provides sufficient headroom above the current 374 fields in beacon top.
Also applies to: 75-75
206-211: LGTM! Early validation prevents runtime issues.Good error message that includes the schema details for debugging.
143-144: LGTM! Wrapping clauses in parentheses prevents operator precedence issues.
tchow-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
4fb28d3 to
2dedcc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)
399-401: 🛠️ Refactor suggestionFix potential flaky test by using a fixed timestamp.
Using
System.currentTimeMillis()can make tests non-deterministic. Consider using a fixed timestamp for reproducible tests.- val now = System.currentTimeMillis() - val today = tableUtils.partitionSpec.at(now) + val fixedTimestamp = 1677628800000L // e.g., 2023-03-01 00:00:00 UTC + val today = tableUtils.partitionSpec.at(fixedTimestamp)
🧹 Nitpick comments (15)
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1)
65-71: Consider removing redundant partition column usage.The partition column is used twice: in column creation and in
gencall. This might lead to confusion.- gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count, partitionColumn) + gen(spark, columns :+ Column(partitionColumnString, StringType, partitions), count)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (2)
55-86: Add more descriptive assertion messages.The assertion messages in these test cases could be more descriptive to help debug failures.
- assertEquals(0, diff.count()) + assertEquals(0, diff.count(), s"Found ${diff.count()} mismatched rows between actual and expected results")Also applies to: 88-137, 139-206, 207-267
397-417: Consider adding randomization to test data.The test data generation in
createTestSourcecould benefit from more randomization to catch edge cases.- val sourceSchema = List( - Column("user", StringType, 10000), - Column("item", StringType, 100), - Column("time_spent_ms", LongType, 5000), - Column("price", DoubleType, 100) - ) + val sourceSchema = List( + Column("user", StringType, scala.util.Random.nextInt(20000) + 1000), + Column("item", StringType, scala.util.Random.nextInt(200) + 50), + Column("time_spent_ms", LongType, scala.util.Random.nextInt(10000) + 1000), + Column("price", DoubleType, scala.util.Random.nextInt(200) + 50) + )online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)
206-211: Consider enhancing the error message.The validation logic is correct, but the error message could be more actionable.
- s"Too many fields in input schema. We support a max of: ${CatalystUtil.MaxFields}. Schema: ${inputSparkSchema.simpleString}" + s"Input schema exceeds maximum field limit of ${CatalystUtil.MaxFields}. Consider splitting the schema. Current schema: ${inputSparkSchema.simpleString}"flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (1)
77-103: Consider adding more assertions.The test validates the core functionality but could be more thorough.
Add assertions to verify:
- Schema field count
- Successful deserialization of specific fields
- CatalystUtil's handling of the large schema
assert(result.isEmpty) // no rows should be returned as the event is not in the where clause +assert(chrononSchema.fields.length > 100, "Should be a large schema") +assert(sparkRow != null, "Should deserialize successfully") +assert(catalystInternalRow != null, "Should process through CatalystUtil")spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (1)
215-294: Add unit tests for SQL helper methods.Complex SQL queries in
computeSimpleAverageThroughSqlandcomputeLastThroughSqlshould be unit tested to ensure correctness.Would you like me to help create unit tests for these SQL helper methods?
Also applies to: 300-409
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
98-111: Prefix logic
Use stripSuffix to simplify trailing slash handling.- } else if (barePrefix.endsWith("/")) { - Some(barePrefix) - } else { - Some(barePrefix + "/") - } + Some(barePrefix.stripSuffix("/") + "/")
817-818: Temporary fix
Offer final approach later.spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (2)
27-28: Consider using a more efficient fetcher initialization.Lazy transient fields could be replaced with a factory method for better testability.
38-42: Deliberate key mistyping needs documentation.Add a comment explaining why keys are deliberately mistyped and the expected behavior.
spark/BUILD.bazel (1)
110-118: Consider adding size tags to test suites.Like the "large" tag in the main test suite, other suites might benefit from size tags.
spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)
67-69: Consider extracting partition column renaming logic.Repeated renaming operations could be moved to a utility method.
+ private def withDefaultPartition(df: DataFrame, partitionCol: String): DataFrame = { + if (partitionCol == "ds") df else df.withColumnRenamed(partitionCol, "ds") + }spark/src/main/scala/ai/chronon/spark/Extensions.scala (1)
330-331: Consider handling null partition columns more explicitly.The
partitionBy(null: _*)usage could be made more explicit about its intention.Consider adding a comment explaining why null partitioning is used, or create a helper method with a descriptive name:
+ private def withoutPartitioning[T](writer: DataFrameWriter[T]): DataFrameWriter[T] = { + writer.partitionBy(null: _*) + }Also applies to: 340-341
.github/workflows/test_scala_spark.yaml (2)
176-176: Add newline at end of file.Add a newline character at the end of the file to comply with POSIX standards.
🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 176-176: no new line character at the end of file
(new-line-at-end-of-file)
48-176: Consider using job templates to reduce duplication.The new test jobs share identical configuration. Consider using job templates or composite actions to reduce duplication.
🧰 Tools
🪛 actionlint (1.7.4)
49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 YAMLlint (1.35.1)
[error] 176-176: no new line character at the end of file
(new-line-at-end-of-file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (36)
.bazelrc(1 hunks).github/workflows/test_scala_spark.yaml(1 hunks)api/py/test/sample/staging_queries/quickstart/purchases.py(1 hunks)api/src/main/scala/ai/chronon/api/Builders.scala(2 hunks)api/thrift/api.thrift(2 hunks)flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala(2 hunks)online/src/main/scala/ai/chronon/online/CatalystUtil.scala(5 hunks)online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala(2 hunks)spark/BUILD.bazel(1 hunks)spark/src/main/scala/ai/chronon/spark/Analyzer.scala(4 hunks)spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala(3 hunks)spark/src/main/scala/ai/chronon/spark/Extensions.scala(4 hunks)spark/src/main/scala/ai/chronon/spark/GroupBy.scala(1 hunks)spark/src/main/scala/ai/chronon/spark/Join.scala(1 hunks)spark/src/main/scala/ai/chronon/spark/JoinBase.scala(5 hunks)spark/src/main/scala/ai/chronon/spark/LabelJoin.scala(3 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala(14 hunks)spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala(0 hunks)spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala(3 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java(1 hunks)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala(2 hunks)spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala(30 hunks)spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala(1 hunks)spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala(3 hunks)
💤 Files with no reviewable changes (1)
- spark/src/test/scala/ai/chronon/spark/test/TaggedFilterSuite.scala
✅ Files skipped from review due to trivial changes (9)
- spark/src/test/scala/ai/chronon/spark/test/fetcher/JavaFetcherTest.java
- spark/src/test/scala/ai/chronon/spark/test/streaming/KafkaStreamBuilderTest.scala
- spark/src/test/scala/ai/chronon/spark/test/join/FeatureWithLabelJoinTest.scala
- spark/src/test/scala/ai/chronon/spark/test/streaming/AvroTest.scala
- spark/src/test/scala/ai/chronon/spark/test/fetcher/ChainingFetcherTest.scala
- spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByUploadTest.scala
- spark/src/test/scala/ai/chronon/spark/test/analyzer/DerivationTest.scala
- spark/src/test/scala/ai/chronon/spark/test/analyzer/AnalyzerTest.scala
- spark/src/test/scala/ai/chronon/spark/test/join/JoinUtilsTest.scala
🚧 Files skipped from review as they are similar to previous changes (1)
- online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
🧰 Additional context used
🪛 actionlint (1.7.4)
.github/workflows/test_scala_spark.yaml
49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 YAMLlint (1.35.1)
.github/workflows/test_scala_spark.yaml
[error] 176-176: no new line character at the end of file
(new-line-at-end-of-file)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: bazel_config_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: fetcher_tests
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (76)
spark/src/main/scala/ai/chronon/spark/LabelJoin.scala (3)
19-20: LGTM!Import consolidation improves code organization.
195-201: LGTM!The partition column drop prevents duplicated column errors during joins.
268-269: LGTM!Variable renaming enhances code clarity.
spark/src/test/scala/ai/chronon/spark/test/streaming/StreamingTest.scala (3)
17-17: LGTM!Package move to streaming-specific location improves organization.
22-22: LGTM!Import changes maintain clean dependencies while preserving access to required functionality.
Also applies to: 25-25
54-54: LGTM!Consistent use of
StreamingTest.buildInMemoryKvStore()improves clarity.Also applies to: 113-113
spark/src/test/scala/ai/chronon/spark/test/join/LabelJoinTest.scala (1)
17-22: LGTM! Package organization and import consolidation.The package move and import consolidation improve code organization.
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (2)
41-45: LGTM! Clean addition of partition column parameter.Maintains backward compatibility while adding flexibility.
52-61: LGTM! Consistent partition column handling.Aligns with
genmethod changes.spark/src/test/scala/ai/chronon/spark/test/groupby/GroupByTest.scala (1)
17-43: LGTM! Well-organized imports.Package and import statements are well-structured and properly consolidated.
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)
36-36: LGTM!Import required for accessing Spark SQL configuration settings.
58-59: LGTM!Constant value (1000) provides sufficient headroom above Beacon top's 374 fields.
75-75: LGTM!Properly configures Spark's codegen field limit.
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala (2)
3-5: LGTM!All imports are required for the new test functionality.
Also applies to: 13-14
74-76: LGTM!Clear explanation of test's purpose and scope.
spark/src/test/scala/ai/chronon/spark/test/streaming/MutationsTest.scala (3)
17-17: LGTM! Package organization and import consolidation.Package moved to streaming subpackage and imports consolidated for better organization.
Also applies to: 22-27
432-859: LGTM! Comprehensive test coverage.Test cases thoroughly cover various scenarios with clear documentation and assertions.
36-36:❓ Verification inconclusive
Verify impact of removing TaggedFilterSuite.
Removal of TaggedFilterSuite mixin might affect test filtering capabilities.
🏁 Script executed:
#!/bin/bash # Check if TaggedFilterSuite is used elsewhere and if there's a replacement rg -A 5 "TaggedFilterSuite"Length of output: 1451
Impact of TaggedFilterSuite Removal
MutationsTestnow extends onlyAnyFlatSpecwhile other tests (e.g., inCatalystUtilHiveUDFTest.scala) still useTaggedFilterSuite.- Confirm that excluding
TaggedFilterSuitein the streaming tests is intentional and that any required test filtering is handled appropriately.spark/src/main/scala/ai/chronon/spark/TableUtils.scala (24)
29-34: Imports for creation statuses
Looks good.
36-36: Additional imports
Usage seems correct.
82-82: Small mode cutoff change
No issues.
202-204: Partition column parameter
Flexibility is improved.
208-208: Dynamic partition columns
Implementation looks good.
302-303: autoExpand parameter
Returning TableCreationStatus is clear.
306-306: Creation status assignment
Logic is consistent.
310-313: Generate table builder
Valid approach.
321-321: TableAlreadyExists
Gracefully handled.
327-328: Check existing table
Redundant but harmless.
331-342: Handling table props & autoExpand
Pattern matching is neat.
356-356: Blank line
No concerns.
359-360: Column rearrangement
Keeps partition columns last.
364-364: createTable invocation
Fine.
383-386: No operation for WithInitialData
Confirm if omission is intended.
434-434: Whitespace
No issues.
440-440: Another createTable call
Seems consistent.
442-445: No action on WithInitialData
Check if skipping is correct.
486-486: Partition columns param
Helpful extension.
835-835: New parameter in scanDf
Straightforward.
837-838: Effective partition column
Works neatly.
844-844: scanDfBase usage
Looks correct.
846-850: Conditional rename
Smart fallback.
863-866: New table creation statuses
Clear usage..bazelrc (1)
7-7: Test timeout raised
Validate necessity.api/py/test/sample/staging_queries/quickstart/purchases.py (4)
1-14: License header
Approved.
15-15: Imports
Looks fine.
19-29: Query definition
Placeholders look correct.
31-38: StagingQuery
Neatly structured.spark/BUILD.bazel (1)
98-107: LGTM! Well-organized test suite structure.Clear separation of test categories improves maintainability.
spark/src/test/scala/ai/chronon/spark/test/bootstrap/TableBootstrapTest.scala (1)
47-48: LGTM! Good default value for partition column.Maintains backward compatibility while adding flexibility.
api/src/main/scala/ai/chronon/api/Builders.scala (1)
49-63: LGTM! Clean addition of partition column support.The implementation maintains backward compatibility and follows the builder pattern consistently.
spark/src/main/scala/ai/chronon/spark/BootstrapInfo.scala (1)
187-191: LGTM! Improved partition column handling.The change to use
effectivePartitionColumnaligns with the flexible partition column support.spark/src/main/scala/ai/chronon/spark/Extensions.scala (2)
387-392: LGTM! Added SourceSparkOps for partition column handling.Clean implementation of partition column retrieval with fallback to tableUtils.
394-397: LGTM! Added QuerySparkOps for effective partition column.Well-structured implementation with proper null handling.
api/thrift/api.thrift (2)
19-20: LGTM! Added partitionColumn to Query struct.Clean addition of optional field for partition column support.
53-57: LGTM! Added partitionColumn to StagingQuery struct.Well-documented field with clear purpose.
spark/src/main/scala/ai/chronon/spark/JoinBase.scala (3)
54-54: LGTM! Added implicit TableUtils.Clean addition of implicit value for better code organization.
173-175: LGTM! Updated unfilledRanges with partition column.Consistent use of effectivePartitionColumn across join operations.
Also applies to: 488-490
287-288: LGTM! Improved variable naming.Renamed variable for better clarity without changing functionality.
spark/src/main/scala/ai/chronon/spark/Join.scala (1)
518-518: LGTM! Using query-specific partition column.The change correctly uses the effective partition column from the query instead of the default partition column.
spark/src/main/scala/ai/chronon/spark/Analyzer.scala (1)
316-319: LGTM! Using query-specific partition column.The change correctly uses the effective partition column from the query for unfilled ranges.
spark/src/main/scala/ai/chronon/spark/GroupBy.scala (1)
629-629: LGTM! Using query-specific partition column.The change correctly uses the partition column from the query for meta columns.
spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTest.scala (1)
62-62: Verify why this test is being ignored.The test for metadata store is being disabled. Please ensure there's a tracking issue to re-enable it.
spark/src/test/scala/ai/chronon/spark/test/join/JoinTest.scala (12)
125-126: LGTM! Reduced test data size.Reducing entity count from 1000 to 100 improves test execution time while maintaining test coverage.
Also applies to: 173-174
438-440: LGTM! Added partition column support.The addition of partition column support enhances test coverage for different partitioning scenarios.
429-551: LGTM! Added test for different partition columns.New test case validates join behavior with different partition columns.
125-126: LGTM! Reduced test data size and added partition column support.The changes reduce test data size from 1000 to 100 records and add partition column support, which aligns with addressing the Avro field limit issues.
Also applies to: 173-174, 318-319, 340-341, 356-357, 439-441, 561-562, 578-579, 614-615, 684-686, 708-709, 1034-1035, 1070-1071, 1136-1137, 1156-1157, 1171-1172, 1322-1323, 1335-1336
17-17: LGTM! Package organization improvement.Moving join tests to a dedicated package improves code organization.
Also applies to: 33-33, 34-37
125-126: LGTM! Test data size optimization.Reducing test data size improves test execution time while maintaining test coverage.
173-174: LGTM! Improved partition column handling.Making partition column configurable and consistent improves flexibility.
Also applies to: 225-232
429-551: LGTM! Comprehensive partition column test coverage.New test case thoroughly validates different partition column scenarios.
318-318: LGTM! Consistent test data optimization.Test data size reduction is consistently applied across all test cases.
Also applies to: 340-340, 463-463
859-859: LGTM! Consistent test data optimization.Test data size reduction is consistently applied to name and user test cases.
Also applies to: 875-875, 883-883
614-614: LGTM! Consistent test data optimization.Test data size reduction is consistently applied to view events test cases.
Also applies to: 635-635, 685-685, 708-708, 1034-1034, 1070-1070, 1136-1136, 1156-1156, 1171-1171
1322-1322: LGTM! Consistent test data optimization.Test data size reduction is consistently applied to selected join parts test.
Also applies to: 1335-1335
.github/workflows/test_scala_spark.yaml (2)
71-72: LGTM! Increased memory allocation.Increased JVM memory settings (
-Xmx16G -Xms8G) for component-specific test jobs will help prevent out-of-memory errors.Also applies to: 97-98, 123-124, 149-150, 175-176
71-71: LGTM! Increased memory limits for test jobs.The increased memory limits (
-Xmx16G -Xms8G) will help handle large Avro schemas.Also applies to: 97-97, 123-123, 149-149, 175-175
| val result = Await.result(future, Duration(10000, SECONDS)) // todo: change back to millis | ||
| val latency = System.currentTimeMillis() - start | ||
| latencySum += latency | ||
| latencyCount += 1 | ||
| result | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify timeout duration.
The TODO comment suggests changing from seconds to milliseconds.
🏁 Script executed:
#!/bin/bash
# Search for similar timeout patterns in tests
rg -A 2 'Duration\([0-9]+.*SECONDS\)' Length of output: 1767
Update Timeout Duration Unit:
In spark/src/test/scala/ai/chronon/spark/test/fetcher/FetcherTestUtil.scala (lines 65–70), the timeout is specified as Duration(10000, SECONDS) with a TODO to revert to milliseconds. Since other tests (e.g., in ExternalSourcesTest.scala) use a 10‑second timeout, please verify if the intended behavior is 10 seconds (i.e. 10000 ms) and update accordingly (e.g. change to Duration(10000, MILLISECONDS)).
| def analyzeJoin(joinConf: api.Join, | ||
| skewDetection: Boolean = false, | ||
| validateTablePermission: Boolean = true, | ||
| validateTablePermission: Boolean = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify if table permission validation can be disabled.
Changing the default value of validateTablePermission from true to false might affect security if table permission validation is needed.
🌐 Web query:
What are the security implications of disabling table permission validation in Apache Spark?
💡 Result:
Disabling table permission validation in Apache Spark introduces significant security risks by bypassing critical access control mechanisms. Here's a breakdown of the implications based on industry practices and documented vulnerabilities:
1. Unauthorized Data Access
- Full Exposure of Sensitive Data: Without permission checks, any user could query/modify tables containing PII, financial records, or proprietary data[2][6][7].
- Bypass of Row/Column-Level Security: Tools like Apache Ranger and Spark-Authorizer enforce granular controls (e.g., masking phone numbers), which become ineffective[2][7].
2. Privilege Escalation Risks
- Exploitable Code Execution: Attackers could inject malicious UDFs or modify query plans to execute arbitrary commands[6][10].
- Credential Theft: Executors with unrestricted access might leak credentials stored in memory or configurations[6].
3. Compliance Violations
- GDPR/HIPAA Non-Compliance: Loss of audit trails for data access makes regulatory compliance impossible[7][10].
- Insecure Defaults: Folders/files created without inherited permissions (e.g., 777 on ADLS Gen1) expose data broadly[5].
4. System Integrity Compromises
- SQL Injection Vulnerabilities: Malicious actors could alter query logic via unvalidated inputs[8].
- UI Exploits: Disabled ACLs enable unauthorized users to view/kill applications via Spark UI[4][10].
Key Mitigations from Search Results
| Security Measure | Purpose | Source |
|---|---|---|
Enable spark.acls.enable |
Enforce UI/API access controls | [4][10] |
| Use Apache Ranger Integration | Centralize row/column policies | [2][7] |
Apply GRANT SELECT/MODIFY |
Restrict file/system operations | [3][5] |
| Validate UDFs/Code Execution | Prevent privilege escalation | [6][10] |
Disabling permission validation fundamentally undermines Spark's security architecture, making systems vulnerable to both internal misuse and external attacks[6][10]. Organizations should instead strengthen validation through granular ACLs and integration with enterprise security frameworks like Ranger[2][7].
Citations:
- 1: https://docs.datastax.com/en/dse/6.9/securing/spark-authentication.html
- 2: https://yaooqinn.github.io/spark-authorizer/docs/spark_sql_authorization.html
- 3: https://kb.databricks.com/security/table-create-security-exception
- 4: https://spark.apache.org/security.html
- 5: https://kb.databricks.com/data/spark-default-perms-adls-gen1
- 6: https://assets.amazon.science/91/50/72bd4f594d85bdfe330ee059bd52/membrane-safe-and-performant-data-access-controls-in-apache-spark-in-the-presence-of-imperative-code.pdf
- 7: https://community.cloudera.com/t5/Community-Articles/Row-Column-level-Security-in-SQL-for-Apache-Spark-2-1-1/ta-p/245738
- 8: https://dzone.com/articles/many-ways-of-validating-data-in-a-spark-dataframe
- 9: https://community.databricks.com/t5/forums/filteredbylabelpage/board-id/data-engineering/label-name/table%20access%20control
- 10: https://granulate.io/blog/spark-security-top-vulnerabilities-6-ways-to-secure-your-spark/
Security Issue: Table Permission Disabled
- Disabling table permission validation can expose sensitive data, allow privilege escalation, and violate compliance requirements.
- Ensure that this configuration is intentionally chosen and that adequate compensatory controls are in place.
- If table access control is required, reset the default to
true.
File: spark/src/main/scala/ai/chronon/spark/Analyzer.scala (Line 278)
validateTablePermission: Boolean = false,| fetcher_tests: | ||
| runs-on: ubuntu-8_cores-32_gb | ||
| container: | ||
| image: ghcr.io/${{ github.repository }}-ci:latest | ||
| credentials: | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Bazel cache credentials | ||
| run: | | ||
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | ||
| - name: Run Fetcher tests | ||
| run: | | ||
| bazel test \ | ||
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | ||
| --google_credentials=bazel-cache-key.json \ | ||
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | ||
| //spark:fetcher_test | ||
| join_tests: | ||
| runs-on: ubuntu-8_cores-32_gb | ||
| container: | ||
| image: ghcr.io/${{ github.repository }}-ci:latest | ||
| credentials: | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Bazel cache credentials | ||
| run: | | ||
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | ||
| - name: Run Join tests | ||
| run: | | ||
| bazel test \ | ||
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | ||
| --google_credentials=bazel-cache-key.json \ | ||
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | ||
| //spark:join_test | ||
| groupby_tests: | ||
| runs-on: ubuntu-8_cores-32_gb | ||
| container: | ||
| image: ghcr.io/${{ github.repository }}-ci:latest | ||
| credentials: | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Bazel cache credentials | ||
| run: | | ||
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | ||
| - name: Run GroupBy tests | ||
| run: | | ||
| bazel test \ | ||
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | ||
| --google_credentials=bazel-cache-key.json \ | ||
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | ||
| //spark:groupby_test | ||
| analyzer_tests: | ||
| runs-on: ubuntu-8_cores-32_gb | ||
| container: | ||
| image: ghcr.io/${{ github.repository }}-ci:latest | ||
| credentials: | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Bazel cache credentials | ||
| run: | | ||
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | ||
| - name: Run Analyzer tests | ||
| run: | | ||
| bazel test \ | ||
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | ||
| --google_credentials=bazel-cache-key.json \ | ||
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | ||
| //spark:analyzer_test | ||
| streaming_tests: | ||
| runs-on: ubuntu-8_cores-32_gb | ||
| container: | ||
| image: ghcr.io/${{ github.repository }}-ci:latest | ||
| credentials: | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ github.workspace }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Bazel cache credentials | ||
| run: | | ||
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | ||
| - name: Run Streaming tests | ||
| run: | | ||
| bazel test \ | ||
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | ||
| --google_credentials=bazel-cache-key.json \ | ||
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | ||
| //spark:streaming_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use standard GitHub Actions runner labels.
The runner label ubuntu-8_cores-32_gb is not a standard GitHub-hosted runner label. Consider using standard labels like ubuntu-latest-8-cores.
Apply this diff to fix the runner labels:
- runs-on: ubuntu-8_cores-32_gb
+ runs-on: ubuntu-latest-8-cores📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fetcher_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Fetcher tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:fetcher_test | |
| join_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Join tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:join_test | |
| groupby_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run GroupBy tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:groupby_test | |
| analyzer_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Analyzer tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:analyzer_test | |
| streaming_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Streaming tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:streaming_test | |
| fetcher_tests: | |
| - runs-on: ubuntu-8_cores-32_gb | |
| + runs-on: ubuntu-latest-8-cores | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Fetcher tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:fetcher_test | |
| join_tests: | |
| - runs-on: ubuntu-8_cores-32_gb | |
| + runs-on: ubuntu-latest-8-cores | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Join tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:join_test | |
| groupby_tests: | |
| - runs-on: ubuntu-8_cores-32_gb | |
| + runs-on: ubuntu-latest-8-cores | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run GroupBy tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:groupby_test | |
| analyzer_tests: | |
| - runs-on: ubuntu-8_cores-32_gb | |
| + runs-on: ubuntu-latest-8-cores | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Analyzer tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:analyzer_test | |
| streaming_tests: | |
| - runs-on: ubuntu-8_cores-32_gb | |
| + runs-on: ubuntu-latest-8-cores | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Streaming tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:streaming_test |
🧰 Tools
🪛 actionlint (1.7.4)
49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 YAMLlint (1.35.1)
[error] 176-176: no new line character at the end of file
(new-line-at-end-of-file)
🛠️ Refactor suggestion
Add newline at end of file and document runner labels.
The workflow changes look good but need some improvements:
- Add a newline at the end of the file
- Document the custom runner label
ubuntu-8_cores-32_gbin actionlint.yaml
Apply this diff to add the newline:
//spark:streaming_test
+📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fetcher_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Fetcher tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:fetcher_test | |
| join_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Join tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:join_test | |
| groupby_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run GroupBy tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:groupby_test | |
| analyzer_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Analyzer tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:analyzer_test | |
| streaming_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Streaming tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:streaming_test | |
| fetcher_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Fetcher tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:fetcher_test | |
| join_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Join tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:join_test | |
| groupby_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run GroupBy tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:groupby_test | |
| analyzer_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Analyzer tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:analyzer_test | |
| streaming_tests: | |
| runs-on: ubuntu-8_cores-32_gb | |
| container: | |
| image: ghcr.io/${{ github.repository }}-ci:latest | |
| credentials: | |
| username: ${{ github.actor }} | |
| password: ${{ secrets.GITHUB_TOKEN }} | |
| defaults: | |
| run: | |
| working-directory: ${{ github.workspace }} | |
| steps: | |
| - uses: actions/checkout@v4 | |
| - name: Setup Bazel cache credentials | |
| run: | | |
| echo "${{ secrets.BAZEL_CACHE_CREDENTIALS }}" | base64 -d > bazel-cache-key.json | |
| - name: Run Streaming tests | |
| run: | | |
| bazel test \ | |
| --remote_cache=https://storage.googleapis.com/zipline-bazel-cache \ | |
| --google_credentials=bazel-cache-key.json \ | |
| --test_env=JAVA_OPTS="-Xmx16G -Xms8G" \ | |
| //spark:streaming_test |
🧰 Tools
🪛 actionlint (1.7.4)
49-49: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
75-75: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
101-101: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
127-127: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
153-153: label "ubuntu-8_cores-32_gb" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 YAMLlint (1.35.1)
[error] 176-176: no new line character at the end of file
(new-line-at-end-of-file)
2dedcc2 to
deb87ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (1)
58-60: Verify MaxFields value aligns with Beacon top requirements.The PR objective mentions 374 fields in Beacon top. While 1000 is sufficient, consider setting it closer to the actual requirement.
Also applies to: 75-75
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (3)
flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala(2 hunks)online/src/main/scala/ai/chronon/online/CatalystUtil.scala(5 hunks)online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- online/src/test/scala/ai/chronon/online/test/CatalystUtilTest.scala
- flink/src/test/scala/org/apache/spark/sql/avro/AvroDeSerializationSupportSpec.scala
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (3)
online/src/main/scala/ai/chronon/online/CatalystUtil.scala (3)
36-36: LGTM: SQLConf import added.Required for accessing SQL configuration settings.
143-144: LGTM: Improved SQL clause readability.Wrapping clauses in parentheses enhances SQL readability.
206-211: LGTM: Early validation prevents codegen issues.Fail-fast approach prevents incorrect code generation for large schemas.
Run this to verify the validation:
#!/bin/bash # Search for other schema size validations rg -A 2 "Too many fields"
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Checklist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… top (#380) ## Summary While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit. ## Cheour clientslist - [X] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced data processing robustness with improved handling and early error detection for large schemas. - Refined SQL query formatting for clearer logical conditions. - **Tests** - Added a new validation for large schema deserialization. - Updated test definitions to improve structure and readability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
While trying to read the updated beacon top topic we hit issues as the number of avro fields is greater than the Spark codegen limit default of 100. Thanks to this the wholestage codegen code is incorrect and we either end up with segfaults (unit tests) or garbled events (prod flink jobs). This PR bumps the limit to allow us to read beacon top (374 fields) as well as adds an assert in Catalyst util's whole stage code gen code to fail if we encounter this again in the future for a higher number of fields than our current bumped limit.
Checklist
Summary by CodeRabbit
New Features
Tests