Skip to content

Conversation

@tchow-zlai
Copy link
Collaborator

@tchow-zlai tchow-zlai commented Jun 13, 2025

Summary

  • Symlinking the python compiled directory and checking that in.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Summary by CodeRabbit

  • New Features

    • Added a new test suite to verify join planner functionality.
    • Added multiple JSON configuration files defining data processing jobs, aggregations, and training sets for AWS and GCP environments.
    • Introduced new team metadata configurations for AWS, GCP, default, and test environments.
    • Added resource files supporting test data and environment configurations.
  • Bug Fixes

    • Improved error handling for JSON parsing by replacing assertions with precondition checks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 13, 2025

Walkthrough

A new Scala test suite was added to verify join planner configurations parse and plan without exceptions. Test resources were added and linked in the Bazel build. A resource file was introduced in test resources. An assertion was replaced by a precondition check in JSON deserialization. The MonolithJoinPlanner class was converted to a case class. Numerous JSON configuration files for data pipelines, joins, and team metadata were added under api/python/test/canary/compiled/. The .gitignore was updated to remove old ignores and add a new one for GCP compiled resources.

Changes

File(s) Change Summary
api/BUILD.bazel Added data attribute with test resource dependency to scala_test_suite named tests.
api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala Added PlannerTest verifying join planner builds plans without exceptions.
spark/src/test/resources/canary/compiled Added file referencing compiled resource path for tests.
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala Replaced assert with require in fromJsonStr method for JSON equivalence check.
api/src/main/scala/ai/chronon/api/planner/MonolithJoinPlanner.scala Converted MonolithJoinPlanner from class to case class; safely handle optional finalNode name.
api/python/test/canary/compiled/group_bys/aws/purchases.v1_dev Added JSON config for AWS purchases data aggregation job.
api/python/test/canary/compiled/group_bys/aws/purchases.v1_test Added JSON test config for AWS purchases group-by job.
api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_pubsub_v2 Added JSON config for GCP item event canary Pub/Sub aggregation job.
api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_v1 Added JSON config for GCP item event canary Kafka aggregation job.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev Added JSON config for GCP purchases data aggregation job.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev_notds Added JSON config for GCP purchases aggregation job on non-TDS source.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test Added JSON test config for GCP purchases data pipeline.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test_notds Added JSON config for GCP purchases test job on non-TDS source.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_dev Added JSON config for GCP purchases native view aggregation job.
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_test Added JSON test config for GCP purchases native view aggregation job.
api/python/test/canary/compiled/joins/gcp/training_set.v1_dev Added JSON config for GCP training set join pipeline.
api/python/test/canary/compiled/joins/gcp/training_set.v1_dev_notds Added JSON config for GCP training set join pipeline on non-TDS source.
api/python/test/canary/compiled/joins/gcp/training_set.v1_test Added JSON test config for GCP training set join pipeline.
api/python/test/canary/compiled/joins/gcp/training_set.v1_test_notds Added JSON config for GCP training set test join pipeline on non-TDS source.
api/python/test/canary/compiled/teams_metadata/aws/aws_team_metadata Added JSON config defining AWS team execution environment and config metadata.
api/python/test/canary/compiled/teams_metadata/default/default_team_metadata Added JSON config defining default team execution environment and config metadata.
api/python/test/canary/compiled/teams_metadata/gcp/gcp_team_metadata Added JSON config defining GCP team execution environment, Spark, and cluster configs.
api/python/test/canary/compiled/teams_metadata/test/test_team_metadata Added JSON config defining test team execution environment and config settings.
.gitignore Removed old ignored paths; added new ignore for api/python/ai/chronon/resources/gcp/compiled/.

Sequence Diagram(s)

sequenceDiagram
  participant TestRunner as PlannerTest
  participant Env as Environment
  participant FileSystem as File System
  participant LocalRunner
  participant MonolithJoinPlanner

  TestRunner->>Env: Read RUNFILES_DIR
  TestRunner->>FileSystem: Locate compiled join configs
  TestRunner->>LocalRunner: Parse configs into Join objects
  loop For each Join object
    TestRunner->>MonolithJoinPlanner: Create planner
    TestRunner->>MonolithJoinPlanner: buildPlan()
    MonolithJoinPlanner-->>TestRunner: No exception thrown
  end
Loading

Suggested reviewers

  • nikhil-zlai

Poem

In test we trust, with joins to parse,
A planner’s path, both near and far.
Resources linked, the configs found,
No errors rise, no faults abound.
With Bazel’s aid and Scala’s might,
Our plans are built, the future bright!
🛠️✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81f928c and 4f71651.

📒 Files selected for processing (1)
  • api/src/main/scala/ai/chronon/api/planner/MonolithJoinPlanner.scala (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/src/main/scala/ai/chronon/api/planner/MonolithJoinPlanner.scala
⏰ Context from checks skipped due to timeout of 90000ms (18)
  • GitHub Check: streaming_tests
  • GitHub Check: join_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: groupby_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: service_tests
  • GitHub Check: flink_tests
  • GitHub Check: api_tests
  • GitHub Check: online_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: python_tests
  • GitHub Check: enforce_triggered_workflows

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala (1)

12-12: Mark implicit explicitly.

Add the type for clarity:

-private implicit val testPartitionSpec = PartitionSpec.daily
+private implicit val testPartitionSpec: PartitionSpec = PartitionSpec.daily
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44b0958 and 18095da.

⛔ Files ignored due to path filters (1)
  • spark/src/test/resources/local_data_csv/test_table_1_data.csv is excluded by !**/*.csv
📒 Files selected for processing (9)
  • api/BUILD.bazel (1 hunks)
  • api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala (1 hunks)
  • spark/src/test/resources/canary/compiled (1 hunks)
  • spark/src/test/resources/group_bys/team/example_group_by.v1 (0 hunks)
  • spark/src/test/resources/group_bys/team/purchases.v1 (0 hunks)
  • spark/src/test/resources/group_bys/team/purchases_only_conf_common.v1 (0 hunks)
  • spark/src/test/resources/joins/team/example_join.v1 (0 hunks)
  • spark/src/test/resources/joins/team/example_join_failure.v1 (0 hunks)
  • spark/src/test/resources/teams_metadata/default_team_metadata (0 hunks)
💤 Files with no reviewable changes (6)
  • spark/src/test/resources/teams_metadata/default_team_metadata
  • spark/src/test/resources/group_bys/team/example_group_by.v1
  • spark/src/test/resources/joins/team/example_join.v1
  • spark/src/test/resources/group_bys/team/purchases.v1
  • spark/src/test/resources/group_bys/team/purchases_only_conf_common.v1
  • spark/src/test/resources/joins/team/example_join_failure.v1
⏰ Context from checks skipped due to timeout of 90000ms (16)
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: streaming_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: join_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: service_tests
  • GitHub Check: api_tests
  • GitHub Check: online_tests
  • GitHub Check: flink_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
api/BUILD.bazel (1)

64-66: Ensure the test-resources target exists & is propagated.

If //spark/src/test/resources:test-resources isn’t defined with visibility = ["//visibility:public"] the test will break at runtime. Double-check the BUILD under spark/src/test/resources (or add one) so Bazel can copy the files.

Comment on lines 24 to 29
joinPlanners
.foreach { planner =>
noException should be thrownBy {
planner.buildPlan
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Assert on result, not just “no exception”.

A test that only checks for absence of exceptions can silently pass on wrong output. Capture the plan and at least assert size > 0.

-    joinPlanners
-      .foreach { planner =>
-        noException should be thrownBy {
-          planner.buildPlan
-        }
-      }
+    joinPlanners.foreach { planner =>
+      val plan = planner.buildPlan
+      plan.nodes.size should be > 0 // or a more specific assertion
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
joinPlanners
.foreach { planner =>
noException should be thrownBy {
planner.buildPlan
}
}
joinPlanners.foreach { planner =>
val plan = planner.buildPlan
plan.nodes.size should be > 0 // or a more specific assertion
}
🤖 Prompt for AI Agents
In api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala around lines
24 to 29, the test currently only asserts that no exception is thrown when
calling planner.buildPlan. Modify the test to capture the result of buildPlan
and add an assertion that the resulting plan's size is greater than zero to
ensure the output is valid and not empty.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (1)

108-113: 🛠️ Refactor suggestion

Prefer assert/require semantics match the failure type

Switching to require works, but note the semantic shift:
require signals bad caller input (throws IllegalArgumentException), whereas this check validates an internal invariant (round-trip equality). For internal sanity checks assert (or Preconditions.checkState) is more idiomatic. If you keep require, rename check param to something like validateRoundTrip to avoid confusion.

🧹 Nitpick comments (1)
api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (1)

101-104: Reuse a single ObjectMapper

new ObjectMapper() per call is fairly heavy and thread-safe once configured. Consider:

-    val mapper = new ObjectMapper()
-    mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
+    val mapper = ThriftJsonCodec.sharedMapper

and add somewhere in the object:

private val sharedMapper: ObjectMapper = new ObjectMapper()
  .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)

Cuts GC churn in hot paths.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18095da and cb0aef6.

📒 Files selected for processing (2)
  • api/src/main/scala/ai/chronon/api/ThriftJsonCodec.scala (1 hunks)
  • api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/src/test/scala/ai/chronon/api/test/planner/PlannerTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (17)
  • GitHub Check: online_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: flink_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: service_tests
  • GitHub Check: api_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: streaming_tests
  • GitHub Check: batch_tests
  • GitHub Check: join_tests
  • GitHub Check: spark_tests
  • GitHub Check: groupby_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: enforce_triggered_workflows

@tchow-zlai tchow-zlai force-pushed the tchow/align-confs branch 2 times, most recently from 880a999 to 54345bd Compare June 13, 2025 20:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
api/python/ai/chronon/resources/gcp/compiled/joins/test/data.v1 (1)

65-69: Leftover angle-bracket placeholders
Values like <customer_id> or <partition-column-name> remain. If this file is meant for real execution (not just test fixtures), replace them to avoid runtime NPEs during env-var interpolation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54345bd and 82a0a42.

📒 Files selected for processing (5)
  • api/python/ai/chronon/resources/gcp/compiled/group_bys/test/data.group_by_v1 (1 hunks)
  • api/python/ai/chronon/resources/gcp/compiled/joins/test/data.v1 (1 hunks)
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/default/default_team_metadata (1 hunks)
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/team_conf/team_conf_team_metadata (1 hunks)
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/test/test_team_metadata (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/test/test_team_metadata
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/default/default_team_metadata
  • api/python/ai/chronon/resources/gcp/compiled/teams_metadata/team_conf/team_conf_team_metadata
  • api/python/ai/chronon/resources/gcp/compiled/group_bys/test/data.group_by_v1
⏰ Context from checks skipped due to timeout of 90000ms (17)
  • GitHub Check: streaming_tests
  • GitHub Check: analyzer_tests
  • GitHub Check: join_tests
  • GitHub Check: fetcher_tests
  • GitHub Check: batch_tests
  • GitHub Check: spark_tests
  • GitHub Check: service_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: online_tests
  • GitHub Check: api_tests
  • GitHub Check: flink_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: python_tests
  • GitHub Check: enforce_triggered_workflows
🔇 Additional comments (1)
api/python/ai/chronon/resources/gcp/compiled/joins/test/data.v1 (1)

6-8: Online-flag mismatch between parent join & child groupBy
metaData.online is 0 for the join, yet the nested group-by advertises online = 1. Planners that propagate the top-level flag may silently drop the online artifact or double-publish. Please confirm the intended behaviour.

Also applies to: 94-99

Comment on lines 178 to 182
"operation": 13,
"argMap": {
"k": "10"
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

k should be an int, not a quoted string
"k": "10" is parsed as a string. The aggregation codepath expects an integer and will fail JSON → protobuf validation.

-              "k": "10"
+              "k": 10
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"operation": 13,
"argMap": {
"k": "10"
}
}
"operation": 13,
"argMap": {
"k": 10
}
}
🤖 Prompt for AI Agents
In api/python/ai/chronon/resources/gcp/compiled/joins/test/data.v1 around lines
178 to 182, the value of "k" is currently a quoted string "10", but it should be
an integer. Remove the quotes around 10 to make it a numeric value, ensuring the
JSON to protobuf validation succeeds and the aggregation codepath works
correctly.

@tchow-zlai tchow-zlai changed the title add planner tests feat: add planner tests Jun 13, 2025
@@ -0,0 +1 @@
api/python/test/canary/compiled/ No newline at end of file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symlinking to canary test files.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

♻️ Duplicate comments (3)
api/python/test/canary/compiled/joins/gcp/training_set.v1_test (3)

16-18: Placeholder values unresolved

Same TODO placeholders; fill before merging.

Also applies to: 34-36


6-7: online flag inconsistent

Root = 0, inner = 1; confirm correctness.

Also applies to: 141-141


225-227: Numeric literal for k

-              "k": "10"
+              "k": 10
🧹 Nitpick comments (12)
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_test (1)

112-114: Inline Dataproc JSON is bulky

Consider moving the long JSON into a separate file or templated helper to keep this spec readable.

api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev (2)

72-75: Hard-coded tiny parallelism

coalesce.factor, default.parallelism, and shuffle.partitions are fixed at 10; risky for larger volumes. Consider param-driven sizing.

Also applies to: 102-104


112-114: Huge Dataproc spec embedded as string

Inline 1-line JSON is unreadable and un-diffable. Externalize to a separate file or multi-line format.

api/python/test/canary/compiled/joins/gcp/training_set.v1_test_notds (1)

227-229: k should be numeric

JSON number preferred over quoted string.

-              "k": "10"
+              "k": 10
api/python/test/canary/compiled/joins/gcp/training_set.v1_dev_notds (3)

9-10: Nested JSON-as-string hurts readability

customJson is double-encoded. Prefer embedding this structure directly or moving to a separate file to avoid escape-hell.


50-78: Heavy duplication across mode configs

env.common and modeEnvironments.upload plus corresponding spark.* blocks are effectively copies. Consolidate in a shared template to cut drift.

Also applies to: 80-109


115-116: Inline Dataproc spec is unwieldy

Storing a full Dataproc cluster JSON inside a single field makes reviews and updates painful. Consider referencing an external YAML/JSON resource instead.

api/python/test/canary/compiled/joins/gcp/training_set.v1_dev (4)

124-131: Make partitioning explicit

The left event omits partitionColumn; relying on defaults can bite later. Add "partitionColumn": "ds" for clarity.


9-10: Avoid JSON-in-JSON

Same escape issue as other file – consider externalising airflowDependencies.


50-78: Reduce config duplication

Duplicate spark.* blocks across modes; template or inherit to ease maintenance.

Also applies to: 80-109


115-116: Inline Dataproc definition

Long JSON string again; external file or shared reference recommended.

api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_v1 (1)

25-26: ENABLE_PUBSUB set but source is Kafka
Flag may be misleading; consider a Kafka-specific toggle.

Also applies to: 125-126

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a520b5 and 7b4d6b0.

📒 Files selected for processing (19)
  • .gitignore (1 hunks)
  • api/python/test/canary/compiled/group_bys/aws/purchases.v1_dev (1 hunks)
  • api/python/test/canary/compiled/group_bys/aws/purchases.v1_test (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_pubsub_v2 (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_v1 (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev_notds (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test_notds (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_dev (1 hunks)
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_test (1 hunks)
  • api/python/test/canary/compiled/joins/gcp/training_set.v1_dev (1 hunks)
  • api/python/test/canary/compiled/joins/gcp/training_set.v1_dev_notds (1 hunks)
  • api/python/test/canary/compiled/joins/gcp/training_set.v1_test (1 hunks)
  • api/python/test/canary/compiled/joins/gcp/training_set.v1_test_notds (1 hunks)
  • api/python/test/canary/compiled/teams_metadata/aws/aws_team_metadata (1 hunks)
  • api/python/test/canary/compiled/teams_metadata/default/default_team_metadata (1 hunks)
  • api/python/test/canary/compiled/teams_metadata/gcp/gcp_team_metadata (1 hunks)
  • api/python/test/canary/compiled/teams_metadata/test/test_team_metadata (1 hunks)
✅ Files skipped from review due to trivial changes (8)
  • .gitignore
  • api/python/test/canary/compiled/teams_metadata/aws/aws_team_metadata
  • api/python/test/canary/compiled/teams_metadata/default/default_team_metadata
  • api/python/test/canary/compiled/group_bys/aws/purchases.v1_dev
  • api/python/test/canary/compiled/group_bys/aws/purchases.v1_test
  • api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev_notds
  • api/python/test/canary/compiled/teams_metadata/gcp/gcp_team_metadata
  • api/python/test/canary/compiled/teams_metadata/test/test_team_metadata
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: service_tests
  • GitHub Check: online_tests
  • GitHub Check: api_tests
  • GitHub Check: cloud_aws_tests
  • GitHub Check: cloud_gcp_tests
  • GitHub Check: service_commons_tests
  • GitHub Check: aggregator_tests
  • GitHub Check: flink_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: python_tests
  • GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (5)
api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev (2)

30-31: JOB_MODE set to local[*]

Local mode defeats cluster scaling; confirm this is intentional for canary.


140-156: Verify aggregation op codes

Ops 6/7/8 must exist in planner enum; otherwise parsing will fail.

Also applies to: 160-176, 178-194

api/python/test/canary/compiled/joins/gcp/training_set.v1_test_notds (2)

16-18: Placeholder values unresolved

[STREAMING-TODO] / [ONLINE-TODO] placeholders remain; pipeline will error at runtime.

Also applies to: 34-36


6-7: online flag inconsistent

Top-level job marked online: 0, but nested groupBy is online: 1. Verify intent.

Also applies to: 142-142

api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test (1)

17-19: Unreferenced partition column ds.
Env/Spark expect ds, but the query lacks partitionColumn; ensure table is actually partitioned or drop these settings.

Also applies to: 50-53, 124-131

Comment on lines +20 to +26
"GCP_PROJECT_ID": "canary-443022",
"GCP_REGION": "us-central1",
"GCP_DATAPROC_CLUSTER_NAME": "zipline-canary-cluster",
"GCP_BIGTABLE_INSTANCE_ID": "zipline-canary-instance",
"FLINK_STATE_URI": "gs://zipline-warehouse-canary/flink-state",
"CLOUD_PROVIDER": "gcp",
"ENABLE_PUBSUB": "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hard-coded project/cluster IDs

Values like canary-443022 & zipline-canary-cluster will break when reused in another environment. Externalize to vars.

Also applies to: 38-45

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_test around
lines 20 to 26 and 38 to 45, the project and cluster IDs such as "canary-443022"
and "zipline-canary-cluster" are hard-coded, which will cause failures in
different environments. Replace these hard-coded strings with environment
variables or configuration parameters to externalize the values, allowing the
code to adapt to different deployment contexts without modification.

Comment on lines +14 to +18
"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
"PARTITION_COLUMN": "ds",
"PARTITION_FORMAT": "yyyy-MM-dd",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

TODO placeholders still present

HADOOP_DIR, CHRONON_ONLINE_CLASS, CHRONON_ONLINE_ARGS carry [STREAMING-TODO]/[ONLINE-TODO]. Replace or purge before shipping.

Also applies to: 32-37

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_test around
lines 14 to 18 and 32 to 37, the configuration values for HADOOP_DIR,
CHRONON_ONLINE_CLASS, and CHRONON_ONLINE_ARGS still contain placeholder tags
like [STREAMING-TODO] and [ONLINE-TODO]. Replace these placeholders with actual
valid paths, class names, and argument strings relevant to your environment or
remove these entries if they are not needed before finalizing the code.

Comment on lines +14 to +16
"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove TODO placeholders before prod

HADOOP_DIR, CHRONON_ONLINE_CLASS, and CHRONON_ONLINE_ARGS still have TODO placeholders. Leaving these will break runtime discovery.

Also applies to: 32-34

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev at lines 14-16
and 32-34, the environment variables HADOOP_DIR, CHRONON_ONLINE_CLASS, and
CHRONON_ONLINE_ARGS contain placeholder TODO values. Replace these placeholders
with the actual production values required for runtime discovery to function
correctly, ensuring no placeholder text remains.

Comment on lines +198 to +201
"operation": 13,
"argMap": {
"k": "10"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

k value should be int

"k": "10" is a string; parsers may expect number.

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_dev around lines
198 to 201, the value of "k" is currently a string "10" but it should be an
integer. Change the value of "k" from the string "10" to the integer 10 to
ensure parsers expecting a number handle it correctly.

Comment on lines +17 to +19
"PARTITION_COLUMN": "ds",
"PARTITION_FORMAT": "yyyy-MM-dd",
"CUSTOMER_ID": "dev",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Partition column mismatch (ds vs notds).
PARTITION_COLUMN + Spark conf use ds, but the source and Airflow spec use notds. This will break partition pruning & scheduling.

-          "PARTITION_COLUMN": "ds",
+          "PARTITION_COLUMN": "notds",
@@
-          "spark.chronon.partition.column": "ds",
+          "spark.chronon.partition.column": "notds",
@@
-          "partitionColumn": "notds"
+          "partitionColumn": "notds"   # keep, now consistent

Also applies to: 50-53, 123-132

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_test_notds around
lines 17 to 19 (and also lines 50-53, 123-132), the PARTITION_COLUMN is set to
"ds" while the source and Airflow spec use "notds", causing a partition column
mismatch. Update the PARTITION_COLUMN and any related Spark configuration to use
"notds" consistently to ensure partition pruning and scheduling work correctly.

"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
"PARTITION_COLUMN": "ds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

PARTITION_COLUMN mismatch
Env sets ds, Spark conf expects _DATE; pick one to avoid empty partitions.

Also applies to: 50-51

🤖 Prompt for AI Agents
In
api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_pubsub_v2
at line 17 and also lines 50-51, the PARTITION_COLUMN is set to "ds" in the
environment but Spark configuration expects "_DATE". To fix this, ensure that
the PARTITION_COLUMN value matches the expected partition column name in Spark
configuration by changing "ds" to "_DATE" or vice versa, so that partitioning
works correctly and does not result in empty partitions.

Comment on lines +14 to +16
"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same TODO placeholders as above
Replace before CI triggers planner.

Also applies to: 32-34

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_v1 at
lines 14-16 and 32-34, replace the placeholder TODO strings in the environment
variable values with actual valid paths, class names, and arguments required for
the Hadoop directory and Chronon online class configuration before running CI.
Ensure these values are correctly set to avoid CI failures.

"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
"PARTITION_COLUMN": "ds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Partition column disparity
ds vs _DATE—align to prevent write/read skew.

Also applies to: 50-51

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/item_event_canary.actions_v1 at
line 17 and also lines 50-51, the partition column is set to "ds" which is
inconsistent with the expected "_DATE". To fix this, update the
"PARTITION_COLUMN" value from "ds" to "_DATE" in all these locations to ensure
consistent partitioning and prevent write/read skew.

Comment on lines +14 to +16
"HADOOP_DIR": "[STREAMING-TODO]/path/to/folder/containing",
"CHRONON_ONLINE_CLASS": "[ONLINE-TODO]your.online.class",
"CHRONON_ONLINE_ARGS": "[ONLINE-TODO]args prefixed with -Z become constructor map for your implementation of ai.chronon.online.Api, -Zkv-host=<YOUR_HOST> -Zkv-port=<YOUR_PORT>",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

TODO placeholders must be resolved before execution

HADOOP_DIR, CHRONON_ONLINE_CLASS, and CHRONON_ONLINE_ARGS still contain placeholder text ([STREAMING-TODO], [ONLINE-TODO]). If this config ever leaves the test fixture path, the job will blow up at runtime. Replace or override these values, or explicitly document why they are safe to remain.

Also applies to: 32-34

🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_dev around
lines 14 to 16 and 32 to 34, the environment variables HADOOP_DIR,
CHRONON_ONLINE_CLASS, and CHRONON_ONLINE_ARGS contain unresolved placeholder
text like [STREAMING-TODO] and [ONLINE-TODO]. Replace these placeholders with
actual valid paths, class names, and argument strings appropriate for the
environment, or override them with correct values before execution to prevent
runtime failures. If these placeholders must remain, add explicit comments
explaining why they are safe in this context.

Comment on lines +198 to +201
"operation": 13,
"argMap": {
"k": "10"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

k should be numeric, not a quoted string

Down-stream aggregation code usually deserialises argMap values as numbers. Keep it an int to avoid parsing hiccups.

-        "k": "10"
+        "k": 10

Convert argMap “k” to an integer
ArgMap values are deserialized as numbers—quoting “10” may break downstream parsing.

• File: api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_dev
Lines: 198–201

-        "k": "10"
+        "k": 10
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"operation": 13,
"argMap": {
"k": "10"
}
"operation": 13,
"argMap": {
"k": 10
}
🤖 Prompt for AI Agents
In api/python/test/canary/compiled/group_bys/gcp/purchases.v1_view_dev around
lines 198 to 201, the value of "k" in argMap is currently a string "10". Change
this value to an integer 10 by removing the quotes to ensure downstream
aggregation code correctly interprets it as a number and avoids parsing issues.

@tchow-zlai tchow-zlai force-pushed the tchow/align-confs branch 2 times, most recently from 9de75a3 to 23a3d4b Compare June 13, 2025 22:57
Comment on lines 29 to 37
val terminalNodeNames = Option(finalNode)
.map(_.metaData)
.map(_.name)
.map((name) =>
Map(
planner.Mode.BACKFILL -> name
))
.getOrElse(Map.empty)
.asJava
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val terminalNodeNames = Option(finalNode)
.map(_.metaData)
.map(_.name)
.map((name) =>
Map(
planner.Mode.BACKFILL -> name
))
.getOrElse(Map.empty)
.asJava
val terminalNodeNames = Option(finalNode)
.map(_.metaData)
.map(_.name)
.map((name) =>
Map(
planner.Mode.BACKFILL -> name
))
.getOrElse(Map.empty)
.asJava

for comprehension?

@tchow-zlai tchow-zlai merged commit e72566e into main Jun 14, 2025
48 of 58 checks passed
@tchow-zlai tchow-zlai deleted the tchow/align-confs branch June 14, 2025 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants