-
Notifications
You must be signed in to change notification settings - Fork 8
feat: complex stagingquery canary #1003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughMultiple new JSON configuration files for GCP staging queries ( Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant StagingQuery_v1
participant StagingQuery_v2
participant StagingQuery_v3
participant StagingQuery_v4
participant StagingQuery_v5
participant StagingQuery_v6
participant Terminal_v1
Scheduler->>StagingQuery_v1: Trigger (daily)
Scheduler->>StagingQuery_v2: Trigger (daily)
Scheduler->>StagingQuery_v3: Trigger (daily)
Scheduler->>StagingQuery_v4: Trigger (daily)
Scheduler->>StagingQuery_v5: Trigger (daily)
Scheduler->>StagingQuery_v6: Trigger (daily)
StagingQuery_v1-->>Terminal_v1: Output table
StagingQuery_v2-->>Terminal_v1: Output table
StagingQuery_v3-->>Terminal_v1: Output table
StagingQuery_v4-->>Terminal_v1: Output table
StagingQuery_v5-->>Terminal_v1: Output table
StagingQuery_v6-->>Terminal_v1: Output table
Scheduler->>Terminal_v1: Trigger after dependencies
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (8)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (5)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1_hub__0 (1)
17-18: Same TODO placeholders as v4These environment placeholders are duplicated across versions and need resolution.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v6__0 (1)
17-18: TODO placeholders replicated across all versionsAddress these placeholders systematically across all staging query versions.
api/python/test/canary/compiled/joins/gcp/training_set.v1_hub__0 (1)
31-33: TODO placeholders in training set configSame environment placeholder issues exist in this training set configuration.
🧹 Nitpick comments (5)
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v4__0 (2)
17-18: TODO placeholders need resolutionEnvironment variables contain placeholder values that will likely cause runtime failures.
54-81: Extensive configuration duplicationCommon and backfill configs are nearly identical. Consider extracting shared configuration.
Also applies to: 83-111
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v2__0 (3)
59-63: Deduplicate bucket path config
temporary_gcs_bucket,connector_output_dataset, etc. re-appear verbatim in every mode. Move to one shared place to avoid drift.
82-111: Backfill config duplicates “common”The block repeats ~30 keys unchanged. If identical, omit and inherit from
common.
118-118: Inline Dataproc JSON hurts readabilityA >600-char escaped string is painful to diff. Store cluster spec in its own file and reference it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
api/python/test/canary/compiled/joins/gcp/training_set.v1_hub__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.terminal_v1__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1_hub__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v2__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v3__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v4__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v5__0(1 hunks)api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v6__0(1 hunks)api/python/test/canary/compiled/teams_metadata/test/test_team_metadata(3 hunks)api/python/test/canary/staging_queries/gcp/sample_staging_query.py(2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: chewy-zlai
PR: zipline-ai/chronon#30
File: api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user:60-61
Timestamp: 2024-10-08T16:18:45.669Z
Learning: The JSON files in 'api/py/test/sample/production/' are automatically generated; avoid suggesting manual changes to them.
Learnt from: chewy-zlai
PR: zipline-ai/chronon#789
File: api/python/ai/chronon/repo/cluster.py:16-16
Timestamp: 2025-05-23T22:52:27.605Z
Learning: The project uses terraform to provision the required "dataproc@${project_id}.iam.gserviceaccount.com" service account, making hardcoded service account patterns in Dataproc configurations reliable and acceptable.
api/python/test/canary/compiled/teams_metadata/test/test_team_metadata (3)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:4-5
Timestamp: 2025-01-15T21:00:45.383Z
Learning: Security suggestions about using environment variables for project IDs and datasets in additional-confs.yaml were deemed non-critical by the team, as these values may be temporarily hardcoded for development purposes.
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
Learnt from: piyush-zlai
PR: #135
File: quickstart/README.md:11-13
Timestamp: 2024-12-18T16:08:20.688Z
Learning: In the “quickstart” Docker environment for Bigtable, GCP credentials aren’t required because the system is using a Bigtable emulator.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.terminal_v1__0 (1)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v5__0 (2)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: #30
File: api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user:60-61
Timestamp: 2024-10-08T16:18:45.669Z
Learning: The JSON files in 'api/py/test/sample/production/' are automatically generated; avoid suggesting manual changes to them.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v4__0 (2)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: #30
File: api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user:60-61
Timestamp: 2024-10-08T16:18:45.669Z
Learning: The JSON files in 'api/py/test/sample/production/' are automatically generated; avoid suggesting manual changes to them.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v6__0 (1)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
api/python/test/canary/compiled/joins/gcp/training_set.v1_hub__0 (1)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v3__0 (2)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: #30
File: api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user:60-61
Timestamp: 2024-10-08T16:18:45.669Z
Learning: The JSON files in 'api/py/test/sample/production/' are automatically generated; avoid suggesting manual changes to them.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1_hub__0 (1)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v2__0 (2)
Learnt from: david-zlai
PR: #222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse" should remain in the main additional-confs.yaml file, not in dev-specific configs.
Learnt from: chewy-zlai
PR: #30
File: api/py/test/sample/production/group_bys/risk/transaction_events.txn_group_by_user:60-61
Timestamp: 2024-10-08T16:18:45.669Z
Learning: The JSON files in 'api/py/test/sample/production/' are automatically generated; avoid suggesting manual changes to them.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: cloud_gcp_tests
- GitHub Check: cloud_aws_tests
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: service_commons_tests
- GitHub Check: groupby_tests
- GitHub Check: service_tests
- GitHub Check: api_tests
- GitHub Check: aggregator_tests
- GitHub Check: online_tests
- GitHub Check: batch_tests
- GitHub Check: flink_tests
- GitHub Check: join_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: python_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (10)
api/python/test/canary/compiled/teams_metadata/test/test_team_metadata (1)
20-56: LGTM - Resource allocation swap makes sense.Environment mode swap aligns backfill with higher resource requirements and upload with lighter configuration.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1__0 (1)
6-6: LGTM - Namespace update maintains consistency.Output namespace aligned with other staging query versions.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v3__0 (1)
1-143: LGTM - Well-structured staging query configuration.Comprehensive GCP configuration follows established patterns for the v3 staging query.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v5__0 (1)
1-143: LGTM - Consistent v5 staging query configuration.Configuration maintains pattern consistency across staging query versions.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.terminal_v1__0 (2)
11-11: LGTM - Comprehensive dependency orchestration.Airflow dependencies properly reference all 6 staging query versions.
127-212: LGTM - Well-structured dependency chain.Table dependencies correctly reference all staging query outputs with proper offsets.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v4__0 (1)
118-118: Service account pattern is acceptableThe hardcoded service account follows the expected terraform-provisioned pattern for this project.
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1_hub__0 (1)
6-6: Output namespace differs from other versionsThis uses "data" while v4 uses "sample_namespace". Verify this is intentional.
api/python/test/canary/compiled/joins/gcp/training_set.v1_hub__0 (2)
198-261: Well-structured aggregation pipelineThe windowed aggregations (sum, count, average) over multiple time periods and last-k operation are properly configured.
7-19: Ignore identical column hashes in canary outputThe hashing algorithm only fingerprints the aggregation logic and input column, not the window parameter—so sum/count/average over 3d/14d/30d all collapse to the same hash by design. No change needed.
Likely an incorrect or invalid review comment.
| def get_staging_query(): | ||
| return StagingQuery( | ||
| query=query, | ||
| start_partition="2020-03-01", | ||
| name="sample_staging_query", | ||
| output_namespace="sample_namespace", | ||
| table_properties={"sample_config_json": """{"sample_key": "sample value"}"""}, | ||
| dependencies=[ | ||
| TableDependency(table=get_join_output_table_name(training_set.v1_test, True), partition_column="ds", offset=1) | ||
| ], | ||
| version=0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Factory function creates identical queries
All staging queries are identical. Consider parameterizing the factory or clarify the purpose of multiple identical queries.
🤖 Prompt for AI Agents
In api/python/test/canary/staging_queries/gcp/sample_staging_query.py around
lines 13 to 24, the factory function get_staging_query creates identical staging
queries without variation. To fix this, modify the function to accept parameters
that customize key attributes such as query, start_partition, name,
output_namespace, or dependencies, allowing creation of distinct queries.
Alternatively, add comments clarifying why multiple identical queries are needed
if parameterization is not intended.
| v1 = get_staging_query() | ||
| v2 = get_staging_query() | ||
| v3 = get_staging_query() | ||
| v4 = get_staging_query() | ||
| v5 = get_staging_query() | ||
| v6 = get_staging_query() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Six identical staging query instances
Creating multiple identical instances may not provide the intended canary testing benefits.
🏁 Script executed:
#!/bin/bash
# Verify if there are differences in the compiled JSON configs for v1-v6
for i in {1..6}; do
echo "=== v${i} differences ==="
if [ $i -gt 1 ]; then
diff api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v1__0 \
api/python/test/canary/compiled/staging_queries/gcp/sample_staging_query.v${i}__0 || echo "Files differ"
fi
doneLength of output: 2065
Differentiate staging query definitions
Shell verification shows each get_staging_query() call yields identical JSON except for the "name" attribute. To leverage canary testing effectively:
- Refactor get_staging_query() to accept parameters (filters, partitions, payloads, etc.) so that each query truly differs.
- Otherwise, if only unique names are required, consolidate to a single call.
Pinpointed file:
• api/python/test/canary/staging_queries/gcp/sample_staging_query.py (lines 26–31)
🤖 Prompt for AI Agents
In api/python/test/canary/staging_queries/gcp/sample_staging_query.py around
lines 26 to 31, multiple calls to get_staging_query() produce nearly identical
JSON except for the "name" attribute, which limits the effectiveness of canary
testing. Refactor get_staging_query() to accept parameters such as filters,
partitions, or payloads to generate truly distinct queries for each call.
Alternatively, if only unique names are necessary, replace the multiple calls
with a single call and adjust the code accordingly.
| TableDependency(table=get_staging_query_output_table_name(v1, True), partition_column="ds", offset=1), | ||
| TableDependency(table=get_staging_query_output_table_name(v2, True), partition_column="ds", offset=1), | ||
| TableDependency(table=get_staging_query_output_table_name(v3, True), partition_column="ds", offset=1), | ||
| TableDependency(table=get_staging_query_output_table_name(v4, True), partition_column="ds", offset=1), | ||
| TableDependency(table=get_staging_query_output_table_name(v5, True), partition_column="ds", offset=1), | ||
| TableDependency(table=get_staging_query_output_table_name(v6, True), partition_column="ds", offset=1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Terminal query dependency explosion
Changed from 1 dependency to 6 identical dependencies. This increases complexity without clear benefit.
🤖 Prompt for AI Agents
In api/python/test/canary/staging_queries/gcp/sample_staging_query.py around
lines 40 to 45, the code adds six identical TableDependency entries, which
unnecessarily increases complexity. Refactor by consolidating these dependencies
into a single TableDependency instance or reduce the number of dependencies to
only those that are distinct and necessary, avoiding duplication.
sean-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomaschow. Here is a thread after running 11/1 - 11/30 and seeing some anomalies. @david-zlai and @kumar-zlai are investigating (not likely config related but more likely workflow/API).
13f3d59 to
34d36e8
Compare
Summary
Checklist
Summary by CodeRabbit