[Spark][Optimization] Modify MergeJob to implement storage partition join using `row_id` #990

varant-zlai · 2025-07-20T17:39:03Z

Summary

Draft of how SPJ can work in the modular flow now that we have row_ids set on join.

Requesting comments.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Introduced a standardized internal row ID column (row_id) across data sources, joins, and test cases.
- Added support for optional bucketing by the internal row ID during table creation and data partitioning to improve join performance.
- Enhanced join logic to consistently include and handle the internal row ID in key selection and output schemas.
- Added a new test suite verifying bucketed join behavior with no shuffle in Spark using Iceberg and Hive metastore.
Bug Fixes
- Added validation to ensure the internal row ID column is present in required data sources before saving or joining.
Refactor
- Centralized join logic related to the row ID column and moved join utility methods for better maintainability.
- Updated method signatures and configurations to support the new row ID paradigm and removed redundant parameters.
Tests
- Updated all relevant tests and test data to include the internal row ID column, adjust schemas, and drop this column from final outputs where appropriate.
- Enhanced test configurations to support bucketing and row ID handling.
Chores
- Added new dependencies to support bucketing features in testing environments.
- Updated sample and canary configurations to align with the new row ID requirements.
Documentation
- Adjusted function and parameter documentation to reflect changes in required and optional fields, especially regarding row ID handling.

coderabbitai · 2025-07-20T17:39:10Z

Walkthrough

This change introduces a required internal row ID column (row_id) across the codebase, updating core logic, APIs, and tests to include, propagate, and validate this column in all relevant data flows. It also implements optional bucketing by row ID in Spark table utilities, revises join logic to use this ID, and removes explicit row ID configuration from join parameters.

Changes

File(s) / Path(s)	Change Summary
`api/src/main/scala/ai/chronon/api/Constants.scala`	Add constant `RowIDColumn = "row_id"`.
`spark/src/main/scala/ai/chronon/spark/Extensions.scala`, `.../catalog/TableUtils.scala`, `CreationUtils.scala`	Add optional `bucketByRowId` parameter to table creation/insertion; propagate to SQL; add constants for row ID and bucket count.
`spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala`, `MergeJob.scala`, `SourceJob.scala`	Ensure inclusion, validation, and bucketing of internal row ID column in DataFrames; refactor join logic for row ID handling.
`spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala`, `JoinUtils.scala`, `Join.scala`	Add/modify methods to include row ID in join keys and deduplication; move join logic to JoinUtils; apply bucketing hints.
`spark/src/main/scala/ai/chronon/spark/JoinBase.scala`	Remove `joinWithLeft` method (moved to JoinUtils).
`spark/src/main/scala/ai/chronon/spark/stats/CompareJob.scala`, `ConsistencyJob.scala`	Always include row ID in join keys and selects.
`online/src/main/scala/ai/chronon/online/fetcher/MetadataStore.scala`	Prepend row ID field to join codec key schema.
`api/python/ai/chronon/join.py`, `query.py`	Make `row_ids` optional in `Join`; require `selects` in `Query`; validate row ID presence in selects.
`api/python/test/canary/compiled/joins/...`, `test/sample/joins/...`, `test/sample/sources/test_sources.py`	Add `"row_id"` to selects and column hashes in test configs; remove `rowIds` from join configs.
`spark/src/test/scala/ai/chronon/spark/test/...` (many files)	Update test data, schemas, queries, and assertions to include and drop internal row ID column as needed.
`spark/BUILD.bazel`	Add Iceberg Spark runtime dependency for testing.
`spark/src/test/scala/ai/chronon/spark/test/batch/StoragePartitionJoinTest.scala`	Add new test for bucketed joins with Iceberg and Hive metastore; verify no shuffle in physical plan.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DataGen
    participant SparkJob
    participant TableUtils
    participant Table
    User->>DataGen: Generate DataFrame (with row_id)
    DataGen->>SparkJob: Pass DataFrame (row_id included)
    SparkJob->>TableUtils: save(df, ..., bucketByRowId=true)
    TableUtils->>Table: Create/Insert (bucketed by row_id)
    Table->>SparkJob: Data available (bucketed, with row_id)
    SparkJob->>SparkJob: Join DataFrames (using row_id in keys)
    SparkJob->>User: Output DataFrame (row_id dropped if needed)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

tchow-zlai
nikhil-zlai

Poem

A row ID now in every row,
Like a secret code where data flows.
Buckets and joins, all know its name,
Ensuring each record’s never the same.
Tests and configs join the parade,
With “row_id” in every charade!
🎉🗃️

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4369b61 and 74b49b2.

📒 Files selected for processing (1)

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: groupby_tests
GitHub Check: join_tests
GitHub Check: fetcher_tests
GitHub Check: spark_tests
GitHub Check: batch_tests
GitHub Check: python_tests
GitHub Check: cloud_aws_tests
GitHub Check: service_tests
GitHub Check: flink_tests
GitHub Check: service_commons_tests
GitHub Check: online_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: api_tests
GitHub Check: cloud_gcp_tests
GitHub Check: aggregator_tests
GitHub Check: enforce_triggered_workflows

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch vz/row_id_optimizations

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e3fd79a and b644167.

📒 Files selected for processing (8)

api/src/main/scala/ai/chronon/api/Extensions.scala (1 hunks)
api/thrift/planner.thrift (1 hunks)
spark/src/main/scala/ai/chronon/spark/Driver.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/Extensions.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (3 hunks)

🧰 Additional context used

🧠 Learnings (6)

spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (2)

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala (1)

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (1)

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (3)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:69-69
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/Api.scala`, the default implementation of the `create` method (`def create(dataset: String, props: Map[String, Any]): Unit = create(dataset)`) doesn't leverage the `props` parameter, but subclasses like `DynamoDBKVStoreImpl` use the `props` parameter in their overridden implementations.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: online/src/main/scala/ai/chronon/online/Api.scala:69-69
Timestamp: 2024-10-07T15:21:50.787Z
Learning: In the `KVStore` trait located at `online/src/main/scala/ai/chronon/online/Api.scala`, the default implementation of the `create` method (`def create(dataset: String, props: Map[String, Any]): Unit = create(dataset)`) doesn't leverage the `props` parameter, but subclasses like `DynamoDBKVStoreImpl` use the `props` parameter in their overridden implementations.

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (8)

Learnt from: piyush-zlai
PR: zipline-ai/chronon#33
File: cloud_aws/src/main/scala/ai/chronon/integrations/aws/DynamoDBKVStoreImpl.scala:245-260
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In `DynamoDBKVStoreImpl.scala`, refactoring methods like `extractTimedValues` and `extractListValues` to eliminate code duplication is discouraged if it would make the code more convoluted.

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#50
File: spark/src/test/scala/ai/chronon/spark/test/MockKVStore.scala:19-28
Timestamp: 2024-10-31T18:29:45.027Z
Learning: In `MockKVStore` located at `spark/src/test/scala/ai/chronon/spark/test/MockKVStore.scala`, the `multiPut` method is intended to be a simple implementation without dataset existence validation, duplicate validation logic elimination, or actual storage of key-value pairs for verification.

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import `scala.util.ScalaVersionSpecificCollectionsConverter` in `service/src/main/java/ai/chronon/service/ApiProvider.java` is correct and should not be flagged in future reviews.

Learnt from: david-zlai
PR: zipline-ai/chronon#222
File: cloud_gcp/src/main/resources/additional-confs.yaml:3-3
Timestamp: 2025-01-15T21:00:35.574Z
Learning: The GCS bucket configuration `spark.chronon.table.gcs.temporary_gcs_bucket: "zl-warehouse"` should remain in the main `additional-confs.yaml` file, not in dev-specific configs.

Learnt from: chewy-zlai
PR: zipline-ai/chronon#62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-06T21:54:56.160Z
Learning: In Spark applications, when defining serializable classes, passing an implicit `ExecutionContext` parameter can cause serialization issues. In such cases, it's acceptable to use `scala.concurrent.ExecutionContext.Implicits.global`.

Learnt from: piyush-zlai
PR: zipline-ai/chronon#53
File: hub/app/controllers/TimeSeriesController.scala:224-224
Timestamp: 2024-10-29T15:21:58.102Z
Learning: In the mocked data implementation in `hub/app/controllers/TimeSeriesController.scala`, potential `NumberFormatException` exceptions due to parsing errors (e.g., when using `val featureId = name.split("_").last.toInt`) are acceptable and will be addressed when adding the concrete backend.

Learnt from: tchow-zlai
PR: zipline-ai/chronon#156
File: spark/src/main/scala/ai/chronon/spark/stats/CompareJob.scala:85-85
Timestamp: 2024-12-24T07:16:11.896Z
Learning: The `saveUnPartitioned` method uses `SaveMode.Overwrite` by default, replacing existing data.

🧬 Code Graph Analysis (1)

api/src/main/scala/ai/chronon/api/Extensions.scala (1)

api/src/main/scala/ai/chronon/api/ScalaJavaConversions.scala (6)

toScala (16-22)

toScala (32-38)

toScala (41-43)

toScala (52-54)

toScala (62-68)

toScala (80-86)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: groupby_tests
GitHub Check: cloud_gcp_tests
GitHub Check: service_commons_tests
GitHub Check: service_tests
GitHub Check: join_tests
GitHub Check: fetcher_tests
GitHub Check: cloud_aws_tests
GitHub Check: online_tests
GitHub Check: api_tests
GitHub Check: batch_tests
GitHub Check: spark_tests
GitHub Check: flink_tests
GitHub Check: aggregator_tests
GitHub Check: scala_compile_fmt_fix

🔇 Additional comments (14)

spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

842-846: LGTM! Row ID expression setup looks correct.

The conditional check and method call are properly implemented for the storage partition join feature.

api/thrift/planner.thrift (1)

10-10: LGTM! Clean Thrift field addition.

The optional rowIdExpression field properly supports the new bucketing feature.

api/src/main/scala/ai/chronon/api/Extensions.scala (1)

933-935: LGTM! Solid row ID expression implementation.

The concat_ws with coalesce properly handles null values and creates a reliable composite key.

spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala (3)

9-9: Good addition of logging import.

19-19: Good addition of logger for debugging bucketing operations.

68-79: LGTM! Proper enforcement and implementation of row ID bucketing.

The validation ensures rowIdExpression is present, and the bucketing setup looks correct for the storage partition join feature.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (2)

47-47: LGTM!

Row ID column correctly included in scan.

99-105: Good defensive programming.

Bucketing only applied when row ID column exists.

spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (3)

138-139: LGTM!

Bucketing on row ID for output table.

188-188: Correct join key addition.

Row ID ensures proper join alignment.

232-236: Smart optimization.

Bucket hints leverage physical bucketing for efficient joins.

spark/src/main/scala/ai/chronon/spark/Extensions.scala (2)

144-152: Clean API extension.

Backward compatible with default empty bucket columns.

155-157: Flexible row ID addition.

Uses SQL expressions for computed columns.

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (1)

81-82: Good naming convention.

Double underscores prevent column conflicts.

coderabbitai · 2025-07-20T17:42:42Z

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala

+    val dataFrameWriter = finalizedDf.write.mode(saveMode)
+
+    // Apply bucketing if bucket columns are specified
+    val bucketedWriter = if (bucketColumns.nonEmpty) {
+      logger.info(s"Applying bucketing on columns: ${bucketColumns.mkString(", ")} with ${rowIdClusterNumber} buckets")
+      dataFrameWriter.bucketBy(rowIdClusterNumber, bucketColumns.head, bucketColumns.tail: _*)
+    } else {
+      dataFrameWriter
+    }
+


💡 Verification agent

🧩 Analysis chain

Clean bucketing implementation.

Proper conditional application with logging.

Verify 100 buckets is optimal for your workload:

🏁 Script executed:

#!/bin/bash # Check if bucket count is configurable rg -A 2 "rowIdClusterNumber|bucket.*100" --type scala

Length of output: 2340

Make bucket count configurable

The bucket count (rowIdClusterNumber) is currently hard-coded to 100. Expose this via Spark config so you can tune it per workload:

File: spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala
object TableUtils {

val rowIdClusterNumber: Int = 100

val rowIdClusterNumber: Int =

sparkSession.conf.getInt("chronon.bucketCount", 100)
…
}

Test with different bucketCount values and document the chosen default.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val dataFrameWriter = finalizedDf.write.mode(saveMode)

// Apply bucketing if bucket columns are specified

val bucketedWriter = if (bucketColumns.nonEmpty) {

logger.info(s"Applying bucketing on columns: ${bucketColumns.mkString(", ")} with ${rowIdClusterNumber} buckets")

dataFrameWriter.bucketBy(rowIdClusterNumber, bucketColumns.head, bucketColumns.tail: _*)

} else {

dataFrameWriter

}

object TableUtils {

// … other members …

// Make bucket count configurable via Spark conf (default 100)

val rowIdClusterNumber: Int =

sparkSession.conf.getInt("chronon.bucketCount", 100)

// … rest of TableUtils …

}

🤖 Prompt for AI Agents

In spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala around lines 288 to 297, the bucket count used in bucketing is hard-coded as rowIdClusterNumber (100). Modify the code to read this bucket count from a Spark configuration property instead, allowing it to be tuned per workload. Add a default value if the config is not set, test with different bucket counts, and document the default chosen.

nikhil-zlai

have a few comments, will let tchow comment on the bucketing logic

nikhil-zlai · 2025-07-20T17:42:15Z

api/src/main/scala/ai/chronon/api/Extensions.scala

    }

+    def rowIdExpression: String = {
+      s"concat_ws('__', ${join.rowIds.toScala.map(c => s"coalesce($c, '')").mkString(", ")})"


if there is only one field - we should simple cast as string.

nikhil-zlai · 2025-07-20T17:43:23Z

spark/src/main/scala/ai/chronon/spark/Driver.scala

+      // Set the rowIdExpression if rowIds are defined on the join
+      if (join.rowIds != null && !join.rowIds.isEmpty) {
+        import ai.chronon.api.Extensions.JoinOps
+        sourceWithFilterNode.setRowIdExpression(join.rowIdExpression)


i would push this into the compiler to set _row_id as a select. that way all JoinPart jobs will also see it directly from the left side.

tchow-zlai · 2025-07-20T17:48:51Z

api/thrift/planner.thrift

 struct SourceWithFilterNode {
    2: optional api.Source source
    3: optional map<string,list<string>> excludeKeys
+    4: optional string rowIdExpression


Do we need any checks on the join itself (maybe in compile) to ensure that rowKeys are at least a subset of rowId? I'm thinking if the bucketing isn't aligned with the join key the SPJ would not activate

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b644167 and 629c798.

📒 Files selected for processing (8)

api/python/ai/chronon/join.py (1 hunks)
spark/src/main/scala/ai/chronon/spark/Extensions.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (5 hunks)
spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala
spark/src/main/scala/ai/chronon/spark/Extensions.scala
spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala

🧰 Additional context used

🧠 Learnings (3)

spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala (1)

Learnt from: nikhil-zlai
PR: #70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import scala.util.ScalaVersionSpecificCollectionsConverter in service/src/main/java/ai/chronon/service/ApiProvider.java is correct and should not be flagged in future reviews.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (3)

Learnt from: nikhil-zlai
PR: #70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import scala.util.ScalaVersionSpecificCollectionsConverter in service/src/main/java/ai/chronon/service/ApiProvider.java is correct and should not be flagged in future reviews.

Learnt from: piyush-zlai
PR: #726
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:456-461
Timestamp: 2025-05-02T16:19:11.001Z
Learning: When using Map-based tags with metrics reporting in Scala, values that need to be evaluated (like object properties or method calls) should not be enclosed in quotes to ensure the actual value is used rather than the literal string.

Learnt from: ken-zlai
PR: #160
File: frontend/src/routes/joins/[slug]/services/joins.service.ts:34-34
Timestamp: 2025-01-17T00:33:14.792Z
Learning: The join timeseries API in frontend/src/routes/joins/[slug]/services/joins.service.ts specifically requires 'drift' as the metric type, regardless of the metricType parameter passed to the function.

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (1)

Learnt from: chewy-zlai
PR: #62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-06T21:54:56.160Z
Learning: In Spark applications, when defining serializable classes, passing an implicit ExecutionContext parameter can cause serialization issues. In such cases, it's acceptable to use scala.concurrent.ExecutionContext.Implicits.global.

🧬 Code Graph Analysis (3)

spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala (2)

api/src/main/scala/ai/chronon/api/Constants.scala (1)

Constants (23-100)

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (1)

leftDf (69-108)

api/python/ai/chronon/join.py (3)

api/src/main/scala/ai/chronon/api/Builders.scala (3)

events (121-130)

entities (107-119)

joinSource (132-139)

api/src/main/scala/ai/chronon/api/Extensions.scala (3)

query (388-396)

keys (607-617)

keys (893-906)

api/python/ai/chronon/query.py (1)

selects (103-126)

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (5)

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (4)

JoinUtils (38-550)

coalescedJoin (158-189)

joinWithLeft (191-262)

leftDf (69-108)

spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (2)

TableUtils (42-637)

TableUtils (639-641)

spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala (2)

UnionJoin (18-229)

computeJoinPart (112-193)

spark/src/main/scala/ai/chronon/spark/Extensions.scala (2)

save (141-153)

pretty (39-51)

api/src/main/scala/ai/chronon/api/Constants.scala (1)

Constants (23-100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)

GitHub Check: service_tests
GitHub Check: cloud_gcp_tests
GitHub Check: service_commons_tests
GitHub Check: cloud_aws_tests
GitHub Check: online_tests
GitHub Check: flink_tests
GitHub Check: aggregator_tests
GitHub Check: api_tests
GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: fetcher_tests
GitHub Check: batch_tests
GitHub Check: spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (4)

spark/src/main/scala/ai/chronon/spark/join/UnionJoin.scala (1)

127-132: LGTM!

Adding the internal row ID column to keys ensures proper deduplication.

spark/src/main/scala/ai/chronon/spark/JoinUtils.scala (1)

191-262: Well-implemented join logic with bucketing optimization.

The method correctly handles key mapping, column prefixing, and partition adjustments.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (2)

102-107: Good bucketing optimization.

Bucketing on row ID column will improve join performance downstream.

232-235: Nice optimization for skew-free mode.

Using UnionJoin avoids unnecessary join back to left.

api/python/ai/chronon/join.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (1)
276-346: Comprehensive join implementation with good logging.

Handles complex join scenarios correctly but consider breaking into smaller methods for maintainability.

Consider extracting key computation and date adjustment logic into separate helper methods:
+  private def computeJoinKeys(leftDataModel: DataModel, accuracy: Accuracy): Seq[String] = {
+    // Extract key computation logic
+  }
+  
+  private def adjustPartitionDates(df: DataFrame, needsAdjustment: Boolean): DataFrame = {
+    // Extract date adjustment logic
+  }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c71215f and 60a6872.

📒 Files selected for processing (8)

spark/src/main/scala/ai/chronon/spark/Extensions.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (5 hunks)
spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/catalog/CreationUtils.scala (3 hunks)
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala (4 hunks)
spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (1 hunks)
spark/src/test/scala/ai/chronon/spark/test/batch/ModularJoinTest.scala (10 hunks)

🧠 Learnings (4)

spark/src/main/scala/ai/chronon/spark/catalog/CreationUtils.scala (5)

Learnt from: tchow-zlai
PR: #263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-01-24T23:55:40.650Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

Learnt from: piyush-zlai
PR: #33
File: online/src/main/scala/ai/chronon/online/Api.scala:69-69
Timestamp: 2024-10-08T16:18:45.669Z
Learning: In the KVStore trait located at online/src/main/scala/ai/chronon/online/Api.scala, the default implementation of the create method (def create(dataset: String, props: Map[String, Any]): Unit = create(dataset)) doesn't leverage the props parameter, but subclasses like DynamoDBKVStoreImpl use the props parameter in their overridden implementations.

Learnt from: piyush-zlai
PR: #33
File: online/src/main/scala/ai/chronon/online/Api.scala:69-69
Timestamp: 2024-10-07T15:21:50.787Z
Learning: In the KVStore trait located at online/src/main/scala/ai/chronon/online/Api.scala, the default implementation of the create method (def create(dataset: String, props: Map[String, Any]): Unit = create(dataset)) doesn't leverage the props parameter, but subclasses like DynamoDBKVStoreImpl use the props parameter in their overridden implementations.

Learnt from: chewy-zlai
PR: #47
File: online/src/main/scala/ai/chronon/online/MetadataStore.scala:232-0
Timestamp: 2024-10-17T00:12:09.763Z
Learning: In the KVStore trait located at online/src/main/scala/ai/chronon/online/KVStore.scala, there are two create methods: def create(dataset: String): Unit and def create(dataset: String, props: Map[String, Any]): Unit. The version with props ignores the props parameter, and the simpler version without props is appropriate when props are not needed.

Learnt from: tchow-zlai
PR: #263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:29-60
Timestamp: 2025-01-24T23:55:30.256Z
Learning: In BigQuery integration, table existence check is performed outside the BigQueryFormat.createTable method, at a higher level in TableUtils.createTable.

spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala (3)

Learnt from: piyush-zlai
PR: #43
File: hub/app/controllers/TimeSeriesController.scala:320-320
Timestamp: 2024-10-14T18:44:24.599Z
Learning: In hub/app/controllers/TimeSeriesController.scala, the generateMockTimeSeriesPercentilePoints method contains placeholder code that will be replaced with the actual implementation soon.

Learnt from: piyush-zlai
PR: #53
File: hub/app/controllers/TimeSeriesController.scala:224-224
Timestamp: 2024-10-29T15:21:58.102Z
Learning: In the mocked data implementation in hub/app/controllers/TimeSeriesController.scala, potential NumberFormatException exceptions due to parsing errors (e.g., when using val featureId = name.split("_").last.toInt) are acceptable and will be addressed when adding the concrete backend.

Learnt from: piyush-zlai
PR: #33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within DynamoDBKVStoreTest.scala is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

spark/src/test/scala/ai/chronon/spark/test/batch/ModularJoinTest.scala (7)

Learnt from: piyush-zlai
PR: #33
File: cloud_aws/src/test/scala/ai/chronon/integrations/aws/DynamoDBKVStoreTest.scala:175-175
Timestamp: 2024-10-07T15:09:51.567Z
Learning: Hardcoding future timestamps in tests within DynamoDBKVStoreTest.scala is acceptable when data is generated and queried within the same time range, ensuring the tests remain valid over time.

Learnt from: piyush-zlai
PR: #44
File: hub/app/controllers/ModelController.scala:15-18
Timestamp: 2024-10-17T19:46:42.629Z
Learning: References to MockDataService in hub/test/controllers/SearchControllerSpec.scala and hub/test/controllers/ModelControllerSpec.scala are needed for tests and should not be removed.

Learnt from: nikhil-zlai
PR: #70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import scala.util.ScalaVersionSpecificCollectionsConverter in service/src/main/java/ai/chronon/service/ApiProvider.java is correct and should not be flagged in future reviews.

Learnt from: piyush-zlai
PR: #53
File: hub/app/controllers/TimeSeriesController.scala:224-224
Timestamp: 2024-10-29T15:21:58.102Z
Learning: In the mocked data implementation in hub/app/controllers/TimeSeriesController.scala, potential NumberFormatException exceptions due to parsing errors (e.g., when using val featureId = name.split("_").last.toInt) are acceptable and will be addressed when adding the concrete backend.

Learnt from: piyush-zlai
PR: #43
File: hub/app/controllers/TimeSeriesController.scala:320-320
Timestamp: 2024-10-14T18:44:24.599Z
Learning: In hub/app/controllers/TimeSeriesController.scala, the generateMockTimeSeriesPercentilePoints method contains placeholder code that will be replaced with the actual implementation soon.

Learnt from: chewy-zlai
PR: #50
File: spark/src/test/scala/ai/chronon/spark/test/MockKVStore.scala:19-28
Timestamp: 2024-10-31T18:29:45.027Z
Learning: In MockKVStore located at spark/src/test/scala/ai/chronon/spark/test/MockKVStore.scala, the multiPut method is intended to be a simple implementation without dataset existence validation, duplicate validation logic elimination, or actual storage of key-value pairs for verification.

Learnt from: chewy-zlai
PR: #62
File: spark/src/main/scala/ai/chronon/spark/stats/drift/SummaryUploader.scala:9-10
Timestamp: 2024-11-06T21:54:56.160Z
Learning: In Spark applications, when defining serializable classes, passing an implicit ExecutionContext parameter can cause serialization issues. In such cases, it's acceptable to use scala.concurrent.ExecutionContext.Implicits.global.

spark/src/main/scala/ai/chronon/spark/batch/JoinPartJob.scala (3)

Learnt from: nikhil-zlai
PR: #70
File: service/src/main/java/ai/chronon/service/ApiProvider.java:6-6
Timestamp: 2024-12-03T04:04:33.809Z
Learning: The import scala.util.ScalaVersionSpecificCollectionsConverter in service/src/main/java/ai/chronon/service/ApiProvider.java is correct and should not be flagged in future reviews.

Learnt from: piyush-zlai
PR: #726
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala:456-461
Timestamp: 2025-05-02T16:19:11.001Z
Learning: When using Map-based tags with metrics reporting in Scala, values that need to be evaluated (like object properties or method calls) should not be enclosed in quotes to ensure the actual value is used rather than the literal string.

Learnt from: ken-zlai
PR: #160
File: frontend/src/routes/joins/[slug]/services/joins.service.ts:34-34
Timestamp: 2025-01-17T00:33:14.792Z
Learning: The join timeseries API in frontend/src/routes/joins/[slug]/services/joins.service.ts specifically requires 'drift' as the metric type, regardless of the metricType parameter passed to the function.

🚧 Files skipped from review as they are similar to previous changes (4)

spark/src/main/scala/ai/chronon/spark/Extensions.scala
spark/src/main/scala/ai/chronon/spark/batch/SourceJob.scala
spark/src/main/scala/ai/chronon/spark/batch/MergeJob.scala
spark/src/main/scala/ai/chronon/spark/catalog/TableUtils.scala