Add Driver verb to bulk load GBU data to KV store #172

piyush-zlai · 2025-01-06T06:08:43Z

Summary

Add a verb to the Driver to allow us to bulk load GBU data to the KV store of choice.

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing the BigTable kv store code (data.test_gbu)

$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds

Was also able to test via triggering the submitter test. The upload kicks off this Spark job and the bulk load succeeds.

Summary by CodeRabbit

New Features
- Added a new command-line subcommand for bulk loading GroupBy data to a key-value store.
- Introduced a new option to upload offline GroupBy tables with specified parameters.
Improvements
- Enhanced the export process for data to BigTable with updated query parameters.
- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow fallback options.
Tests
- Added a new test case for the DataprocSubmitter class related to GBU bulk loading.

coderabbitai · 2025-01-06T06:08:51Z

Walkthrough

The pull request introduces a new GroupByUploadToKVBulkLoad object in the Spark Driver, enabling a bulk load operation for GroupBy uploads. This addition extends the existing command-line interface with a new subcommand that allows users to perform key-value store bulk uploads by specifying source offline table, GroupBy name, and partition details. Additionally, modifications to the bulkPut method in BigTableKVStoreImpl enhance the export process to BigTable. Changes to environment variable handling in GcpApiImpl improve configuration flexibility, while a new test case in DataprocSubmitterTest adds coverage for the bulk load functionality.

Changes

File	Change Summary
`spark/src/main/scala/ai/chronon/spark/Driver.scala`	- Added `GroupByUploadToKVBulkLoad` object - Created nested `Args` class for bulk load subcommand - Implemented `run` method for bulk upload operation - Updated main method to handle new subcommand
`cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala`	- Modified `bulkPut` method's SQL export query - Updated job ID generation logic for uniqueness
`cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala`	- Enhanced retrieval of `GCP_PROJECT_ID` and `GCP_INSTANCE_ID` with fallback to `conf` map
`cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala`	- Added ignored test case for GBU bulk load functionality

Possibly related PRs

Summary upload #50: The SummaryUploader class introduced in this PR is related to the new GroupByUploadToKVBulkLoad object in the main PR, as both involve uploading data to a key-value store and include methods for handling data uploads.
Driver Summarizer #62: The changes in the Driver.scala file in this PR include enhancements to data summarization and uploading capabilities, which align with the new functionalities introduced in the GroupByUploadToKVBulkLoad object in the main PR.
Rework BigTableKV Store & GCP Api #135: The modifications to the BigTableKVStoreImpl class in this PR, particularly regarding the handling of bulk uploads and data management, are relevant to the bulk upload functionality introduced in the GroupByUploadToKVBulkLoad object in the main PR.
feat: support providing additional confs as yaml file for Driver.scala #164: The introduction of additional configuration capabilities in this PR may enhance the functionality of the GroupByUploadToKVBulkLoad object by allowing for more flexible configuration options during data uploads.
GroupByUploader in Driver should use the top level TableUtils #177: The change to utilize the top-level TableUtils in the GroupByUploader directly relates to the enhancements made in the main PR, ensuring that the new bulk upload functionality operates with the correct configurations.

Suggested reviewers

tchow-zlai
david-zlai

Poem

🚀 Data flows like a river's might,
Bulk loads dancing in digital light,
Chronon's spark ignites the way,
Transforming bytes without delay,
A symphony of upload's delight! 🔥

Warning

Review ran into problems

🔥 Problems

GitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a7a8bf3 and 79e703f.

📒 Files selected for processing (1)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: table_utils_delta_format_spark_tests
GitHub Check: other_spark_tests
GitHub Check: fetcher_spark_tests
GitHub Check: mutation_spark_tests
GitHub Check: join_spark_tests
GitHub Check: scala_compile_fmt_fix

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

730-746: Add better error recovery.
Currently, one exception disrupts the entire bulk load. Consider partial or iterative handling to avoid data loss.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa7369 and bb1f7ec.

📒 Files selected for processing (1)

spark/src/main/scala/ai/chronon/spark/Driver.scala (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: table_utils_delta_format_spark_tests
GitHub Check: other_spark_tests
GitHub Check: mutation_spark_tests
GitHub Check: fetcher_spark_tests
GitHub Check: join_spark_tests
GitHub Check: scala_compile_fmt_fix

🔇 Additional comments (4)

spark/src/main/scala/ai/chronon/spark/Driver.scala (4)

717-718: Good introduction of a new subcommand.
The naming is consistent with other subcommands.

719-729: Validate the partition format.
Ensure the provided partition string matches “yyyy-MM-dd” to reduce runtime errors.

948-949: Subcommand integration looks good.
Everything aligns with the established pattern.

995-997: Consistent with existing structure.
No issues spotted, subcommand is properly invoked.

tchow-zlai · 2025-01-06T18:29:41Z

spark/src/main/scala/ai/chronon/spark/Driver.scala

+        opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")
+
+      val partitionString: ScallopOption[String] =
+        opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")


Is this string supposed to match whatever's configured at:

chronon/spark/src/main/scala/ai/chronon/spark/TableUtils.scala

Lines 71 to 72 in 3aa7369

private val partitionFormat: String =

sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")

?

yeah this needs to match what we're writing out.

okay I think there's some form of this where we can DRY things up but wouldn't block the PR. Thanks for clarifying it!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between bb1f7ec and 078d378.

📒 Files selected for processing (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/Driver.scala (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: table_utils_delta_format_spark_tests
GitHub Check: other_spark_tests
GitHub Check: join_spark_tests
GitHub Check: mutation_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: fetcher_spark_tests

🔇 Additional comments (4)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2)

336-338: LGTM: Job ID uniqueness improvement.

Adding timestamp to job ID prevents conflicts for same-partition exports.

315-315: Verify column family configuration matches TableUtils.

The empty field name and "cf" alias should align with the column family configuration in TableUtils.

Also applies to: 323-323

✅ Verification successful

Column family configuration verified ✓

The empty field name and "cf" alias in the export configuration correctly align with BigTableKVStoreImpl's constants:

ColumnFamilyString = "cf"

ColumnFamilyQualifierString = "value"
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for column family configuration in TableUtils
rg -A 5 'column.*family' 
Length of output: 26

Script:
#!/bin/bash
# Find TableUtils file
fd TableUtils

# Search for cf or columnFamily patterns
rg -i "columnfamily|cf\s*=" 

# Search for BigTable configuration patterns
rg -i "bigtable.*config"
Length of output: 2607
spark/src/main/scala/ai/chronon/spark/Driver.scala (2)

717-747: LGTM: Well-structured bulk load implementation.

Good error handling and timing metrics. Clear logging of operations.

948-949: LGTM: Clean CLI integration.

New command properly integrated into the CLI framework.

Also applies to: 995-997

coderabbitai · 2025-01-07T05:37:54Z

spark/src/main/scala/ai/chronon/spark/Driver.scala

+      val srcOfflineTable: ScallopOption[String] =
+        opt[String](required = true, descr = "Name of the source GroupBy Upload table")
+
+      val groupbyName: ScallopOption[String] =
+        opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")
+
+      val partitionString: ScallopOption[String] =
+        opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")


🛠️ Refactor suggestion

Add validation for partition string format.

Ensure the partition string matches 'yyyy-MM-dd' format to prevent runtime errors.

val partitionString: ScallopOption[String] = opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading") + validate(s => try { + java.time.LocalDate.parse(s, java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd")) + true + } catch { + case _: Exception => false + })

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val srcOfflineTable: ScallopOption[String] =

opt[String](required = true, descr = "Name of the source GroupBy Upload table")

val groupbyName: ScallopOption[String] =

opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")

val partitionString: ScallopOption[String] =

opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")

val srcOfflineTable: ScallopOption[String] =

opt[String](required = true, descr = "Name of the source GroupBy Upload table")

val groupbyName: ScallopOption[String] =

opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")

val partitionString: ScallopOption[String] =

opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")

validate(s => try {

java.time.LocalDate.parse(s, java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd"))

true

} catch {

case _: Exception => false

})

tchow-zlai · 2025-01-07T22:40:51Z

Actually @piyush-zlai we were chatting earlier today and @david-zlai brought up the topic of how this job is meant to be kicked off. It seems like this will be run straight from the user's laptop, the way it's implemented now. Currently, the way things work is that run.py will submit a jar that contains the Driver.scala somewhere to the dataproc cluster and Driver.scala gets invoked on that cluster. In this scenario, David mentioned this doesn't need to be run off dataproc the way it's written, so I guess it will be done on the user's laptop. However, we are not working under the assumption that Driver.scala is executed from the user's laptop environment. Is this an execution path we should introduce? Or maybe we can just piggyback off a spark driver container to run the export query, so that the submission flow is consistent?

piyush-zlai · 2025-01-08T03:09:21Z

Let's chat about this offline - I'll ping you and David

piyush-zlai · 2025-01-08T04:07:50Z

(Discussed offline - run.py will call the submitter which in turn calls Driver with the right verb + params. We can attempt to use the same rails for this bulk load invocation as well)

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala (2)
24-26: LGTM! Consider enhancing error message.

The fallback mechanism is well implemented.
-      .getOrElse(throw new IllegalArgumentException("GCP_PROJECT_ID environment variable not set"))
+      .getOrElse(throw new IllegalArgumentException("GCP_PROJECT_ID not found in environment or configuration"))
29-31: Consider extracting the credential retrieval pattern.

Duplicate pattern with GCP_PROJECT_ID retrieval.
private def getGcpConfig(key: String): String = {
  sys.env.get(key)
    .orElse(conf.get(key))
    .getOrElse(throw new IllegalArgumentException(s"$key not found in environment or configuration"))
}
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1)

61-76: Consider adding parameter validation.

The test would be more robust with validation of the parameters passed to submit.

Add assertions to verify:

Source table format

GroupBy name format

Partition string format

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 078d378 and a7a8bf3.

📒 Files selected for processing (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala (1 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (6)

GitHub Check: table_utils_delta_format_spark_tests
GitHub Check: fetcher_spark_tests
GitHub Check: mutation_spark_tests
GitHub Check: join_spark_tests
GitHub Check: other_spark_tests
GitHub Check: scala_compile_fmt_fix

🔇 Additional comments (1)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1)

65-65: Verify empty jar list.

The empty jar list (List.empty) might be unintentional.

coderabbitai · 2025-01-08T07:15:58Z

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala

+    println(submittedJobId)
+    assertEquals(submittedJobId, "mock-job-id")


⚠️ Potential issue

Fix incorrect assertion.

The test creates a real DataprocSubmitter but expects a mock job ID.

Either:

Use a mocked submitter like in the first test case, or

Remove the assertion and keep it as a local-only test

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala

piyush-zlai · 2025-01-08T08:20:25Z

Was able to test this via a submitter test that invokes the driver with the right verb + params. Merging this.

## Summary Add a verb to the Driver to allow us to bulk load GBU data to the KV store of choice. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested manually using the dummy gbu table (on bq) I used while testing the BigTable kv store code (data.test_gbu) ``` $ export GCP_INSTANCE_ID="zipline-canary-instance" $ export GCP_PROJECT_ID="canary-443022" $ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30 ... Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds ``` Was also able to test via triggering the submitter test. The upload kicks off this [Spark job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022) and the bulk load succeeds.  ## Summary by CodeRabbit - **New Features** - Added a new command-line subcommand for bulk loading GroupBy data to a key-value store. - Introduced a new option to upload offline GroupBy tables with specified parameters. - **Improvements** - Enhanced the export process for data to BigTable with updated query parameters. - Improved job ID generation for uniqueness during data uploads. - Updated error handling for environment variable retrieval to allow fallback options. - **Tests** - Added a new test case for the `DataprocSubmitter` class related to GBU bulk loading.  Co-authored-by: Thomas Chow <[email protected]>

## Summary Add a verb to the Driver to allow us to bulk load GBU data to the KV store of choice. ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested manually using the dummy gbu table (on bq) I used while testing the BigTable kv store code (data.test_gbu) ``` $ export GCP_INSTANCE_ID="zipline-canary-instance" $ export GCP_PROJECT_ID="canary-443022" $ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30 ... Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds ``` Was also able to test via triggering the submitter test. The upload kicks off this [Spark job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022) and the bulk load succeeds.  ## Summary by CodeRabbit - **New Features** - Added a new command-line subcommand for bulk loading GroupBy data to a key-value store. - Introduced a new option to upload offline GroupBy tables with specified parameters. - **Improvements** - Enhanced the export process for data to BigTable with updated query parameters. - Improved job ID generation for uniqueness during data uploads. - Updated error handling for environment variable retrieval to allow fallback options. - **Tests** - Added a new test case for the `DataprocSubmitter` class related to GBU bulk loading.

## Summary Add a verb to the Driver to allow us to bulk load GBU data to the KV store of choice. ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [X] Integration tested - [ ] Documentation update Tested manually using the dummy gbu table (on bq) I used while testing the BigTable kv store code (data.test_gbu) ``` $ export GCP_INSTANCE_ID="zipline-canary-instance" $ export GCP_PROJECT_ID="canary-443022" $ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quiour clientsstart.purchases.v1 --partition-string=2023-11-30 ... Triggering bulk load for GroupBy: quiour clientsstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu Uploaded GroupByUpload data to KV store for GroupBy: quiour clientsstart.purchases.v1; partition: 2023-11-30 in 2 seconds ``` Was also able to test via triggering the submitter test. The upload kiour clientss off this [Spark job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022) and the bulk load succeeds.  ## Summary by CodeRabbit - **New Features** - Added a new command-line subcommand for bulk loading GroupBy data to a key-value store. - Introduced a new option to upload offline GroupBy tables with specified parameters. - **Improvements** - Enhanced the export process for data to BigTable with updated query parameters. - Improved job ID generation for uniqueness during data uploads. - Updated error handling for environment variable retrieval to allow fallbaour clients options. - **Tests** - Added a new test case for the `DataprocSubmitter` class related to GBU bulk loading.

piyush-zlai requested a review from david-zlai January 6, 2025 06:08

piyush-zlai requested a review from tchow-zlai January 6, 2025 06:08

piyush-zlai assigned tchow-zlai and david-zlai Jan 6, 2025

coderabbitai bot reviewed Jan 6, 2025

View reviewed changes

tchow-zlai reviewed Jan 6, 2025

View reviewed changes

piyush-zlai added 4 commits January 7, 2025 11:04

Add verb to bulk load GBU data to KV store

1d44c1b

style: Apply scalafix and scalafmt changes

712f86f

Fix query to not add col family for value

aa71445

style: Apply scalafix and scalafmt changes

078d378

piyush-zlai force-pushed the piyush/gbu_driver_verb branch from aca549b to 078d378 Compare January 7, 2025 05:35

coderabbitai bot reviewed Jan 7, 2025

View reviewed changes

tchow-zlai approved these changes Jan 7, 2025

View reviewed changes

Add support to configure api impl using conf + submitter test

a7a8bf3

coderabbitai bot reviewed Jan 8, 2025

View reviewed changes

Use mocks

79e703f

piyush-zlai merged commit 04c9ad7 into main Jan 8, 2025
9 checks passed

piyush-zlai deleted the piyush/gbu_driver_verb branch January 8, 2025 08:20

This was referenced Jan 9, 2025

Connect run.py to DataprocSubmitter.scala so that offline jobs can be run on Dataproc #186

Merged

Connect GroupByUploadToKVBulkLoad from Driver.scala to run.py #221

Merged

This was referenced Mar 27, 2025

feat: Partition Sensor #547

Merged

Rework BigTableKVStore multiget to issue a bulkGet request rather than n query calls #562

Merged

	private val partitionFormat: String =
	sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")

		println(submittedJobId)
		assertEquals(submittedJobId, "mock-job-id")

Add Driver verb to bulk load GBU data to KV store #172

Add Driver verb to bulk load GBU data to KV store #172

Uh oh!

Conversation

piyush-zlai commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piyush-zlai Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

piyush-zlai commented Jan 8, 2025

Uh oh!

piyush-zlai commented Jan 8, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

piyush-zlai commented Jan 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

piyush-zlai commented Jan 6, 2025 •

edited

Loading

coderabbitai bot commented Jan 6, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

tchow-zlai Jan 6, 2025 •

edited

Loading

tchow-zlai commented Jan 7, 2025 •

edited

Loading