Skip to content

Conversation

@piyush-zlai
Copy link
Contributor

@piyush-zlai piyush-zlai commented Jan 6, 2025

Summary

Add a verb to the Driver to allow us to bulk load GBU data to the KV store of choice.

Checklist

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing the BigTable kv store code (data.test_gbu)

$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds

Was also able to test via triggering the submitter test. The upload kicks off this Spark job and the bulk load succeeds.

Summary by CodeRabbit

  • New Features

    • Added a new command-line subcommand for bulk loading GroupBy data to a key-value store.
    • Introduced a new option to upload offline GroupBy tables with specified parameters.
  • Improvements

    • Enhanced the export process for data to BigTable with updated query parameters.
    • Improved job ID generation for uniqueness during data uploads.
    • Updated error handling for environment variable retrieval to allow fallback options.
  • Tests

    • Added a new test case for the DataprocSubmitter class related to GBU bulk loading.

@piyush-zlai piyush-zlai requested a review from david-zlai January 6, 2025 06:08
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2025

Walkthrough

The pull request introduces a new GroupByUploadToKVBulkLoad object in the Spark Driver, enabling a bulk load operation for GroupBy uploads. This addition extends the existing command-line interface with a new subcommand that allows users to perform key-value store bulk uploads by specifying source offline table, GroupBy name, and partition details. Additionally, modifications to the bulkPut method in BigTableKVStoreImpl enhance the export process to BigTable. Changes to environment variable handling in GcpApiImpl improve configuration flexibility, while a new test case in DataprocSubmitterTest adds coverage for the bulk load functionality.

Changes

File Change Summary
spark/src/main/scala/ai/chronon/spark/Driver.scala - Added GroupByUploadToKVBulkLoad object
- Created nested Args class for bulk load subcommand
- Implemented run method for bulk upload operation
- Updated main method to handle new subcommand
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala - Modified bulkPut method's SQL export query
- Updated job ID generation logic for uniqueness
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala - Enhanced retrieval of GCP_PROJECT_ID and GCP_INSTANCE_ID with fallback to conf map
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala - Added ignored test case for GBU bulk load functionality

Possibly related PRs

  • Summary upload #50: The SummaryUploader class introduced in this PR is related to the new GroupByUploadToKVBulkLoad object in the main PR, as both involve uploading data to a key-value store and include methods for handling data uploads.
  • Driver Summarizer #62: The changes in the Driver.scala file in this PR include enhancements to data summarization and uploading capabilities, which align with the new functionalities introduced in the GroupByUploadToKVBulkLoad object in the main PR.
  • Rework BigTableKV Store & GCP Api #135: The modifications to the BigTableKVStoreImpl class in this PR, particularly regarding the handling of bulk uploads and data management, are relevant to the bulk upload functionality introduced in the GroupByUploadToKVBulkLoad object in the main PR.
  • feat: support providing additional confs as yaml file for Driver.scala #164: The introduction of additional configuration capabilities in this PR may enhance the functionality of the GroupByUploadToKVBulkLoad object by allowing for more flexible configuration options during data uploads.
  • GroupByUploader in Driver should use the top level TableUtils #177: The change to utilize the top-level TableUtils in the GroupByUploader directly relates to the enhancements made in the main PR, ensuring that the new bulk upload functionality operates with the correct configurations.

Suggested reviewers

  • tchow-zlai
  • david-zlai

Poem

🚀 Data flows like a river's might,
Bulk loads dancing in digital light,
Chronon's spark ignites the way,
Transforming bytes without delay,
A symphony of upload's delight! 🔥

Warning

Review ran into problems

🔥 Problems

GitHub Actions: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a7a8bf3 and 79e703f.

📒 Files selected for processing (1)
  • cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: scala_compile_fmt_fix

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

730-746: Add better error recovery.
Currently, one exception disrupts the entire bulk load. Consider partial or iterative handling to avoid data loss.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa7369 and bb1f7ec.

📒 Files selected for processing (1)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (4)
spark/src/main/scala/ai/chronon/spark/Driver.scala (4)

717-718: Good introduction of a new subcommand.
The naming is consistent with other subcommands.


719-729: Validate the partition format.
Ensure the provided partition string matches “yyyy-MM-dd” to reduce runtime errors.


948-949: Subcommand integration looks good.
Everything aligns with the established pattern.


995-997: Consistent with existing structure.
No issues spotted, subcommand is properly invoked.

opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")

val partitionString: ScallopOption[String] =
opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")
Copy link
Collaborator

@tchow-zlai tchow-zlai Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this string supposed to match whatever's configured at:

private val partitionFormat: String =
sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this needs to match what we're writing out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I think there's some form of this where we can DRY things up but wouldn't block the PR. Thanks for clarifying it!

@piyush-zlai piyush-zlai force-pushed the piyush/gbu_driver_verb branch from aca549b to 078d378 Compare January 7, 2025 05:35
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between bb1f7ec and 078d378.

📒 Files selected for processing (2)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2 hunks)
  • spark/src/main/scala/ai/chronon/spark/Driver.scala (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: scala_compile_fmt_fix
  • GitHub Check: fetcher_spark_tests
🔇 Additional comments (4)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigTableKVStoreImpl.scala (2)

336-338: LGTM: Job ID uniqueness improvement.

Adding timestamp to job ID prevents conflicts for same-partition exports.


315-315: Verify column family configuration matches TableUtils.

The empty field name and "cf" alias should align with the column family configuration in TableUtils.

Also applies to: 323-323

✅ Verification successful

Column family configuration verified ✓

The empty field name and "cf" alias in the export configuration correctly align with BigTableKVStoreImpl's constants:

  • ColumnFamilyString = "cf"
  • ColumnFamilyQualifierString = "value"
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for column family configuration in TableUtils
rg -A 5 'column.*family' 

Length of output: 26


Script:

#!/bin/bash
# Find TableUtils file
fd TableUtils

# Search for cf or columnFamily patterns
rg -i "columnfamily|cf\s*=" 

# Search for BigTable configuration patterns
rg -i "bigtable.*config"

Length of output: 2607

spark/src/main/scala/ai/chronon/spark/Driver.scala (2)

717-747: LGTM: Well-structured bulk load implementation.

Good error handling and timing metrics. Clear logging of operations.


948-949: LGTM: Clean CLI integration.

New command properly integrated into the CLI framework.

Also applies to: 995-997

Comment on lines +720 to +727
val srcOfflineTable: ScallopOption[String] =
opt[String](required = true, descr = "Name of the source GroupBy Upload table")

val groupbyName: ScallopOption[String] =
opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")

val partitionString: ScallopOption[String] =
opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for partition string format.

Ensure the partition string matches 'yyyy-MM-dd' format to prevent runtime errors.

 val partitionString: ScallopOption[String] =
   opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")
+  validate(s => try {
+    java.time.LocalDate.parse(s, java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd"))
+    true
+  } catch {
+    case _: Exception => false
+  })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val srcOfflineTable: ScallopOption[String] =
opt[String](required = true, descr = "Name of the source GroupBy Upload table")
val groupbyName: ScallopOption[String] =
opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")
val partitionString: ScallopOption[String] =
opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")
val srcOfflineTable: ScallopOption[String] =
opt[String](required = true, descr = "Name of the source GroupBy Upload table")
val groupbyName: ScallopOption[String] =
opt[String](required = true, descr = "Name of the GroupBy that we're triggering this upload for")
val partitionString: ScallopOption[String] =
opt[String](required = true, descr = "Partition string (in 'yyyy-MM-dd' format) that we are uploading")
validate(s => try {
java.time.LocalDate.parse(s, java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd"))
true
} catch {
case _: Exception => false
})

@tchow-zlai
Copy link
Collaborator

tchow-zlai commented Jan 7, 2025

Actually @piyush-zlai we were chatting earlier today and @david-zlai brought up the topic of how this job is meant to be kicked off. It seems like this will be run straight from the user's laptop, the way it's implemented now. Currently, the way things work is that run.py will submit a jar that contains the Driver.scala somewhere to the dataproc cluster and Driver.scala gets invoked on that cluster. In this scenario, David mentioned this doesn't need to be run off dataproc the way it's written, so I guess it will be done on the user's laptop. However, we are not working under the assumption that Driver.scala is executed from the user's laptop environment. Is this an execution path we should introduce? Or maybe we can just piggyback off a spark driver container to run the export query, so that the submission flow is consistent?

@piyush-zlai
Copy link
Contributor Author

Let's chat about this offline - I'll ping you and David

@piyush-zlai
Copy link
Contributor Author

(Discussed offline - run.py will call the submitter which in turn calls Driver with the right verb + params. We can attempt to use the same rails for this bulk load invocation as well)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala (2)

24-26: LGTM! Consider enhancing error message.

The fallback mechanism is well implemented.

-      .getOrElse(throw new IllegalArgumentException("GCP_PROJECT_ID environment variable not set"))
+      .getOrElse(throw new IllegalArgumentException("GCP_PROJECT_ID not found in environment or configuration"))

29-31: Consider extracting the credential retrieval pattern.

Duplicate pattern with GCP_PROJECT_ID retrieval.

private def getGcpConfig(key: String): String = {
  sys.env.get(key)
    .orElse(conf.get(key))
    .getOrElse(throw new IllegalArgumentException(s"$key not found in environment or configuration"))
}
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1)

61-76: Consider adding parameter validation.

The test would be more robust with validation of the parameters passed to submit.

Add assertions to verify:

  • Source table format
  • GroupBy name format
  • Partition string format
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 078d378 and a7a8bf3.

📒 Files selected for processing (2)
  • cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GcpApiImpl.scala (1 hunks)
  • cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: table_utils_delta_format_spark_tests
  • GitHub Check: fetcher_spark_tests
  • GitHub Check: mutation_spark_tests
  • GitHub Check: join_spark_tests
  • GitHub Check: other_spark_tests
  • GitHub Check: scala_compile_fmt_fix
🔇 Additional comments (1)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/DataprocSubmitterTest.scala (1)

65-65: Verify empty jar list.

The empty jar list (List.empty) might be unintentional.

Comment on lines +74 to +75
println(submittedJobId)
assertEquals(submittedJobId, "mock-job-id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect assertion.

The test creates a real DataprocSubmitter but expects a mock job ID.

Either:

  1. Use a mocked submitter like in the first test case, or
  2. Remove the assertion and keep it as a local-only test

@piyush-zlai
Copy link
Contributor Author

Was able to test this via a submitter test that invokes the driver with the right verb + params. Merging this.

@piyush-zlai piyush-zlai merged commit 04c9ad7 into main Jan 8, 2025
9 checks passed
@piyush-zlai piyush-zlai deleted the piyush/gbu_driver_verb branch January 8, 2025 08:20
tchow-zlai pushed a commit that referenced this pull request Jan 9, 2025
## Summary
Add a verb to the Driver to allow us to bulk load GBU data to the KV
store of choice.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing
the BigTable kv store code (data.test_gbu)

```
$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds
```

Was also able to test via triggering the submitter test. The upload
kicks off this [Spark
job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022)
and the bulk load succeeds.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a new command-line subcommand for bulk loading GroupBy data to a
key-value store.
- Introduced a new option to upload offline GroupBy tables with
specified parameters.

- **Improvements**
- Enhanced the export process for data to BigTable with updated query
parameters.
	- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow
fallback options.

- **Tests**
- Added a new test case for the `DataprocSubmitter` class related to GBU
bulk loading.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Co-authored-by: Thomas Chow <[email protected]>
kumar-zlai pushed a commit that referenced this pull request Apr 25, 2025
## Summary
Add a verb to the Driver to allow us to bulk load GBU data to the KV
store of choice.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing
the BigTable kv store code (data.test_gbu)

```
$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds
```

Was also able to test via triggering the submitter test. The upload
kicks off this [Spark
job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022)
and the bulk load succeeds.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a new command-line subcommand for bulk loading GroupBy data to a
key-value store.
- Introduced a new option to upload offline GroupBy tables with
specified parameters.

- **Improvements**
- Enhanced the export process for data to BigTable with updated query
parameters.
	- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow
fallback options.

- **Tests**
- Added a new test case for the `DataprocSubmitter` class related to GBU
bulk loading.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kumar-zlai pushed a commit that referenced this pull request Apr 29, 2025
## Summary
Add a verb to the Driver to allow us to bulk load GBU data to the KV
store of choice.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing
the BigTable kv store code (data.test_gbu)

```
$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds
```

Was also able to test via triggering the submitter test. The upload
kicks off this [Spark
job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022)
and the bulk load succeeds.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a new command-line subcommand for bulk loading GroupBy data to a
key-value store.
- Introduced a new option to upload offline GroupBy tables with
specified parameters.

- **Improvements**
- Enhanced the export process for data to BigTable with updated query
parameters.
	- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow
fallback options.

- **Tests**
- Added a new test case for the `DataprocSubmitter` class related to GBU
bulk loading.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 15, 2025
## Summary
Add a verb to the Driver to allow us to bulk load GBU data to the KV
store of choice.

## Checklist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing
the BigTable kv store code (data.test_gbu)

```
$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quickstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quickstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quickstart.purchases.v1; partition: 2023-11-30 in 2 seconds
```

Was also able to test via triggering the submitter test. The upload
kicks off this [Spark
job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022)
and the bulk load succeeds.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a new command-line subcommand for bulk loading GroupBy data to a
key-value store.
- Introduced a new option to upload offline GroupBy tables with
specified parameters.

- **Improvements**
- Enhanced the export process for data to BigTable with updated query
parameters.
	- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow
fallback options.

- **Tests**
- Added a new test case for the `DataprocSubmitter` class related to GBU
bulk loading.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
chewy-zlai pushed a commit that referenced this pull request May 16, 2025
## Summary
Add a verb to the Driver to allow us to bulk load GBU data to the KV
store of choice.

## Cheour clientslist
- [ ] Added Unit Tests
- [ ] Covered by existing CI
- [X] Integration tested
- [ ] Documentation update

Tested manually using the dummy gbu table (on bq) I used while testing
the BigTable kv store code (data.test_gbu)

```
$ export GCP_INSTANCE_ID="zipline-canary-instance"
$ export GCP_PROJECT_ID="canary-443022"
$ java -cp spark/target/scala-2.12/spark-assembly-0.1.0-SNAPSHOT.jar:/opt/homebrew/Cellar/apache-spark/3.5.4/libexec/jars/* ai.chronon.spark.Driver groupby-upload-bulk-load --online-jar=cloud_gcp/target/scala-2.12/cloud_gcp-assembly-0.1.0-SNAPSHOT.jar --online-class=ai.chronon.integrations.cloud_gcp.GcpApiImpl --src-offline-table=data.test_gbu --groupby-name=quiour clientsstart.purchases.v1 --partition-string=2023-11-30
...
Triggering bulk load for GroupBy: quiour clientsstart.purchases.v1 for partition: 2023-11-30 from table: data.test_gbu
Uploaded GroupByUpload data to KV store for GroupBy: quiour clientsstart.purchases.v1; partition: 2023-11-30 in 2 seconds
```

Was also able to test via triggering the submitter test. The upload
kiour clientss off this [Spark
job](https://console.cloud.google.com/dataproc/jobs/0af24968-51b2-45e7-95da-8a890b094837?region=us-central1&hl=en&inv=1&invt=AbmQ-g&project=canary-443022)
and the bulk load succeeds.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
- Added a new command-line subcommand for bulk loading GroupBy data to a
key-value store.
- Introduced a new option to upload offline GroupBy tables with
specified parameters.

- **Improvements**
- Enhanced the export process for data to BigTable with updated query
parameters.
	- Improved job ID generation for uniqueness during data uploads.
- Updated error handling for environment variable retrieval to allow
fallbaour clients options.

- **Tests**
- Added a new test case for the `DataprocSubmitter` class related to GBU
bulk loading.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants