feat: Custom StagingQuery to write parquet #604

tchow-zlai · 2025-04-07T23:53:28Z

Summary

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Introduced support for creating tables with the "PARQUET" provider in addition to "ICEBERG".
- Added a new method to convert identifiers into a string representation for better integration.
- Expanded the set of shared dependencies to include the Spark MLlib library, enhancing machine learning capabilities.
- Updated table creation SQL commands to optionally accept location strings.
Bug Fixes
- Improved handling of location strings in table creation SQL commands.
Tests
- Added a test case for creating external tables in Parquet format, verifying functionality and partition handling.
- Updated tests to include new configuration settings for Spark SQL catalog.
Chores
- Updated artifact management to reflect new dependencies and hash changes in the project configuration.

coderabbitai · 2025-04-07T23:53:38Z

Walkthrough

The changes update table creation logic across Spark and Cloud GCP components. In Spark modules, a new private variable and an optional SQL location parameter are introduced to enhance table creation. The Cloud GCP integrations now support additional table providers via pattern matching in the createTable method and improve table identifier handling. In addition, dependency management is expanded with new Maven artifacts and updated build scripts, and a test case for external Parquet tables has been added.

Changes

File(s)	Change Summary
`spark/.../TableUtils.scala`, `spark/.../CreationUtils.scala`	Added a new `tableWriteWarehouse` variable in TableUtils and updated `createTable` to pass it; enhanced `createTableSql` with an optional `locationString` parameter.
`cloud_gcp/.../DelegatingBigQueryMetastoreCatalog.scala`, `cloud_gcp/.../SparkBQUtils.scala`	Modified `createTable` in DelegatingBigQueryMetastoreCatalog to support multiple providers (ICEBERG and PARQUET) and updated table identifier parsing by adding `toTableString` and overloading `toTableId` in SparkBQUtils.
`cloud_gcp/BUILD.bazel`, `maven_install.json`, `tools/.../maven_repository.bzl`	Added a new Spark MLlib dependency and several new Maven artifacts, while updating artifact hashes and dependency relationships.
`cloud_gcp/.../BigQueryCatalogTest.scala`	Uncommented Spark SQL catalog configurations and added a new test case for creating an external Parquet table.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Catalog as DelegatingBigQueryMetastoreCatalog
    participant Iceberg as icebergCatalog
    participant BigQuery as bigQueryClient
    Caller->>Catalog: createTable(ident, schema, partitions, properties)
    alt Provider == ICEBERG
        Catalog->>Iceberg: createTable(...)
    else Provider == PARQUET
        Catalog->>Catalog: Build external table definition\nand validate partitioning
        Catalog->>BigQuery: create(table)
    else
        Catalog-->>Caller: Throw UnsupportedOperationException
    end

Possibly related PRs

feat: Set iceberg table options through table properties #531: Modifications in the TableUtils class related to table management.
feat: support Format specific createTable #261: Updates to the createTable method's parameters and functionality.
feat: bigquery catalog with iceberg support #393: Enhancements to the createTable method in the DelegatingBigQueryMetastoreCatalog class.

Suggested reviewers

nikhil-zlai
piyush-zlai
varant-zlai
david-zlai

Poem

In code we trust, tonight we write,
Tables awaken with new insight.
Dependencies dance, tests align,
MLlib joins the grand design.
Spark and GCP in joyful flight! ✨🚀

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai plan to trigger planning for file edits and PR creation.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala (1)

14-14: Tests are well-structured.
Coverage looks good. Consider adding a case for invalid GCS path schemes.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala (4)

20-22: Optional improvement for configuration.
Consider allowing the bqOptions to be injected for flexibility in testing.

28-45: Use an exception instead of assertion.
Assertions can be disabled in production. Prefer IllegalArgumentException.

58-79: Avoid duplicated Parquet save logic.
Unify unpartitioned/partitioned writes in a helper to reduce redundancy.

81-101: Add a simple log entry.
Helps visibility when creating the table in BigQuery.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 22aa9a8 and f8dd58f.

📒 Files selected for processing (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala (1 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala (1 hunks)

🧰 Additional context used

🧬 Code Definitions (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala (3)

api/src/main/scala/ai/chronon/api/ScalaJavaConversions.scala (2)

ScalaJavaConversions (6-97)

JListOps (70-78)

spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)

sql (298-326)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (2)

SparkBQUtils (6-17)

toTableId (8-15)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: non_spark_tests
GitHub Check: non_spark_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (2)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala (2)

30-35: Looks correct.
Checks PARQUET setup and asserts correctly.

57-68: Good coverage of trailing slash scenarios.
Both path forms are validated.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between c14f43f and 2b4ecd9.

📒 Files selected for processing (1)

spark/src/main/scala/ai/chronon/spark/Driver.scala (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (17)

GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: fetcher_tests
GitHub Check: streaming_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: batch_tests
GitHub Check: batch_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: fetcher_tests
GitHub Check: spark_tests
GitHub Check: non_spark_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests

🔇 Additional comments (1)

spark/src/main/scala/ai/chronon/spark/Driver.scala (1)

480-484: Good use of conditional instantiation for GCP integration.

The conditional logic creates appropriate query instances based on environment.

spark/src/main/scala/ai/chronon/spark/Driver.scala

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala

spark/src/main/scala/ai/chronon/spark/Driver.scala

Co-authored-by: Thomas Chow <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
174-227:
❓ Verification inconclusive

New PARQUET path creation.
Consider verifying presence of underlying data or gracefully handling missing files.

🏁 Script executed:
#!/bin/bash
# Checking for *.parquet files in the repo
fd -e parquet
Length of output: 15
PARQUET Data Check Note:

The PARQUET path doesn’t verify that .parquet files exist at the specified URI.

Our check (using fd -e parquet) returned no results; if data is expected externally, ensure this case is handled gracefully.

Please manually verify that missing file scenarios are appropriately addressed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 2b4ecd9 and cdd5a1e.

📒 Files selected for processing (8)

cloud_gcp/BUILD.bazel (1 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (1 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2 hunks)
maven_install.json (53 hunks)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala (2 hunks)
tools/build_rules/dependencies/maven_repository.bzl (1 hunks)

🧰 Additional context used

🧬 Code Definitions (2)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2)

createTable (169-229)

loadTable (120-167)

spark/src/main/scala/ai/chronon/spark/TableUtils.scala (3)

createTable (210-236)

insertPartitions (238-290)

loadTable (118-120)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)

SparkBQUtils (7-30)

toTableString (9-11)

toTableId (12-15)

toTableId (17-20)

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: batch_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: bazel_config_tests
GitHub Check: non_spark_tests
GitHub Check: batch_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests

🔇 Additional comments (71)

cloud_gcp/BUILD.bazel (1)

36-36: Added Spark MLlib dependency

tools/build_rules/dependencies/maven_repository.bzl (2)

194-194: Added Spark MLlib for Scala 2.12

201-201: Added Spark MLlib for Scala 2.13

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (3)

43-52: Uncommented Iceberg catalog configuration

These settings enable the Iceberg catalog integration with BigQuery, providing necessary configuration for warehouse location and GCP project details.

54-55: Added table write configuration

Configuration settings for Parquet format and warehouse location that support the custom StagingQuery functionality.

118-126: Added test for external Parquet table creation

Test validates the core functionality of creating an external Parquet table with proper partitioning.

spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)

79-79: Added tableWriteWarehouse configuration variable

219-226: Updated createTableSql call to include warehouse location

The method now passes tableWriteWarehouse to CreationUtils.createTableSql, enabling custom location for Parquet tables.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (5)

5-5: Import is fine.

9-11: Confirm backslash usage.
Potentially use "." to separate namespaces if the backslash is unintentional.

14-15: Good unification of parsing logic.

17-20: Consistent method for Identifier → TableId.

22-23: All clear here.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)

3-4: Imports look good.

Also applies to: 8-9, 12-13, 15-15, 18-18

spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala (2)

14-15: Location param is handy.

31-31: Conditional LOCATION usage.
Looks concise and correct.

maven_install.json (55)

3-4: Hashes Updated. Confirm new __INPUT_ARTIFACTS_HASH and __RESOLVED_ARTIFACTS_HASH.

434-440: New Artifact Added. "com.github.wendykierp:JTransforms" v3.1 with its shas.

1225-1231: New Artifact Added. "com.sun.istack:istack-commons-runtime" v3.0.8 with updated shas.

1435-1455: New Netlib Artifacts. "arpack", "blas", "lapack" v3.0.3 added.

2558-2564: New Artifact Added. "net.sourceforge.f2j:arpack_combined_all" v0.1.

3479-3492: New Spark GraphX Artifacts. Both Scala 2.12 and 2.13 set to v3.5.3.

3535-3565: New Spark MLlib & Network. Added Spark MLlib (2.12/2.13) and network-common, all v3.5.3.

4039-4045: Update JAXB Runtime. "jaxb-runtime" now at v2.3.2 with new shas.

4515-4545: New Breeze & Scalatest Artifacts. Breeze macros/core v2.1.0 and scalatest-compatible added.

4837-4850: Typelevel Algebra Versions. Updated to 2.0.1 (2.12) & 2.8.0 (2.13).

4893-4951: New Spire & Snappy. Added spire modules (varying versions) and snappy-java update.

4970-4976: New Artifact Added. "pl.edu.icm:JLargeArrays" v1.5 added.

4977-4979: Dependency Note. Verify "ru.vyarus:generics-resolver" version consistency.

5450-5453: Dep Mapping. "com.github.wendykierp:JTransforms" now maps to commons-math3 and JLargeArrays.

6545-6553: Dep Map. Netlib artifacts now depend on "net.sourceforge.f2j:arpack_combined_all".

8084-8099: Dep Map Update. Spark GraphX now lists required dependencies.

8169-8213: MLlib Dep Map. Updated dependencies for Spark MLlib (Scala 2.12 & 2.13).

8455-8458: JAXB Dep Update. "jaxb-runtime" now includes istack-commons-runtime and jakarta.xml.bind-api.

8603-8632: Breeze Dep Map. Updated dependencies for "breeze_2.12" and "breeze_2.13".

8777-8833: Typelevel Dep Map. Revised mappings for algebra, spire, and cats modules.

9892-9898: JTransforms Dep Map. Now includes DCT, DHT, DST, FFT, and utils.

11351-11355: Dep Map Update. "istack-commons-runtime" now lists additional modules.

11687-11695: Netlib Dep Map Standardized. Now uses dot notation for arpack, blas, and lapack.

13186-13192: Arpack_Combined_All Map. Now includes netlib.arpack, blas, err, lapack, and util.

20891-20906: Spark GraphX Dep Map. Expanded module mapping for GraphX.

20955-21123: Spark MLlib Overhaul. Extensive dependency mapping; please verify correctness.

22421-22447: JAXB Expansion. Full dependency list added for "jaxb-runtime".

23153-23230: Breeze Detailed. Comprehensive dependency list for Breeze (both Scala versions).

23853-23900: Typelevel Algebra Map. Detailed list for Scala 2.12 and 2.13.

24005-24069: Spire Map Update. Comprehensive mapping for all spire modules.

24108-24110: JLargeArrays Map. Now maps to "pl.edu.icm.jlargearrays".

24584-24585: JTransforms Dep Map. Jar sources mapping added.

24809-24810: Istask Commons Mapping. Jar sources mapping added.

24869-24874: Netlib Map Update. Jar sources now mapped for arpack, blas, and lapack.

25194-25194: Arpack_Combined_All. Jar sources mapping updated.

25446-25449: Spark GraphX Mapping. Jar sources added for GraphX.

25462-25469: MLlib Jar Sources. Updated jar sources for Spark MLlib.

25604-25605: JAXB Runtime Mapping. Jar sources added.

25740-25747: Breeze Jar Sources. Jar sources added for Breeze libraries.

25832-25835: Algebra Jar Sources. Jar sources mapping updated.

25848-25871: Comprehensive Jar Mapping. Spire, Snakeyaml, JLargeArrays, and related deps updated.

26063-26064: JTransforms Mapping. Jar sources updated.

26288-26289: Istask Mapping. Jar sources confirmed.

26348-26353: Netlib Update. Jar sources added for arpack, blas, and lapack.

26673-26673: Arpack_Combined_All. Jar sources mapping updated.

26925-26928: Spark GraphX Sources. Jar sources mapping updated.

26941-26948: MLlib Sources. Jar sources updated.

27083-27084: JAXB Runtime. Jar source mapping updated.

27219-27226: Breeze Jar Sources. Updated mapping for macros and core.

27311-27314: Algebra Sources. Jar sources mapping updated.

27327-27342: Spire Jar Sources. Updated for all spire modules.

27349-27350: JLargeArrays Sources. Jar sources mapping added.

32187-32234: Spark MLlib Map. Updated mappings for MLFormatRegister and DataSourceRegister.

33995-34004: JAXB Context Mapping. Consistent mapping for JAXBContext.

34259-34259: Arpack_Combined_All. Jar sources mapping updated.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between cdd5a1e and f4c6c65.

📒 Files selected for processing (8)

cloud_gcp/BUILD.bazel (1 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (1 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2 hunks)
maven_install.json (53 hunks)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala (2 hunks)
tools/build_rules/dependencies/maven_repository.bzl (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

cloud_gcp/BUILD.bazel
tools/build_rules/dependencies/maven_repository.bzl
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala
spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala
spark/src/main/scala/ai/chronon/spark/TableUtils.scala

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: streaming_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: batch_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: analyzer_tests
GitHub Check: batch_tests
GitHub Check: non_spark_tests

🔇 Additional comments (56)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (3)

17-20: LGTM: New overloaded method enhances flexibility

Clean implementation that accepts Identifier directly.

22-28: LGTM: Good extraction of parsing logic

Improves code reuse and maintainability.

12-15: LGTM: Method update uses extracted helper

Good refactoring to use the new helper method.

maven_install.json (53)

1-7: Hashes Updated: Updated artifact hashes look good.

431-440: JTransforms Added: New artifact (v3.1) added.

1222-1231: istack Added: New istack-commons-runtime (v3.0.8) added.

1432-1455: Netlib Artifacts: Added arpack, blas, and lapack (v3.0.3).

2555-2564: arpack_combined_all Added: v0.1 added; sources is null – verify if expected.

3476-3493: Spark GraphX Added: New spark-graphx for both 2.12 and 2.13.

3532-3562: Spark MLlib Update: MLLib and local artifacts updated.

4036-4045: JAXB-runtime Update: Now using version 2.3.2.

4512-4540: Breeze Libraries: breeze-macros and breeze added.

4834-4850: Typelevel Algebra: Versions updated; verify cross-Scala consistency.

4890-4948: Spire Libraries: spire-macros, platform, util, and spire added.

4967-4976: JLargeArrays Added: New artifact (v1.5) added.

5447-5456: JTransforms Mapping: Now maps to commons-math3 and JLargeArrays.

6542-6556: Netlib Mapping Update: Dependencies for arpack, blas, and lapack now target arpack_combined_all.

8081-8099: GraphX Mapping: Dependency mapping for spark-graphx updated.

8166-8214: MLlib Mapping: Spark MLlib-local and MLlib dependencies updated.

8452-8461: Jersey/JAXB Mapping: Dependency mappings updated.

8600-8632: Breeze Mapping: Dependency arrays for breeze libraries updated.

8777-8807: Typelevel/Cats Mapping: Updated dependencies for typelevel algebra and cats-core.

9889-9901: Findbugs/JTransforms Mapping: Mapping updated.

11348-11358: istack Mapping: istack-commons-runtime mapping updated.

11684-11698: Netlib Mapping: arpack, blas, and lapack mapping updated.

13183-13193: f2j arpack Mapping: Dependency mapping updated.

20888-20909: GraphX Expanded: Additional spark-graphx util dependencies added.

20940-21009: MLlib Expanded: Extensive Spark ML mapping updated.

22418-22447: JAXB-runtime Expanded: More submodules added.

23150-23233: Breeze Mapping: Updated for Scala 2.12 and 2.13.

23850-23900: Algebra Mapping: Typelevel algebra dependencies updated.

24002-24020: Spire Mapping: spire-macros, platform, and util updated.

24105-24113: JLargeArrays Mapping: Dependency mapping revised.

24581-24588: JTransforms Mapping: Mapping verified.

24806-24813: Javapoet/istack Mapping: Updated.

24866-24877: Netlib/Dnsjava Mapping: Updated.

25191-25197: Opencsv/ST4 Mapping: Dependency mapping updated.

25443-25452: Spark Core/GraphX Mapping: Updated.

25459-25472: Spark Launcher/MLlib Mapping: Updated.

25601-25608: HK2/JAXB Mapping: Updated.

25737-25750: Scalactic/Breeze Mapping: Updated.

25829-25838: Algebra/Cats-Core Mapping: Updated.

25845-25867: Jawn/Spire Mapping: Updated.

26060-26067: JTransforms Duplicate? Check for redundant mapping.

26285-26292: Javapoet/istack Update: Mapping confirmed.

26345-26356: Netlib/Dnsjava Mapping: Still valid.

26670-26676: Opencsv/ST4 Update: Mapping updated.

26922-26931: Spark Core/GraphX Update: Verified.

26938-26951: Spark Launcher/MLlib Update: Verified.

27080-27087: HK2/JAXB Update: Mapping verified.

27216-27229: Scalactic/Breeze Update: Mapping verified.

27308-27317: Threetenbp/Algebra Update: Mapping verified.

27324-27353: Spire/Snappy Update: Mapping verified.

32184-32237: Spark MLFormat Mapping: Register mappings updated.

33992-34004: HK2/JAXB Update: Mapping verified.

34256-34262: Netty/Curator/Derby Update: Mapping updated.

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala

nikhil-zlai · 2025-04-08T04:37:12Z

cloud_gcp/BUILD.bazel

    maven_artifact("ch.qos.reload4j:reload4j"),
    maven_artifact("org.threeten:threetenbp"),
    maven_artifact("org.apache.kafka:kafka-clients"),
+    maven_artifact_with_suffix("org.apache.spark:spark-mllib"),


sadly required by spark bigquery connector to map types accordingly.

nikhil-zlai · 2025-04-08T04:37:52Z

...cp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala

-    if (provider.toUpperCase != "ICEBERG") {
-      throw new UnsupportedOperationException("Only creating iceberg tables supported.")
+    provider.toUpperCase match {
+      case "ICEBERG" => icebergCatalog.createTable(ident, schema, partitions, properties)


should we make these constants somewhere?

will do, probably a good time now

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala

nikhil-zlai · 2025-04-08T05:42:00Z

tools/build_rules/dependencies/maven_repository.bzl

        "org.apache.spark:spark-hive_2.12:3.5.3",
        "org.apache.spark:spark-streaming_2.12:3.5.3",
        "org.apache.spark:spark-avro_2.12:3.5.3",
+        "org.apache.spark:spark-mllib_2.12:3.5.3",


unfortunately required by the spark bigquery connector which we rely on to map between spark types and bigquery types.

nikhil-zlai

few comments but looks clean otherwise

Co-authored-by: Thomas Chow <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (3)
179-179: Consider extracting hardcoded provider names
+  private val ICEBERG_PROVIDER = "ICEBERG"
+  private val PARQUET_PROVIDER = "PARQUET"

-      case "ICEBERG" => icebergCatalog.createTable(ident, schema, partitions, properties)
+      case ICEBERG_PROVIDER => icebergCatalog.createTable(ident, schema, partitions, properties)
-      case "PARQUET" => {
+      case PARQUET_PROVIDER => {
196-211: Commented partitioning implementation

Assertion for identity partitioning is good, but the extensive commented code could be removed or implemented fully.

Consider either removing the commented code or implementing it if it's needed soon.

213-217: Consider reusing toTableId implementation

This block duplicates logic from SparkBQUtils.toTableId with shadedTableId.
-        val shadedTableId = scala
-          .Option(tableId.getProject)
-          .map(TableId.of(_, tableId.getDataset, tableId.getTable))
-          .getOrElse(TableId.of(tableId.getDataset, tableId.getTable))
+        // tableId is already correctly formatted by SparkBQUtils.toTableId
+        val shadedTableId = tableId

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between f4c6c65 and 6811933.

📒 Files selected for processing (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (4 hunks)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2)

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:29-60
Timestamp: 2025-04-02T19:05:37.870Z
Learning: In BigQuery integration, table existence check is performed outside the BigQueryFormat.createTable method, at a higher level in TableUtils.createTable.

Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-04-02T19:05:37.870Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.

🧬 Code Definitions (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)

SparkBQUtils (7-30)

toTableString (9-11)

toTableId (12-15)

toTableId (17-20)

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: batch_tests
GitHub Check: streaming_tests
GitHub Check: non_spark_tests
GitHub Check: analyzer_tests
GitHub Check: join_tests
GitHub Check: spark_tests
GitHub Check: bazel_config_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: batch_tests
GitHub Check: non_spark_tests
GitHub Check: fetcher_tests

🔇 Additional comments (8)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (4)

42-43: Added optional schema parameter to DelegatingTable

Good addition for supporting custom schemas when creating Parquet tables.

50-50: Schema override implementation

Correctly updates schema() to use the optionally provided schema before falling back to internal table's schema.

178-180: Pattern matching for table provider

Clean implementation to support multiple table providers. Proper use of toUpperCase for case-insensitive comparison.

180-230: Parquet table creation implementation

Complete implementation for creating external Parquet tables in BigQuery. Code handles:

Location URI construction

Schema conversion

Table creation in BigQuery

Returning appropriate table representation

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)

9-11: New toTableString method for Identifiers

Clean implementation that joins namespace and name with dots.

14-15: Refactored parseIdentifier to use common logic

Good refactoring to use the new parseBQIdentifier method.

17-20: Added overloaded toTableId for Identifiers

Useful addition that supports direct conversion from Identifiers to TableId objects.

22-28: Extracted common parseBQIdentifier method

Good extraction of common logic to reduce code duplication.

Co-authored-by: Thomas Chow <[email protected]>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 6811933 and 70d5319.

📒 Files selected for processing (2)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (5 hunks)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala

🧰 Additional context used

🧬 Code Graph Analysis (1)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (1)

spark/src/main/scala/ai/chronon/spark/TableUtils.scala (4)

tableReachable (105-116)

createTable (210-236)

insertPartitions (238-290)

loadTable (118-120)

⏰ Context from checks skipped due to timeout of 90000ms (18)

GitHub Check: streaming_tests
GitHub Check: join_tests
GitHub Check: streaming_tests
GitHub Check: groupby_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: spark_tests
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: join_tests
GitHub Check: batch_tests
GitHub Check: non_spark_tests
GitHub Check: fetcher_tests
GitHub Check: bazel_config_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: batch_tests

🔇 Additional comments (2)

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2)

43-50: Iceberg catalog configuration activated

These configuration settings enable the Iceberg catalog with BigQuery as the metastore.

54-55: Supporting Parquet table write format

New configuration properties for specifying Parquet as write format and setting warehouse location.

coderabbitai · 2025-04-10T00:30:18Z

cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala

+  it should "create external parquet table" in {
+    val externalTable = "default_iceberg.data.tchow_external_parquet"
+
+    val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds")
+    if (!tableUtils.tableReachable(externalTable)) {
+      tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET")
+    }
+    tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds"))
+    val roundTripped = tableUtils.loadTable(externalTable)
+    println(roundTripped)
+  }


🛠️ Refactor suggestion

Add assertions to test case

The test creates and populates a Parquet table but doesn't verify results.

tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds")) val roundTripped = tableUtils.loadTable(externalTable) - println(roundTripped) + val result = roundTripped.collect() + assertEquals(1, result.length) + assertEquals(1, result(0).getAs[Int]("id")) + assertEquals("2021-01-01", result(0).getAs[String]("ds"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

it should "create external parquet table" in {

val externalTable = "default_iceberg.data.tchow_external_parquet"

val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds")

if (!tableUtils.tableReachable(externalTable)) {

tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET")

}

tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds"))

val roundTripped = tableUtils.loadTable(externalTable)

println(roundTripped)

}

it should "create external parquet table" in {

val externalTable = "default_iceberg.data.tchow_external_parquet"

val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds")

if (!tableUtils.tableReachable(externalTable)) {

tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET")

}

tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds"))

val roundTripped = tableUtils.loadTable(externalTable)

val result = roundTripped.collect()

assertEquals(1, result.length)

assertEquals(1, result(0).getAs[Int]("id"))

assertEquals("2021-01-01", result(0).getAs[String]("ds"))

}

coderabbitai bot reviewed Apr 7, 2025

View reviewed changes

coderabbitai bot reviewed Apr 8, 2025

View reviewed changes

spark/src/main/scala/ai/chronon/spark/Driver.scala Outdated Show resolved Hide resolved

nikhil-zlai reviewed Apr 8, 2025

View reviewed changes

tchow-zlai force-pushed the tchow/vanilla-parquet branch from 2b4ecd9 to cdd5a1e Compare April 8, 2025 02:53

comment out code

f4c6c65

Co-authored-by: Thomas Chow <[email protected]>

tchow-zlai force-pushed the tchow/vanilla-parquet branch from cdd5a1e to f4c6c65 Compare April 8, 2025 02:56

coderabbitai bot reviewed Apr 8, 2025

View reviewed changes

cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala Show resolved Hide resolved

nikhil-zlai reviewed Apr 8, 2025

View reviewed changes

nikhil-zlai approved these changes Apr 8, 2025

View reviewed changes

update

6811933

Co-authored-by: Thomas Chow <[email protected]>

coderabbitai bot reviewed Apr 8, 2025

View reviewed changes

wip

70d5319

Co-authored-by: Thomas Chow <[email protected]>

coderabbitai bot reviewed Apr 10, 2025

View reviewed changes

tchow-zlai closed this Apr 12, 2025

tchow-zlai deleted the tchow/vanilla-parquet branch April 12, 2025 16:17

feat: Custom StagingQuery to write parquet #604

feat: Custom StagingQuery to write parquet #604

Uh oh!

Conversation

tchow-zlai commented Apr 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nikhil-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nikhil-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

tchow-zlai Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

tchow-zlai commented Apr 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)