-
Notifications
You must be signed in to change notification settings - Fork 8
feat: Custom StagingQuery to write parquet #604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes update table creation logic across Spark and Cloud GCP components. In Spark modules, a new private variable and an optional SQL location parameter are introduced to enhance table creation. The Cloud GCP integrations now support additional table providers via pattern matching in the createTable method and improve table identifier handling. In addition, dependency management is expanded with new Maven artifacts and updated build scripts, and a test case for external Parquet tables has been added. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Catalog as DelegatingBigQueryMetastoreCatalog
participant Iceberg as icebergCatalog
participant BigQuery as bigQueryClient
Caller->>Catalog: createTable(ident, schema, partitions, properties)
alt Provider == ICEBERG
Catalog->>Iceberg: createTable(...)
else Provider == PARQUET
Catalog->>Catalog: Build external table definition\nand validate partitioning
Catalog->>BigQuery: create(table)
else
Catalog-->>Caller: Throw UnsupportedOperationException
end
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsGitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository. Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala (1)
14-14: Tests are well-structured.
Coverage looks good. Consider adding a case for invalid GCS path schemes.cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala (4)
20-22: Optional improvement for configuration.
Consider allowing thebqOptionsto be injected for flexibility in testing.
28-45: Use an exception instead of assertion.
Assertions can be disabled in production. PreferIllegalArgumentException.
58-79: Avoid duplicated Parquet save logic.
Unify unpartitioned/partitioned writes in a helper to reduce redundancy.
81-101: Add a simple log entry.
Helps visibility when creating the table in BigQuery.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala(1 hunks)cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala(1 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala (3)
api/src/main/scala/ai/chronon/api/ScalaJavaConversions.scala (2)
ScalaJavaConversions(6-97)JListOps(70-78)spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1)
sql(298-326)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (2)
SparkBQUtils(6-17)toTableId(8-15)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: non_spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: enforce_triggered_workflows
🔇 Additional comments (2)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQueryTest.scala (2)
30-35: Looks correct.
Checks PARQUET setup and asserts correctly.
57-68: Good coverage of trailing slash scenarios.
Both path forms are validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (17)
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: batch_tests
- GitHub Check: batch_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: fetcher_tests
- GitHub Check: spark_tests
- GitHub Check: non_spark_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
🔇 Additional comments (1)
spark/src/main/scala/ai/chronon/spark/Driver.scala (1)
480-484: Good use of conditional instantiation for GCP integration.The conditional logic creates appropriate query instances based on environment.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala
Outdated
Show resolved
Hide resolved
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/GCPStagingQuery.scala
Outdated
Show resolved
Hide resolved
2b4ecd9 to
cdd5a1e
Compare
Co-authored-by: Thomas Chow <[email protected]>
cdd5a1e to
f4c6c65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
174-227:❓ Verification inconclusive
New PARQUET path creation.
Consider verifying presence of underlying data or gracefully handling missing files.
🏁 Script executed:
#!/bin/bash # Checking for *.parquet files in the repo fd -e parquetLength of output: 15
PARQUET Data Check Note:
- The PARQUET path doesn’t verify that
.parquetfiles exist at the specified URI.- Our check (using
fd -e parquet) returned no results; if data is expected externally, ensure this case is handled gracefully.- Please manually verify that missing file scenarios are appropriately addressed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (8)
cloud_gcp/BUILD.bazel(1 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala(2 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala(1 hunks)cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala(2 hunks)maven_install.json(53 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala(2 hunks)spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala(2 hunks)tools/build_rules/dependencies/maven_repository.bzl(1 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2)
createTable(169-229)loadTable(120-167)spark/src/main/scala/ai/chronon/spark/TableUtils.scala (3)
createTable(210-236)insertPartitions(238-290)loadTable(118-120)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)
SparkBQUtils(7-30)toTableString(9-11)toTableId(12-15)toTableId(17-20)
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: batch_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: streaming_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: bazel_config_tests
- GitHub Check: non_spark_tests
- GitHub Check: batch_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: non_spark_tests
🔇 Additional comments (71)
cloud_gcp/BUILD.bazel (1)
36-36: Added Spark MLlib dependencytools/build_rules/dependencies/maven_repository.bzl (2)
194-194: Added Spark MLlib for Scala 2.12
201-201: Added Spark MLlib for Scala 2.13cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (3)
43-52: Uncommented Iceberg catalog configurationThese settings enable the Iceberg catalog integration with BigQuery, providing necessary configuration for warehouse location and GCP project details.
54-55: Added table write configurationConfiguration settings for Parquet format and warehouse location that support the custom StagingQuery functionality.
118-126: Added test for external Parquet table creationTest validates the core functionality of creating an external Parquet table with proper partitioning.
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2)
79-79: Added tableWriteWarehouse configuration variable
219-226: Updated createTableSql call to include warehouse locationThe method now passes tableWriteWarehouse to CreationUtils.createTableSql, enabling custom location for Parquet tables.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (5)
5-5: Import is fine.
9-11: Confirm backslash usage.
Potentially use "." to separate namespaces if the backslash is unintentional.
14-15: Good unification of parsing logic.
17-20: Consistent method for Identifier → TableId.
22-23: All clear here.cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
3-4: Imports look good.Also applies to: 8-9, 12-13, 15-15, 18-18
spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala (2)
14-15: Location param is handy.
31-31: Conditional LOCATION usage.
Looks concise and correct.maven_install.json (55)
3-4: Hashes Updated. Confirm new __INPUT_ARTIFACTS_HASH and __RESOLVED_ARTIFACTS_HASH.
434-440: New Artifact Added. "com.github.wendykierp:JTransforms" v3.1 with its shas.
1225-1231: New Artifact Added. "com.sun.istack:istack-commons-runtime" v3.0.8 with updated shas.
1435-1455: New Netlib Artifacts. "arpack", "blas", "lapack" v3.0.3 added.
2558-2564: New Artifact Added. "net.sourceforge.f2j:arpack_combined_all" v0.1.
3479-3492: New Spark GraphX Artifacts. Both Scala 2.12 and 2.13 set to v3.5.3.
3535-3565: New Spark MLlib & Network. Added Spark MLlib (2.12/2.13) and network-common, all v3.5.3.
4039-4045: Update JAXB Runtime. "jaxb-runtime" now at v2.3.2 with new shas.
4515-4545: New Breeze & Scalatest Artifacts. Breeze macros/core v2.1.0 and scalatest-compatible added.
4837-4850: Typelevel Algebra Versions. Updated to 2.0.1 (2.12) & 2.8.0 (2.13).
4893-4951: New Spire & Snappy. Added spire modules (varying versions) and snappy-java update.
4970-4976: New Artifact Added. "pl.edu.icm:JLargeArrays" v1.5 added.
4977-4979: Dependency Note. Verify "ru.vyarus:generics-resolver" version consistency.
5450-5453: Dep Mapping. "com.github.wendykierp:JTransforms" now maps to commons-math3 and JLargeArrays.
6545-6553: Dep Map. Netlib artifacts now depend on "net.sourceforge.f2j:arpack_combined_all".
8084-8099: Dep Map Update. Spark GraphX now lists required dependencies.
8169-8213: MLlib Dep Map. Updated dependencies for Spark MLlib (Scala 2.12 & 2.13).
8455-8458: JAXB Dep Update. "jaxb-runtime" now includes istack-commons-runtime and jakarta.xml.bind-api.
8603-8632: Breeze Dep Map. Updated dependencies for "breeze_2.12" and "breeze_2.13".
8777-8833: Typelevel Dep Map. Revised mappings for algebra, spire, and cats modules.
9892-9898: JTransforms Dep Map. Now includes DCT, DHT, DST, FFT, and utils.
11351-11355: Dep Map Update. "istack-commons-runtime" now lists additional modules.
11687-11695: Netlib Dep Map Standardized. Now uses dot notation for arpack, blas, and lapack.
13186-13192: Arpack_Combined_All Map. Now includes netlib.arpack, blas, err, lapack, and util.
20891-20906: Spark GraphX Dep Map. Expanded module mapping for GraphX.
20955-21123: Spark MLlib Overhaul. Extensive dependency mapping; please verify correctness.
22421-22447: JAXB Expansion. Full dependency list added for "jaxb-runtime".
23153-23230: Breeze Detailed. Comprehensive dependency list for Breeze (both Scala versions).
23853-23900: Typelevel Algebra Map. Detailed list for Scala 2.12 and 2.13.
24005-24069: Spire Map Update. Comprehensive mapping for all spire modules.
24108-24110: JLargeArrays Map. Now maps to "pl.edu.icm.jlargearrays".
24584-24585: JTransforms Dep Map. Jar sources mapping added.
24809-24810: Istask Commons Mapping. Jar sources mapping added.
24869-24874: Netlib Map Update. Jar sources now mapped for arpack, blas, and lapack.
25194-25194: Arpack_Combined_All. Jar sources mapping updated.
25446-25449: Spark GraphX Mapping. Jar sources added for GraphX.
25462-25469: MLlib Jar Sources. Updated jar sources for Spark MLlib.
25604-25605: JAXB Runtime Mapping. Jar sources added.
25740-25747: Breeze Jar Sources. Jar sources added for Breeze libraries.
25832-25835: Algebra Jar Sources. Jar sources mapping updated.
25848-25871: Comprehensive Jar Mapping. Spire, Snakeyaml, JLargeArrays, and related deps updated.
26063-26064: JTransforms Mapping. Jar sources updated.
26288-26289: Istask Mapping. Jar sources confirmed.
26348-26353: Netlib Update. Jar sources added for arpack, blas, and lapack.
26673-26673: Arpack_Combined_All. Jar sources mapping updated.
26925-26928: Spark GraphX Sources. Jar sources mapping updated.
26941-26948: MLlib Sources. Jar sources updated.
27083-27084: JAXB Runtime. Jar source mapping updated.
27219-27226: Breeze Jar Sources. Updated mapping for macros and core.
27311-27314: Algebra Sources. Jar sources mapping updated.
27327-27342: Spire Jar Sources. Updated for all spire modules.
27349-27350: JLargeArrays Sources. Jar sources mapping added.
32187-32234: Spark MLlib Map. Updated mappings for MLFormatRegister and DataSourceRegister.
33995-34004: JAXB Context Mapping. Consistent mapping for JAXBContext.
34259-34259: Arpack_Combined_All. Jar sources mapping updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (8)
cloud_gcp/BUILD.bazel(1 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala(2 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala(1 hunks)cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala(2 hunks)maven_install.json(53 hunks)spark/src/main/scala/ai/chronon/spark/TableUtils.scala(2 hunks)spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala(2 hunks)tools/build_rules/dependencies/maven_repository.bzl(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- cloud_gcp/BUILD.bazel
- tools/build_rules/dependencies/maven_repository.bzl
- cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala
- spark/src/main/scala/ai/chronon/spark/format/CreationUtils.scala
- cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala
- spark/src/main/scala/ai/chronon/spark/TableUtils.scala
⏰ Context from checks skipped due to timeout of 90000ms (16)
- GitHub Check: streaming_tests
- GitHub Check: streaming_tests
- GitHub Check: spark_tests
- GitHub Check: join_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: fetcher_tests
- GitHub Check: non_spark_tests
- GitHub Check: batch_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: analyzer_tests
- GitHub Check: batch_tests
- GitHub Check: non_spark_tests
🔇 Additional comments (56)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (3)
17-20: LGTM: New overloaded method enhances flexibilityClean implementation that accepts Identifier directly.
22-28: LGTM: Good extraction of parsing logicImproves code reuse and maintainability.
12-15: LGTM: Method update uses extracted helperGood refactoring to use the new helper method.
maven_install.json (53)
1-7: Hashes Updated: Updated artifact hashes look good.
431-440: JTransforms Added: New artifact (v3.1) added.
1222-1231: istack Added: New istack-commons-runtime (v3.0.8) added.
1432-1455: Netlib Artifacts: Added arpack, blas, and lapack (v3.0.3).
2555-2564: arpack_combined_all Added: v0.1 added; sources is null – verify if expected.
3476-3493: Spark GraphX Added: New spark-graphx for both 2.12 and 2.13.
3532-3562: Spark MLlib Update: MLLib and local artifacts updated.
4036-4045: JAXB-runtime Update: Now using version 2.3.2.
4512-4540: Breeze Libraries: breeze-macros and breeze added.
4834-4850: Typelevel Algebra: Versions updated; verify cross-Scala consistency.
4890-4948: Spire Libraries: spire-macros, platform, util, and spire added.
4967-4976: JLargeArrays Added: New artifact (v1.5) added.
5447-5456: JTransforms Mapping: Now maps to commons-math3 and JLargeArrays.
6542-6556: Netlib Mapping Update: Dependencies for arpack, blas, and lapack now target arpack_combined_all.
8081-8099: GraphX Mapping: Dependency mapping for spark-graphx updated.
8166-8214: MLlib Mapping: Spark MLlib-local and MLlib dependencies updated.
8452-8461: Jersey/JAXB Mapping: Dependency mappings updated.
8600-8632: Breeze Mapping: Dependency arrays for breeze libraries updated.
8777-8807: Typelevel/Cats Mapping: Updated dependencies for typelevel algebra and cats-core.
9889-9901: Findbugs/JTransforms Mapping: Mapping updated.
11348-11358: istack Mapping: istack-commons-runtime mapping updated.
11684-11698: Netlib Mapping: arpack, blas, and lapack mapping updated.
13183-13193: f2j arpack Mapping: Dependency mapping updated.
20888-20909: GraphX Expanded: Additional spark-graphx util dependencies added.
20940-21009: MLlib Expanded: Extensive Spark ML mapping updated.
22418-22447: JAXB-runtime Expanded: More submodules added.
23150-23233: Breeze Mapping: Updated for Scala 2.12 and 2.13.
23850-23900: Algebra Mapping: Typelevel algebra dependencies updated.
24002-24020: Spire Mapping: spire-macros, platform, and util updated.
24105-24113: JLargeArrays Mapping: Dependency mapping revised.
24581-24588: JTransforms Mapping: Mapping verified.
24806-24813: Javapoet/istack Mapping: Updated.
24866-24877: Netlib/Dnsjava Mapping: Updated.
25191-25197: Opencsv/ST4 Mapping: Dependency mapping updated.
25443-25452: Spark Core/GraphX Mapping: Updated.
25459-25472: Spark Launcher/MLlib Mapping: Updated.
25601-25608: HK2/JAXB Mapping: Updated.
25737-25750: Scalactic/Breeze Mapping: Updated.
25829-25838: Algebra/Cats-Core Mapping: Updated.
25845-25867: Jawn/Spire Mapping: Updated.
26060-26067: JTransforms Duplicate? Check for redundant mapping.
26285-26292: Javapoet/istack Update: Mapping confirmed.
26345-26356: Netlib/Dnsjava Mapping: Still valid.
26670-26676: Opencsv/ST4 Update: Mapping updated.
26922-26931: Spark Core/GraphX Update: Verified.
26938-26951: Spark Launcher/MLlib Update: Verified.
27080-27087: HK2/JAXB Update: Mapping verified.
27216-27229: Scalactic/Breeze Update: Mapping verified.
27308-27317: Threetenbp/Algebra Update: Mapping verified.
27324-27353: Spire/Snappy Update: Mapping verified.
32184-32237: Spark MLFormat Mapping: Register mappings updated.
33992-34004: HK2/JAXB Update: Mapping verified.
34256-34262: Netty/Curator/Derby Update: Mapping updated.
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala
Show resolved
Hide resolved
| maven_artifact("ch.qos.reload4j:reload4j"), | ||
| maven_artifact("org.threeten:threetenbp"), | ||
| maven_artifact("org.apache.kafka:kafka-clients"), | ||
| maven_artifact_with_suffix("org.apache.spark:spark-mllib"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait what!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sadly required by spark bigquery connector to map types accordingly.
| if (provider.toUpperCase != "ICEBERG") { | ||
| throw new UnsupportedOperationException("Only creating iceberg tables supported.") | ||
| provider.toUpperCase match { | ||
| case "ICEBERG" => icebergCatalog.createTable(ident, schema, partitions, properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make these constants somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, probably a good time now
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala
Show resolved
Hide resolved
| "org.apache.spark:spark-hive_2.12:3.5.3", | ||
| "org.apache.spark:spark-streaming_2.12:3.5.3", | ||
| "org.apache.spark:spark-avro_2.12:3.5.3", | ||
| "org.apache.spark:spark-mllib_2.12:3.5.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why mllib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately required by the spark bigquery connector which we rely on to map between spark types and bigquery types.
nikhil-zlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few comments but looks clean otherwise
Co-authored-by: Thomas Chow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (3)
179-179: Consider extracting hardcoded provider names+ private val ICEBERG_PROVIDER = "ICEBERG" + private val PARQUET_PROVIDER = "PARQUET" - case "ICEBERG" => icebergCatalog.createTable(ident, schema, partitions, properties) + case ICEBERG_PROVIDER => icebergCatalog.createTable(ident, schema, partitions, properties) - case "PARQUET" => { + case PARQUET_PROVIDER => {
196-211: Commented partitioning implementationAssertion for identity partitioning is good, but the extensive commented code could be removed or implemented fully.
Consider either removing the commented code or implementing it if it's needed soon.
213-217: Consider reusing toTableId implementationThis block duplicates logic from SparkBQUtils.toTableId with shadedTableId.
- val shadedTableId = scala - .Option(tableId.getProject) - .map(TableId.of(_, tableId.getDataset, tableId.getTable)) - .getOrElse(TableId.of(tableId.getDataset, tableId.getTable)) + // tableId is already correctly formatted by SparkBQUtils.toTableId + val shadedTableId = tableId
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala(4 hunks)cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (2)
Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:29-60
Timestamp: 2025-04-02T19:05:37.870Z
Learning: In BigQuery integration, table existence check is performed outside the BigQueryFormat.createTable method, at a higher level in TableUtils.createTable.
Learnt from: tchow-zlai
PR: zipline-ai/chronon#263
File: cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/BigQueryFormat.scala:56-57
Timestamp: 2025-04-02T19:05:37.870Z
Learning: For BigQuery table creation operations in BigQueryFormat.scala, allow exceptions to propagate directly without wrapping them in try-catch blocks, as the original BigQuery exceptions provide sufficient context.
🧬 Code Definitions (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (1)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)
SparkBQUtils(7-30)toTableString(9-11)toTableId(12-15)toTableId(17-20)
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: groupby_tests
- GitHub Check: fetcher_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: batch_tests
- GitHub Check: streaming_tests
- GitHub Check: non_spark_tests
- GitHub Check: analyzer_tests
- GitHub Check: join_tests
- GitHub Check: spark_tests
- GitHub Check: bazel_config_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: batch_tests
- GitHub Check: non_spark_tests
- GitHub Check: fetcher_tests
🔇 Additional comments (8)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala (4)
42-43: Added optional schema parameter to DelegatingTableGood addition for supporting custom schemas when creating Parquet tables.
50-50: Schema override implementationCorrectly updates schema() to use the optionally provided schema before falling back to internal table's schema.
178-180: Pattern matching for table providerClean implementation to support multiple table providers. Proper use of toUpperCase for case-insensitive comparison.
180-230: Parquet table creation implementationComplete implementation for creating external Parquet tables in BigQuery. Code handles:
- Location URI construction
- Schema conversion
- Table creation in BigQuery
- Returning appropriate table representation
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/SparkBQUtils.scala (4)
9-11: New toTableString method for IdentifiersClean implementation that joins namespace and name with dots.
14-15: Refactored parseIdentifier to use common logicGood refactoring to use the new parseBQIdentifier method.
17-20: Added overloaded toTableId for IdentifiersUseful addition that supports direct conversion from Identifiers to TableId objects.
22-28: Extracted common parseBQIdentifier methodGood extraction of common logic to reduce code duplication.
Co-authored-by: Thomas Chow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)
📒 Files selected for processing (2)
cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala(5 hunks)cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cloud_gcp/src/main/scala/ai/chronon/integrations/cloud_gcp/DelegatingBigQueryMetastoreCatalog.scala
🧰 Additional context used
🧬 Code Graph Analysis (1)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (1)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (4)
tableReachable(105-116)createTable(210-236)insertPartitions(238-290)loadTable(118-120)
⏰ Context from checks skipped due to timeout of 90000ms (18)
- GitHub Check: streaming_tests
- GitHub Check: join_tests
- GitHub Check: streaming_tests
- GitHub Check: groupby_tests
- GitHub Check: groupby_tests
- GitHub Check: analyzer_tests
- GitHub Check: analyzer_tests
- GitHub Check: spark_tests
- GitHub Check: spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: non_spark_tests
- GitHub Check: join_tests
- GitHub Check: batch_tests
- GitHub Check: non_spark_tests
- GitHub Check: fetcher_tests
- GitHub Check: bazel_config_tests
- GitHub Check: scala_compile_fmt_fix
- GitHub Check: batch_tests
🔇 Additional comments (2)
cloud_gcp/src/test/scala/ai/chronon/integrations/cloud_gcp/BigQueryCatalogTest.scala (2)
43-50: Iceberg catalog configuration activatedThese configuration settings enable the Iceberg catalog with BigQuery as the metastore.
54-55: Supporting Parquet table write formatNew configuration properties for specifying Parquet as write format and setting warehouse location.
| it should "create external parquet table" in { | ||
| val externalTable = "default_iceberg.data.tchow_external_parquet" | ||
|
|
||
| val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds") | ||
| if (!tableUtils.tableReachable(externalTable)) { | ||
| tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET") | ||
| } | ||
| tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds")) | ||
| val roundTripped = tableUtils.loadTable(externalTable) | ||
| println(roundTripped) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add assertions to test case
The test creates and populates a Parquet table but doesn't verify results.
tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds"))
val roundTripped = tableUtils.loadTable(externalTable)
- println(roundTripped)
+ val result = roundTripped.collect()
+ assertEquals(1, result.length)
+ assertEquals(1, result(0).getAs[Int]("id"))
+ assertEquals("2021-01-01", result(0).getAs[String]("ds"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| it should "create external parquet table" in { | |
| val externalTable = "default_iceberg.data.tchow_external_parquet" | |
| val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds") | |
| if (!tableUtils.tableReachable(externalTable)) { | |
| tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET") | |
| } | |
| tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds")) | |
| val roundTripped = tableUtils.loadTable(externalTable) | |
| println(roundTripped) | |
| } | |
| it should "create external parquet table" in { | |
| val externalTable = "default_iceberg.data.tchow_external_parquet" | |
| val testDf = spark.createDataFrame(Seq((1, "2021-01-01"))).toDF("id", "ds") | |
| if (!tableUtils.tableReachable(externalTable)) { | |
| tableUtils.createTable(testDf, externalTable, List("ds"), Map.empty[String, String], "PARQUET") | |
| } | |
| tableUtils.insertPartitions(testDf, externalTable, Map.empty[String, String], List("ds")) | |
| val roundTripped = tableUtils.loadTable(externalTable) | |
| val result = roundTripped.collect() | |
| assertEquals(1, result.length) | |
| assertEquals(1, result(0).getAs[Int]("id")) | |
| assertEquals("2021-01-01", result(0).getAs[String]("ds")) | |
| } |
Summary
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores