Add Hudi format #496

david-zlai · 2025-03-12T18:24:26Z

Summary

^^^

Tested on AWS. see below:

Checklist

Added Unit Tests
Covered by existing CI
Integration tested
Documentation update

Summary by CodeRabbit

New Features
- Introduced a utility for parsing Hive partition strings, streamlining partition management.
- Enhanced Apache Hudi integration in Spark SQL with updated configurations for catalog support and table write format.
- Expanded AWS integration settings to support additional development environments, enabling broader network configuration options.
Refactor
- Improved underlying serialization handling to bolster data processing reliability.

coderabbitai · 2025-03-12T18:24:34Z

Walkthrough

The pull request introduces non-functional changes across various modules. It adds a comment noting a potential partition issue in the Python API, updates a JSON customer value from “canary” to “dev”, and expands AWS integration mappings for “dev”. The changes also reorganize imports and formatting for readability, add a utility method for Hive partition parsing, remove a redundant local method, update Hudi-Spark configurations, extend Kryo registrations, and introduce a new test class for Hudi table operations.

Changes

Files	Change Summary
`api/py/.../run.py`, `api/py/.../teams.json`	Added a comment on the default `"ds"` assignment and updated `"CUSTOMER_ID"` value from `"canary"` to `"dev"`
`cloud_aws/.../EmrSubmitter.scala`	Added "dev" customer entries for subnet and security group mappings
`cloud_aws/.../GlueCatalogTest.scala`, `cloud_aws/.../HudiTableUtilsTest.scala`	Reorganized imports and formatting in GlueCatalogTest; introduced a test class to validate Hudi table operations
`spark/.../format/Format.scala`, `spark/.../format/Hive.scala`	Added new `parseHivePartition` method to `Format` trait; removed local partition parsing in `Hive` to delegate to superclass
`cloud_aws/.../hudi_spark_confs.yaml`	Introduced new Hudi-Spark configuration settings
`spark/.../ChrononKryoRegistrator.scala`	Registered additional classes for Kryo serialization (`EmptyList`, `WriteStatus`, `OverwriteWithLatestAvroPayload`)
`spark/.../TableUtils.scala`	Reorganized import statements and added commented options in `repartitionAndWrite`

Sequence Diagram(s)

sequenceDiagram
    participant T as HudiTableUtilsTest
    participant S as SparkSession
    participant TU as TableUtils
    participant H as Hudi Framework
    T->>S: Initialize Spark session with Hudi configs
    T->>TU: Create Hudi table with DataFrame
    TU->>H: Write data in Hudi format
    H-->>TU: Confirm table creation
    TU->>T: Verify table exists in Spark catalog

Suggested reviewers

piyush-zlai
varant-zlai
chewy-zlai
nikhil-zlai

Poem

In bytes and code the changes gleam,
A comment, a test—a shared dream.
New paths and tweaks light up the night,
Configs align in a subtle rite.
Cheers to our merge, a code delight!

Warning

Review ran into problems

🔥 Problems

GitHub Actions and Pipeline Checks: Resource not accessible by integration - https://docs.github.com/rest/actions/workflow-runs#list-workflow-runs-for-a-repository.

Please grant the required permissions to the CodeRabbit GitHub App under the organization or repository settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 4583093 and 234d53c.

📒 Files selected for processing (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/EmrSubmitter.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cloud_aws/src/main/scala/ai/chronon/integrations/aws/EmrSubmitter.scala

⏰ Context from checks skipped due to timeout of 90000ms (15)

GitHub Check: spark_tests
GitHub Check: streaming_tests
GitHub Check: streaming_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: groupby_tests
GitHub Check: non_spark_tests
GitHub Check: fetcher_tests
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: analyzer_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: enforce_triggered_workflows

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

david-zlai · 2025-03-12T19:28:21Z

spark/src/main/scala/ai/chronon/spark/format/DefaultFormatProvider.scala

+      case Success(isHudi) =>
+        logger.info(s"Hudi check: Successfully read the format of table: $tableName as Hudi")
+        isHudi


worked here

�[36m2025/03/12 19:17:14�[m �[32mINFO �[m �[32mDefaultFormatProvider.scala:40�[m - Hudi check: Successfully read the format of table: data.plaid_raw as Hudi

tchow-zlai · 2025-03-12T21:47:56Z

cloud_aws/src/main/scala/ai/chronon/integrations/aws/EmrSubmitter.scala

  // Customer specific infra configurations
  private val CustomerToSubnetIdMap = Map(
-    "canary" -> "subnet-085b2af531b50db44"
+    "canary" -> "subnet-085b2af531b50db44",


are we going to need one for plaid?

mmmm yes we will. @chewy-zlai , we don't know these because it's on their account huh?

is this something we can thread through from teams.json? Or in general can the cluster configuration stuff be passed in from teams.json?

it can be threaded over.

Oh, yeah. The subnet is going to be something we have to get from Daniel as they aren't using the VPC we wanted to setup.

david-zlai · 2025-03-12T21:49:23Z

api/py/ai/chronon/repo/aws.py

                    main_class=main_class,
                )
-                + f" --additional-conf-path=additional-confs.yaml --files={s3_file_args}"
+                + f" --additional-conf-path={EMR_MOUNT_FILE_PREFIX}additional-confs.yaml --files={s3_file_args}"


doing this for now to fix. cc @chewy-zlai this is what you ran into earlier.

but in a follow up pr, i'm going to move this away to spark.files

tchow-zlai · 2025-03-12T21:50:26Z

api/py/ai/chronon/repo/run.py

        "mode": "backfill",
        "dataproc": False,
-        "ds": today,
+        "ds": today,  # TODO: this breaks if the partition column is not the same as yyyy-MM-dd.


you mean if the format is not yyyy-MM-dd ?

yeah. plaid's date format was yyyyMMdd.

when I set the backfill start date to like 20250216, then the ds here was set to today's date of 2025-02-12. inconsistent formats

oh is this something that's controlled with the spark config?

chronon/spark/src/main/scala/ai/chronon/spark/TableUtils.scala

Line 56 in 07e483b

sparkSession.conf.get("spark.chronon.partition.format", "yyyy-MM-dd")

or does this happen even earlier during compliation?

earlier right here: https://github.com/zipline-ai/chronon/blob/main/api/py/ai/chronon/repo/default_runner.py#L221

tchow-zlai · 2025-03-12T21:52:21Z

spark/src/main/scala/ai/chronon/spark/format/Hudi.scala

+
+import org.apache.spark.sql.SparkSession
+
+case object Hudi extends Format {


I'm actaully not sure if we need a new format for HUDI at all, I think we can just use Hive.

david-zlai · 2025-03-12T23:24:56Z

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala

+        .getString(1)
+      assertEquals("hudi", provider)
+
+      tableUtils.insertPartitions(sourceDF, tableName)


fails for me here

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

cloud_aws/src/main/resources/hudi_spark_confs.yaml (1)
1-3: Add newline at end of file.

Missing newline character at end of file per YAML linting rules.
spark.sql.catalog.spark_catalog: "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
spark.sql.extensions: "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
spark.chronon.table_write.format: "hudi"
+
🧰 Tools

🪛 YAMLlint (1.35.1)

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between f9f5394 and eb1baeb.

📒 Files selected for processing (4)

cloud_aws/src/main/resources/hudi_spark_confs.yaml (1 hunks)
cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (2 hunks)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (2 hunks)

✅ Files skipped from review due to trivial changes (1)

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

🧰 Additional context used

🧠 Learnings (1)

spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (1)

Learnt from: nikhil-zlai
PR: zipline-ai/chronon#51
File: spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala:192-200
Timestamp: 2025-03-12T15:28:06.350Z
Learning: Only suggest registering Delta Lake action classes for serialization if they are actually used in the codebase.

🪛 YAMLlint (1.35.1)

cloud_aws/src/main/resources/hudi_spark_confs.yaml

[error] 3-3: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: streaming_tests
GitHub Check: spark_tests
GitHub Check: spark_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: groupby_tests
GitHub Check: fetcher_tests
GitHub Check: groupby_tests
GitHub Check: analyzer_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: fetcher_tests
GitHub Check: non_spark_tests
GitHub Check: non_spark_tests
GitHub Check: analyzer_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (2)

spark/src/main/scala/ai/chronon/spark/ChrononKryoRegistrator.scala (2)

93-93: Appropriate addition for Kryo serialization.

EmptyList registration helps shuffle operations.

215-216: Good Hudi integration enhancement.

Required for proper serialization of Hudi payload classes.

coderabbitai · 2025-03-12T23:27:01Z

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala

+      val back = tableUtils.loadTable(tableName)
+      val backSet = back.select("id").as[Int].collect().toSet
+//      assertEquals(input, backSet)


⚠️ Potential issue

Fix type casting issue in test.

String values ("a"-"d") can't be cast to Int.

val back = tableUtils.loadTable(tableName) - val backSet = back.select("id").as[Int].collect().toSet + val backSet = back.select("id").as[String].collect().toSet // assertEquals(input, backSet)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val back = tableUtils.loadTable(tableName)

val backSet = back.select("id").as[Int].collect().toSet

// assertEquals(input, backSet)

val back = tableUtils.loadTable(tableName)

val backSet = back.select("id").as[String].collect().toSet

// assertEquals(input, backSet)

coderabbitai · 2025-03-12T23:27:01Z

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala

+      val backSet = back.select("id").as[Int].collect().toSet
+//      assertEquals(input, backSet)


🛠️ Refactor suggestion

Uncomment or remove assertion.

Incomplete test - assertion is commented out.

val backSet = back.select("id").as[String].collect().toSet - // assertEquals(input, backSet) + val expectedSet = Set("a", "b", "c", "d") + assertEquals(expectedSet, backSet)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val backSet = back.select("id").as[Int].collect().toSet

// assertEquals(input, backSet)

val backSet = back.select("id").as[String].collect().toSet

val expectedSet = Set("a", "b", "c", "d")

assertEquals(expectedSet, backSet)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala (2)
33-37: Consider diversifying test partition data.

All records use same date. Add multiple dates to properly test partitioning.
  val source = Seq(
    ("a", "2025-03-12"),
    ("b", "2025-03-12"),
    ("c", "2025-03-12"),
    ("d", "2025-03-12")
  )
26-58: Consider adding update/delete tests.

Current test only covers basic create/read. Hudi's key features include update and delete operations.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 432b670 and 4583093.

📒 Files selected for processing (2)

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala (1 hunks)
spark/src/main/scala/ai/chronon/spark/TableUtils.scala (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

spark/src/main/scala/ai/chronon/spark/TableUtils.scala

⏰ Context from checks skipped due to timeout of 90000ms (16)

GitHub Check: streaming_tests
GitHub Check: streaming_tests
GitHub Check: analyzer_tests
GitHub Check: join_tests
GitHub Check: join_tests
GitHub Check: analyzer_tests
GitHub Check: spark_tests
GitHub Check: groupby_tests
GitHub Check: non_spark_tests
GitHub Check: spark_tests
GitHub Check: fetcher_tests
GitHub Check: scala_compile_fmt_fix
GitHub Check: non_spark_tests
GitHub Check: fetcher_tests
GitHub Check: groupby_tests
GitHub Check: enforce_triggered_workflows

🔇 Additional comments (6)

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala (6)

11-24: Solid test setup.

Properly configures Hudi catalog, extensions, and Kryo registrator.

40-40: Good partitioning setup.

Correctly creates Hudi table with PARQUET format and date-based partitioning.

41-48: Good provider verification.

Properly verifies table exists and uses Hudi provider.

50-50: Previously failing section now fixed.

Line previously had issues per david-zlai's comment, now properly implemented.

53-54: Fixed type issue and uncommented assertion.

Correctly collects both columns as (String, String) tuple and verifies against source data.

55-57: Good cleanup practice.

Properly drops test table in finally block.

tchow-zlai · 2025-03-13T05:57:35Z

cloud_aws/src/test/scala/ai/chronon/integrations/aws/HudiTableUtilsTest.scala

+      local = true,
+      additionalConfig = Some(
+        Map(
+          "spark.sql.catalog.spark_catalog" -> "org.apache.spark.sql.hudi.catalog.HoodieCatalog",


let's change this to a different catalog default_hudi, and set that as the default catalog.

"spark.sql.defaultCatalog" -> "default_hudi", "spark.sql.catalog.default_hudi" -> "org.apache.spark.sql.hudi.catalog.HoodieCatalog"

ok, but what's the reason

going to merge and put up a new PR

## Summary ^^^ Tested on AWS. see below: ![image](https://github.com/user-attachments/assets/c9bdcc5a-7319-459e-88a4-a32cfa84f1cf) ![image](https://github.com/user-attachments/assets/9c76b8f0-3aef-4215-8db3-4b2535dd745b) ## Checklist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a utility for parsing Hive partition strings, streamlining partition management. - Enhanced Apache Hudi integration in Spark SQL with updated configurations for catalog support and table write format. - Expanded AWS integration settings to support additional development environments, enabling broader network configuration options. - **Refactor** - Improved underlying serialization handling to bolster data processing reliability.

## Summary ^^^ Tested on AWS. see below: ![image](https://github.com/user-attachments/assets/c9bdcc5a-7319-459e-88a4-a32cfa84f1cf) ![image](https://github.com/user-attachments/assets/9c76b8f0-3aef-4215-8db3-4b2535dd745b) ## Cheour clientslist - [ ] Added Unit Tests - [ ] Covered by existing CI - [ ] Integration tested - [ ] Documentation update  ## Summary by CodeRabbit - **New Features** - Introduced a utility for parsing Hive partition strings, streamlining partition management. - Enhanced Apache Hudi integration in Spark SQL with updated configurations for catalog support and table write format. - Expanded AWS integration settings to support additional development environments, enabling broader network configuration options. - **Refactor** - Improved underlying serialization handling to bolster data processing reliability.

david-zlai added 5 commits March 12, 2025 10:59

Add emr mount prefix

ab132b3

add todo

e3f10bc

add hudi to applications

1f41fa2

add dev

ce68d56

incorporate Hudi

e7f8942

david-zlai commented Mar 12, 2025

View reviewed changes

david-zlai changed the title ~~Davidhan/hudi emr~~ Add Hudi format Mar 12, 2025

python changes

ab9eee3

david-zlai marked this pull request as ready for review March 12, 2025 21:46

tchow-zlai reviewed Mar 12, 2025

View reviewed changes

david-zlai commented Mar 12, 2025

View reviewed changes

tchow-zlai reviewed Mar 12, 2025

View reviewed changes

david-zlai added 3 commits March 12, 2025 17:58

remove hudi format

e292b53

Merge branch 'main' into davidhan/hudi_emr

f9f5394

share

eb1baeb

david-zlai commented Mar 12, 2025

View reviewed changes

coderabbitai bot reviewed Mar 12, 2025

View reviewed changes

david-zlai added 2 commits March 12, 2025 20:27

Add org.apache.hudi.client.WriteStatus

432b670

fix test

4583093

coderabbitai bot reviewed Mar 13, 2025

View reviewed changes

Merge branch 'main' into davidhan/hudi_emr

234d53c

david-zlai requested review from chewy-zlai and tchow-zlai March 13, 2025 02:34

tchow-zlai approved these changes Mar 13, 2025

View reviewed changes

tchow-zlai reviewed Mar 13, 2025

View reviewed changes

david-zlai merged commit 85306d9 into main Mar 13, 2025
20 checks passed

david-zlai deleted the davidhan/hudi_emr branch March 13, 2025 16:12

coderabbitai bot mentioned this pull request Mar 28, 2025

chore: bump hudi to 1.X #556

Merged

4 tasks


		import org.apache.spark.sql.SparkSession

		case object Hudi extends Format {

		val backSet = back.select("id").as[Int].collect().toSet
		// assertEquals(input, backSet)

-      val backSet = back.select("id").as[Int].collect().toSet
-//      assertEquals(input, backSet)
+      val backSet = back.select("id").as[String].collect().toSet
+      val expectedSet = Set("a", "b", "c", "d")
+      assertEquals(expectedSet, backSet)

Add Hudi format #496

Add Hudi format #496

Uh oh!

Conversation

david-zlai commented Mar 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Poem

Review ran into problems

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

david-zlai commented Mar 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)