[HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs #5667

codope · 2022-05-23T16:04:06Z

What is the purpose of the pull request

RFC for new table APIs and config changes.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codope · 2022-05-23T16:04:40Z

cc @xushiyan
@fengjian428 I can make you co-author of thir RFC and include your hive sync proposals in this or if you have a draft RFC let me know, i can link that later in this RFC.

alexeykudinkin · 2022-05-23T18:14:19Z

rfc/rfc-54/rfc-54.md

+| update        | Update rows in a Hudi table that match the given condition with the given update expression   |
+| drop          | Drop the given Hudi table completely  |
+| truncate      | Delete data from the given Hudi table but does not drop it |
+| restoreToTime | Restore Hudi table to the given older commit time  |


It's more idiomatic to call it restoreTo, but if you want to specify the noun it's better to use "timestamp" than just time

+1 on restoreTo(), but can we avoid tying to timestamp. this can be a logical time too per see.

Agree. That was the intention. I will make it clear.

rfc/rfc-54/rfc-54.md

leesf · 2022-05-24T01:56:06Z

rfc/rfc-54/rfc-54.md

+| ------------- | ------------- |
+| bootstrap     | Create a Hudi table from the given parquet table  |
+| create        | Create a Hudi table with the given configs if it does not exist.   |
+| update        | Update rows in a Hudi table that match the given condition with the given update expression   |


here will we make use of spark sql expression or build hudi expression and transform spark sql expression to hudi expression?

spark sql expression. Please check my comment below #5667 (comment)

danny0405 · 2022-05-24T02:56:19Z

rfc/rfc-54/rfc-54.md

+// update Hudi table, add 1 to colA for all records of the current year (2022)
+hudiTable.update(
+    functions.col("dateCol").gt("2021-12-31"), // filter condition
+    functions.col("colA").plus(1) // update expression


What's the running engine behind these APIs, do we have configurable/pluggable engine options ?

Good question. For now, we will start with spark engine. This assumption also relates to the question raised by @leesf . So, we will use spark sql expressions. Eventually, we can build hudi expression and transformers for different engines.

rfc/rfc-54/rfc-54.md

garyli1019 · 2022-05-26T12:01:15Z

rfc/rfc-54/rfc-54.md

+
+```java
+// create Hudi table
+HudiTable hudiTable=HudiTable.create(HoodieTableConfig.newBuilder()


will the HudiTable be used by the driver only? Can we access the HudiTable in a distributed manner?

Can you please elaborate more on "distributed manner", perhaps with an example/usecase? As such Huditable is just another client. Do you mean distributed in the sense that there can be multiple drivers and how to keep state of HudiTable consistent across them?

For example we have HoodieTable in HoodieIOHandle, currently we can create many IOHandle in the workers. Also for Flink, we create a writeClient with HoodieTable object in each subtask(StreamWriteFunction) to do the write workload. Right now the ways how we use the write client and HoodieTable are different for each compute engine, so I was wondering if we can define an Abstraction layer on Hudi side, to provide the unified entry point including driver side write trigger(like spark) and distributed worker side write trigger(like Flink), to fully manage the consistency. So we can adapt to new compute engines faster by reusing the existing pattern.

garyli1019 · 2022-05-26T12:01:46Z

rfc/rfc-54/rfc-54.md

+## Abstract
+
+Users configure jobs to write Hudi tables and control the behaviour of their
+jobs at different levels such as table, write client, datasource, record


How do we use the HudiTable in the write client?

HudiTable is an alternative to using write client directly. It is not meant to be used within the write client.

ok, sounds like the HudiTable is a new concept and not meant to replace the current HoodieTable. Did I understand correctly?

Yes, you got it.

I thought there is kinda naming convention in community: prefix "hudi" - is for project and its submodules, but "hoodie" - is for classes. May be it is better don't break this rule and do not use HudiTable as a class name?

fengjian428 · 2022-05-26T13:57:34Z

cc @xushiyan @fengjian428 I can make you co-author of thir RFC and include your hive sync proposals in this or if you have a draft RFC let me know, i can link that later in this RFC.
HI @codope here is RFC 55

Link to hive sync refactor rfc

hudi-bot · 2022-05-26T18:51:52Z

CI report:

0199def Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2022-05-30T08:30:12Z

Does hudi have plan to implement its own efficient data structure like Spark internal row and Flink row data ?

alvarolemos · 2022-06-02T21:24:05Z

rfc/rfc-54/rfc-54.md

+  e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`
+
+### Table APIs
+


This is a great idea! Many people used to other frameworks (like DeltaLake) would onboard easily. As a user, I just have a concern: are you planning on creating SDKs for other languages supported by Spark, especially Python? Asking that because at my company we use Hudi successfully with PySpark (even though the Hudi project doesn't have a single line of Python) because of the way it works through configuration. I believe that there are many other users that have successfully used Hudi with PySpark for that same reason, so I would think about that and maybe add that support in the roadmap

This is a good Call out. +1 for a python client.

Happy you liked it, @vinothchandar. Would love to help if you guys need it. I would just need some initial direction :)

kazdy · 2022-06-18T13:12:09Z

rfc/rfc-54/rfc-54.md

+| Method Name   | Description   |
+| ------------- | ------------- |
+| bootstrap     | Create a Hudi table from the given table in parquet and other supported formats  |
+| create        | Create a Hudi table with the given configs if it does not exist. Returns an instance of `HudiTable` for the newly created or an existing Hudi table.   |
+| update        | Update rows in a Hudi table that match the given condition with the given update expression   |
+| drop          | Drop the given Hudi table completely  |
+| truncate      | Delete data from the given Hudi table but does not drop it |
+| restoreTo     | Restore Hudi table to the given older commit time or a logical time.  |


Is this going to be a preferred api rather than spark datasource V2 apis like writeTo(), toTable() etc.?

Will this support spark structured streaming read/write?

kazdy · 2022-06-18T13:12:45Z

rfc/rfc-54/rfc-54.md

+
+**Phase 1**
+
+Spark will be the execution engine behind these APIs. We will use spark sql functions for update expressions.


Are you planning to add merge into, insert, upsert operations in the future/ next phases?

vinothchandar

High level direction LGTM. What part of this is already Going into 012?

wombatu-kun · 2023-10-11T07:32:21Z

rfc/rfc-54/rfc-54.md

+Currently, users can create and update Hudi Table using three different
+ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
+[SQL](https://hudi.apache.org/docs/table_management)
+and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one


but there is no DeltaStreamer anymore. it was renamed to just Streamer https://hudi.apache.org/docs/hoodie_streaming_ingestion

vinothchandar · 2025-08-28T03:29:40Z

Closing due to inactivity

alexeykudinkin reviewed May 23, 2022

View reviewed changes

vinothchandar added the rfc Request for comments label May 23, 2022

leesf reviewed May 24, 2022

View reviewed changes

rfc/rfc-54/rfc-54.md Outdated Show resolved Hide resolved

leesf reviewed May 24, 2022

View reviewed changes

danny0405 reviewed May 24, 2022

View reviewed changes

nsivabalan reviewed May 24, 2022

View reviewed changes

rfc/rfc-54/rfc-54.md Outdated Show resolved Hide resolved

garyli1019 reviewed May 26, 2022

View reviewed changes

xushiyan assigned xushiyan and vinothchandar May 26, 2022

codope added 2 commits May 26, 2022 21:45

[HUDI-4142] RFC for new table APIs and config changes

40a3ecc

Update with some clarifications and phased execution plan

0199def

Link to hive sync refactor rfc

codope force-pushed the rfc-54-table-api branch from cc2c72f to 0199def Compare May 26, 2022 16:26

xushiyan changed the title ~~[HUDI-4142] RFC for new table APIs and config changes~~ [HUDI-4142] [RFC-54] for new table APIs and config changes May 30, 2022

xushiyan changed the title ~~[HUDI-4142] [RFC-54] for new table APIs and config changes~~ [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs May 30, 2022

minor edits

fd7c38c

alvarolemos reviewed Jun 2, 2022

View reviewed changes

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jun 9, 2022

kazdy reviewed Jun 18, 2022

View reviewed changes

vinothchandar reviewed Jul 19, 2022

View reviewed changes

vinothchandar added the release-1.0.0 label Aug 16, 2023

wombatu-kun reviewed Oct 11, 2023

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

vinothchandar removed their assignment May 23, 2025

vinothchandar closed this Aug 28, 2025

github-project-automation bot moved this from 🆕 New to ✅ Done in Hudi PR Support Aug 28, 2025

hudi-bot mentioned this pull request Dec 9, 2025

RFC for new Table APIs proposal for query engine integrations #15195

Open

		e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`

		### Table APIs


		Phase 1

		Spark will be the execution engine behind these APIs. We will use spark sql functions for update expressions.

[HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs #5667

[HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs #5667

Uh oh!

Conversation

codope commented May 23, 2022

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codope commented May 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengjian428 commented May 26, 2022

Uh oh!

hudi-bot commented May 26, 2022

CI report:

Uh oh!

danny0405 commented May 30, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

codope commented May 23, 2022 •

edited

Loading