-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs #5667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @xushiyan |
rfc/rfc-54/rfc-54.md
Outdated
| | update | Update rows in a Hudi table that match the given condition with the given update expression | | ||
| | drop | Drop the given Hudi table completely | | ||
| | truncate | Delete data from the given Hudi table but does not drop it | | ||
| | restoreToTime | Restore Hudi table to the given older commit time | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more idiomatic to call it restoreTo, but if you want to specify the noun it's better to use "timestamp" than just time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on restoreTo(), but can we avoid tying to timestamp. this can be a logical time too per see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. That was the intention. I will make it clear.
| | ------------- | ------------- | | ||
| | bootstrap | Create a Hudi table from the given parquet table | | ||
| | create | Create a Hudi table with the given configs if it does not exist. | | ||
| | update | Update rows in a Hudi table that match the given condition with the given update expression | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here will we make use of spark sql expression or build hudi expression and transform spark sql expression to hudi expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark sql expression. Please check my comment below #5667 (comment)
| // update Hudi table, add 1 to colA for all records of the current year (2022) | ||
| hudiTable.update( | ||
| functions.col("dateCol").gt("2021-12-31"), // filter condition | ||
| functions.col("colA").plus(1) // update expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the running engine behind these APIs, do we have configurable/pluggable engine options ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rfc/rfc-54/rfc-54.md
Outdated
|
|
||
| ```java | ||
| // create Hudi table | ||
| HudiTable hudiTable=HudiTable.create(HoodieTableConfig.newBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the HudiTable be used by the driver only? Can we access the HudiTable in a distributed manner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate more on "distributed manner", perhaps with an example/usecase? As such Huditable is just another client. Do you mean distributed in the sense that there can be multiple drivers and how to keep state of HudiTable consistent across them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example we have HoodieTable in HoodieIOHandle, currently we can create many IOHandle in the workers. Also for Flink, we create a writeClient with HoodieTable object in each subtask(StreamWriteFunction) to do the write workload. Right now the ways how we use the write client and HoodieTable are different for each compute engine, so I was wondering if we can define an Abstraction layer on Hudi side, to provide the unified entry point including driver side write trigger(like spark) and distributed worker side write trigger(like Flink), to fully manage the consistency. So we can adapt to new compute engines faster by reusing the existing pattern.
| ## Abstract | ||
|
|
||
| Users configure jobs to write Hudi tables and control the behaviour of their | ||
| jobs at different levels such as table, write client, datasource, record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we use the HudiTable in the write client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HudiTable is an alternative to using write client directly. It is not meant to be used within the write client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds like the HudiTable is a new concept and not meant to replace the current HoodieTable. Did I understand correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there is kinda naming convention in community: prefix "hudi" - is for project and its submodules, but "hoodie" - is for classes. May be it is better don't break this rule and do not use HudiTable as a class name?
|
Link to hive sync refactor rfc
|
Does hudi have plan to implement its own efficient data structure like Spark internal row and Flink row data ? |
| e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())` | ||
|
|
||
| ### Table APIs | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea! Many people used to other frameworks (like DeltaLake) would onboard easily. As a user, I just have a concern: are you planning on creating SDKs for other languages supported by Spark, especially Python? Asking that because at my company we use Hudi successfully with PySpark (even though the Hudi project doesn't have a single line of Python) because of the way it works through configuration. I believe that there are many other users that have successfully used Hudi with PySpark for that same reason, so I would think about that and maybe add that support in the roadmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good Call out. +1 for a python client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy you liked it, @vinothchandar. Would love to help if you guys need it. I would just need some initial direction :)
| | Method Name | Description | | ||
| | ------------- | ------------- | | ||
| | bootstrap | Create a Hudi table from the given table in parquet and other supported formats | | ||
| | create | Create a Hudi table with the given configs if it does not exist. Returns an instance of `HudiTable` for the newly created or an existing Hudi table. | | ||
| | update | Update rows in a Hudi table that match the given condition with the given update expression | | ||
| | drop | Drop the given Hudi table completely | | ||
| | truncate | Delete data from the given Hudi table but does not drop it | | ||
| | restoreTo | Restore Hudi table to the given older commit time or a logical time. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a preferred api rather than spark datasource V2 apis like writeTo(), toTable() etc.?
Will this support spark structured streaming read/write?
|
|
||
| **Phase 1** | ||
|
|
||
| Spark will be the execution engine behind these APIs. We will use spark sql functions for update expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to add merge into, insert, upsert operations in the future/ next phases?
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level direction LGTM. What part of this is already Going into 012?
| Currently, users can create and update Hudi Table using three different | ||
| ways: [Spark datasource](https://hudi.apache.org/docs/writing_data), | ||
| [SQL](https://hudi.apache.org/docs/table_management) | ||
| and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there is no DeltaStreamer anymore. it was renamed to just Streamer https://hudi.apache.org/docs/hoodie_streaming_ingestion
|
Closing due to inactivity |
What is the purpose of the pull request
RFC for new table APIs and config changes.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.