Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented May 23, 2022

What is the purpose of the pull request

RFC for new table APIs and config changes.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope
Copy link
Member Author

codope commented May 23, 2022

cc @xushiyan
@fengjian428 I can make you co-author of thir RFC and include your hive sync proposals in this or if you have a draft RFC let me know, i can link that later in this RFC.

| update | Update rows in a Hudi table that match the given condition with the given update expression |
| drop | Drop the given Hudi table completely |
| truncate | Delete data from the given Hudi table but does not drop it |
| restoreToTime | Restore Hudi table to the given older commit time |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more idiomatic to call it restoreTo, but if you want to specify the noun it's better to use "timestamp" than just time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on restoreTo(), but can we avoid tying to timestamp. this can be a logical time too per see.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. That was the intention. I will make it clear.

@vinothchandar vinothchandar added the rfc Request for comments label May 23, 2022
| ------------- | ------------- |
| bootstrap | Create a Hudi table from the given parquet table |
| create | Create a Hudi table with the given configs if it does not exist. |
| update | Update rows in a Hudi table that match the given condition with the given update expression |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here will we make use of spark sql expression or build hudi expression and transform spark sql expression to hudi expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark sql expression. Please check my comment below #5667 (comment)

// update Hudi table, add 1 to colA for all records of the current year (2022)
hudiTable.update(
functions.col("dateCol").gt("2021-12-31"), // filter condition
functions.col("colA").plus(1) // update expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the running engine behind these APIs, do we have configurable/pluggable engine options ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For now, we will start with spark engine. This assumption also relates to the question raised by @leesf . So, we will use spark sql expressions. Eventually, we can build hudi expression and transformers for different engines.


```java
// create Hudi table
HudiTable hudiTable=HudiTable.create(HoodieTableConfig.newBuilder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the HudiTable be used by the driver only? Can we access the HudiTable in a distributed manner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate more on "distributed manner", perhaps with an example/usecase? As such Huditable is just another client. Do you mean distributed in the sense that there can be multiple drivers and how to keep state of HudiTable consistent across them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example we have HoodieTable in HoodieIOHandle, currently we can create many IOHandle in the workers. Also for Flink, we create a writeClient with HoodieTable object in each subtask(StreamWriteFunction) to do the write workload. Right now the ways how we use the write client and HoodieTable are different for each compute engine, so I was wondering if we can define an Abstraction layer on Hudi side, to provide the unified entry point including driver side write trigger(like spark) and distributed worker side write trigger(like Flink), to fully manage the consistency. So we can adapt to new compute engines faster by reusing the existing pattern.

## Abstract

Users configure jobs to write Hudi tables and control the behaviour of their
jobs at different levels such as table, write client, datasource, record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we use the HudiTable in the write client?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HudiTable is an alternative to using write client directly. It is not meant to be used within the write client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds like the HudiTable is a new concept and not meant to replace the current HoodieTable. Did I understand correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you got it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there is kinda naming convention in community: prefix "hudi" - is for project and its submodules, but "hoodie" - is for classes. May be it is better don't break this rule and do not use HudiTable as a class name?

@fengjian428
Copy link
Contributor

cc @xushiyan @fengjian428 I can make you co-author of thir RFC and include your hive sync proposals in this or if you have a draft RFC let me know, i can link that later in this RFC.
HI @codope here is RFC 55

@codope codope force-pushed the rfc-54-table-api branch from cc2c72f to 0199def Compare May 26, 2022 16:26
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@xushiyan xushiyan changed the title [HUDI-4142] RFC for new table APIs and config changes [HUDI-4142] [RFC-54] for new table APIs and config changes May 30, 2022
@xushiyan xushiyan changed the title [HUDI-4142] [RFC-54] for new table APIs and config changes [HUDI-4142] [RFC-54] New Table APIs and streamline Hudi configs May 30, 2022
@danny0405
Copy link
Contributor

Does hudi have plan to implement its own efficient data structure like Spark internal row and Flink row data ?

e.g. `spark.write.format("hudi").options(HoodieClusteringConfig.Builder().withXYZ().build())`

### Table APIs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea! Many people used to other frameworks (like DeltaLake) would onboard easily. As a user, I just have a concern: are you planning on creating SDKs for other languages supported by Spark, especially Python? Asking that because at my company we use Hudi successfully with PySpark (even though the Hudi project doesn't have a single line of Python) because of the way it works through configuration. I believe that there are many other users that have successfully used Hudi with PySpark for that same reason, so I would think about that and maybe add that support in the roadmap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good Call out. +1 for a python client.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy you liked it, @vinothchandar. Would love to help if you guys need it. I would just need some initial direction :)

@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jun 9, 2022
Comment on lines +119 to +126
| Method Name | Description |
| ------------- | ------------- |
| bootstrap | Create a Hudi table from the given table in parquet and other supported formats |
| create | Create a Hudi table with the given configs if it does not exist. Returns an instance of `HudiTable` for the newly created or an existing Hudi table. |
| update | Update rows in a Hudi table that match the given condition with the given update expression |
| drop | Drop the given Hudi table completely |
| truncate | Delete data from the given Hudi table but does not drop it |
| restoreTo | Restore Hudi table to the given older commit time or a logical time. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a preferred api rather than spark datasource V2 apis like writeTo(), toTable() etc.?

Will this support spark structured streaming read/write?


**Phase 1**

Spark will be the execution engine behind these APIs. We will use spark sql functions for update expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to add merge into, insert, upsert operations in the future/ next phases?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level direction LGTM. What part of this is already Going into 012?

Currently, users can create and update Hudi Table using three different
ways: [Spark datasource](https://hudi.apache.org/docs/writing_data),
[SQL](https://hudi.apache.org/docs/table_management)
and [DeltaStreamer](https://hudi.apache.org/docs/hoodie_deltastreamer). Each one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there is no DeltaStreamer anymore. it was renamed to just Streamer https://hudi.apache.org/docs/hoodie_streaming_ingestion

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
@vinothchandar vinothchandar removed their assignment May 23, 2025
@vinothchandar
Copy link
Member

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled release-1.0.0 rfc Request for comments size:M PR with lines of changes in (100, 300]

Projects

Status: 👤 User Action
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.