Skip to content

Conversation

@minihippo
Copy link
Contributor

What is the purpose of the pull request

A new rfc for hudi metastore server

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@leesf leesf self-assigned this Feb 14, 2022
@nsivabalan nsivabalan removed their assignment Feb 16, 2022
@vinothchandar vinothchandar added the rfc Request for comments label Feb 17, 2022
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed proposal @minihippo . Reviewing!

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction here is great. Can we also flesh out Some of the details? Iam happy to push a Commit with my edits to the design. You can let me know how you like it or revert if you don't. Let me know if You are Okay with it


**How Hudi metadata is stored**

The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also love to get file listings and column ranges for each file part ofthe metastore

Copy link
Contributor Author

@minihippo minihippo Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File listing is covered by the metastore in the alpha version, supported by snapshot service part. Does column ranges mean statistics like min, max at column level? Has plan to do it in the next version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. To be precise, I would like to see if the metastore can serve FILES and COL_STATS partitions of the metadata table.


Hive metastore server is widely used as a metadata center in the data warehouse on Hadoop. It stores the metadata for hive tables like their schema, location and partitions. Currently, almost all of the storage or computing engines support registering table information to it, discovering and retrieving metadata from it. Meanwhile, cloud service providers like AWS Glue, HUAWEI Cloud, Google Cloud Dataproc, Alibaba Cloud, ByteDance Volcano Engine all provide Apache Hive metastore compatible catalog. It seems that hive metastore has become a standard in the data warehouse.

Different from the traditional table format like hive table, the data lake table not only has schema, partitions and other hive metadata, but also has timeline, snapshot which is unconventional. Hence, the metadata of data lake cannot be managed by HMS directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a lock provider mechanism to this list

Copy link
Contributor Author

@minihippo minihippo Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a proposal about the lock provider but not in the alpha version. I will add the detailed design of this part to the rfc.

- **Pluggable storage**
- The storage is only responsible for metadata presistency. Therefore, it's doesn't matter what the storage engine is used to store the data, it can be a RDBMS, kv system or file system.

- **Easy to be expanded**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus one. if we can Make it horizontally Scalable and highly available like all the Standard micro services out there, it will be amazing

- creating or updating API cannot be invoked directly, only a new commit completion can trigger it.
- dropping a partition not only deletes the partition and files at metadata level, but also triggers a clean action to do the physical clean that deletes the data on the file system.

- **timeline service**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know we have a timeline Server already. Can we merge the existing functionality

Copy link
Contributor Author

@minihippo minihippo Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree with u. About the timeline service, there are two options:

  1. Replace the timeline service started at the driver/coodinator in each job with the service in the metastore. Then all the tasks will connect to the metastore server directly.
  2. A timeline service started in the metastore server and an embeddedTS in each job, tasks connect to the embeddedTS started at job driver, the embeddedTS connects with the metastore server.

Considering the concurrent writing scenario, option1 will bring consistency problems. I prefer the option2, its architecture is more easier to expand and can reduce access pressure on the metastore server side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Option 2. Its more scalable. let's discuss details during implementation

@vinothchandar
Copy link
Member

@minihippo Picking this back up again. What are the next steps in our plan here?

@minihippo
Copy link
Contributor Author

minihippo commented Mar 12, 2022

@minihippo Picking this back up again. What are the next steps in our plan here?

@vinothchandar Thanks for the review,

  1. More details for RFC
  2. I will submit a pr about the initial hudi-metastore module which supports basic functions next week

@vinothchandar
Copy link
Member

@minihippo Sounds good! We can revisit once you have the basic PR out

@boneanxs
Copy link
Contributor

@minihippo This is a great work👍, I think it can also solve the problem I recently met: HUDI-3634 as we keep commit instants consistent in the hudi metastore server.

But I'm curious how spark side get metadata of a hudi table(stored in the hudi metastore server) and a hive table (stored in the HMS) in one query(like a hudi table join a hive table)? Will we handle this in the HudiCatalog to get hudi table metadata from hudi metastore server and hive table from HMS, or we provide a unified view in the hudi metastore server, let hudi metastore to request HMS server if it's a hive table?

@zhangyue19921010
Copy link
Contributor

zhangyue19921010 commented Apr 18, 2022

Very valuable idea!

Further, maybe we can do more interesting things based on this very valuable hudi metastore server, which is beneficial to realize Hudi Lake Manager which could decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, compaction and any table services in the feature.

And this lake manager could unify and automatically call put services such as cleaner/clustering/compaction/archive(multi-writer and async) based on this metastore server.

Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table thereby greatly reducing the pressure of operation and maintenance and the cost of on board.

Maybe We could expand this RFC or raising a new RFC and take this MTS as informations inputs?

CC @yihua and @nsivabalan

@minihippo
Copy link
Contributor Author

minihippo commented Apr 25, 2022

@minihippo This is a great work👍, I think it can also solve the problem I recently met: HUDI-3634 as we keep commit instants consistent in the hudi metastore server.

But I'm curious how spark side get metadata of a hudi table(stored in the hudi metastore server) and a hive table (stored in the HMS) in one query(like a hudi table join a hive table)? Will we handle this in the HudiCatalog to get hudi table metadata from hudi metastore server and hive table from HMS, or we provide a unified view in the hudi metastore server, let hudi metastore to request HMS server if it's a hive table?

@boneanxs In ByteDance in house implementation, we do more like the second way. There is a proxy over the hudi metastore server and hive metastore server. The proxy routes requests to the corresponding server according to the table type.

@minihippo
Copy link
Contributor Author

minihippo commented Apr 25, 2022

Very valuable idea!

Further, maybe we can do more interesting things based on this very valuable hudi metastore server, which is beneficial to realize Hudi Lake Manager which could decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, compaction and any table services in the feature.

And this lake manager could unify and automatically call put services such as cleaner/clustering/compaction/archive(multi-writer and async) based on this metastore server.

Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table thereby greatly reducing the pressure of operation and maintenance and the cost of on board.

Maybe We could expand this RFC or raising a new RFC and take this MTS as informations inputs?

CC @yihua and @nsivabalan

@zhangyue19921010 #4309 here it is.

@zhangyue19921010
Copy link
Contributor

Yeap, I read #4309 RFC. What i am thinking is that could we expand this scope. Maybe is more common infrastructure not only clustering/compaction but also clean, archive and any other service in the future :)

@minihippo
Copy link
Contributor Author

@zhangyue19921010 Yes. It's on the list. Hi @yuzhaojing could u supply this part in the RFC?

@vinothchandar
Copy link
Member

On this RFC, I think the main thing is to decide the first phase scope. IMO, it can be limited to just Hudi tables for now and depending on whether a hudi.metastore.uris is configured or not, the queries will use this metaserver or not.

Does the RFC address high availability/sharding of metadata? Have you thought about these? If the metastore will also deal with locks, then the servers will become stateful. May be we can phase them as well? @minihippo thoughts?

@minihippo
Copy link
Contributor Author

@vinothchandar sorry for replying the comments so late. When design the storage schema of metadata store, tbl_id is in each storage table so that metadata could be sharded by tbl_id, and all metadata of a table is in one shard. There are no problems about joining across the shard.

@minihippo
Copy link
Contributor Author

Short-term plan (target 1.0)

Phase1

Implement the basic functions

  1. Databases and tables store
  2. All actions (i.e. commit, compaction) and operations (i.e. upsert, compact, cluster)
  3. Timeline, instant meta store.
  4. Partition, snapshot store.
  5. Spark/ Flink read/write available based on metastore
  6. Parameters of table/partition level persistence.
  7. e.g. table config

Phase2

Extensions

  1. Schema store and support schema evolution
  2. Concurrency support (will submit a new rfc)
  3. Hudi catalog


**How Hudi metadata is stored**

The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to list all the metadata stored in the file system today and explicitly call out the metadata that will be in the metadata server as per this design and the ones we still plan to read from the file system. Eventually I think it is good to evolve the design in the direction of abstracting ALL the details from the file system and then having multiple implementations.

List of metadata I can think of

  1. Dataset Partitioning
  2. Partition to File Group IDs mapping
  3. File Group Id to Data files mapping (Base file, log files for a specific version)
  4. Transaction timeline
  5. Schema registry
  6. Physical data statistics (column stats)
  7. Index (bloom, etc)
  8. Lock Manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions. I will add them and explain how they are stored in the rfc.

Copy link
Contributor

@pratyakshsharma pratyakshsharma Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion brings me to a high level question. Today column stats are already stored at a file level in metadata table. So do we intend to completely replace metadata table with this new metastore server?
Or do we intend to use metastore server only to store table level stats similar to how hive metastore does that?

Another possibility I can think of is just exposing endpoints via metastore service to interact with different partitions of metadata table as Vinoth pointed out in another comment.
@minihippo


RFC-15 metadata table is a proposal that can solve these problems. However, it only manages the metadata of one table. There is a lack of a unified view.

**The integration of Hive metastore and Hudi metadata lacks a single source of truth.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to this generally. Moving a Hudi user away from Hive Metastore into a Hudi specific metastore requires a lot more work on establishing trust and working out details. Have to remember that a lot of these companies having existing support contracts with engines and they support Hive meta store or cloud specific meta store and I dont see this practical for us to push towards replacing this.

In my opinion, The practical way is to abstract all metadata interactions. Design a metaserver that serves metadata much more efficiently than doing it through file system and plug that into these catalogs (HMS, Glue etc) and for any non-standard API provide a thin client that abstracts remote api calls into the Hudi meta server if configured. Hudi meta server becomes an invisible implementation that if configured will make Hudi write and read path much more efficient.

Copy link
Contributor Author

@minihippo minihippo Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first part, i think it's a question about how we provide a lake metastore that is hive compatible, so that the metastore can connect engines with no effort. There is a discussion / simple idea left in the Hive Metastore Adapter part, the end of the rfc page. Maybe the lake metastore is another store which adapts to the existing hive metastore, and the lake one provides a superset of functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the second part, I totally agree with you. That's the features that metastore will support desperately, has more higher priority that hive metastore adapter


There are specific requirements of the metastore server in the different scenarios. Through the storage of server is pluggable, considering the general situation of disk storage, good performance of read and write, convenience of development, RDBMS may be a better one to be chosen.

#### Storage Schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think about the Storage schema version at the highest level and every evolution of the storage schema needs to ensure strict forwards and backwards compatibility. Think about scenarios where hudi is upgraded and then downgraded to an older version in production.

Copy link
Contributor Author

@minihippo minihippo Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question about the compatibility. In my opinion, scalability will brings the big changes of the storage schema. So, at first we thought db sharding could make the metastore storage scalable. Tbl_id is a suitable to be shard key. Cause shard limits the join action and only relations belong to the same tbl_id will connect together.

During the metastore evolution at ByteDance, the other reason results in schema changes can all be solved by adding a new column with default value to the table. There is no compatibility problem.

| action | tinyint | instant action, commit, deltacommit and etc. |
| state | tinyint | instant state, requested, inflight and etc. |
| duration | int | for heartbeat |
| start_ts | timestamp | for heartbeat |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behind having 2 different fields for heartbeat? Can you please elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the heartbeat fails, it is judeged by whether duration + start_ts < current_time.

- **Easy to be expanded**
- The service is stateless, so it can be scaled horizontally to support higher QPS. The storage can be split vertically to store more data.

- **Compatible with multiple computing engines**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you meant compatible with multiple catalog services? HMS not computing engine precisely..

- To client, it exposes API that

- get the latest snapshot of a partition without multiple file version
- get the incremental files after a specified timestamp, for incremental reading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be precise, this incremental files is the files for current incremental query results? In RFC-51 CDC support will enrich the cdc results by having complete changed data. Maybe this API should belong to a "CDC service" that supports current basic incremental files and CDC files? The name "snapshot" usually does not imply CDC capabilities.


- No external components are introduced and maintained.

crons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
crons:
cons:


- **tbl_params**

- is used to record table level statistics like file num, total size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's meant for statistics, then tbl_stats is more precise. params may lead to confusion with table properties

| update_time | timestamp | partition updated time |
| is_deleted | tinyint | whether the partition is deleted |

- **partition****_parmas**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.


- **partition****_key_val**

- is used to support partition pruning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused by "key_val" vs "params". i suppose in "params" we store statistics just like "table params", but don't you also store min, max stats in partition params? if so then partition pruning should leverage "params" too. not sure what is planned for "key_val"


[TBD]

## Implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would love to see some plans on when/how to incorporate existing sync features like hive-sync, glue-sync, datahub-sync, bigquery/snowflake-sync. it would be a good idea to consolidate the sync features

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minihippo cross-posting a related comment here: #5064 (comment)
clarifying the RFC plans will certainly help expedite the implementation's review process

@xushiyan xushiyan changed the title [HUDI-3345][RFC-36] Proposal for hudi metastore server. [HUDI-3345][RFC-36] Hudi metastore server Aug 10, 2022
@yihua yihua added the priority:critical Production degraded; pipelines stalled label Sep 13, 2022
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 26, 2024
@osipovgit
Copy link

Hi! This RFC has not been merged, can you explain if we are going to develop Metaserver or if it has been decided to abandon this idea?
@vinothchandar @xushiyan @yihua

@xushiyan
Copy link
Member

Hi! This RFC has not been merged, can you explain if we are going to develop Metaserver or if it has been decided to abandon this idea? @vinothchandar @xushiyan @yihua

@osipovgit metaserver was implemented - see https://github.com/apache/hudi/tree/master/hudi-platform-service/hudi-metaserver

The RFC was not merged as there are discussion points to be finalized, and the current implementation does not include every feature proposed or to-be-proposed.

@vinothchandar vinothchandar removed their assignment May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled rfc Request for comments size:L PR with lines of changes in (300, 1000]

Projects

Status: 👤 User Action
Status: 🆕 New

Development

Successfully merging this pull request may close these issues.