-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3345][RFC-36] Hudi metastore server #4718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed proposal @minihippo . Reviewing!
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the direction here is great. Can we also flesh out Some of the details? Iam happy to push a Commit with my edits to the design. You can let me know how you like it or revert if you don't. Let me know if You are Okay with it
|
|
||
| **How Hudi metadata is stored** | ||
|
|
||
| The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also love to get file listings and column ranges for each file part ofthe metastore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File listing is covered by the metastore in the alpha version, supported by snapshot service part. Does column ranges mean statistics like min, max at column level? Has plan to do it in the next version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. To be precise, I would like to see if the metastore can serve FILES and COL_STATS partitions of the metadata table.
|
|
||
| Hive metastore server is widely used as a metadata center in the data warehouse on Hadoop. It stores the metadata for hive tables like their schema, location and partitions. Currently, almost all of the storage or computing engines support registering table information to it, discovering and retrieving metadata from it. Meanwhile, cloud service providers like AWS Glue, HUAWEI Cloud, Google Cloud Dataproc, Alibaba Cloud, ByteDance Volcano Engine all provide Apache Hive metastore compatible catalog. It seems that hive metastore has become a standard in the data warehouse. | ||
|
|
||
| Different from the traditional table format like hive table, the data lake table not only has schema, partitions and other hive metadata, but also has timeline, snapshot which is unconventional. Hence, the metadata of data lake cannot be managed by HMS directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a lock provider mechanism to this list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a proposal about the lock provider but not in the alpha version. I will add the detailed design of this part to the rfc.
| - **Pluggable storage** | ||
| - The storage is only responsible for metadata presistency. Therefore, it's doesn't matter what the storage engine is used to store the data, it can be a RDBMS, kv system or file system. | ||
|
|
||
| - **Easy to be expanded** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plus one. if we can Make it horizontally Scalable and highly available like all the Standard micro services out there, it will be amazing
| - creating or updating API cannot be invoked directly, only a new commit completion can trigger it. | ||
| - dropping a partition not only deletes the partition and files at metadata level, but also triggers a clean action to do the physical clean that deletes the data on the file system. | ||
|
|
||
| - **timeline service** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know we have a timeline Server already. Can we merge the existing functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree with u. About the timeline service, there are two options:
- Replace the timeline service started at the driver/coodinator in each job with the service in the metastore. Then all the tasks will connect to the metastore server directly.
- A timeline service started in the metastore server and an embeddedTS in each job, tasks connect to the embeddedTS started at job driver, the embeddedTS connects with the metastore server.
Considering the concurrent writing scenario, option1 will bring consistency problems. I prefer the option2, its architecture is more easier to expand and can reduce access pressure on the metastore server side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Option 2. Its more scalable. let's discuss details during implementation
|
@minihippo Picking this back up again. What are the next steps in our plan here? |
@vinothchandar Thanks for the review,
|
|
@minihippo Sounds good! We can revisit once you have the basic PR out |
|
@minihippo This is a great work👍, I think it can also solve the problem I recently met: HUDI-3634 as we keep commit instants consistent in the hudi metastore server. But I'm curious how spark side get metadata of a hudi table(stored in the hudi metastore server) and a hive table (stored in the HMS) in one query(like a hudi table join a hive table)? Will we handle this in the HudiCatalog to get hudi table metadata from hudi metastore server and hive table from HMS, or we provide a unified view in the hudi metastore server, let hudi metastore to request HMS server if it's a hive table? |
|
Very valuable idea! Further, maybe we can do more interesting things based on this very valuable hudi metastore server, which is beneficial to realize Hudi Lake Manager which could decouple hudi ingestion and hudi table service, including cleaner, archival, clustering, compaction and any table services in the feature. And this lake manager could unify and automatically call put services such as cleaner/clustering/compaction/archive(multi-writer and async) based on this metastore server. Users only need to care about their own ingest pipline and leave all the table services to the manager to automatically discover and manage the hudi table thereby greatly reducing the pressure of operation and maintenance and the cost of on board. Maybe We could expand this RFC or raising a new RFC and take this MTS as informations inputs? CC @yihua and @nsivabalan |
@boneanxs In ByteDance in house implementation, we do more like the second way. There is a proxy over the hudi metastore server and hive metastore server. The proxy routes requests to the corresponding server according to the table type. |
@zhangyue19921010 #4309 here it is. |
|
Yeap, I read #4309 RFC. What i am thinking is that could we expand this scope. Maybe is more common infrastructure not only clustering/compaction but also clean, archive and any other service in the future :) |
|
@zhangyue19921010 Yes. It's on the list. Hi @yuzhaojing could u supply this part in the RFC? |
|
On this RFC, I think the main thing is to decide the first phase scope. IMO, it can be limited to just Hudi tables for now and depending on whether a Does the RFC address high availability/sharding of metadata? Have you thought about these? If the metastore will also deal with locks, then the servers will become stateful. May be we can phase them as well? @minihippo thoughts? |
|
@vinothchandar sorry for replying the comments so late. When design the storage schema of metadata store, tbl_id is in each storage table so that metadata could be sharded by tbl_id, and all metadata of a table is in one shard. There are no problems about joining across the shard. |
Short-term plan (target 1.0)Phase1Implement the basic functions
Phase2Extensions
|
|
|
||
| **How Hudi metadata is stored** | ||
|
|
||
| The metadata of hudi are table location, configuration and schema, timeline generated by instants, metadata of each commit / instant, which records files created / updated, new records num and so on in this commit. Besides, the information of files in a hudi table is also a part of hudi metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to list all the metadata stored in the file system today and explicitly call out the metadata that will be in the metadata server as per this design and the ones we still plan to read from the file system. Eventually I think it is good to evolve the design in the direction of abstracting ALL the details from the file system and then having multiple implementations.
List of metadata I can think of
- Dataset Partitioning
- Partition to File Group IDs mapping
- File Group Id to Data files mapping (Base file, log files for a specific version)
- Transaction timeline
- Schema registry
- Physical data statistics (column stats)
- Index (bloom, etc)
- Lock Manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestions. I will add them and explain how they are stored in the rfc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion brings me to a high level question. Today column stats are already stored at a file level in metadata table. So do we intend to completely replace metadata table with this new metastore server?
Or do we intend to use metastore server only to store table level stats similar to how hive metastore does that?
Another possibility I can think of is just exposing endpoints via metastore service to interact with different partitions of metadata table as Vinoth pointed out in another comment.
@minihippo
|
|
||
| RFC-15 metadata table is a proposal that can solve these problems. However, it only manages the metadata of one table. There is a lack of a unified view. | ||
|
|
||
| **The integration of Hive metastore and Hudi metadata lacks a single source of truth.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to this generally. Moving a Hudi user away from Hive Metastore into a Hudi specific metastore requires a lot more work on establishing trust and working out details. Have to remember that a lot of these companies having existing support contracts with engines and they support Hive meta store or cloud specific meta store and I dont see this practical for us to push towards replacing this.
In my opinion, The practical way is to abstract all metadata interactions. Design a metaserver that serves metadata much more efficiently than doing it through file system and plug that into these catalogs (HMS, Glue etc) and for any non-standard API provide a thin client that abstracts remote api calls into the Hudi meta server if configured. Hudi meta server becomes an invisible implementation that if configured will make Hudi write and read path much more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first part, i think it's a question about how we provide a lake metastore that is hive compatible, so that the metastore can connect engines with no effort. There is a discussion / simple idea left in the Hive Metastore Adapter part, the end of the rfc page. Maybe the lake metastore is another store which adapts to the existing hive metastore, and the lake one provides a superset of functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the second part, I totally agree with you. That's the features that metastore will support desperately, has more higher priority that hive metastore adapter
|
|
||
| There are specific requirements of the metastore server in the different scenarios. Through the storage of server is pluggable, considering the general situation of disk storage, good performance of read and write, convenience of development, RDBMS may be a better one to be chosen. | ||
|
|
||
| #### Storage Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to think about the Storage schema version at the highest level and every evolution of the storage schema needs to ensure strict forwards and backwards compatibility. Think about scenarios where hudi is upgraded and then downgraded to an older version in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question about the compatibility. In my opinion, scalability will brings the big changes of the storage schema. So, at first we thought db sharding could make the metastore storage scalable. Tbl_id is a suitable to be shard key. Cause shard limits the join action and only relations belong to the same tbl_id will connect together.
During the metastore evolution at ByteDance, the other reason results in schema changes can all be solved by adding a new column with default value to the table. There is no compatibility problem.
| | action | tinyint | instant action, commit, deltacommit and etc. | | ||
| | state | tinyint | instant state, requested, inflight and etc. | | ||
| | duration | int | for heartbeat | | ||
| | start_ts | timestamp | for heartbeat | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason behind having 2 different fields for heartbeat? Can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether the heartbeat fails, it is judeged by whether duration + start_ts < current_time.
| - **Easy to be expanded** | ||
| - The service is stateless, so it can be scaled horizontally to support higher QPS. The storage can be split vertically to store more data. | ||
|
|
||
| - **Compatible with multiple computing engines** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you meant compatible with multiple catalog services? HMS not computing engine precisely..
| - To client, it exposes API that | ||
|
|
||
| - get the latest snapshot of a partition without multiple file version | ||
| - get the incremental files after a specified timestamp, for incremental reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be precise, this incremental files is the files for current incremental query results? In RFC-51 CDC support will enrich the cdc results by having complete changed data. Maybe this API should belong to a "CDC service" that supports current basic incremental files and CDC files? The name "snapshot" usually does not imply CDC capabilities.
|
|
||
| - No external components are introduced and maintained. | ||
|
|
||
| crons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| crons: | |
| cons: |
|
|
||
| - **tbl_params** | ||
|
|
||
| - is used to record table level statistics like file num, total size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's meant for statistics, then tbl_stats is more precise. params may lead to confusion with table properties
| | update_time | timestamp | partition updated time | | ||
| | is_deleted | tinyint | whether the partition is deleted | | ||
|
|
||
| - **partition****_parmas** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
|
|
||
| - **partition****_key_val** | ||
|
|
||
| - is used to support partition pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confused by "key_val" vs "params". i suppose in "params" we store statistics just like "table params", but don't you also store min, max stats in partition params? if so then partition pruning should leverage "params" too. not sure what is planned for "key_val"
rfc/rfc-36/rfc-36.md
Outdated
|
|
||
| [TBD] | ||
|
|
||
| ## Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would love to see some plans on when/how to incorporate existing sync features like hive-sync, glue-sync, datahub-sync, bigquery/snowflake-sync. it would be a good idea to consolidate the sync features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minihippo cross-posting a related comment here: #5064 (comment)
clarifying the RFC plans will certainly help expedite the implementation's review process
|
Hi! This RFC has not been merged, can you explain if we are going to develop Metaserver or if it has been decided to abandon this idea? |
@osipovgit metaserver was implemented - see https://github.com/apache/hudi/tree/master/hudi-platform-service/hudi-metaserver The RFC was not merged as there are discussion points to be finalized, and the current implementation does not include every feature proposed or to-be-proposed. |
What is the purpose of the pull request
A new rfc for hudi metastore server
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.