Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added rfc/rfc-44/presto-connector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
158 changes: 158 additions & 0 deletions rfc/rfc-44/rfc-44.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# RFC-44: Hudi Connector for Presto

## Proposers

- @7c00

## Approvers

- @codope
- @vinothchandar

## Status

JIRA: [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210)

> Please keep the status updated in `rfc/README.md`.

## Abstract

The support for querying Hudi tables in Presto is provided by Presto Hive connector. The implementation is built on the
InputFormat interface from hudi-hadoop-mr module. This approach has known performance and stability issues. It's also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the real issue has been maintainability i.e code often gets changed in Presto without us tracking/following with side-effects.

hard to adopt new Hudi features due to the restrictions from current code. A separate Hudi connector would make it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with the new connector, it will be easy to expose all the indexes as well

efficient to make specific optimization, add new functionalities, integrate advanced features, and evolve rapidly with
the upstream project.

## Background

The current Presto integration reads Hudi tables as regular Hive tables with some additional processing. For a COW or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have actually completely moved away frm this to using the FileSystemView directly. So bunch of these issues should not exist. at-least in latest presto/master

MOR-RO table, Presto applies a dedicated file filter (`HoodieROTablePathFilter`) to prune invalid files during split
generation. For an MOR-RT table, Presto delegates split generation to `HoodieParquetRealtimeInputFormat#getSplits`,
and data loading to the reader created by `HoodieParquetRealtimeInputFormat#getRecordReader`.

This implementation could take advantage of existing code. But there are some drawbacks.

- Due to the MVCC design, file layouts of Hudi tables are quite different from those of regular Hive tables.
Mixing the split generation for Hudi and non-Hudi tables makes it hard to extend to integrate custom split scheduling
strategies. For example, HoodieParquetRealtimeInputFormat generates a split for a file slice, which is not performant
and could be improved by fine-grained splitting.
- In order to transport Hudi-only split properties, CustomSplitConverter is applied. The restrictions from current code
make it necessary to have such hacking and tricky code, and increases the difficulty of testing. It would continue to
affect the new code to adopt new Hudi features in the future.
- By delegating most work to HoodieParquetRealtimeInputFormat, memory usage is out of control of Presto memory
management mechanism. That hurts the system robust: workers tend to crash for OOM when querying a large MOR-RT table.

With a new separator connector, Hudi integration could be improved without the restriction of current code in Hive
connector. A separate connector also separates Hudi's bugs away from Hive connector. It becomes more confident to add
new code, since it would never break Hive connector.

## Implementation

Presto provides a set of service provider interfaces (SPI) to allow developers to create plugins that would be
dynamically loaded into Presto runtime system to add custom functionalities. The connector is a kind of plugin,
which can read and sink data from/to external data sources.

To create a connector for Hudi, SPIs below are to be implements:

- **Plugin**: the entry class to create plugin components (including connectors), instantiated and invoked by Presto
internal services when booting. `HudiPlugin` which implements Plugin interface is to be added for Hudi connector.
- **ConnectorFactory**: the factory class to create connector instances. `HudiConnectorFactory` which implements
ConnectorFactory is to be added for Hudi connector.
- **Connector**: the facade class of all service classes for a connector. `HudiConnector` which implements Connector
is to be added for Hudi connector. Primary service classes for Hudi connector are listed below.
- **ConnectorMetadata**: the service class to retrieve (and update if possible) metadata from/to a data source.
`HudiMetadata` which implements ConnectorMetadata is to be added to Hudi connector.
- **ConnectorSplitManager**: the service class to generate ConnectorSplit for accessing the data source.
A ConnectorSplit is similar to a Split in Hadoop MapReduce computation paradigm. `HudiConnectorSplitManager`
which implements ConnectorSplitManager, is to be added to Hudi connector.
- **ConnectorPageSourceProvider**: the service class to generate a reader to load data from the data source.
`HudiPageSourceProvider` which implements ConnectorPageSourceProvider, is to be added to Hudi connector.

There are other service classes (e.g. `ConnectorPageSinkProvider`), which are not covered in this RFC but might be
implemented in the future. A class-diagrammatic view of the different components is shown below.

![](presto-connector.png)

### HudiMetadata

HudiMetadata implements ConnectorMetadata interface, and provides methods to access the metadata like databases
(called schemas in Presto), table and column structures, statistics and properties, and other info.
As Hudi table metadata is synchronized to Hive MetaStore, HudiMetadata reuses ExtendHiveMetastore from Presto codebase
to retrieve Hudi table metadata. Besides, HudiMetadata takes use of HoodieTableMetaClient to expose Hudi specified
metadata, e.g. instants list of a table timeline, represented as SystemTable.

### HudiSplitManager & HudiSplit

HudiSplitManager implements the ConnectorSplitManager interface. It partitions a Hudi table to read into multiple
individual chunks (called ConnectorSplit in Presto), so that the data set can be processed in parallel.
HudiSplitManager also performs partition pruning if possible.

HudiSplit, which implements ConnectorSplit, describes which files to read, and the file properties (i.e. offset, length,
location, etc). For a normal Hudi table, HudiSplitManager generates a split for each FileSlice at certain instant.
For a Hudi table that contains large base files, HudiSplitManager divides the base file into multiple splits,
each containing all the log files and part of the base file. This should improve the performance.

For large Hudi tables, generating splits tends to cost lots of time. This can be improved in the following ways.
Firstly, increase parallelism. Since each partition is independent of another, processing the partitions in multiple
threads, one thread per partition, helps reduce the wall time. Secondly, take use of Hudi metadata table or external
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to have some pluggable options here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added in the PR.

listing services. This should save a large amount of cost to make requests to HDFS.


### HudiPageSourceProvider & HudiPageSource

For each HudiSplit, HudiPageSourceProvider generates a reader for it to load data from the data source (e.g. HCFS)
into memory. Presto organizes the memory data in the form of Pages. Page is implemented in a column-store favor.
This design contributes to enhance Presto's query performance. In particular, Presto writes its own readers to
accelerate the data loading for column-store favored files, such as Parquet and ORC.

For the snapshot query on a COW table, HudiPageSourceProvider takes advantage of Presto's native ParquetPageSource
as it's well optimized for Parquet files as described above.

For the snapshot query on an MOR table, HudiPageSourceProvider is planned to add a page source that makes mixed uses of
ParquetPageSource and HoodieRealtimeRecordReader. The new page source takes care of memory usage under Presto's memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, Hudi log format supports parquet blocks as well. is there a way we can read/merge without involving the Hive/RecordReader APIs? something we are exploring independently as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
HUDI-4007 to track this task

tracking framework. A native log file reader optimized for Presto might be provided in the future.

## Rollout/Adoption Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add an example of how to connect and query? Basically, what will it be like for the users? Are they going to run use hudi.schema_name and select * from hudi.schema_name.table_name?
I believe for the existing tables in Hudi tables in Hive, they should be able to query through the Hudi connector just by changing the catalog name right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to add a catalog remapping service to remapping the catalog (e.g. hive to hudi), so the end user could query hudi table hudi.schema1.table1 with the name hive.schema1.table1. But before that users need to change their queries.


- What impact (if any) will there be on existing users?

There will be no impact on existing users because this is a new connector. It does not change the behavior of current
integration through the existing Hive connector. It gives users more choice. However, in order to use this connector, the catalog name should be changed to `hudi` from `hive`.
For example, after users have configured Hudi connector, then `USE hudi.schema_name` should be used instead of `USE hive.schema_name`.

- What do we lose if we move away from the Hive connector?

Some features in Hive connectors (e.g. RaptorX) might be temporarily unavailable during the early development stage of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from caching, are going to miss out something on filter evaluation? Will the filter evaluation be done by the presto engine instead of hive connector?

Copy link
Contributor Author

@7c00 7c00 Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the exact meaing here for filter evaluation?

AFAIK, presto push down some predicate expresssions to the connectors. Connectors take use of the expressions to prune some chunks of input data. For hudi connector, we could do it as well.

Hudi connector. Most of them will be ported to Hudi connector as the connector evolution.

- If we need special migration tools, describe them here.

The Hudi connector provides another choice besides Hive connector for people to query Hudi tables from Presto.
No migration is needed.

- When will we remove the existing behavior?

We are not proposing to remove the existing behavior. We hope that we will have a critical mass of users who will like
to use the new Hudi connector. That said, we should continue to support the current integration.

## Test Plan

- [x] POC for snapshot query on COW table
- [ ] Unit tests for the connector
- [ ] Product integration tests
- [ ] Benchmark snapshot query for large tables