diff --git a/rfc/rfc-44/presto-connector.png b/rfc/rfc-44/presto-connector.png new file mode 100644 index 0000000000000..21248e0f1ca01 Binary files /dev/null and b/rfc/rfc-44/presto-connector.png differ diff --git a/rfc/rfc-44/rfc-44.md b/rfc/rfc-44/rfc-44.md new file mode 100644 index 0000000000000..a55cce46ec684 --- /dev/null +++ b/rfc/rfc-44/rfc-44.md @@ -0,0 +1,158 @@ + + +# RFC-44: Hudi Connector for Presto + +## Proposers + +- @7c00 + +## Approvers + +- @codope +- @vinothchandar + +## Status + +JIRA: [HUDI-3210](https://issues.apache.org/jira/browse/HUDI-3210) + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +The support for querying Hudi tables in Presto is provided by Presto Hive connector. The implementation is built on the +InputFormat interface from hudi-hadoop-mr module. This approach has known performance and stability issues. It's also +hard to adopt new Hudi features due to the restrictions from current code. A separate Hudi connector would make it +efficient to make specific optimization, add new functionalities, integrate advanced features, and evolve rapidly with +the upstream project. + +## Background + +The current Presto integration reads Hudi tables as regular Hive tables with some additional processing. For a COW or +MOR-RO table, Presto applies a dedicated file filter (`HoodieROTablePathFilter`) to prune invalid files during split +generation. For an MOR-RT table, Presto delegates split generation to `HoodieParquetRealtimeInputFormat#getSplits`, +and data loading to the reader created by `HoodieParquetRealtimeInputFormat#getRecordReader`. + +This implementation could take advantage of existing code. But there are some drawbacks. + +- Due to the MVCC design, file layouts of Hudi tables are quite different from those of regular Hive tables. + Mixing the split generation for Hudi and non-Hudi tables makes it hard to extend to integrate custom split scheduling + strategies. For example, HoodieParquetRealtimeInputFormat generates a split for a file slice, which is not performant + and could be improved by fine-grained splitting. +- In order to transport Hudi-only split properties, CustomSplitConverter is applied. The restrictions from current code + make it necessary to have such hacking and tricky code, and increases the difficulty of testing. It would continue to + affect the new code to adopt new Hudi features in the future. +- By delegating most work to HoodieParquetRealtimeInputFormat, memory usage is out of control of Presto memory + management mechanism. That hurts the system robust: workers tend to crash for OOM when querying a large MOR-RT table. + +With a new separator connector, Hudi integration could be improved without the restriction of current code in Hive +connector. A separate connector also separates Hudi's bugs away from Hive connector. It becomes more confident to add +new code, since it would never break Hive connector. + +## Implementation + +Presto provides a set of service provider interfaces (SPI) to allow developers to create plugins that would be +dynamically loaded into Presto runtime system to add custom functionalities. The connector is a kind of plugin, +which can read and sink data from/to external data sources. + +To create a connector for Hudi, SPIs below are to be implements: + +- **Plugin**: the entry class to create plugin components (including connectors), instantiated and invoked by Presto + internal services when booting. `HudiPlugin` which implements Plugin interface is to be added for Hudi connector. +- **ConnectorFactory**: the factory class to create connector instances. `HudiConnectorFactory` which implements + ConnectorFactory is to be added for Hudi connector. +- **Connector**: the facade class of all service classes for a connector. `HudiConnector` which implements Connector + is to be added for Hudi connector. Primary service classes for Hudi connector are listed below. + - **ConnectorMetadata**: the service class to retrieve (and update if possible) metadata from/to a data source. + `HudiMetadata` which implements ConnectorMetadata is to be added to Hudi connector. + - **ConnectorSplitManager**: the service class to generate ConnectorSplit for accessing the data source. + A ConnectorSplit is similar to a Split in Hadoop MapReduce computation paradigm. `HudiConnectorSplitManager` + which implements ConnectorSplitManager, is to be added to Hudi connector. + - **ConnectorPageSourceProvider**: the service class to generate a reader to load data from the data source. + `HudiPageSourceProvider` which implements ConnectorPageSourceProvider, is to be added to Hudi connector. + +There are other service classes (e.g. `ConnectorPageSinkProvider`), which are not covered in this RFC but might be +implemented in the future. A class-diagrammatic view of the different components is shown below. + +![](presto-connector.png) + +### HudiMetadata + +HudiMetadata implements ConnectorMetadata interface, and provides methods to access the metadata like databases +(called schemas in Presto), table and column structures, statistics and properties, and other info. +As Hudi table metadata is synchronized to Hive MetaStore, HudiMetadata reuses ExtendHiveMetastore from Presto codebase +to retrieve Hudi table metadata. Besides, HudiMetadata takes use of HoodieTableMetaClient to expose Hudi specified +metadata, e.g. instants list of a table timeline, represented as SystemTable. + +### HudiSplitManager & HudiSplit + +HudiSplitManager implements the ConnectorSplitManager interface. It partitions a Hudi table to read into multiple +individual chunks (called ConnectorSplit in Presto), so that the data set can be processed in parallel. +HudiSplitManager also performs partition pruning if possible. + +HudiSplit, which implements ConnectorSplit, describes which files to read, and the file properties (i.e. offset, length, +location, etc). For a normal Hudi table, HudiSplitManager generates a split for each FileSlice at certain instant. +For a Hudi table that contains large base files, HudiSplitManager divides the base file into multiple splits, +each containing all the log files and part of the base file. This should improve the performance. + +For large Hudi tables, generating splits tends to cost lots of time. This can be improved in the following ways. +Firstly, increase parallelism. Since each partition is independent of another, processing the partitions in multiple +threads, one thread per partition, helps reduce the wall time. Secondly, take use of Hudi metadata table or external +listing services. This should save a large amount of cost to make requests to HDFS. + + +### HudiPageSourceProvider & HudiPageSource + +For each HudiSplit, HudiPageSourceProvider generates a reader for it to load data from the data source (e.g. HCFS) +into memory. Presto organizes the memory data in the form of Pages. Page is implemented in a column-store favor. +This design contributes to enhance Presto's query performance. In particular, Presto writes its own readers to +accelerate the data loading for column-store favored files, such as Parquet and ORC. + +For the snapshot query on a COW table, HudiPageSourceProvider takes advantage of Presto's native ParquetPageSource +as it's well optimized for Parquet files as described above. + +For the snapshot query on an MOR table, HudiPageSourceProvider is planned to add a page source that makes mixed uses of +ParquetPageSource and HoodieRealtimeRecordReader. The new page source takes care of memory usage under Presto's memory +tracking framework. A native log file reader optimized for Presto might be provided in the future. + +## Rollout/Adoption Plan + +- What impact (if any) will there be on existing users? + + There will be no impact on existing users because this is a new connector. It does not change the behavior of current + integration through the existing Hive connector. It gives users more choice. However, in order to use this connector, the catalog name should be changed to `hudi` from `hive`. + For example, after users have configured Hudi connector, then `USE hudi.schema_name` should be used instead of `USE hive.schema_name`. + +- What do we lose if we move away from the Hive connector? + + Some features in Hive connectors (e.g. RaptorX) might be temporarily unavailable during the early development stage of + Hudi connector. Most of them will be ported to Hudi connector as the connector evolution. + +- If we need special migration tools, describe them here. + + The Hudi connector provides another choice besides Hive connector for people to query Hudi tables from Presto. + No migration is needed. + +- When will we remove the existing behavior? + + We are not proposing to remove the existing behavior. We hope that we will have a critical mass of users who will like + to use the new Hudi connector. That said, we should continue to support the current integration. + +## Test Plan + +- [x] POC for snapshot query on COW table +- [ ] Unit tests for the connector +- [ ] Product integration tests +- [ ] Benchmark snapshot query for large tables \ No newline at end of file