diff --git a/docs/src/main/sphinx/connector.rst b/docs/src/main/sphinx/connector.rst index 54cca76b3874..de328d25fc36 100644 --- a/docs/src/main/sphinx/connector.rst +++ b/docs/src/main/sphinx/connector.rst @@ -19,6 +19,7 @@ from different data sources. Elasticsearch Google Sheets Hive + Hudi Iceberg JMX Kafka diff --git a/docs/src/main/sphinx/connector/hudi.rst b/docs/src/main/sphinx/connector/hudi.rst new file mode 100644 index 000000000000..8b532c6f6e28 --- /dev/null +++ b/docs/src/main/sphinx/connector/hudi.rst @@ -0,0 +1,167 @@ +============== +Hudi connector +============== + +.. raw:: html + + + +The Hudi connector enables querying `Hudi `_ tables. + +Requirements +------------ + +To use the Hudi connector, you need: + +* Network access from the Trino coordinator and workers to the Hudi storage. +* Access to the Hive metastore service (HMS). +* Network access from the Trino coordinator to the HMS. + +Configuration +------------- + +The connector requires a Hive metastore for table metadata and supports the same +metastore configuration properties as the :doc:`Hive connector +`. At a minimum, ``hive.metastore.uri`` must be configured. +The connector recognizes Hudi tables synced to the metastore by the +`Hudi sync tool `_. + +To create a catalog that uses the Hudi connector, create a catalog properties file, +for example ``etc/catalog/example.properties``, that references the ``hudi`` +connector. Update the ``hive.metastore.uri`` with the URI of your Hive metastore +Thrift service: + +.. code-block:: properties + + connector.name=hudi + hive.metastore.uri=thrift://example.net:9083 + +Additionally, following configuration properties can be set depending on the use-case. + +.. list-table:: Hudi configuration properties + :widths: 30, 55, 15 + :header-rows: 1 + + * - Property name + - Description + - Default + * - ``hudi.metadata-enabled`` + - Fetch the list of file names and sizes from metadata rather than storage. + - ``false`` + * - ``hudi.columns-to-hide`` + - List of column names that are hidden from the query output. + It can be used to hide Hudi meta fields. By default, no fields are hidden. + - + * - ``hudi.parquet.use-column-names`` + - Access Parquet columns using names from the file. If disabled, then columns + are accessed using the index. Only applicable to Parquet file format. + - ``true`` + * - ``hudi.min-partition-batch-size`` + - Minimum number of partitions returned in a single batch. + - ``10`` + * - ``hudi.max-partition-batch-size`` + - Maximum number of partitions returned in a single batch. + - ``100`` + * - ``hudi.size-based-split-weights-enabled`` + - Unlike uniform splitting, size-based splitting ensures that each batch of splits + has enough data to process. By default, it is enabled to improve performance. + - ``true`` + * - ``hudi.standard-split-weight-size`` + - The split size corresponding to the standard weight (1.0) + when size-based split weights are enabled. + - ``128MB`` + * - ``hudi.minimum-assigned-split-weight`` + - Minimum weight that a split can be assigned + when size-based split weights are enabled. + - ``0.05`` + * - ``hudi.max-splits-per-second`` + - Rate at which splits are queued for processing. + The queue is throttled if this rate limit is breached. + - ``Integer.MAX_VALUE`` + * - ``hudi.max-outstanding-splits`` + - Maximum outstanding splits in a batch enqueued for processing. + - ``1000`` + +Supported file types +-------------------- + +The connector supports Parquet file type. + +SQL support +----------- + +The connector provides read access to data in the Hudi table that has been synced to +Hive metastore. The :ref:`globally available ` +and :ref:`read operation ` statements are supported. + +Supported query types +^^^^^^^^^^^^^^^^^^^^^ + +Hudi supports `two types of tables `_ +depending on how the data is indexed and laid out on the file system. The following +table displays a support matrix of tables types and query types for the connector. + +=========================== ============================================= +Table type Supported query type +=========================== ============================================= +Copy on write Snapshot queries + +Merge on read Read optimized queries +=========================== ============================================= + +Examples queries +^^^^^^^^^^^^^^^^ + +In the queries below, ``stock_ticks_cow`` is a Hudi copy-on-write table that we refer +in the Hudi `quickstart `_ documentation. + +Here are some sample queries: + +.. code-block:: sql + + USE a-catalog.myschema; + + SELECT symbol, max(ts) + FROM stock_ticks_cow + GROUP BY symbol + HAVING symbol = 'GOOG'; + +.. code-block:: text + + symbol | _col1 | + -----------+----------------------+ + GOOG | 2018-08-31 10:59:00 | + (1 rows) + +.. code-block:: sql + + SELECT dt, symbol + FROM stock_ticks_cow + WHERE symbol = 'GOOG'; + +.. code-block:: text + + dt | symbol | + ------------+--------+ + 2018-08-31 | GOOG | + (1 rows) + +.. code-block:: sql + + SELECT dt, count(*) + FROM stock_ticks_cow + GROUP BY dt; +.. code-block:: text + + dt | _col1 | + ------------+--------+ + 2018-08-31 | 99 | + (1 rows) + +Reading Hudi tables with the Hive connector +------------------------------------------- + +Hudi tables can also be accessed with a catalog using the Hive connector. The supported query types +with the Hive connector are same as that of Hudi connector. To query Hudi tables on Trino, place +the `hudi-trino-bundle `_ +JAR file into the Hive connector installation ``/plugin/hive``. diff --git a/docs/src/main/sphinx/static/img/hudi.png b/docs/src/main/sphinx/static/img/hudi.png new file mode 100644 index 000000000000..96ae05e396fc Binary files /dev/null and b/docs/src/main/sphinx/static/img/hudi.png differ