Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/main/sphinx/connector.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ from different data sources.
Elasticsearch <connector/elasticsearch>
Google Sheets <connector/googlesheets>
Hive <connector/hive>
Hudi <connector/hudi>
Iceberg <connector/iceberg>
JMX <connector/jmx>
Kafka <connector/kafka>
Expand Down
167 changes: 167 additions & 0 deletions docs/src/main/sphinx/connector/hudi.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
==============
Hudi connector
==============

.. raw:: html

<img src="../_static/img/hudi.png" class="connector-logo">

The Hudi connector enables querying `Hudi <https://hudi.apache.org/docs/overview/>`_ tables.

Requirements
------------

To use the Hudi connector, you need:

* Network access from the Trino coordinator and workers to the Hudi storage.
* Access to the Hive metastore service (HMS).
* Network access from the Trino coordinator to the HMS.

Configuration
-------------

The connector requires a Hive metastore for table metadata and supports the same
metastore configuration properties as the :doc:`Hive connector
</connector/hive>`. At a minimum, ``hive.metastore.uri`` must be configured.
The connector recognizes Hudi tables synced to the metastore by the
`Hudi sync tool <https://hudi.apache.org/docs/syncing_metastore>`_.

To create a catalog that uses the Hudi connector, create a catalog properties file,
for example ``etc/catalog/example.properties``, that references the ``hudi``
connector. Update the ``hive.metastore.uri`` with the URI of your Hive metastore
Thrift service:

.. code-block:: properties

connector.name=hudi
hive.metastore.uri=thrift://example.net:9083

Additionally, following configuration properties can be set depending on the use-case.

.. list-table:: Hudi configuration properties
:widths: 30, 55, 15
:header-rows: 1

* - Property name
- Description
- Default
* - ``hudi.metadata-enabled``
- Fetch the list of file names and sizes from metadata rather than storage.
- ``false``
* - ``hudi.columns-to-hide``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are speaking here about the metadata columns and not regular table columns. Please add a note what those metadata columns are. Do reconsider having metadata columns in the Hudi connector (same as in Hive - https://trino.io/docs/current/connector/hive.html#special-columns). Let's strive for consistency within Trino and not for consistency for keeping the status quo on Hudi's approach of exposing by default metadata columns as regular columns.

cc @martint @electrum

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. It's not a blocker for us. Let me think a bit more about this. I want to also discuss this internally with the team as well.

- List of column names that are hidden from the query output.
It can be used to hide Hudi meta fields. By default, no fields are hidden.
-
* - ``hudi.parquet.use-column-names``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Access Parquet columns using names from the file. If disabled, then columns
are accessed using the index. Only applicable to Parquet file format.
- ``true``
* - ``hudi.min-partition-batch-size``
- Minimum number of partitions returned in a single batch.
- ``10``
* - ``hudi.max-partition-batch-size``
- Maximum number of partitions returned in a single batch.
- ``100``
* - ``hudi.size-based-split-weights-enabled``
- Unlike uniform splitting, size-based splitting ensures that each batch of splits
has enough data to process. By default, it is enabled to improve performance.
- ``true``
* - ``hudi.standard-split-weight-size``
- The split size corresponding to the standard weight (1.0)
when size-based split weights are enabled.
- ``128MB``
* - ``hudi.minimum-assigned-split-weight``
- Minimum weight that a split can be assigned
when size-based split weights are enabled.
- ``0.05``
* - ``hudi.max-splits-per-second``
- Rate at which splits are queued for processing.
The queue is throttled if this rate limit is breached.
- ``Integer.MAX_VALUE``
* - ``hudi.max-outstanding-splits``
- Maximum outstanding splits in a batch enqueued for processing.
- ``1000``

Supported file types
--------------------

The connector supports Parquet file type.

SQL support
-----------

The connector provides read access to data in the Hudi table that has been synced to
Hive metastore. The :ref:`globally available <sql-globally-available>`
and :ref:`read operation <sql-read-operations>` statements are supported.

Supported query types
^^^^^^^^^^^^^^^^^^^^^

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW , talk on the exposed virtual tables table_name_rt , table_name_ro and what they are.

Hudi supports `two types of tables <https://hudi.apache.org/docs/table_types>`_
depending on how the data is indexed and laid out on the file system. The following
table displays a support matrix of tables types and query types for the connector.

=========================== =============================================
Table type Supported query type
=========================== =============================================
Copy on write Snapshot queries

Merge on read Read optimized queries
=========================== =============================================

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you care to mention any limitations (if known) ?

e.g. : Hive has a Hive 3 related limitations section https://trino.io/docs/current/connector/hive.html#hive-3-related-limitations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not talk about limitations .. we generally only document what works .. not all the stuff that doesnt work

Examples queries
^^^^^^^^^^^^^^^^

In the queries below, ``stock_ticks_cow`` is a Hudi copy-on-write table that we refer
in the Hudi `quickstart <https://hudi.apache.org/docs/docker_demo/>`_ documentation.

Here are some sample queries:

.. code-block:: sql

USE a-catalog.myschema;

SELECT symbol, max(ts)
FROM stock_ticks_cow
GROUP BY symbol
HAVING symbol = 'GOOG';

.. code-block:: text

symbol | _col1 |
-----------+----------------------+
GOOG | 2018-08-31 10:59:00 |
(1 rows)

.. code-block:: sql

SELECT dt, symbol
FROM stock_ticks_cow
WHERE symbol = 'GOOG';

.. code-block:: text

dt | symbol |
------------+--------+
2018-08-31 | GOOG |
(1 rows)

.. code-block:: sql

SELECT dt, count(*)
FROM stock_ticks_cow
GROUP BY dt;
.. code-block:: text

dt | _col1 |
------------+--------+
2018-08-31 | 99 |
(1 rows)

Reading Hudi tables with the Hive connector
-------------------------------------------

Hudi tables can also be accessed with a catalog using the Hive connector. The supported query types
with the Hive connector are same as that of Hudi connector. To query Hudi tables on Trino, place
the `hudi-trino-bundle <https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-trino-bundle>`_
JAR file into the Hive connector installation ``<trino_install>/plugin/hive``.
Binary file added docs/src/main/sphinx/static/img/hudi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.