-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add documentation for Hudi connector #13753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| ============== | ||
| Hudi connector | ||
| ============== | ||
|
|
||
| .. raw:: html | ||
|
|
||
| <img src="../_static/img/hudi.png" class="connector-logo"> | ||
|
|
||
| The Hudi connector enables querying `Hudi <https://hudi.apache.org/docs/overview/>`_ tables. | ||
|
|
||
| Requirements | ||
| ------------ | ||
|
|
||
| To use the Hudi connector, you need: | ||
|
|
||
| * Network access from the Trino coordinator and workers to the Hudi storage. | ||
| * Access to the Hive metastore service (HMS). | ||
| * Network access from the Trino coordinator to the HMS. | ||
|
|
||
| Configuration | ||
| ------------- | ||
|
|
||
| The connector requires a Hive metastore for table metadata and supports the same | ||
| metastore configuration properties as the :doc:`Hive connector | ||
| </connector/hive>`. At a minimum, ``hive.metastore.uri`` must be configured. | ||
| The connector recognizes Hudi tables synced to the metastore by the | ||
| `Hudi sync tool <https://hudi.apache.org/docs/syncing_metastore>`_. | ||
|
|
||
| To create a catalog that uses the Hudi connector, create a catalog properties file, | ||
| for example ``etc/catalog/example.properties``, that references the ``hudi`` | ||
| connector. Update the ``hive.metastore.uri`` with the URI of your Hive metastore | ||
| Thrift service: | ||
|
|
||
| .. code-block:: properties | ||
|
|
||
| connector.name=hudi | ||
| hive.metastore.uri=thrift://example.net:9083 | ||
|
|
||
| Additionally, following configuration properties can be set depending on the use-case. | ||
|
|
||
| .. list-table:: Hudi configuration properties | ||
| :widths: 30, 55, 15 | ||
| :header-rows: 1 | ||
|
|
||
| * - Property name | ||
| - Description | ||
| - Default | ||
| * - ``hudi.metadata-enabled`` | ||
| - Fetch the list of file names and sizes from metadata rather than storage. | ||
| - ``false`` | ||
| * - ``hudi.columns-to-hide`` | ||
| - List of column names that are hidden from the query output. | ||
| It can be used to hide Hudi meta fields. By default, no fields are hidden. | ||
| - | ||
| * - ``hudi.parquet.use-column-names`` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a Supported file types section https://trino.io/docs/current/connector/hive.html#supported-file-types |
||
| - Access Parquet columns using names from the file. If disabled, then columns | ||
| are accessed using the index. Only applicable to Parquet file format. | ||
| - ``true`` | ||
| * - ``hudi.min-partition-batch-size`` | ||
| - Minimum number of partitions returned in a single batch. | ||
| - ``10`` | ||
| * - ``hudi.max-partition-batch-size`` | ||
| - Maximum number of partitions returned in a single batch. | ||
| - ``100`` | ||
| * - ``hudi.size-based-split-weights-enabled`` | ||
| - Unlike uniform splitting, size-based splitting ensures that each batch of splits | ||
| has enough data to process. By default, it is enabled to improve performance. | ||
| - ``true`` | ||
| * - ``hudi.standard-split-weight-size`` | ||
| - The split size corresponding to the standard weight (1.0) | ||
| when size-based split weights are enabled. | ||
| - ``128MB`` | ||
| * - ``hudi.minimum-assigned-split-weight`` | ||
| - Minimum weight that a split can be assigned | ||
| when size-based split weights are enabled. | ||
| - ``0.05`` | ||
| * - ``hudi.max-splits-per-second`` | ||
| - Rate at which splits are queued for processing. | ||
| The queue is throttled if this rate limit is breached. | ||
| - ``Integer.MAX_VALUE`` | ||
| * - ``hudi.max-outstanding-splits`` | ||
| - Maximum outstanding splits in a batch enqueued for processing. | ||
| - ``1000`` | ||
|
|
||
| Supported file types | ||
| -------------------- | ||
|
|
||
| The connector supports Parquet file type. | ||
|
|
||
| SQL support | ||
| ----------- | ||
|
|
||
| The connector provides read access to data in the Hudi table that has been synced to | ||
| Hive metastore. The :ref:`globally available <sql-globally-available>` | ||
| and :ref:`read operation <sql-read-operations>` statements are supported. | ||
|
|
||
| Supported query types | ||
| ^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW , talk on the exposed virtual tables |
||
| Hudi supports `two types of tables <https://hudi.apache.org/docs/table_types>`_ | ||
| depending on how the data is indexed and laid out on the file system. The following | ||
| table displays a support matrix of tables types and query types for the connector. | ||
|
|
||
| =========================== ============================================= | ||
| Table type Supported query type | ||
| =========================== ============================================= | ||
| Copy on write Snapshot queries | ||
|
|
||
| Merge on read Read optimized queries | ||
| =========================== ============================================= | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you care to mention any limitations (if known) ? e.g. : Hive has a Hive 3 related limitations section https://trino.io/docs/current/connector/hive.html#hive-3-related-limitations
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lets not talk about limitations .. we generally only document what works .. not all the stuff that doesnt work |
||
| Examples queries | ||
| ^^^^^^^^^^^^^^^^ | ||
|
|
||
| In the queries below, ``stock_ticks_cow`` is a Hudi copy-on-write table that we refer | ||
| in the Hudi `quickstart <https://hudi.apache.org/docs/docker_demo/>`_ documentation. | ||
|
|
||
| Here are some sample queries: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| USE a-catalog.myschema; | ||
|
|
||
| SELECT symbol, max(ts) | ||
| FROM stock_ticks_cow | ||
| GROUP BY symbol | ||
| HAVING symbol = 'GOOG'; | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| symbol | _col1 | | ||
| -----------+----------------------+ | ||
| GOOG | 2018-08-31 10:59:00 | | ||
| (1 rows) | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT dt, symbol | ||
| FROM stock_ticks_cow | ||
| WHERE symbol = 'GOOG'; | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| dt | symbol | | ||
| ------------+--------+ | ||
| 2018-08-31 | GOOG | | ||
| (1 rows) | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT dt, count(*) | ||
| FROM stock_ticks_cow | ||
| GROUP BY dt; | ||
| .. code-block:: text | ||
|
|
||
| dt | _col1 | | ||
| ------------+--------+ | ||
| 2018-08-31 | 99 | | ||
| (1 rows) | ||
|
|
||
| Reading Hudi tables with the Hive connector | ||
| ------------------------------------------- | ||
|
|
||
| Hudi tables can also be accessed with a catalog using the Hive connector. The supported query types | ||
| with the Hive connector are same as that of Hudi connector. To query Hudi tables on Trino, place | ||
| the `hudi-trino-bundle <https://repo.maven.apache.org/maven2/org/apache/hudi/hudi-trino-bundle>`_ | ||
| JAR file into the Hive connector installation ``<trino_install>/plugin/hive``. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are speaking here about the metadata columns and not regular table columns. Please add a note what those metadata columns are. Do reconsider having metadata columns in the Hudi connector (same as in Hive - https://trino.io/docs/current/connector/hive.html#special-columns). Let's strive for consistency within Trino and not for consistency for keeping the status quo on Hudi's approach of exposing by default metadata columns as regular columns.
cc @martint @electrum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. It's not a blocker for us. Let me think a bit more about this. I want to also discuss this internally with the team as well.