diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.md
similarity index 54%
rename from docs/src/main/sphinx/connector/delta-lake.rst
rename to docs/src/main/sphinx/connector/delta-lake.md
index ad98019d280e..be630b0cd059 100644
--- a/docs/src/main/sphinx/connector/delta-lake.rst
+++ b/docs/src/main/sphinx/connector/delta-lake.md
@@ -1,59 +1,53 @@
-====================
-Delta Lake connector
-====================
+# Delta Lake connector
-.. raw:: html
+```{raw} html
+
+```
-
-
-The Delta Lake connector allows querying data stored in the `Delta Lake
-`_ format, including `Databricks Delta Lake
-`_. The connector can natively
+The Delta Lake connector allows querying data stored in the [Delta Lake](https://delta.io) format, including [Databricks Delta Lake](https://docs.databricks.com/delta/index.html). The connector can natively
read the Delta Lake transaction log and thus detect when external systems change
data.
-Requirements
-------------
+## Requirements
To connect to Databricks Delta Lake, you need:
-* Tables written by Databricks Runtime 7.3 LTS, 9.1 LTS, 10.4 LTS, 11.3 LTS, and
+- Tables written by Databricks Runtime 7.3 LTS, 9.1 LTS, 10.4 LTS, 11.3 LTS, and
12.2 LTS are supported.
-* Deployments using AWS, HDFS, Azure Storage, and Google Cloud Storage (GCS) are
+- Deployments using AWS, HDFS, Azure Storage, and Google Cloud Storage (GCS) are
fully supported.
-* Network access from the coordinator and workers to the Delta Lake storage.
-* Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS,
+- Network access from the coordinator and workers to the Delta Lake storage.
+- Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS,
or a Glue metastore.
-* Network access to the HMS from the coordinator and workers. Port 9083 is the
+- Network access to the HMS from the coordinator and workers. Port 9083 is the
default port for the Thrift protocol used by the HMS.
-* Data files stored in the Parquet file format. These can be configured using
- :ref:`file format configuration properties ` per
+- Data files stored in the Parquet file format. These can be configured using
+ {ref}`file format configuration properties ` per
catalog.
-General configuration
----------------------
+## General configuration
To configure the Delta Lake connector, create a catalog properties file
-``etc/catalog/example.properties`` that references the ``delta_lake``
+`etc/catalog/example.properties` that references the `delta_lake`
connector and defines a metastore. You must configure a metastore for table
-metadata. If you are using a :ref:`Hive metastore `,
-``hive.metastore.uri`` must be configured:
-
-.. code-block:: properties
-
- connector.name=delta_lake
- hive.metastore.uri=thrift://example.net:9083
+metadata. If you are using a {ref}`Hive metastore `,
+`hive.metastore.uri` must be configured:
-If you are using :ref:`AWS Glue ` as your metastore, you
-must instead set ``hive.metastore`` to ``glue``:
+```properties
+connector.name=delta_lake
+hive.metastore.uri=thrift://example.net:9083
+```
-.. code-block:: properties
+If you are using {ref}`AWS Glue ` as your metastore, you
+must instead set `hive.metastore` to `glue`:
- connector.name=delta_lake
- hive.metastore=glue
+```properties
+connector.name=delta_lake
+hive.metastore=glue
+```
Each metastore type has specific configuration properties along with
-:ref:`general metastore configuration properties `.
+{ref}`general metastore configuration properties `.
The connector recognizes Delta Lake tables created in the metastore by the Databricks
runtime. If non-Delta Lake tables are present in the metastore as well, they are not
@@ -62,16 +56,16 @@ visible to the connector.
To configure access to S3 and S3-compatible storage, Azure storage, and others,
consult the appropriate section of the Hive documentation:
-* :doc:`Amazon S3 `
-* :doc:`Azure storage documentation `
-* :ref:`GCS `
+- {doc}`Amazon S3 `
+- {doc}`Azure storage documentation `
+- {ref}`GCS `
-Delta Lake general configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Delta Lake general configuration properties
The following configuration properties are all using reasonable, tested default
values. Typical usage does not require you to configure them.
+```{eval-rst}
.. list-table:: Delta Lake configuration properties
:widths: 30, 55, 15
:header-rows: 1
@@ -180,13 +174,14 @@ values. Typical usage does not require you to configure them.
The equivalent catalog session property is
``vacuum_min_retention``.
- ``7 DAYS``
+```
-Catalog session properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Catalog session properties
-The following table describes :ref:`catalog session properties
+The following table describes {ref}`catalog session properties
` supported by the Delta Lake connector:
+```{eval-rst}
.. list-table:: Catalog session properties
:widths: 40, 60, 20
:header-rows: 1
@@ -209,29 +204,28 @@ The following table describes :ref:`catalog session properties
* - ``projection_pushdown_enabled``
- Read only projected fields from row columns while performing ``SELECT`` queries
- ``true``
+```
-.. _delta-lake-type-mapping:
+(delta-lake-type-mapping)=
-Type mapping
-------------
+## Type mapping
Because Trino and Delta Lake each support types that the other does not, this
-connector :ref:`modifies some types ` when reading or
+connector {ref}`modifies some types ` when reading or
writing data. Data types might not map the same way in both directions between
Trino and the data source. Refer to the following sections for type mapping in
each direction.
-See the `Delta Transaction Log specification
-`_
+See the [Delta Transaction Log specification](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types)
for more information about supported data types in the Delta Lake table format
specification.
-Delta Lake to Trino type mapping
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Delta Lake to Trino type mapping
The connector maps Delta Lake types to the corresponding Trino types following
this table:
+```{eval-rst}
.. list-table:: Delta Lake to Trino type mapping
:widths: 40, 60
:header-rows: 1
@@ -270,15 +264,16 @@ this table:
- ``MAP``
* - ``STRUCT(...)``
- ``ROW(...)``
+```
No other types are supported.
-Trino to Delta Lake type mapping
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Trino to Delta Lake type mapping
The connector maps Trino types to the corresponding Delta Lake types following
this table:
+```{eval-rst}
.. list-table:: Trino to Delta Lake type mapping
:widths: 60, 40
:header-rows: 1
@@ -315,25 +310,25 @@ this table:
- ``MAP``
* - ``ROW(...)``
- ``STRUCT(...)``
+```
No other types are supported.
-Security
---------
+## Security
The Delta Lake connector allows you to choose one of several means of providing
authorization at the catalog level. You can select a different type of
authorization check in different Delta Lake catalog files.
-.. _delta-lake-authorization:
+(delta-lake-authorization)=
-Authorization checks
-^^^^^^^^^^^^^^^^^^^^
+### Authorization checks
-Enable authorization checks for the connector by setting the ``delta.security``
+Enable authorization checks for the connector by setting the `delta.security`
property in the catalog properties file. This property must be one of the
security values in the following table:
+```{eval-rst}
.. list-table:: Delta Lake security values
:widths: 30, 60
:header-rows: 1
@@ -355,273 +350,277 @@ security values in the following table:
catalog configuration property. See
:ref:`catalog-file-based-access-control` for information on the
authorization configuration file.
+```
-.. _delta-lake-sql-support:
+(delta-lake-sql-support)=
-SQL support
------------
+## SQL support
The connector provides read and write access to data and metadata in
-Delta Lake. In addition to the :ref:`globally available
-` and :ref:`read operation `
+Delta Lake. In addition to the {ref}`globally available
+` and {ref}`read operation `
statements, the connector supports the following features:
-* :ref:`sql-data-management`, see also :ref:`delta-lake-data-management`
-* :ref:`sql-view-management`
-* :doc:`/sql/create-schema`, see also :ref:`delta-lake-sql-basic-usage`
-* :doc:`/sql/create-table`, see also :ref:`delta-lake-sql-basic-usage`
-* :doc:`/sql/create-table-as`
-* :doc:`/sql/drop-table`
-* :doc:`/sql/alter-table`
-* :doc:`/sql/drop-schema`
-* :doc:`/sql/show-create-schema`
-* :doc:`/sql/show-create-table`
-* :doc:`/sql/comment`
+- {ref}`sql-data-management`, see also {ref}`delta-lake-data-management`
+- {ref}`sql-view-management`
+- {doc}`/sql/create-schema`, see also {ref}`delta-lake-sql-basic-usage`
+- {doc}`/sql/create-table`, see also {ref}`delta-lake-sql-basic-usage`
+- {doc}`/sql/create-table-as`
+- {doc}`/sql/drop-table`
+- {doc}`/sql/alter-table`
+- {doc}`/sql/drop-schema`
+- {doc}`/sql/show-create-schema`
+- {doc}`/sql/show-create-table`
+- {doc}`/sql/comment`
-.. _delta-lake-sql-basic-usage:
+(delta-lake-sql-basic-usage)=
-Basic usage examples
-^^^^^^^^^^^^^^^^^^^^
+### Basic usage examples
The connector supports creating schemas. You can create a schema with or without
a specified location.
-You can create a schema with the :doc:`/sql/create-schema` statement and the
-``location`` schema property. Tables in this schema are located in a
+You can create a schema with the {doc}`/sql/create-schema` statement and the
+`location` schema property. Tables in this schema are located in a
subdirectory under the schema location. Data files for tables in this schema
-using the default location are cleaned up if the table is dropped::
+using the default location are cleaned up if the table is dropped:
- CREATE SCHEMA example.example_schema
- WITH (location = 's3://my-bucket/a/path');
+```
+CREATE SCHEMA example.example_schema
+WITH (location = 's3://my-bucket/a/path');
+```
Optionally, the location can be omitted. Tables in this schema must have a
location included when you create them. The data files for these tables are not
-removed if the table is dropped::
-
- CREATE SCHEMA example.example_schema;
+removed if the table is dropped:
+```
+CREATE SCHEMA example.example_schema;
+```
When Delta Lake tables exist in storage but not in the metastore, Trino can be
-used to register the tables::
-
- CREATE TABLE example.default.example_table (
- dummy BIGINT
- )
- WITH (
- location = '...'
- )
-
-Columns listed in the DDL, such as ``dummy`` in the preceding example, are
+used to register the tables:
+
+```
+CREATE TABLE example.default.example_table (
+ dummy BIGINT
+)
+WITH (
+ location = '...'
+)
+```
+
+Columns listed in the DDL, such as `dummy` in the preceding example, are
ignored. The table schema is read from the transaction log instead. If the
schema is changed by an external system, Trino automatically uses the new
schema.
-.. warning::
-
- Using ``CREATE TABLE`` with an existing table content is deprecated, instead
- use the ``system.register_table`` procedure. The ``CREATE TABLE ... WITH
- (location=...)`` syntax can be temporarily re-enabled using the
- ``delta.legacy-create-table-with-existing-location.enabled`` catalog
- configuration property or
- ``legacy_create_table_with_existing_location_enabled`` catalog session
- property.
+:::{warning}
+Using `CREATE TABLE` with an existing table content is deprecated, instead
+use the `system.register_table` procedure. The `CREATE TABLE ... WITH
+(location=...)` syntax can be temporarily re-enabled using the
+`delta.legacy-create-table-with-existing-location.enabled` catalog
+configuration property or
+`legacy_create_table_with_existing_location_enabled` catalog session
+property.
+:::
If the specified location does not already contain a Delta table, the connector
automatically writes the initial transaction log entries and registers the table
-in the metastore. As a result, any Databricks engine can write to the table::
+in the metastore. As a result, any Databricks engine can write to the table:
- CREATE TABLE example.default.new_table (id BIGINT, address VARCHAR);
+```
+CREATE TABLE example.default.new_table (id BIGINT, address VARCHAR);
+```
-The Delta Lake connector also supports creating tables using the :doc:`CREATE
+The Delta Lake connector also supports creating tables using the {doc}`CREATE
TABLE AS ` syntax.
-Procedures
-^^^^^^^^^^
+### Procedures
-Use the :doc:`/sql/call` statement to perform data manipulation or
+Use the {doc}`/sql/call` statement to perform data manipulation or
administrative tasks. Procedures are available in the system schema of each
catalog. The following code snippet displays how to call the
-``example_procedure`` in the ``examplecatalog`` catalog::
+`example_procedure` in the `examplecatalog` catalog:
- CALL examplecatalog.system.example_procedure()
+```
+CALL examplecatalog.system.example_procedure()
+```
-.. _delta-lake-register-table:
+(delta-lake-register-table)=
-Register table
-""""""""""""""
+#### Register table
The connector can register table into the metastore with existing transaction
logs and data files.
-The ``system.register_table`` procedure allows the caller to register an
+The `system.register_table` procedure allows the caller to register an
existing Delta Lake table in the metastore, using its existing transaction logs
-and data files::
+and data files:
- CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 's3://my-bucket/a/path')
+```
+CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 's3://my-bucket/a/path')
+```
To prevent unauthorized users from accessing data, this procedure is disabled by
default. The procedure is enabled only when
-``delta.register-table-procedure.enabled`` is set to ``true``.
+`delta.register-table-procedure.enabled` is set to `true`.
-.. _delta-lake-unregister-table:
+(delta-lake-unregister-table)=
+
+#### Unregister table
-Unregister table
-""""""""""""""""
The connector can unregister existing Delta Lake tables from the metastore.
-The procedure ``system.unregister_table`` allows the caller to unregister an
-existing Delta Lake table from the metastores without deleting the data::
+The procedure `system.unregister_table` allows the caller to unregister an
+existing Delta Lake table from the metastores without deleting the data:
- CALL example.system.unregister_table(schema_name => 'testdb', table_name => 'customer_orders')
+```
+CALL example.system.unregister_table(schema_name => 'testdb', table_name => 'customer_orders')
+```
-.. _delta-lake-flush-metadata-cache:
+(delta-lake-flush-metadata-cache)=
-Flush metadata cache
-""""""""""""""""""""
+#### Flush metadata cache
-* ``system.flush_metadata_cache()``
+- `system.flush_metadata_cache()`
Flushes all metadata caches.
-* ``system.flush_metadata_cache(schema_name => ..., table_name => ...)``
+- `system.flush_metadata_cache(schema_name => ..., table_name => ...)`
Flushes metadata cache entries of a specific table.
Procedure requires passing named parameters.
-.. _delta-lake-vacuum:
+(delta-lake-vacuum)=
-``VACUUM``
-""""""""""
+#### `VACUUM`
-The ``VACUUM`` procedure removes all old files that are not in the transaction
+The `VACUUM` procedure removes all old files that are not in the transaction
log, as well as files that are not needed to read table snapshots newer than the
-current time minus the retention period defined by the ``retention period``
+current time minus the retention period defined by the `retention period`
parameter.
-Users with ``INSERT`` and ``DELETE`` permissions on a table can run ``VACUUM``
+Users with `INSERT` and `DELETE` permissions on a table can run `VACUUM`
as follows:
-.. code-block:: shell
-
- CALL example.system.vacuum('exampleschemaname', 'exampletablename', '7d');
+```shell
+CALL example.system.vacuum('exampleschemaname', 'exampletablename', '7d');
+```
All parameters are required and must be presented in the following order:
-* Schema name
-* Table name
-* Retention period
+- Schema name
+- Table name
+- Retention period
-The ``delta.vacuum.min-retention`` configuration property provides a safety
+The `delta.vacuum.min-retention` configuration property provides a safety
measure to ensure that files are retained as expected. The minimum value for
-this property is ``0s``. There is a minimum retention session property as well,
-``vacuum_min_retention``.
+this property is `0s`. There is a minimum retention session property as well,
+`vacuum_min_retention`.
-.. _delta-lake-write-support:
+(delta-lake-write-support)=
-Updating data
-^^^^^^^^^^^^^
+### Updating data
-You can use the connector to :doc:`/sql/insert`, :doc:`/sql/delete`,
-:doc:`/sql/update`, and :doc:`/sql/merge` data in Delta Lake tables.
+You can use the connector to {doc}`/sql/insert`, {doc}`/sql/delete`,
+{doc}`/sql/update`, and {doc}`/sql/merge` data in Delta Lake tables.
Write operations are supported for tables stored on the following systems:
-* Azure ADLS Gen2, Google Cloud Storage
+- Azure ADLS Gen2, Google Cloud Storage
Writes to the Azure ADLS Gen2 and Google Cloud Storage are
enabled by default. Trino detects write collisions on these storage systems
when writing from multiple Trino clusters, or from other query engines.
-* S3 and S3-compatible storage
+- S3 and S3-compatible storage
- Writes to :doc:`Amazon S3 ` and S3-compatible storage must be enabled
- with the ``delta.enable-non-concurrent-writes`` property. Writes to S3 can
+ Writes to {doc}`Amazon S3 ` and S3-compatible storage must be enabled
+ with the `delta.enable-non-concurrent-writes` property. Writes to S3 can
safely be made from multiple Trino clusters; however, write collisions are not
detected when writing concurrently from other Delta Lake engines. You need to
make sure that no concurrent data modifications are run to avoid data
corruption.
-.. _delta-lake-data-management:
+(delta-lake-data-management)=
-Data management
-^^^^^^^^^^^^^^^
+### Data management
-You can use the connector to :doc:`/sql/insert`, :doc:`/sql/delete`,
-:doc:`/sql/update`, and :doc:`/sql/merge` data in Delta Lake tables.
+You can use the connector to {doc}`/sql/insert`, {doc}`/sql/delete`,
+{doc}`/sql/update`, and {doc}`/sql/merge` data in Delta Lake tables.
Write operations are supported for tables stored on the following systems:
-* Azure ADLS Gen2, Google Cloud Storage
+- Azure ADLS Gen2, Google Cloud Storage
Writes to the Azure ADLS Gen2 and Google Cloud Storage are
enabled by default. Trino detects write collisions on these storage systems
when writing from multiple Trino clusters, or from other query engines.
-* S3 and S3-compatible storage
+- S3 and S3-compatible storage
- Writes to :doc:`Amazon S3 ` and S3-compatible storage must be enabled
- with the ``delta.enable-non-concurrent-writes`` property. Writes to S3 can
+ Writes to {doc}`Amazon S3 ` and S3-compatible storage must be enabled
+ with the `delta.enable-non-concurrent-writes` property. Writes to S3 can
safely be made from multiple Trino clusters; however, write collisions are not
detected when writing concurrently from other Delta Lake engines. You must
make sure that no concurrent data modifications are run to avoid data
corruption.
-Schema and table management
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Schema and table management
-The :ref:`sql-schema-table-management` functionality includes support for:
+The {ref}`sql-schema-table-management` functionality includes support for:
-* :doc:`/sql/create-schema`
-* :doc:`/sql/drop-schema`
-* :doc:`/sql/alter-schema`
-* :doc:`/sql/create-table`
-* :doc:`/sql/create-table-as`
-* :doc:`/sql/drop-table`
-* :doc:`/sql/alter-table`
-* :doc:`/sql/comment`
+- {doc}`/sql/create-schema`
+- {doc}`/sql/drop-schema`
+- {doc}`/sql/alter-schema`
+- {doc}`/sql/create-table`
+- {doc}`/sql/create-table-as`
+- {doc}`/sql/drop-table`
+- {doc}`/sql/alter-table`
+- {doc}`/sql/comment`
-.. _delta-lake-alter-table-execute:
+(delta-lake-alter-table-execute)=
-ALTER TABLE EXECUTE
-"""""""""""""""""""
+#### ALTER TABLE EXECUTE
The connector supports the following commands for use with
-:ref:`ALTER TABLE EXECUTE `.
+{ref}`ALTER TABLE EXECUTE `.
-optimize
-~~~~~~~~
+##### optimize
-The ``optimize`` command is used for rewriting the content of the specified
+The `optimize` command is used for rewriting the content of the specified
table so that it is merged into fewer but larger files. If the table is
partitioned, the data compaction acts separately on each partition selected for
optimization. This operation improves read performance.
-All files with a size below the optional ``file_size_threshold`` parameter
-(default value for the threshold is ``100MB``) are merged:
-
-.. code-block:: sql
+All files with a size below the optional `file_size_threshold` parameter
+(default value for the threshold is `100MB`) are merged:
- ALTER TABLE test_table EXECUTE optimize
+```sql
+ALTER TABLE test_table EXECUTE optimize
+```
The following statement merges files in a table that are
under 10 megabytes in size:
-.. code-block:: sql
+```sql
+ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
+```
- ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
-
-You can use a ``WHERE`` clause with the columns used to partition the table
+You can use a `WHERE` clause with the columns used to partition the table
to filter which partitions are optimized:
-.. code-block:: sql
+```sql
+ALTER TABLE test_partitioned_table EXECUTE optimize
+WHERE partition_key = 1
+```
- ALTER TABLE test_partitioned_table EXECUTE optimize
- WHERE partition_key = 1
+#### Table properties
-Table properties
-""""""""""""""""
The following table properties are available for use:
+```{eval-rst}
.. list-table:: Delta Lake table properties
:widths: 40, 60
:header-rows: 1
@@ -644,50 +643,56 @@ The following table properties are available for use:
* ``NONE``
Defaults to ``NONE``.
+```
-The following example uses all available table properties::
+The following example uses all available table properties:
- CREATE TABLE example.default.example_partitioned_table
- WITH (
- location = 's3://my-bucket/a/path',
- partitioned_by = ARRAY['regionkey'],
- checkpoint_interval = 5,
- change_data_feed_enabled = false,
- column_mapping_mode = 'name'
- )
- AS SELECT name, comment, regionkey FROM tpch.tiny.nation;
+```
+CREATE TABLE example.default.example_partitioned_table
+WITH (
+ location = 's3://my-bucket/a/path',
+ partitioned_by = ARRAY['regionkey'],
+ checkpoint_interval = 5,
+ change_data_feed_enabled = false,
+ column_mapping_mode = 'name'
+)
+AS SELECT name, comment, regionkey FROM tpch.tiny.nation;
+```
-Metadata tables
-"""""""""""""""
+#### Metadata tables
The connector exposes several metadata tables for each Delta Lake table.
These metadata tables contain information about the internal structure
of the Delta Lake table. You can query each metadata table by appending the
-metadata table name to the table name::
+metadata table name to the table name:
- SELECT * FROM "test_table$history"
+```
+SELECT * FROM "test_table$history"
+```
-``$history`` table
-~~~~~~~~~~~~~~~~~~
+##### `$history` table
-The ``$history`` table provides a log of the metadata changes performed on
+The `$history` table provides a log of the metadata changes performed on
the Delta Lake table.
-You can retrieve the changelog of the Delta Lake table ``test_table``
-by using the following query::
+You can retrieve the changelog of the Delta Lake table `test_table`
+by using the following query:
- SELECT * FROM "test_table$history"
+```
+SELECT * FROM "test_table$history"
+```
-.. code-block:: text
-
- version | timestamp | user_id | user_name | operation | operation_parameters | cluster_id | read_version | isolation_level | is_blind_append
- ---------+---------------------------------------+---------+-----------+--------------+---------------------------------------+---------------------------------+--------------+-------------------+----------------
- 2 | 2023-01-19 07:40:54.684 Europe/Vienna | trino | trino | WRITE | {queryId=20230119_064054_00008_4vq5t} | trino-406-trino-coordinator | 2 | WriteSerializable | true
- 1 | 2023-01-19 07:40:41.373 Europe/Vienna | trino | trino | ADD COLUMNS | {queryId=20230119_064041_00007_4vq5t} | trino-406-trino-coordinator | 0 | WriteSerializable | true
- 0 | 2023-01-19 07:40:10.497 Europe/Vienna | trino | trino | CREATE TABLE | {queryId=20230119_064010_00005_4vq5t} | trino-406-trino-coordinator | 0 | WriteSerializable | true
+```text
+ version | timestamp | user_id | user_name | operation | operation_parameters | cluster_id | read_version | isolation_level | is_blind_append
+---------+---------------------------------------+---------+-----------+--------------+---------------------------------------+---------------------------------+--------------+-------------------+----------------
+ 2 | 2023-01-19 07:40:54.684 Europe/Vienna | trino | trino | WRITE | {queryId=20230119_064054_00008_4vq5t} | trino-406-trino-coordinator | 2 | WriteSerializable | true
+ 1 | 2023-01-19 07:40:41.373 Europe/Vienna | trino | trino | ADD COLUMNS | {queryId=20230119_064041_00007_4vq5t} | trino-406-trino-coordinator | 0 | WriteSerializable | true
+ 0 | 2023-01-19 07:40:10.497 Europe/Vienna | trino | trino | CREATE TABLE | {queryId=20230119_064010_00005_4vq5t} | trino-406-trino-coordinator | 0 | WriteSerializable | true
+```
The output of the query has the following history columns:
+```{eval-rst}
.. list-table:: History columns
:widths: 30, 30, 40
:header-rows: 1
@@ -725,306 +730,295 @@ The output of the query has the following history columns:
* - ``is_blind_append``
- ``BOOLEAN``
- Whether or not the operation appended data
+```
-``$properties`` table
-~~~~~~~~~~~~~~~~~~~~~
+##### `$properties` table
-The ``$properties`` table provides access to Delta Lake table configuration,
+The `$properties` table provides access to Delta Lake table configuration,
table features and table properties. The table rows are key/value pairs.
You can retrieve the properties of the Delta
-table ``test_table`` by using the following query::
-
- SELECT * FROM "test_table$properties"
+table `test_table` by using the following query:
-.. code-block:: text
+```
+SELECT * FROM "test_table$properties"
+```
- key | value |
- ----------------------------+-----------------+
- delta.minReaderVersion | 1 |
- delta.minWriterVersion | 4 |
- delta.columnMapping.mode | name |
- delta.feature.columnMapping | supported |
+```text
+ key | value |
+----------------------------+-----------------+
+delta.minReaderVersion | 1 |
+delta.minWriterVersion | 4 |
+delta.columnMapping.mode | name |
+delta.feature.columnMapping | supported |
+```
-.. _delta-lake-special-columns:
+(delta-lake-special-columns)=
-Metadata columns
-""""""""""""""""
+#### Metadata columns
In addition to the defined columns, the Delta Lake connector automatically
exposes metadata in a number of hidden columns in each table. You can use these
columns in your SQL statements like any other column, e.g., they can be selected
directly or used in conditional statements.
-* ``$path``
- Full file system path name of the file for this row.
-
-* ``$file_modified_time``
- Date and time of the last modification of the file for this row.
-
-* ``$file_size``
- Size of the file for this row.
+- `$path`
+ : Full file system path name of the file for this row.
+- `$file_modified_time`
+ : Date and time of the last modification of the file for this row.
+- `$file_size`
+ : Size of the file for this row.
-.. _delta-lake-fte-support:
+(delta-lake-fte-support)=
-Fault-tolerant execution support
---------------------------------
+## Fault-tolerant execution support
-The connector supports :doc:`/admin/fault-tolerant-execution` of query
+The connector supports {doc}`/admin/fault-tolerant-execution` of query
processing. Read and write operations are both supported with any retry policy.
-
-Table functions
----------------
+## Table functions
The connector provides the following table functions:
-table_changes
-^^^^^^^^^^^^^
+### table_changes
Allows reading Change Data Feed (CDF) entries to expose row-level changes
-between two versions of a Delta Lake table. When the ``change_data_feed_enabled``
-table property is set to ``true`` on a specific Delta Lake table,
+between two versions of a Delta Lake table. When the `change_data_feed_enabled`
+table property is set to `true` on a specific Delta Lake table,
the connector records change events for all data changes on the table.
This is how these changes can be read:
-.. code-block:: sql
-
- SELECT
- *
- FROM
- TABLE(
- system.table_changes(
- schema_name => 'test_schema',
- table_name => 'tableName',
- since_version => 0
- )
- );
+```sql
+SELECT
+ *
+FROM
+ TABLE(
+ system.table_changes(
+ schema_name => 'test_schema',
+ table_name => 'tableName',
+ since_version => 0
+ )
+ );
+```
-``schema_name`` - type ``VARCHAR``, required, name of the schema for which the function is called
+`schema_name` - type `VARCHAR`, required, name of the schema for which the function is called
-``table_name`` - type ``VARCHAR``, required, name of the table for which the function is called
+`table_name` - type `VARCHAR`, required, name of the table for which the function is called
-``since_version`` - type ``BIGINT``, optional, version from which changes are shown, exclusive
+`since_version` - type `BIGINT`, optional, version from which changes are shown, exclusive
In addition to returning the columns present in the table, the function
returns the following values for each change event:
-* ``_change_type``
- Gives the type of change that occurred. Possible values are ``insert``,
- ``delete``, ``update_preimage`` and ``update_postimage``.
-
-* ``_commit_version``
- Shows the table version for which the change occurred.
-
-* ``_commit_timestamp``
- Represents the timestamp for the commit in which the specified change happened.
+- `_change_type`
+ : Gives the type of change that occurred. Possible values are `insert`,
+ `delete`, `update_preimage` and `update_postimage`.
+- `_commit_version`
+ : Shows the table version for which the change occurred.
+- `_commit_timestamp`
+ : Represents the timestamp for the commit in which the specified change happened.
This is how it would be normally used:
Create table:
-.. code-block:: sql
-
- CREATE TABLE test_schema.pages (page_url VARCHAR, domain VARCHAR, views INTEGER)
- WITH (change_data_feed_enabled = true);
+```sql
+CREATE TABLE test_schema.pages (page_url VARCHAR, domain VARCHAR, views INTEGER)
+ WITH (change_data_feed_enabled = true);
+```
Insert data:
-.. code-block:: sql
-
- INSERT INTO test_schema.pages
- VALUES
- ('url1', 'domain1', 1),
- ('url2', 'domain2', 2),
- ('url3', 'domain1', 3);
- INSERT INTO test_schema.pages
- VALUES
- ('url4', 'domain1', 400),
- ('url5', 'domain2', 500),
- ('url6', 'domain3', 2);
+```sql
+INSERT INTO test_schema.pages
+ VALUES
+ ('url1', 'domain1', 1),
+ ('url2', 'domain2', 2),
+ ('url3', 'domain1', 3);
+INSERT INTO test_schema.pages
+ VALUES
+ ('url4', 'domain1', 400),
+ ('url5', 'domain2', 500),
+ ('url6', 'domain3', 2);
+```
Update data:
-.. code-block:: sql
-
- UPDATE test_schema.pages
- SET domain = 'domain4'
- WHERE views = 2;
+```sql
+UPDATE test_schema.pages
+ SET domain = 'domain4'
+ WHERE views = 2;
+```
Select changes:
-.. code-block:: sql
-
- SELECT
- *
- FROM
- TABLE(
- system.table_changes(
- schema_name => 'test_schema',
- table_name => 'pages',
- since_version => 1
- )
- )
- ORDER BY _commit_version ASC;
+```sql
+SELECT
+ *
+FROM
+ TABLE(
+ system.table_changes(
+ schema_name => 'test_schema',
+ table_name => 'pages',
+ since_version => 1
+ )
+ )
+ORDER BY _commit_version ASC;
+```
The preceding sequence of SQL statements returns the following result:
-.. code-block:: text
-
- page_url | domain | views | _change_type | _commit_version | _commit_timestamp
- url4 | domain1 | 400 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url5 | domain2 | 500 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url6 | domain3 | 2 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url2 | domain2 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
- url2 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
- url6 | domain3 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
- url6 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+```text
+page_url | domain | views | _change_type | _commit_version | _commit_timestamp
+url4 | domain1 | 400 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url5 | domain2 | 500 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url6 | domain3 | 2 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url2 | domain2 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
+url2 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+url6 | domain3 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
+url6 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+```
The output shows what changes happen in which version.
For example in version 3 two rows were modified, first one changed from
-``('url2', 'domain2', 2)`` into ``('url2', 'domain4', 2)`` and the second from
-``('url6', 'domain2', 2)`` into ``('url6', 'domain4', 2)``.
+`('url2', 'domain2', 2)` into `('url2', 'domain4', 2)` and the second from
+`('url6', 'domain2', 2)` into `('url6', 'domain4', 2)`.
-If ``since_version`` is not provided the function produces change events
+If `since_version` is not provided the function produces change events
starting from when the table was created.
-.. code-block:: sql
-
- SELECT
- *
- FROM
- TABLE(
- system.table_changes(
- schema_name => 'test_schema',
- table_name => 'pages'
- )
- )
- ORDER BY _commit_version ASC;
+```sql
+SELECT
+ *
+FROM
+ TABLE(
+ system.table_changes(
+ schema_name => 'test_schema',
+ table_name => 'pages'
+ )
+ )
+ORDER BY _commit_version ASC;
+```
The preceding SQL statement returns the following result:
-.. code-block:: text
-
- page_url | domain | views | _change_type | _commit_version | _commit_timestamp
- url1 | domain1 | 1 | insert | 1 | 2023-03-10T20:21:22.000+0000
- url2 | domain2 | 2 | insert | 1 | 2023-03-10T20:21:22.000+0000
- url3 | domain1 | 3 | insert | 1 | 2023-03-10T20:21:22.000+0000
- url4 | domain1 | 400 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url5 | domain2 | 500 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url6 | domain3 | 2 | insert | 2 | 2023-03-10T21:22:23.000+0000
- url2 | domain2 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
- url2 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
- url6 | domain3 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
- url6 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+```text
+page_url | domain | views | _change_type | _commit_version | _commit_timestamp
+url1 | domain1 | 1 | insert | 1 | 2023-03-10T20:21:22.000+0000
+url2 | domain2 | 2 | insert | 1 | 2023-03-10T20:21:22.000+0000
+url3 | domain1 | 3 | insert | 1 | 2023-03-10T20:21:22.000+0000
+url4 | domain1 | 400 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url5 | domain2 | 500 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url6 | domain3 | 2 | insert | 2 | 2023-03-10T21:22:23.000+0000
+url2 | domain2 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
+url2 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+url6 | domain3 | 2 | update_preimage | 3 | 2023-03-10T22:23:24.000+0000
+url6 | domain4 | 2 | update_postimage | 3 | 2023-03-10T22:23:24.000+0000
+```
You can see changes that occurred at version 1 as three inserts. They are
-not visible in the previous statement when ``since_version`` value was set to 1.
+not visible in the previous statement when `since_version` value was set to 1.
-Performance
------------
+## Performance
The connector includes a number of performance improvements detailed in the
following sections:
-* Support for :doc:`write partitioning `.
+- Support for {doc}`write partitioning `.
-.. _delta-lake-table-statistics:
+(delta-lake-table-statistics)=
-Table statistics
-^^^^^^^^^^^^^^^^
+### Table statistics
-Use :doc:`/sql/analyze` statements in Trino to populate data size and
+Use {doc}`/sql/analyze` statements in Trino to populate data size and
number of distinct values (NDV) extended table statistics in Delta Lake.
The minimum value, maximum value, value count, and null value count
statistics are computed on the fly out of the transaction log of the
-Delta Lake table. The :doc:`cost-based optimizer
+Delta Lake table. The {doc}`cost-based optimizer
` then uses these statistics to improve
query performance.
Extended statistics enable a broader set of optimizations, including join
-reordering. The controlling catalog property ``delta.table-statistics-enabled``
-is enabled by default. The equivalent :ref:`catalog session property
-` is ``statistics_enabled``.
-
-Each ``ANALYZE`` statement updates the table statistics incrementally, so only
-the data changed since the last ``ANALYZE`` is counted. The table statistics are
-not automatically updated by write operations such as ``INSERT``, ``UPDATE``,
-and ``DELETE``. You must manually run ``ANALYZE`` again to update the table
+reordering. The controlling catalog property `delta.table-statistics-enabled`
+is enabled by default. The equivalent {ref}`catalog session property
+` is `statistics_enabled`.
+
+Each `ANALYZE` statement updates the table statistics incrementally, so only
+the data changed since the last `ANALYZE` is counted. The table statistics are
+not automatically updated by write operations such as `INSERT`, `UPDATE`,
+and `DELETE`. You must manually run `ANALYZE` again to update the table
statistics.
-To collect statistics for a table, execute the following statement::
+To collect statistics for a table, execute the following statement:
- ANALYZE table_schema.table_name;
+```
+ANALYZE table_schema.table_name;
+```
-To recalculate from scratch the statistics for the table use additional parameter ``mode``:
+To recalculate from scratch the statistics for the table use additional parameter `mode`:
- ANALYZE table_schema.table_name WITH(mode = 'full_refresh');
+> ANALYZE table_schema.table_name WITH(mode = 'full_refresh');
-There are two modes available ``full_refresh`` and ``incremental``.
-The procedure use ``incremental`` by default.
+There are two modes available `full_refresh` and `incremental`.
+The procedure use `incremental` by default.
-To gain the most benefit from cost-based optimizations, run periodic ``ANALYZE``
+To gain the most benefit from cost-based optimizations, run periodic `ANALYZE`
statements on every large table that is frequently queried.
-Fine-tuning
-"""""""""""
+#### Fine-tuning
-The ``files_modified_after`` property is useful if you want to run the
-``ANALYZE`` statement on a table that was previously analyzed. You can use it to
+The `files_modified_after` property is useful if you want to run the
+`ANALYZE` statement on a table that was previously analyzed. You can use it to
limit the amount of data used to generate the table statistics:
-.. code-block:: SQL
-
- ANALYZE example_table WITH(files_modified_after = TIMESTAMP '2021-08-23
- 16:43:01.321 Z')
+```SQL
+ANALYZE example_table WITH(files_modified_after = TIMESTAMP '2021-08-23
+16:43:01.321 Z')
+```
As a result, only files newer than the specified time stamp are used in the
analysis.
-You can also specify a set or subset of columns to analyze using the ``columns``
+You can also specify a set or subset of columns to analyze using the `columns`
property:
-.. code-block:: SQL
+```SQL
+ANALYZE example_table WITH(columns = ARRAY['nationkey', 'regionkey'])
+```
- ANALYZE example_table WITH(columns = ARRAY['nationkey', 'regionkey'])
-
-To run ``ANALYZE`` with ``columns`` more than once, the next ``ANALYZE`` must
+To run `ANALYZE` with `columns` more than once, the next `ANALYZE` must
run on the same set or a subset of the original columns used.
-To broaden the set of ``columns``, drop the statistics and reanalyze the table.
+To broaden the set of `columns`, drop the statistics and reanalyze the table.
-Disable and drop extended statistics
-""""""""""""""""""""""""""""""""""""
+#### Disable and drop extended statistics
You can disable extended statistics with the catalog configuration property
-``delta.extended-statistics.enabled`` set to ``false``. Alternatively, you can
-disable it for a session, with the :doc:`catalog session property
-` ``extended_statistics_enabled`` set to ``false``.
+`delta.extended-statistics.enabled` set to `false`. Alternatively, you can
+disable it for a session, with the {doc}`catalog session property
+` `extended_statistics_enabled` set to `false`.
-If a table is changed with many delete and update operation, calling ``ANALYZE``
+If a table is changed with many delete and update operation, calling `ANALYZE`
does not result in accurate statistics. To correct the statistics, you have to
drop the extended statistics and analyze the table again.
-Use the ``system.drop_extended_stats`` procedure in the catalog to drop the
+Use the `system.drop_extended_stats` procedure in the catalog to drop the
extended statistics for a specified table in a specified schema:
-.. code-block::
-
- CALL example.system.drop_extended_stats('example_schema', 'example_table')
+```
+CALL example.system.drop_extended_stats('example_schema', 'example_table')
+```
-Memory usage
-^^^^^^^^^^^^
+### Memory usage
The Delta Lake connector is memory intensive and the amount of required memory
grows with the size of Delta Lake transaction logs of any accessed tables. It is
important to take that into account when provisioning the coordinator.
You must decrease memory usage by keeping the number of active data files in
-the table low by regularly running ``OPTIMIZE`` and ``VACUUM`` in Delta Lake.
+the table low by regularly running `OPTIMIZE` and `VACUUM` in Delta Lake.
-Memory monitoring
-"""""""""""""""""
+#### Memory monitoring
When using the Delta Lake connector, you must monitor memory usage on the
coordinator. Specifically, monitor JVM heap utilization using standard tools as
@@ -1032,54 +1026,56 @@ part of routine operation of the cluster.
A good proxy for memory usage is the cache utilization of Delta Lake caches. It
is exposed by the connector with the
-``plugin.deltalake.transactionlog:name=,type=transactionlogaccess``
+`plugin.deltalake.transactionlog:name=,type=transactionlogaccess`
JMX bean.
You can access it with any standard monitoring software with JMX support, or use
-the :doc:`/connector/jmx` with the following query::
+the {doc}`/connector/jmx` with the following query:
- SELECT * FROM jmx.current."*.plugin.deltalake.transactionlog:name=,type=transactionlogaccess"
+```
+SELECT * FROM jmx.current."*.plugin.deltalake.transactionlog:name=,type=transactionlogaccess"
+```
Following is an example result:
-.. code-block:: text
+```text
+datafilemetadatacachestats.hitrate | 0.97
+datafilemetadatacachestats.missrate | 0.03
+datafilemetadatacachestats.requestcount | 3232
+metadatacachestats.hitrate | 0.98
+metadatacachestats.missrate | 0.02
+metadatacachestats.requestcount | 6783
+node | trino-master
+object_name | io.trino.plugin.deltalake.transactionlog:type=TransactionLogAccess,name=delta
+```
- datafilemetadatacachestats.hitrate | 0.97
- datafilemetadatacachestats.missrate | 0.03
- datafilemetadatacachestats.requestcount | 3232
- metadatacachestats.hitrate | 0.98
- metadatacachestats.missrate | 0.02
- metadatacachestats.requestcount | 6783
- node | trino-master
- object_name | io.trino.plugin.deltalake.transactionlog:type=TransactionLogAccess,name=delta
+In a healthy system, both `datafilemetadatacachestats.hitrate` and
+`metadatacachestats.hitrate` are close to `1.0`.
-In a healthy system, both ``datafilemetadatacachestats.hitrate`` and
-``metadatacachestats.hitrate`` are close to ``1.0``.
+(delta-lake-table-redirection)=
-.. _delta-lake-table-redirection:
+### Table redirection
-Table redirection
-^^^^^^^^^^^^^^^^^
-
-.. include:: table-redirection.fragment
+```{include} table-redirection.fragment
+```
The connector supports redirection from Delta Lake tables to Hive tables
-with the ``delta.hive-catalog-name`` catalog configuration property.
+with the `delta.hive-catalog-name` catalog configuration property.
-Performance tuning configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Performance tuning configuration properties
The following table describes performance tuning catalog properties for the
connector.
-.. warning::
-
- Performance tuning configuration properties are considered expert-level
- features. Altering these properties from their default values is likely to
- cause instability and performance degradation. It is strongly suggested that
- you use them only to address non-trivial performance issues, and that you
- keep a backup of the original values if you change them.
+:::{warning}
+Performance tuning configuration properties are considered expert-level
+features. Altering these properties from their default values is likely to
+cause instability and performance degradation. It is strongly suggested that
+you use them only to address non-trivial performance issues, and that you
+keep a backup of the original values if you change them.
+:::
+```{eval-rst}
.. list-table:: Delta Lake performance tuning configuration properties
:widths: 30, 50, 20
:header-rows: 1
@@ -1145,3 +1141,4 @@ connector.
the ``query_partition_filter_required`` catalog session property for
temporary, catalog specific use.
- ``false``
+```
diff --git a/docs/src/main/sphinx/connector/hive-alluxio.md b/docs/src/main/sphinx/connector/hive-alluxio.md
new file mode 100644
index 000000000000..594295178119
--- /dev/null
+++ b/docs/src/main/sphinx/connector/hive-alluxio.md
@@ -0,0 +1,16 @@
+# Hive connector with Alluxio
+
+The {doc}`hive` can read and write tables stored in the [Alluxio Data Orchestration
+System](https://www.alluxio.io/),
+leveraging Alluxio's distributed block-level read/write caching functionality.
+The tables must be created in the Hive metastore with the `alluxio://`
+location prefix (see [Running Apache Hive with Alluxio](https://docs.alluxio.io/os/user/stable/en/compute/Hive.html)
+for details and examples).
+
+Trino queries will then transparently retrieve and cache files or objects from
+a variety of disparate storage systems including HDFS and S3.
+
+## Setting up Alluxio with Trino
+
+For information on how to setup, configure, and use Alluxio, refer to [Alluxio's
+documentation on using their platform with Trino](https://docs.alluxio.io/ee/user/stable/en/compute/Trino.html).
diff --git a/docs/src/main/sphinx/connector/hive-alluxio.rst b/docs/src/main/sphinx/connector/hive-alluxio.rst
deleted file mode 100644
index b084daf99970..000000000000
--- a/docs/src/main/sphinx/connector/hive-alluxio.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-===========================
-Hive connector with Alluxio
-===========================
-
-The :doc:`hive` can read and write tables stored in the `Alluxio Data Orchestration
-System `_,
-leveraging Alluxio's distributed block-level read/write caching functionality.
-The tables must be created in the Hive metastore with the ``alluxio://``
-location prefix (see `Running Apache Hive with Alluxio
-`_
-for details and examples).
-
-Trino queries will then transparently retrieve and cache files or objects from
-a variety of disparate storage systems including HDFS and S3.
-
-Setting up Alluxio with Trino
------------------------------
-
-For information on how to setup, configure, and use Alluxio, refer to `Alluxio's
-documentation on using their platform with Trino
-`_.
diff --git a/docs/src/main/sphinx/connector/hive-azure.rst b/docs/src/main/sphinx/connector/hive-azure.md
similarity index 50%
rename from docs/src/main/sphinx/connector/hive-azure.rst
rename to docs/src/main/sphinx/connector/hive-azure.md
index a2af709bf8d8..048d90fcf613 100644
--- a/docs/src/main/sphinx/connector/hive-azure.rst
+++ b/docs/src/main/sphinx/connector/hive-azure.md
@@ -1,22 +1,15 @@
-=================================
-Hive connector with Azure Storage
-=================================
+# Hive connector with Azure Storage
-The :doc:`hive` can be configured to use `Azure Data Lake Storage (Gen2)
-`_. Trino
+The {doc}`hive` can be configured to use [Azure Data Lake Storage (Gen2)](https://azure.microsoft.com/products/storage/data-lake-storage/). Trino
supports Azure Blob File System (ABFS) to access data in ADLS Gen2.
-Trino also supports `ADLS Gen1
-`_
-and Windows Azure Storage Blob driver (WASB), but we recommend `migrating to
-ADLS Gen2
-`_,
+Trino also supports [ADLS Gen1](https://learn.microsoft.com/azure/data-lake-store/data-lake-store-overview)
+and Windows Azure Storage Blob driver (WASB), but we recommend [migrating to
+ADLS Gen2](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2-azure-portal),
as ADLS Gen1 and WASB are legacy options that will be removed in the future.
-Learn more from `the official documentation
-`_.
+Learn more from [the official documentation](https://docs.microsoft.com/azure/data-lake-store/data-lake-store-overview).
-Hive connector configuration for Azure Storage credentials
-----------------------------------------------------------
+## Hive connector configuration for Azure Storage credentials
To configure Trino to use the Azure Storage credentials, set the following
configuration properties in the catalog properties file. It is best to use this
@@ -26,16 +19,16 @@ The specific configuration depends on the type of storage and uses the
properties from the following sections in the catalog properties file.
For more complex use cases, such as configuring multiple secondary storage
-accounts using Hadoop's ``core-site.xml``, see the
-:ref:`hive-azure-advanced-config` options.
+accounts using Hadoop's `core-site.xml`, see the
+{ref}`hive-azure-advanced-config` options.
-ADLS Gen2 / ABFS storage
-^^^^^^^^^^^^^^^^^^^^^^^^
+### ADLS Gen2 / ABFS storage
To connect to ABFS storage, you may either use the storage account's access
key, or a service principal. Do not use both sets of properties at the
same time.
+```{eval-rst}
.. list-table:: ABFS Access Key
:widths: 30, 70
:header-rows: 1
@@ -46,7 +39,9 @@ same time.
- The name of the ADLS Gen2 storage account
* - ``hive.azure.abfs-access-key``
- The decrypted access key for the ADLS Gen2 storage account
+```
+```{eval-rst}
.. list-table:: ABFS Service Principal OAuth
:widths: 30, 70
:header-rows: 1
@@ -59,28 +54,28 @@ same time.
- The service principal's client/application ID.
* - ``hive.azure.abfs.oauth.secret``
- A client secret for the service principal.
+```
When using a service principal, it must have the Storage Blob Data Owner,
Contributor, or Reader role on the storage account you are using, depending on
which operations you would like to use.
-ADLS Gen1 (legacy)
-^^^^^^^^^^^^^^^^^^
+### ADLS Gen1 (legacy)
While it is advised to migrate to ADLS Gen2 whenever possible, if you still
choose to use ADLS Gen1 you need to include the following properties in your
catalog configuration.
-.. note::
-
- Credentials for the filesystem can be configured using ``ClientCredential``
- type. To authenticate with ADLS Gen1 you must create a new application
- secret for your ADLS Gen1 account's App Registration, and save this value
- because you won't able to retrieve the key later. Refer to the Azure
- `documentation
- `_
- for details.
+:::{note}
+Credentials for the filesystem can be configured using `ClientCredential`
+type. To authenticate with ADLS Gen1 you must create a new application
+secret for your ADLS Gen1 account's App Registration, and save this value
+because you won't able to retrieve the key later. Refer to the Azure
+[documentation](https://docs.microsoft.com/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory)
+for details.
+:::
+```{eval-rst}
.. list-table:: ADLS properties
:widths: 30, 70
:header-rows: 1
@@ -97,10 +92,11 @@ catalog configuration.
* - ``hive.azure.adl-proxy-host``
- Proxy host and port in ``host:port`` format. Use this property to connect
to an ADLS endpoint via a SOCKS proxy.
+```
-WASB storage (legacy)
-^^^^^^^^^^^^^^^^^^^^^
+### WASB storage (legacy)
+```{eval-rst}
.. list-table:: WASB properties
:widths: 30, 70
:header-rows: 1
@@ -111,54 +107,51 @@ WASB storage (legacy)
- Storage account name of Azure Blob Storage
* - ``hive.azure.wasb-access-key``
- The decrypted access key for the Azure Blob Storage
+```
-.. _hive-azure-advanced-config:
+(hive-azure-advanced-config)=
-Advanced configuration
-^^^^^^^^^^^^^^^^^^^^^^
+### Advanced configuration
All of the configuration properties for the Azure storage driver are stored in
-the Hadoop ``core-site.xml`` configuration file. When there are secondary
+the Hadoop `core-site.xml` configuration file. When there are secondary
storage accounts involved, we recommend configuring Trino using a
-``core-site.xml`` containing the appropriate credentials for each account.
+`core-site.xml` containing the appropriate credentials for each account.
The path to the file must be configured in the catalog properties file:
-.. code-block:: text
-
- hive.config.resources=
+```text
+hive.config.resources=
+```
One way to find your account key is to ask for the connection string for the
-storage account. The ``abfsexample.dfs.core.windows.net`` account refers to the
+storage account. The `abfsexample.dfs.core.windows.net` account refers to the
storage account. The connection string contains the account key:
-.. code-block:: text
+```text
+az storage account show-connection-string --name abfswales1
+{
+ "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfsexample;AccountKey=examplekey..."
+}
+```
- az storage account show-connection-string --name abfswales1
- {
- "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfsexample;AccountKey=examplekey..."
- }
-
-When you have the account access key, you can add it to your ``core-site.xml``
+When you have the account access key, you can add it to your `core-site.xml`
or Java cryptography extension (JCEKS) file. Alternatively, you can have your
cluster management tool to set the option
-``fs.azure.account.key.STORAGE-ACCOUNT`` to the account key value:
-
-.. code-block:: text
+`fs.azure.account.key.STORAGE-ACCOUNT` to the account key value:
-
- fs.azure.account.key.abfsexample.dfs.core.windows.net
- examplekey...
-
+```text
+
+ fs.azure.account.key.abfsexample.dfs.core.windows.net
+ examplekey...
+
+```
-For more information, see `Hadoop Azure Support: ABFS
-`_.
+For more information, see [Hadoop Azure Support: ABFS](https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html).
-Accessing Azure Storage data
-----------------------------
+## Accessing Azure Storage data
-URI scheme to reference data
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### URI scheme to reference data
Consistent with other FileSystem implementations within Hadoop, the Azure
Standard Blob and Azure Data Lake Storage Gen2 (ABFS) drivers define their own
@@ -169,86 +162,89 @@ different systems.
ABFS URI:
-.. code-block:: text
-
- abfs[s]://@.dfs.core.windows.net///
+```text
+abfs[s]://@.dfs.core.windows.net///
+```
ADLS Gen1 URI:
-.. code-block:: text
-
- adl://.azuredatalakestore.net//
+```text
+adl://.azuredatalakestore.net//
+```
Azure Standard Blob URI:
-.. code-block:: text
+```text
+wasb[s]://@.blob.core.windows.net///
+```
- wasb[s]://@.blob.core.windows.net///
-
-Querying Azure Storage
-^^^^^^^^^^^^^^^^^^^^^^
+### Querying Azure Storage
You can query tables already configured in your Hive metastore used in your Hive
catalog. To access Azure Storage data that is not yet mapped in the Hive
metastore, you need to provide the schema of the data, the file format, and the
data location.
-For example, if you have ORC or Parquet files in an ABFS ``file_system``, you
-need to execute a query::
+For example, if you have ORC or Parquet files in an ABFS `file_system`, you
+need to execute a query:
- -- select schema in which the table is to be defined, must already exist
- USE hive.default;
+```
+-- select schema in which the table is to be defined, must already exist
+USE hive.default;
- -- create table
- CREATE TABLE orders (
- orderkey BIGINT,
- custkey BIGINT,
- orderstatus VARCHAR(1),
- totalprice DOUBLE,
- orderdate DATE,
- orderpriority VARCHAR(15),
- clerk VARCHAR(15),
- shippriority INTEGER,
- comment VARCHAR(79)
- ) WITH (
- external_location = 'abfs[s]://@.dfs.core.windows.net///',
- format = 'ORC' -- or 'PARQUET'
- );
+-- create table
+CREATE TABLE orders (
+ orderkey BIGINT,
+ custkey BIGINT,
+ orderstatus VARCHAR(1),
+ totalprice DOUBLE,
+ orderdate DATE,
+ orderpriority VARCHAR(15),
+ clerk VARCHAR(15),
+ shippriority INTEGER,
+ comment VARCHAR(79)
+) WITH (
+ external_location = 'abfs[s]://@.dfs.core.windows.net///',
+ format = 'ORC' -- or 'PARQUET'
+);
+```
-Now you can query the newly mapped table::
+Now you can query the newly mapped table:
- SELECT * FROM orders;
+```
+SELECT * FROM orders;
+```
-Writing data
-------------
+## Writing data
-Prerequisites
-^^^^^^^^^^^^^
+### Prerequisites
Before you attempt to write data to Azure Storage, make sure you have configured
everything necessary to read data from the storage.
-Create a write schema
-^^^^^^^^^^^^^^^^^^^^^
+### Create a write schema
If the Hive metastore contains schema(s) mapped to Azure storage filesystems,
you can use them to write data to Azure storage.
If you don't want to use existing schemas, or there are no appropriate schemas
-in the Hive metastore, you need to create a new one::
+in the Hive metastore, you need to create a new one:
- CREATE SCHEMA hive.abfs_export
- WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/');
+```
+CREATE SCHEMA hive.abfs_export
+WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/');
+```
-Write data to Azure Storage
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Write data to Azure Storage
Once you have a schema pointing to a location where you want to write the data,
-you can issue a ``CREATE TABLE AS`` statement and select your desired file
+you can issue a `CREATE TABLE AS` statement and select your desired file
format. The data will be written to one or more files within the
-``abfs[s]://file_system@account_name.dfs.core.windows.net//my_table``
-namespace. Example::
-
- CREATE TABLE hive.abfs_export.orders_abfs
- WITH (format = 'ORC')
- AS SELECT * FROM tpch.sf1.orders;
+`abfs[s]://file_system@account_name.dfs.core.windows.net//my_table`
+namespace. Example:
+
+```
+CREATE TABLE hive.abfs_export.orders_abfs
+WITH (format = 'ORC')
+AS SELECT * FROM tpch.sf1.orders;
+```
diff --git a/docs/src/main/sphinx/connector/hive-caching.rst b/docs/src/main/sphinx/connector/hive-caching.md
similarity index 84%
rename from docs/src/main/sphinx/connector/hive-caching.rst
rename to docs/src/main/sphinx/connector/hive-caching.md
index 8cdb76a9f0f9..edc3cf3f5048 100644
--- a/docs/src/main/sphinx/connector/hive-caching.rst
+++ b/docs/src/main/sphinx/connector/hive-caching.md
@@ -1,16 +1,13 @@
-==============================
-Hive connector storage caching
-==============================
+# Hive connector storage caching
-Querying object storage with the :doc:`/connector/hive` is a
+Querying object storage with the {doc}`/connector/hive` is a
very common use case for Trino. It often involves the transfer of large amounts
of data. The objects are retrieved from HDFS, or any other supported object
storage, by multiple workers and processed on these workers. Repeated queries
with different parameters, or even different queries from different users, often
access, and therefore transfer, the same objects.
-Benefits
---------
+## Benefits
Enabling caching can result in significant benefits:
@@ -53,8 +50,7 @@ significantly reduced network traffic. Network traffic however is a considerable
cost factor in an setup, specifically also when hosted in public cloud provider
systems.
-Architecture
-------------
+## Architecture
Caching can operate in two modes. The async mode provides the queried data
directly and caches any objects asynchronously afterwards. Async is the default
@@ -75,17 +71,16 @@ storage.
The cache chunks are 1MB in size and are well suited for ORC or Parquet file
formats.
-Configuration
--------------
+## Configuration
-The caching feature is part of the :doc:`/connector/hive` and
+The caching feature is part of the {doc}`/connector/hive` and
can be activated in the catalog properties file:
-.. code-block:: text
-
- connector.name=hive
- hive.cache.enabled=true
- hive.cache.location=/opt/hive-cache
+```text
+connector.name=hive
+hive.cache.enabled=true
+hive.cache.location=/opt/hive-cache
+```
The cache operates on the coordinator and all workers accessing the object
storage. The used networking ports for the managing BookKeeper and the data
@@ -94,6 +89,7 @@ transfer, by default 8898 and 8899, need to be available.
To use caching on multiple catalogs, you need to configure different caching
directories and different BookKeeper and data-transfer ports.
+```{eval-rst}
.. list-table:: **Cache Configuration Parameters**
:widths: 25, 63, 12
:header-rows: 1
@@ -128,11 +124,11 @@ directories and different BookKeeper and data-transfer ports.
* - ``hive.cache.disk-usage-percentage``
- Percentage of disk space used for cached data.
- 80
+```
-.. _hive-cache-recommendations:
+(hive-cache-recommendations)=
-Recommendations
----------------
+## Recommendations
The speed of the local cache storage is crucial to the performance of the cache.
The most common and cost efficient approach is to attach high performance SSD
@@ -157,45 +153,41 @@ caching. Typically you need to connect a fast storage system, like an SSD drive,
and ensure that is it mounted on the configured path. Kubernetes, CFT and other
systems allow this via volumes.
-Object storage systems
-----------------------
+## Object storage systems
The following object storage systems are tested:
-* HDFS
-* :doc:`Amazon S3 and S3-compatible systems `
-* :doc:`Azure storage systems `
-* Google Cloud Storage
+- HDFS
+- {doc}`Amazon S3 and S3-compatible systems `
+- {doc}`Azure storage systems `
+- Google Cloud Storage
-Metrics
--------
+## Metrics
In order to verify how caching works on your system you can take multiple
approaches:
-* Inspect the disk usage on the cache storage drives on all nodes
-* Query the metrics of the caching system exposed by JMX
+- Inspect the disk usage on the cache storage drives on all nodes
+- Query the metrics of the caching system exposed by JMX
-The implementation of the cache exposes a `number of metrics
-`_ via JMX. You can
-:doc:`inspect these and other metrics directly in Trino with the JMX connector
+The implementation of the cache exposes a [number of metrics](https://rubix.readthedocs.io/en/latest/metrics.html) via JMX. You can
+{doc}`inspect these and other metrics directly in Trino with the JMX connector
or in external tools `.
Basic caching statistics for the catalog are available in the
-``jmx.current."rubix:catalog=,name=stats"`` table.
-The table ``jmx.current."rubix:catalog=,type=detailed,name=stats``
+`jmx.current."rubix:catalog=,name=stats"` table.
+The table `jmx.current."rubix:catalog=,type=detailed,name=stats`
contains more detailed statistics.
-The following example query returns the average cache hit ratio for the ``hive`` catalog:
-
-.. code-block:: sql
+The following example query returns the average cache hit ratio for the `hive` catalog:
- SELECT avg(cache_hit)
- FROM jmx.current."rubix:catalog=hive,name=stats"
- WHERE NOT is_nan(cache_hit);
+```sql
+SELECT avg(cache_hit)
+FROM jmx.current."rubix:catalog=hive,name=stats"
+WHERE NOT is_nan(cache_hit);
+```
-Limitations
------------
+## Limitations
Caching does not support user impersonation and cannot be used with HDFS secured by Kerberos.
It does not take any user-specific access rights to the object storage into account.
diff --git a/docs/src/main/sphinx/connector/hive-cos.md b/docs/src/main/sphinx/connector/hive-cos.md
new file mode 100644
index 000000000000..b9b9a83e75a0
--- /dev/null
+++ b/docs/src/main/sphinx/connector/hive-cos.md
@@ -0,0 +1,98 @@
+# Hive connector with IBM Cloud Object Storage
+
+Configure the {doc}`hive` to support [IBM Cloud Object Storage COS](https://www.ibm.com/cloud/object-storage) access.
+
+## Configuration
+
+To use COS, you need to configure a catalog file to use the Hive
+connector. For example, create a file `etc/ibmcos.properties` and
+specify the path to the COS service config file with the
+`hive.cos.service-config` property.
+
+```properties
+connector.name=hive
+hive.cos.service-config=etc/cos-service.properties
+```
+
+The service configuration file contains the access and secret keys, as well as
+the endpoints for one or multiple COS services:
+
+```properties
+service1.access-key=
+service1.secret-key=
+service1.endpoint=
+service2.access-key=
+service2.secret-key=
+service2.endpoint=
+```
+
+The endpoints property is optional. `service1` and `service2` are
+placeholders for unique COS service names. The granularity for providing access
+credentials is at the COS service level.
+
+To use IBM COS service, specify the service name, for example: `service1` in
+the COS path. The general URI path pattern is
+`cos://./object(s)`.
+
+```
+cos://example-bucket.service1/orders_tiny
+```
+
+Trino translates the COS path, and uses the `service1` endpoint and
+credentials from `cos-service.properties` to access
+`cos://example-bucket.service1/object`.
+
+The Hive Metastore (HMS) does not support the IBM COS filesystem, by default.
+The [Stocator library](https://github.com/CODAIT/stocator) is a possible
+solution to this problem. Download the [Stocator JAR](https://repo1.maven.org/maven2/com/ibm/stocator/stocator/1.1.4/stocator-1.1.4.jar),
+and place it in Hadoop PATH. The [Stocator IBM COS configuration](https://github.com/CODAIT/stocator#reference-stocator-in-the-core-sitexml)
+should be placed in `core-site.xml`. For example:
+
+```
+
+ fs.stocator.scheme.list
+ cos
+
+
+ fs.cos.impl
+ com.ibm.stocator.fs.ObjectStoreFileSystem
+
+
+ fs.stocator.cos.impl
+ com.ibm.stocator.fs.cos.COSAPIClient
+
+
+ fs.stocator.cos.scheme
+ cos
+
+
+ fs.cos.service1.endpoint
+ http://s3.eu-de.cloud-object-storage.appdomain.cloud
+
+
+ fs.cos.service1.access.key
+ access-key
+
+
+ fs.cos.service1.secret.key
+ secret-key
+
+```
+
+## Alternative configuration using S3 compatibility
+
+Use the S3 properties for the Hive connector in the catalog file. If only one
+IBM Cloud Object Storage endpoint is used, then the configuration can be
+simplified:
+
+```
+hive.s3.endpoint=http://s3.eu-de.cloud-object-storage.appdomain.cloud
+hive.s3.aws-access-key=access-key
+hive.s3.aws-secret-key=secret-key
+```
+
+Use `s3` protocol instead of `cos` for the table location:
+
+```
+s3://example-bucket/object/
+```
diff --git a/docs/src/main/sphinx/connector/hive-cos.rst b/docs/src/main/sphinx/connector/hive-cos.rst
deleted file mode 100644
index d877da076954..000000000000
--- a/docs/src/main/sphinx/connector/hive-cos.rst
+++ /dev/null
@@ -1,105 +0,0 @@
-============================================
-Hive connector with IBM Cloud Object Storage
-============================================
-
-Configure the :doc:`hive` to support `IBM Cloud Object Storage COS
-`_ access.
-
-Configuration
--------------
-
-To use COS, you need to configure a catalog file to use the Hive
-connector. For example, create a file ``etc/ibmcos.properties`` and
-specify the path to the COS service config file with the
-``hive.cos.service-config`` property.
-
-.. code-block:: properties
-
- connector.name=hive
- hive.cos.service-config=etc/cos-service.properties
-
-The service configuration file contains the access and secret keys, as well as
-the endpoints for one or multiple COS services:
-
-.. code-block:: properties
-
- service1.access-key=
- service1.secret-key=
- service1.endpoint=
- service2.access-key=
- service2.secret-key=
- service2.endpoint=
-
-The endpoints property is optional. ``service1`` and ``service2`` are
-placeholders for unique COS service names. The granularity for providing access
-credentials is at the COS service level.
-
-To use IBM COS service, specify the service name, for example: ``service1`` in
-the COS path. The general URI path pattern is
-``cos://./object(s)``.
-
-.. code-block::
-
- cos://example-bucket.service1/orders_tiny
-
-Trino translates the COS path, and uses the ``service1`` endpoint and
-credentials from ``cos-service.properties`` to access
-``cos://example-bucket.service1/object``.
-
-The Hive Metastore (HMS) does not support the IBM COS filesystem, by default.
-The `Stocator library `_ is a possible
-solution to this problem. Download the `Stocator JAR
-`_,
-and place it in Hadoop PATH. The `Stocator IBM COS configuration
-`_
-should be placed in ``core-site.xml``. For example:
-
-.. code-block::
-
-
- fs.stocator.scheme.list
- cos
-
-
- fs.cos.impl
- com.ibm.stocator.fs.ObjectStoreFileSystem
-
-
- fs.stocator.cos.impl
- com.ibm.stocator.fs.cos.COSAPIClient
-
-
- fs.stocator.cos.scheme
- cos
-
-
- fs.cos.service1.endpoint
- http://s3.eu-de.cloud-object-storage.appdomain.cloud
-
-
- fs.cos.service1.access.key
- access-key
-
-
- fs.cos.service1.secret.key
- secret-key
-
-
-Alternative configuration using S3 compatibility
-------------------------------------------------
-
-Use the S3 properties for the Hive connector in the catalog file. If only one
-IBM Cloud Object Storage endpoint is used, then the configuration can be
-simplified:
-
-.. code-block::
-
- hive.s3.endpoint=http://s3.eu-de.cloud-object-storage.appdomain.cloud
- hive.s3.aws-access-key=access-key
- hive.s3.aws-secret-key=secret-key
-
-Use ``s3`` protocol instead of ``cos`` for the table location:
-
-.. code-block::
-
- s3://example-bucket/object/
diff --git a/docs/src/main/sphinx/connector/hive-gcs-tutorial.md b/docs/src/main/sphinx/connector/hive-gcs-tutorial.md
new file mode 100644
index 000000000000..3c5c3a9fa5a6
--- /dev/null
+++ b/docs/src/main/sphinx/connector/hive-gcs-tutorial.md
@@ -0,0 +1,81 @@
+# Google Cloud Storage
+
+Object storage connectors can access
+[Google Cloud Storage](https://cloud.google.com/storage/) data using the
+`gs://` URI prefix.
+
+## Requirements
+
+To use Google Cloud Storage with non-anonymous access objects, you need:
+
+- A [Google Cloud service account](https://console.cloud.google.com/projectselector2/iam-admin/serviceaccounts)
+- The key for the service account in JSON format
+
+(hive-google-cloud-storage-configuration)=
+
+## Configuration
+
+The use of Google Cloud Storage as a storage location for an object storage
+catalog requires setting a configuration property that defines the
+[authentication method for any non-anonymous access object](https://cloud.google.com/storage/docs/authentication). Access methods cannot
+be combined.
+
+The default root path used by the `gs:\\` prefix is set in the catalog by the
+contents of the specified key file, or the key file used to create the OAuth
+token.
+
+```{eval-rst}
+.. list-table:: Google Cloud Storage configuration properties
+ :widths: 35, 65
+ :header-rows: 1
+
+ * - Property Name
+ - Description
+ * - ``hive.gcs.json-key-file-path``
+ - JSON key file used to authenticate your Google Cloud service account
+ with Google Cloud Storage.
+ * - ``hive.gcs.use-access-token``
+ - Use client-provided OAuth token to access Google Cloud Storage.
+```
+
+The following uses the Delta Lake connector in an example of a minimal
+configuration file for an object storage catalog using a JSON key file:
+
+```properties
+connector.name=delta_lake
+hive.metastore.uri=thrift://example.net:9083
+hive.gcs.json-key-file-path=${ENV:GCP_CREDENTIALS_FILE_PATH}
+```
+
+## General usage
+
+Create a schema to use if one does not already exist, as in the following
+example:
+
+```sql
+CREATE SCHEMA storage_catalog.sales_data_in_gcs WITH (location = 'gs://example_location');
+```
+
+Once you have created a schema, you can create tables in the schema, as in the
+following example:
+
+```sql
+CREATE TABLE storage_catalog.sales_data_in_gcs.orders (
+ orderkey BIGINT,
+ custkey BIGINT,
+ orderstatus VARCHAR(1),
+ totalprice DOUBLE,
+ orderdate DATE,
+ orderpriority VARCHAR(15),
+ clerk VARCHAR(15),
+ shippriority INTEGER,
+ comment VARCHAR(79)
+);
+```
+
+This statement creates the folder `gs://sales_data_in_gcs/orders` in the root
+folder defined in the JSON key file.
+
+Your table is now ready to populate with data using `INSERT` statements.
+Alternatively, you can use `CREATE TABLE AS` statements to create and
+populate the table in a single statement.
diff --git a/docs/src/main/sphinx/connector/hive-gcs-tutorial.rst b/docs/src/main/sphinx/connector/hive-gcs-tutorial.rst
deleted file mode 100644
index 6f03c76f1f3c..000000000000
--- a/docs/src/main/sphinx/connector/hive-gcs-tutorial.rst
+++ /dev/null
@@ -1,84 +0,0 @@
-Google Cloud Storage
-====================
-
-Object storage connectors can access
-`Google Cloud Storage `_ data using the
-``gs://`` URI prefix.
-
-Requirements
--------------
-
-To use Google Cloud Storage with non-anonymous access objects, you need:
-
-* A `Google Cloud service account `_
-* The key for the service account in JSON format
-
-.. _hive-google-cloud-storage-configuration:
-
-Configuration
--------------
-
-The use of Google Cloud Storage as a storage location for an object storage
-catalog requires setting a configuration property that defines the
-`authentication method for any non-anonymous access object
-`_. Access methods cannot
-be combined.
-
-The default root path used by the ``gs:\\`` prefix is set in the catalog by the
-contents of the specified key file, or the key file used to create the OAuth
-token.
-
-.. list-table:: Google Cloud Storage configuration properties
- :widths: 35, 65
- :header-rows: 1
-
- * - Property Name
- - Description
- * - ``hive.gcs.json-key-file-path``
- - JSON key file used to authenticate your Google Cloud service account
- with Google Cloud Storage.
- * - ``hive.gcs.use-access-token``
- - Use client-provided OAuth token to access Google Cloud Storage.
-
-The following uses the Delta Lake connector in an example of a minimal
-configuration file for an object storage catalog using a JSON key file:
-
-.. code-block:: properties
-
- connector.name=delta_lake
- hive.metastore.uri=thrift://example.net:9083
- hive.gcs.json-key-file-path=${ENV:GCP_CREDENTIALS_FILE_PATH}
-
-General usage
--------------
-
-Create a schema to use if one does not already exist, as in the following
-example:
-
-.. code-block:: sql
-
- CREATE SCHEMA storage_catalog.sales_data_in_gcs WITH (location = 'gs://example_location');
-
-Once you have created a schema, you can create tables in the schema, as in the
-following example:
-
-.. code-block:: sql
-
- CREATE TABLE storage_catalog.sales_data_in_gcs.orders (
- orderkey BIGINT,
- custkey BIGINT,
- orderstatus VARCHAR(1),
- totalprice DOUBLE,
- orderdate DATE,
- orderpriority VARCHAR(15),
- clerk VARCHAR(15),
- shippriority INTEGER,
- comment VARCHAR(79)
- );
-
-This statement creates the folder ``gs://sales_data_in_gcs/orders`` in the root
-folder defined in the JSON key file.
-
-Your table is now ready to populate with data using ``INSERT`` statements.
-Alternatively, you can use ``CREATE TABLE AS`` statements to create and
-populate the table in a single statement.
\ No newline at end of file
diff --git a/docs/src/main/sphinx/connector/hive-s3.rst b/docs/src/main/sphinx/connector/hive-s3.md
similarity index 51%
rename from docs/src/main/sphinx/connector/hive-s3.rst
rename to docs/src/main/sphinx/connector/hive-s3.md
index 028235f27b29..70c11a6b8219 100644
--- a/docs/src/main/sphinx/connector/hive-s3.rst
+++ b/docs/src/main/sphinx/connector/hive-s3.md
@@ -1,20 +1,18 @@
-=============================
-Hive connector with Amazon S3
-=============================
+# Hive connector with Amazon S3
-The :doc:`hive` can read and write tables that are stored in
-`Amazon S3 `_ or S3-compatible systems.
+The {doc}`hive` can read and write tables that are stored in
+[Amazon S3](https://aws.amazon.com/s3/) or S3-compatible systems.
This is accomplished by having a table or database location that
uses an S3 prefix, rather than an HDFS prefix.
Trino uses its own S3 filesystem for the URI prefixes
-``s3://``, ``s3n://`` and ``s3a://``.
+`s3://`, `s3n://` and `s3a://`.
-.. _hive-s3-configuration:
+(hive-s3-configuration)=
-S3 configuration properties
----------------------------
+## S3 configuration properties
+```{eval-rst}
.. list-table::
:widths: 35, 65
:header-rows: 1
@@ -115,45 +113,42 @@ S3 configuration properties
* - ``hive.s3.sts.region``
- Optional override for the sts region given that IAM role based
authentication via sts is used.
+```
-.. _hive-s3-credentials:
+(hive-s3-credentials)=
-S3 credentials
---------------
+## S3 credentials
If you are running Trino on Amazon EC2, using EMR or another facility,
it is recommended that you use IAM Roles for EC2 to govern access to S3.
To enable this, your EC2 instances need to be assigned an IAM Role which
grants appropriate access to the data stored in the S3 bucket(s) you wish
-to use. It is also possible to configure an IAM role with ``hive.s3.iam-role``
+to use. It is also possible to configure an IAM role with `hive.s3.iam-role`
that is used for accessing any S3 bucket. This is much cleaner than
-setting AWS access and secret keys in the ``hive.s3.aws-access-key``
-and ``hive.s3.aws-secret-key`` settings, and also allows EC2 to automatically
+setting AWS access and secret keys in the `hive.s3.aws-access-key`
+and `hive.s3.aws-secret-key` settings, and also allows EC2 to automatically
rotate credentials on a regular basis without any additional work on your part.
-Custom S3 credentials provider
-------------------------------
+## Custom S3 credentials provider
You can configure a custom S3 credentials provider by setting the configuration
-property ``trino.s3.credentials-provider`` to the fully qualified class name of
+property `trino.s3.credentials-provider` to the fully qualified class name of
a custom AWS credentials provider implementation. The property must be set in
-the Hadoop configuration files referenced by the ``hive.config.resources`` Hive
+the Hadoop configuration files referenced by the `hive.config.resources` Hive
connector property.
The class must implement the
-`AWSCredentialsProvider `_
+[AWSCredentialsProvider](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/AWSCredentialsProvider.html)
interface and provide a two-argument constructor that takes a
-``java.net.URI`` and a Hadoop ``org.apache.hadoop.conf.Configuration``
+`java.net.URI` and a Hadoop `org.apache.hadoop.conf.Configuration`
as arguments. A custom credentials provider can be used to provide
-temporary credentials from STS (using ``STSSessionCredentialsProvider``),
-IAM role-based credentials (using ``STSAssumeRoleSessionCredentialsProvider``),
+temporary credentials from STS (using `STSSessionCredentialsProvider`),
+IAM role-based credentials (using `STSAssumeRoleSessionCredentialsProvider`),
or credentials for a specific use case (e.g., bucket/user specific credentials).
+(hive-s3-security-mapping)=
-.. _hive-s3-security-mapping:
-
-S3 security mapping
--------------------
+## S3 security mapping
Trino supports flexible security mapping for S3, allowing for separate
credentials or IAM roles for specific users or buckets/paths. The IAM role
@@ -163,206 +158,171 @@ it as an *extra credential*.
Each security mapping entry may specify one or more match criteria. If multiple
criteria are specified, all criteria must match. Available match criteria:
-* ``user``: Regular expression to match against username. Example: ``alice|bob``
-
-* ``group``: Regular expression to match against any of the groups that the user
- belongs to. Example: ``finance|sales``
-
-* ``prefix``: S3 URL prefix. It can specify an entire bucket or a path within a
- bucket. The URL must start with ``s3://`` but will also match ``s3a`` or ``s3n``.
- Example: ``s3://bucket-name/abc/xyz/``
+- `user`: Regular expression to match against username. Example: `alice|bob`
+- `group`: Regular expression to match against any of the groups that the user
+ belongs to. Example: `finance|sales`
+- `prefix`: S3 URL prefix. It can specify an entire bucket or a path within a
+ bucket. The URL must start with `s3://` but will also match `s3a` or `s3n`.
+ Example: `s3://bucket-name/abc/xyz/`
The security mapping must provide one or more configuration settings:
-* ``accessKey`` and ``secretKey``: AWS access key and secret key. This overrides
+- `accessKey` and `secretKey`: AWS access key and secret key. This overrides
any globally configured credentials, such as access key or instance credentials.
-
-* ``iamRole``: IAM role to use if no user provided role is specified as an
+- `iamRole`: IAM role to use if no user provided role is specified as an
extra credential. This overrides any globally configured IAM role. This role
is allowed to be specified as an extra credential, although specifying it
explicitly has no effect, as it would be used anyway.
-
-* ``roleSessionName``: Optional role session name to use with ``iamRole``. This can only
- be used when ``iamRole`` is specified. If ``roleSessionName`` includes the string
- ``${USER}``, then the ``${USER}`` portion of the string will be replaced with the
- current session's username. If ``roleSessionName`` is not specified, it defaults
- to ``trino-session``.
-
-* ``allowedIamRoles``: IAM roles that are allowed to be specified as an extra
+- `roleSessionName`: Optional role session name to use with `iamRole`. This can only
+ be used when `iamRole` is specified. If `roleSessionName` includes the string
+ `${USER}`, then the `${USER}` portion of the string will be replaced with the
+ current session's username. If `roleSessionName` is not specified, it defaults
+ to `trino-session`.
+- `allowedIamRoles`: IAM roles that are allowed to be specified as an extra
credential. This is useful because a particular AWS account may have permissions
to use many roles, but a specific user should only be allowed to use a subset
of those roles.
-
-* ``kmsKeyId``: ID of KMS-managed key to be used for client-side encryption.
-
-* ``allowedKmsKeyIds``: KMS-managed key IDs that are allowed to be specified as an extra
- credential. If list cotains "*", then any key can be specified via extra credential.
+- `kmsKeyId`: ID of KMS-managed key to be used for client-side encryption.
+- `allowedKmsKeyIds`: KMS-managed key IDs that are allowed to be specified as an extra
+ credential. If list cotains "\*", then any key can be specified via extra credential.
The security mapping entries are processed in the order listed in the configuration
JSON. More specific mappings should thus be specified before less specific mappings.
-For example, the mapping list might have URL prefix ``s3://abc/xyz/`` followed by
-``s3://abc/`` to allow different configuration for a specific path within a bucket
+For example, the mapping list might have URL prefix `s3://abc/xyz/` followed by
+`s3://abc/` to allow different configuration for a specific path within a bucket
than for other paths within the bucket. You can set default configuration by not
including any match criteria for the last entry in the list.
In addition to the rules above, the default mapping can contain the optional
-``useClusterDefault`` boolean property with the following behavior:
+`useClusterDefault` boolean property with the following behavior:
-- ``false`` - (is set by default) property is ignored.
-- ``true`` - This causes the default cluster role to be used as a fallback option.
+- `false` - (is set by default) property is ignored.
+
+- `true` - This causes the default cluster role to be used as a fallback option.
It can not be used with the following configuration properties:
- - ``accessKey``
- - ``secretKey``
- - ``iamRole``
- - ``allowedIamRoles``
+ - `accessKey`
+ - `secretKey`
+ - `iamRole`
+ - `allowedIamRoles`
If no mapping entry matches and no default is configured, the access is denied.
The configuration JSON can either be retrieved from a file or REST-endpoint specified via
-``hive.s3.security-mapping.config-file``.
+`hive.s3.security-mapping.config-file`.
Example JSON configuration:
-.. code-block:: json
-
+```json
+{
+ "mappings": [
{
- "mappings": [
- {
- "prefix": "s3://bucket-name/abc/",
- "iamRole": "arn:aws:iam::123456789101:role/test_path"
- },
- {
- "user": "bob|charlie",
- "iamRole": "arn:aws:iam::123456789101:role/test_default",
- "allowedIamRoles": [
- "arn:aws:iam::123456789101:role/test1",
- "arn:aws:iam::123456789101:role/test2",
- "arn:aws:iam::123456789101:role/test3"
- ]
- },
- {
- "prefix": "s3://special-bucket/",
- "accessKey": "AKIAxxxaccess",
- "secretKey": "iXbXxxxsecret"
- },
- {
- "prefix": "s3://encrypted-bucket/",
- "kmsKeyId": "kmsKey_10",
- },
- {
- "user": "test.*",
- "iamRole": "arn:aws:iam::123456789101:role/test_users"
- },
- {
- "group": "finance",
- "iamRole": "arn:aws:iam::123456789101:role/finance_users"
- },
- {
- "iamRole": "arn:aws:iam::123456789101:role/default"
- }
+ "prefix": "s3://bucket-name/abc/",
+ "iamRole": "arn:aws:iam::123456789101:role/test_path"
+ },
+ {
+ "user": "bob|charlie",
+ "iamRole": "arn:aws:iam::123456789101:role/test_default",
+ "allowedIamRoles": [
+ "arn:aws:iam::123456789101:role/test1",
+ "arn:aws:iam::123456789101:role/test2",
+ "arn:aws:iam::123456789101:role/test3"
]
+ },
+ {
+ "prefix": "s3://special-bucket/",
+ "accessKey": "AKIAxxxaccess",
+ "secretKey": "iXbXxxxsecret"
+ },
+ {
+ "prefix": "s3://encrypted-bucket/",
+ "kmsKeyId": "kmsKey_10",
+ },
+ {
+ "user": "test.*",
+ "iamRole": "arn:aws:iam::123456789101:role/test_users"
+ },
+ {
+ "group": "finance",
+ "iamRole": "arn:aws:iam::123456789101:role/finance_users"
+ },
+ {
+ "iamRole": "arn:aws:iam::123456789101:role/default"
}
+ ]
+}
+```
-======================================================= =================================================================
-Property name Description
-======================================================= =================================================================
-``hive.s3.security-mapping.config-file`` The JSON configuration file or REST-endpoint URI containing
- security mappings.
-``hive.s3.security-mapping.json-pointer`` A JSON pointer (RFC 6901) to mappings inside the JSON retrieved from
- the config file or REST-endpont. The whole document ("") by default.
+| Property name | Description |
+| ----------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `hive.s3.security-mapping.config-file` | The JSON configuration file or REST-endpoint URI containing security mappings. |
+| `hive.s3.security-mapping.json-pointer` | A JSON pointer (RFC 6901) to mappings inside the JSON retrieved from the config file or REST-endpont. The whole document ("") by default. |
+| `hive.s3.security-mapping.iam-role-credential-name` | The name of the *extra credential* used to provide the IAM role. |
+| `hive.s3.security-mapping.kms-key-id-credential-name` | The name of the *extra credential* used to provide the KMS-managed key ID. |
+| `hive.s3.security-mapping.refresh-period` | How often to refresh the security mapping configuration. |
+| `hive.s3.security-mapping.colon-replacement` | The character or characters to be used in place of the colon (`:`) character when specifying an IAM role name as an extra credential. Any instances of this replacement value in the extra credential value will be converted to a colon. Choose a value that is not used in any of your IAM ARNs. |
-``hive.s3.security-mapping.iam-role-credential-name`` The name of the *extra credential* used to provide the IAM role.
+(hive-s3-tuning-configuration)=
-``hive.s3.security-mapping.kms-key-id-credential-name`` The name of the *extra credential* used to provide the
- KMS-managed key ID.
-
-``hive.s3.security-mapping.refresh-period`` How often to refresh the security mapping configuration.
-
-``hive.s3.security-mapping.colon-replacement`` The character or characters to be used in place of the colon
- (``:``) character when specifying an IAM role name as an
- extra credential. Any instances of this replacement value in the
- extra credential value will be converted to a colon. Choose a
- value that is not used in any of your IAM ARNs.
-======================================================= =================================================================
-
-.. _hive-s3-tuning-configuration:
-
-Tuning properties
------------------
+## Tuning properties
The following tuning properties affect the behavior of the client
used by the Trino S3 filesystem when communicating with S3.
-Most of these parameters affect settings on the ``ClientConfiguration``
-object associated with the ``AmazonS3Client``.
-
-===================================== =========================================================== ==========================
-Property name Description Default
-===================================== =========================================================== ==========================
-``hive.s3.max-error-retries`` Maximum number of error retries, set on the S3 client. ``10``
-
-``hive.s3.max-client-retries`` Maximum number of read attempts to retry. ``5``
-
-``hive.s3.max-backoff-time`` Use exponential backoff starting at 1 second up to ``10 minutes``
- this maximum value when communicating with S3.
-
-``hive.s3.max-retry-time`` Maximum time to retry communicating with S3. ``10 minutes``
-
-``hive.s3.connect-timeout`` TCP connect timeout. ``5 seconds``
-
-``hive.s3.connect-ttl`` TCP connect TTL, which affects connection reusage. Connections do not expire.
-
-``hive.s3.socket-timeout`` TCP socket read timeout. ``5 seconds``
-
-``hive.s3.max-connections`` Maximum number of simultaneous open connections to S3. ``500``
-
-``hive.s3.multipart.min-file-size`` Minimum file size before multi-part upload to S3 is used. ``16 MB``
-
-``hive.s3.multipart.min-part-size`` Minimum multi-part upload part size. ``5 MB``
-===================================== =========================================================== ==========================
-
-.. _hive-s3-data-encryption:
-
-S3 data encryption
-------------------
+Most of these parameters affect settings on the `ClientConfiguration`
+object associated with the `AmazonS3Client`.
+
+| Property name | Description | Default |
+| --------------------------------- | ------------------------------------------------------------------------------------------------- | -------------------------- |
+| `hive.s3.max-error-retries` | Maximum number of error retries, set on the S3 client. | `10` |
+| `hive.s3.max-client-retries` | Maximum number of read attempts to retry. | `5` |
+| `hive.s3.max-backoff-time` | Use exponential backoff starting at 1 second up to this maximum value when communicating with S3. | `10 minutes` |
+| `hive.s3.max-retry-time` | Maximum time to retry communicating with S3. | `10 minutes` |
+| `hive.s3.connect-timeout` | TCP connect timeout. | `5 seconds` |
+| `hive.s3.connect-ttl` | TCP connect TTL, which affects connection reusage. | Connections do not expire. |
+| `hive.s3.socket-timeout` | TCP socket read timeout. | `5 seconds` |
+| `hive.s3.max-connections` | Maximum number of simultaneous open connections to S3. | `500` |
+| `hive.s3.multipart.min-file-size` | Minimum file size before multi-part upload to S3 is used. | `16 MB` |
+| `hive.s3.multipart.min-part-size` | Minimum multi-part upload part size. | `5 MB` |
+
+(hive-s3-data-encryption)=
+
+## S3 data encryption
Trino supports reading and writing encrypted data in S3 using both
server-side encryption with S3 managed keys and client-side encryption using
either the Amazon KMS or a software plugin to manage AES encryption keys.
-With `S3 server-side encryption `_,
+With [S3 server-side encryption](http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html),
called *SSE-S3* in the Amazon documentation, the S3 infrastructure takes care of all encryption and decryption
-work. One exception is SSL to the client, assuming you have ``hive.s3.ssl.enabled`` set to ``true``.
-S3 also manages all the encryption keys for you. To enable this, set ``hive.s3.sse.enabled`` to ``true``.
+work. One exception is SSL to the client, assuming you have `hive.s3.ssl.enabled` set to `true`.
+S3 also manages all the encryption keys for you. To enable this, set `hive.s3.sse.enabled` to `true`.
-With `S3 client-side encryption `_,
+With [S3 client-side encryption](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingClientSideEncryption.html),
S3 stores encrypted data and the encryption keys are managed outside of the S3 infrastructure. Data is encrypted
and decrypted by Trino instead of in the S3 infrastructure. In this case, encryption keys can be managed
either by using the AWS KMS, or your own key management system. To use the AWS KMS for key management, set
-``hive.s3.kms-key-id`` to the UUID of a KMS key. Your AWS credentials or EC2 IAM role will need to be
+`hive.s3.kms-key-id` to the UUID of a KMS key. Your AWS credentials or EC2 IAM role will need to be
granted permission to use the given key as well.
-To use a custom encryption key management system, set ``hive.s3.encryption-materials-provider`` to the
+To use a custom encryption key management system, set `hive.s3.encryption-materials-provider` to the
fully qualified name of a class which implements the
-`EncryptionMaterialsProvider `_
+[EncryptionMaterialsProvider](http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/EncryptionMaterialsProvider.html)
interface from the AWS Java SDK. This class has to be accessible to the Hive Connector through the
classpath and must be able to communicate with your custom key management system. If this class also implements
-the ``org.apache.hadoop.conf.Configurable`` interface from the Hadoop Java API, then the Hadoop configuration
+the `org.apache.hadoop.conf.Configurable` interface from the Hadoop Java API, then the Hadoop configuration
is passed in after the object instance is created, and before it is asked to provision or retrieve any
encryption keys.
-.. _s3selectpushdown:
+(s3selectpushdown)=
-S3 Select pushdown
-------------------
+## S3 Select pushdown
S3 Select pushdown enables pushing down projection (SELECT) and predicate (WHERE)
-processing to `S3 Select `_.
+processing to [S3 Select](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html).
With S3 Select Pushdown, Trino only retrieves the required data from S3 instead
of entire S3 objects, reducing both latency and network usage.
-Is S3 Select a good fit for my workload?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Is S3 Select a good fit for my workload?
Performance of S3 Select pushdown depends on the amount of data filtered by the
query. Filtering a large number of rows should result in better performance. If
@@ -372,68 +332,65 @@ benchmark your workloads with and without S3 Select to see if using it may be
suitable for your workload. By default, S3 Select Pushdown is disabled and you
should enable it in production after proper benchmarking and cost analysis. For
more information on S3 Select request cost, please see
-`Amazon S3 Cloud Storage Pricing `_.
+[Amazon S3 Cloud Storage Pricing](https://aws.amazon.com/s3/pricing/).
Use the following guidelines to determine if S3 Select is a good fit for your
workload:
-* Your query filters out more than half of the original data set.
-* Your query filter predicates use columns that have a data type supported by
+- Your query filters out more than half of the original data set.
+- Your query filter predicates use columns that have a data type supported by
Trino and S3 Select.
- The ``TIMESTAMP``, ``DECIMAL``, ``REAL``, and ``DOUBLE`` data types are not
+ The `TIMESTAMP`, `DECIMAL`, `REAL`, and `DOUBLE` data types are not
supported by S3 Select Pushdown. For more information about supported data
types for S3 Select, see the
- `Data Types documentation `_.
-* Your network connection between Amazon S3 and the Amazon EMR cluster has good
+ [Data Types documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-data-types.html).
+- Your network connection between Amazon S3 and the Amazon EMR cluster has good
transfer speed and available bandwidth. Amazon S3 Select does not compress
HTTP responses, so the response size may increase for compressed input files.
-Considerations and limitations
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Considerations and limitations
-* Only objects stored in JSON format are supported. Objects can be uncompressed,
+- Only objects stored in JSON format are supported. Objects can be uncompressed,
or optionally compressed with gzip or bzip2.
-* The "AllowQuotedRecordDelimiters" property is not supported. If this property
+- The "AllowQuotedRecordDelimiters" property is not supported. If this property
is specified, the query fails.
-* Amazon S3 server-side encryption with customer-provided encryption keys
+- Amazon S3 server-side encryption with customer-provided encryption keys
(SSE-C) and client-side encryption are not supported.
-* S3 Select Pushdown is not a substitute for using columnar or compressed file
+- S3 Select Pushdown is not a substitute for using columnar or compressed file
formats such as ORC and Parquet.
-Enabling S3 Select pushdown
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Enabling S3 Select pushdown
-You can enable S3 Select Pushdown using the ``s3_select_pushdown_enabled``
-Hive session property, or using the ``hive.s3select-pushdown.enabled``
+You can enable S3 Select Pushdown using the `s3_select_pushdown_enabled`
+Hive session property, or using the `hive.s3select-pushdown.enabled`
configuration property. The session property overrides the config
property, allowing you enable or disable on a per-query basis. Non-filtering
-queries (``SELECT * FROM table``) are not pushed down to S3 Select,
+queries (`SELECT * FROM table`) are not pushed down to S3 Select,
as they retrieve the entire object content.
For uncompressed files, S3 Select scans ranges of bytes in parallel. The scan range
requests run across the byte ranges of the internal Hive splits for the query fragments
-pushed down to S3 Select. Changes in the Hive connector :ref:`performance tuning
+pushed down to S3 Select. Changes in the Hive connector {ref}`performance tuning
configuration properties ` are likely to impact
S3 Select pushdown performance.
S3 Select can be enabled for TEXTFILE data using the
-``hive.s3select-pushdown.experimental-textfile-pushdown-enabled`` configuration property,
+`hive.s3select-pushdown.experimental-textfile-pushdown-enabled` configuration property,
however this has been shown to produce incorrect results. For more information see
-`the GitHub Issue. `_
+[the GitHub Issue.](https://github.com/trinodb/trino/issues/17775)
-Understanding and tuning the maximum connections
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Understanding and tuning the maximum connections
Trino can use its native S3 file system or EMRFS. When using the native FS, the
-maximum connections is configured via the ``hive.s3.max-connections``
+maximum connections is configured via the `hive.s3.max-connections`
configuration property. When using EMRFS, the maximum connections is configured
-via the ``fs.s3.maxConnections`` Hadoop configuration property.
+via the `fs.s3.maxConnections` Hadoop configuration property.
S3 Select Pushdown bypasses the file systems, when accessing Amazon S3 for
predicate operations. In this case, the value of
-``hive.s3select-pushdown.max-connections`` determines the maximum number of
+`hive.s3select-pushdown.max-connections` determines the maximum number of
client connections allowed for those operations from worker nodes.
If your workload experiences the error *Timeout waiting for connection from
-pool*, increase the value of both ``hive.s3select-pushdown.max-connections`` and
+pool*, increase the value of both `hive.s3select-pushdown.max-connections` and
the maximum connections configuration for the file system you are using.
diff --git a/docs/src/main/sphinx/connector/hive-security.rst b/docs/src/main/sphinx/connector/hive-security.md
similarity index 62%
rename from docs/src/main/sphinx/connector/hive-security.rst
rename to docs/src/main/sphinx/connector/hive-security.md
index 9ae33fec80fa..9ebdd748baa1 100644
--- a/docs/src/main/sphinx/connector/hive-security.rst
+++ b/docs/src/main/sphinx/connector/hive-security.md
@@ -1,24 +1,20 @@
-=====================================
-Hive connector security configuration
-=====================================
+# Hive connector security configuration
-.. _hive-security-impersonation:
+(hive-security-impersonation)=
-Overview
-========
+## Overview
The Hive connector supports both authentication and authorization.
Trino can impersonate the end user who is running a query. In the case of a
user running a query from the command line interface, the end user is the
username associated with the Trino CLI process or argument to the optional
-``--user`` option.
+`--user` option.
Authentication can be configured with or without user impersonation on
Kerberized Hadoop clusters.
-Requirements
-============
+## Requirements
End user authentication limited to Kerberized Hadoop clusters. Authentication
user impersonation is available for both Kerberized and non-Kerberized clusters.
@@ -26,48 +22,46 @@ user impersonation is available for both Kerberized and non-Kerberized clusters.
You must ensure that you meet the Kerberos, user impersonation and keytab
requirements described in this section that apply to your configuration.
-.. _hive-security-kerberos-support:
+(hive-security-kerberos-support)=
-Kerberos
---------
+### Kerberos
-In order to use the Hive connector with a Hadoop cluster that uses ``kerberos``
+In order to use the Hive connector with a Hadoop cluster that uses `kerberos`
authentication, you must configure the connector to work with two services on
the Hadoop cluster:
-* The Hive metastore Thrift service
-* The Hadoop Distributed File System (HDFS)
+- The Hive metastore Thrift service
+- The Hadoop Distributed File System (HDFS)
Access to these services by the Hive connector is configured in the properties
file that contains the general Hive connector configuration.
Kerberos authentication by ticket cache is not yet supported.
-.. note::
+:::{note}
+If your `krb5.conf` location is different from `/etc/krb5.conf` you
+must set it explicitly using the `java.security.krb5.conf` JVM property
+in `jvm.config` file.
- If your ``krb5.conf`` location is different from ``/etc/krb5.conf`` you
- must set it explicitly using the ``java.security.krb5.conf`` JVM property
- in ``jvm.config`` file.
+Example: `-Djava.security.krb5.conf=/example/path/krb5.conf`.
+:::
- Example: ``-Djava.security.krb5.conf=/example/path/krb5.conf``.
+:::{warning}
+Access to the Trino coordinator must be secured e.g., using Kerberos or
+password authentication, when using Kerberos authentication to Hadoop services.
+Failure to secure access to the Trino coordinator could result in unauthorized
+access to sensitive data on the Hadoop cluster. Refer to {doc}`/security` for
+further information.
-.. warning::
+See {doc}`/security/kerberos` for information on setting up Kerberos authentication.
+:::
- Access to the Trino coordinator must be secured e.g., using Kerberos or
- password authentication, when using Kerberos authentication to Hadoop services.
- Failure to secure access to the Trino coordinator could result in unauthorized
- access to sensitive data on the Hadoop cluster. Refer to :doc:`/security` for
- further information.
+(hive-security-additional-keytab)=
- See :doc:`/security/kerberos` for information on setting up Kerberos authentication.
-
-.. _hive-security-additional-keytab:
-
-Keytab files
-^^^^^^^^^^^^
+#### Keytab files
Keytab files contain encryption keys that are used to authenticate principals
-to the Kerberos :abbr:`KDC (Key Distribution Center)`. These encryption keys
+to the Kerberos {abbr}`KDC (Key Distribution Center)`. These encryption keys
must be stored securely; you must take the same precautions to protect them
that you take to protect ssh private keys.
@@ -84,44 +78,41 @@ node.
You must ensure that the keytab files have the correct permissions on every
node after distributing them.
-.. _configuring-hadoop-impersonation:
+(configuring-hadoop-impersonation)=
-Impersonation in Hadoop
------------------------
+### Impersonation in Hadoop
In order to use impersonation, the Hadoop cluster must be
configured to allow the user or principal that Trino is running as to
impersonate the users who log in to Trino. Impersonation in Hadoop is
-configured in the file :file:`core-site.xml`. A complete description of the
-configuration options can be found in the `Hadoop documentation
-`_.
+configured in the file {file}`core-site.xml`. A complete description of the
+configuration options can be found in the [Hadoop documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html#Configurations).
-Authentication
-==============
+## Authentication
-The default security configuration of the :doc:`/connector/hive` does not use
+The default security configuration of the {doc}`/connector/hive` does not use
authentication when connecting to a Hadoop cluster. All queries are executed as
the user who runs the Trino process, regardless of which user submits the
query.
The Hive connector provides additional security options to support Hadoop
-clusters that have been configured to use :ref:`Kerberos
+clusters that have been configured to use {ref}`Kerberos
`.
-When accessing :abbr:`HDFS (Hadoop Distributed File System)`, Trino can
-:ref:`impersonate` the end user who is running the
-query. This can be used with HDFS permissions and :abbr:`ACLs (Access Control
+When accessing {abbr}`HDFS (Hadoop Distributed File System)`, Trino can
+{ref}`impersonate` the end user who is running the
+query. This can be used with HDFS permissions and {abbr}`ACLs (Access Control
Lists)` to provide additional security for data.
-Hive metastore Thrift service authentication
---------------------------------------------
+### Hive metastore Thrift service authentication
In a Kerberized Hadoop cluster, Trino connects to the Hive metastore Thrift
-service using :abbr:`SASL (Simple Authentication and Security Layer)` and
+service using {abbr}`SASL (Simple Authentication and Security Layer)` and
authenticates using Kerberos. Kerberos authentication for the metastore is
configured in the connector's properties file using the following optional
properties:
+```{eval-rst}
.. list-table:: Hive metastore Thrift service authentication properties
:widths: 30, 55, 15
:header-rows: 1
@@ -182,61 +173,59 @@ properties:
specified by ``hive.metastore.client.principal``. This file must be
readable by the operating system user running Trino.
-
+```
-Configuration examples
-^^^^^^^^^^^^^^^^^^^^^^
+#### Configuration examples
The following sections describe the configuration properties and values needed
for the various authentication configurations needed to use the Hive metastore
Thrift service with the Hive connector.
-Default ``NONE`` authentication without impersonation
-"""""""""""""""""""""""""""""""""""""""""""""""""""""
+##### Default `NONE` authentication without impersonation
-.. code-block:: text
+```text
+hive.metastore.authentication.type=NONE
+```
- hive.metastore.authentication.type=NONE
-
-The default authentication type for the Hive metastore is ``NONE``. When the
-authentication type is ``NONE``, Trino connects to an unsecured Hive
+The default authentication type for the Hive metastore is `NONE`. When the
+authentication type is `NONE`, Trino connects to an unsecured Hive
metastore. Kerberos is not used.
-.. _hive-security-metastore-impersonation:
-
-``KERBEROS`` authentication with impersonation
-""""""""""""""""""""""""""""""""""""""""""""""
+(hive-security-metastore-impersonation)=
-.. code-block:: text
+##### `KERBEROS` authentication with impersonation
- hive.metastore.authentication.type=KERBEROS
- hive.metastore.thrift.impersonation.enabled=true
- hive.metastore.service.principal=hive/hive-metastore-host.example.com@EXAMPLE.COM
- hive.metastore.client.principal=trino@EXAMPLE.COM
- hive.metastore.client.keytab=/etc/trino/hive.keytab
+```text
+hive.metastore.authentication.type=KERBEROS
+hive.metastore.thrift.impersonation.enabled=true
+hive.metastore.service.principal=hive/hive-metastore-host.example.com@EXAMPLE.COM
+hive.metastore.client.principal=trino@EXAMPLE.COM
+hive.metastore.client.keytab=/etc/trino/hive.keytab
+```
When the authentication type for the Hive metastore Thrift service is
-``KERBEROS``, Trino connects as the Kerberos principal specified by the
-property ``hive.metastore.client.principal``. Trino authenticates this
-principal using the keytab specified by the ``hive.metastore.client.keytab``
+`KERBEROS`, Trino connects as the Kerberos principal specified by the
+property `hive.metastore.client.principal`. Trino authenticates this
+principal using the keytab specified by the `hive.metastore.client.keytab`
property, and verifies that the identity of the metastore matches
-``hive.metastore.service.principal``.
+`hive.metastore.service.principal`.
-When using ``KERBEROS`` Metastore authentication with impersonation, the
-principal specified by the ``hive.metastore.client.principal`` property must be
+When using `KERBEROS` Metastore authentication with impersonation, the
+principal specified by the `hive.metastore.client.principal` property must be
allowed to impersonate the current Trino user, as discussed in the section
-:ref:`configuring-hadoop-impersonation`.
+{ref}`configuring-hadoop-impersonation`.
Keytab files must be distributed to every node in the cluster that runs Trino.
-:ref:`Additional Information About Keytab Files.`
+{ref}`Additional Information About Keytab Files.`
-HDFS authentication
--------------------
+### HDFS authentication
In a Kerberized Hadoop cluster, Trino authenticates to HDFS using Kerberos.
Kerberos authentication for HDFS is configured in the connector's properties
file using the following optional properties:
+```{eval-rst}
.. list-table:: HDFS authentication properties
:widths: 30, 55, 15
:header-rows: 1
@@ -282,92 +271,88 @@ file using the following optional properties:
HDFS. Note that using wire encryption may impact query execution
performance.
-
+```
-Configuration examples
-^^^^^^^^^^^^^^^^^^^^^^
+#### Configuration examples
The following sections describe the configuration properties and values needed
for the various authentication configurations with HDFS and the Hive connector.
-.. _hive-security-simple:
+(hive-security-simple)=
-Default ``NONE`` authentication without impersonation
-"""""""""""""""""""""""""""""""""""""""""""""""""""""
+##### Default `NONE` authentication without impersonation
-.. code-block:: text
+```text
+hive.hdfs.authentication.type=NONE
+```
- hive.hdfs.authentication.type=NONE
-
-The default authentication type for HDFS is ``NONE``. When the authentication
-type is ``NONE``, Trino connects to HDFS using Hadoop's simple authentication
+The default authentication type for HDFS is `NONE`. When the authentication
+type is `NONE`, Trino connects to HDFS using Hadoop's simple authentication
mechanism. Kerberos is not used.
-.. _hive-security-simple-impersonation:
-
-``NONE`` authentication with impersonation
-""""""""""""""""""""""""""""""""""""""""""
+(hive-security-simple-impersonation)=
-.. code-block:: text
+##### `NONE` authentication with impersonation
- hive.hdfs.authentication.type=NONE
- hive.hdfs.impersonation.enabled=true
+```text
+hive.hdfs.authentication.type=NONE
+hive.hdfs.impersonation.enabled=true
+```
-When using ``NONE`` authentication with impersonation, Trino impersonates
+When using `NONE` authentication with impersonation, Trino impersonates
the user who is running the query when accessing HDFS. The user Trino is
running as must be allowed to impersonate this user, as discussed in the
-section :ref:`configuring-hadoop-impersonation`. Kerberos is not used.
+section {ref}`configuring-hadoop-impersonation`. Kerberos is not used.
-.. _hive-security-kerberos:
+(hive-security-kerberos)=
-``KERBEROS`` authentication without impersonation
-"""""""""""""""""""""""""""""""""""""""""""""""""
+##### `KERBEROS` authentication without impersonation
-.. code-block:: text
+```text
+hive.hdfs.authentication.type=KERBEROS
+hive.hdfs.trino.principal=hdfs@EXAMPLE.COM
+hive.hdfs.trino.keytab=/etc/trino/hdfs.keytab
+```
- hive.hdfs.authentication.type=KERBEROS
- hive.hdfs.trino.principal=hdfs@EXAMPLE.COM
- hive.hdfs.trino.keytab=/etc/trino/hdfs.keytab
-
-When the authentication type is ``KERBEROS``, Trino accesses HDFS as the
-principal specified by the ``hive.hdfs.trino.principal`` property. Trino
+When the authentication type is `KERBEROS`, Trino accesses HDFS as the
+principal specified by the `hive.hdfs.trino.principal` property. Trino
authenticates this principal using the keytab specified by the
-``hive.hdfs.trino.keytab`` keytab.
+`hive.hdfs.trino.keytab` keytab.
Keytab files must be distributed to every node in the cluster that runs Trino.
-:ref:`Additional Information About Keytab Files.`
-
-.. _hive-security-kerberos-impersonation:
+{ref}`Additional Information About Keytab Files.`
-``KERBEROS`` authentication with impersonation
-""""""""""""""""""""""""""""""""""""""""""""""
+(hive-security-kerberos-impersonation)=
-.. code-block:: text
+##### `KERBEROS` authentication with impersonation
- hive.hdfs.authentication.type=KERBEROS
- hive.hdfs.impersonation.enabled=true
- hive.hdfs.trino.principal=trino@EXAMPLE.COM
- hive.hdfs.trino.keytab=/etc/trino/hdfs.keytab
+```text
+hive.hdfs.authentication.type=KERBEROS
+hive.hdfs.impersonation.enabled=true
+hive.hdfs.trino.principal=trino@EXAMPLE.COM
+hive.hdfs.trino.keytab=/etc/trino/hdfs.keytab
+```
-When using ``KERBEROS`` authentication with impersonation, Trino impersonates
+When using `KERBEROS` authentication with impersonation, Trino impersonates
the user who is running the query when accessing HDFS. The principal
-specified by the ``hive.hdfs.trino.principal`` property must be allowed to
+specified by the `hive.hdfs.trino.principal` property must be allowed to
impersonate the current Trino user, as discussed in the section
-:ref:`configuring-hadoop-impersonation`. Trino authenticates
-``hive.hdfs.trino.principal`` using the keytab specified by
-``hive.hdfs.trino.keytab``.
+{ref}`configuring-hadoop-impersonation`. Trino authenticates
+`hive.hdfs.trino.principal` using the keytab specified by
+`hive.hdfs.trino.keytab`.
Keytab files must be distributed to every node in the cluster that runs Trino.
-:ref:`Additional Information About Keytab Files.`
+{ref}`Additional Information About Keytab Files.`
-Authorization
-=============
+## Authorization
-You can enable authorization checks for the :doc:`hive` by setting
-the ``hive.security`` property in the Hive catalog properties file. This
+You can enable authorization checks for the {doc}`hive` by setting
+the `hive.security` property in the Hive catalog properties file. This
property must be one of the following values:
+```{eval-rst}
.. list-table:: ``hive.security`` property values
:widths: 30, 60
:header-rows: 1
@@ -398,25 +383,25 @@ property must be one of the following values:
See the :ref:`hive-sql-standard-based-authorization` section for details.
* - ``allow-all``
- No authorization checks are enforced.
+```
-.. _hive-sql-standard-based-authorization:
+(hive-sql-standard-based-authorization)=
-SQL standard based authorization
---------------------------------
+### SQL standard based authorization
-When ``sql-standard`` security is enabled, Trino enforces the same SQL
+When `sql-standard` security is enabled, Trino enforces the same SQL
standard-based authorization as Hive does.
-Since Trino's ``ROLE`` syntax support matches the SQL standard, and
+Since Trino's `ROLE` syntax support matches the SQL standard, and
Hive does not exactly follow the SQL standard, there are the following
limitations and differences:
-* ``CREATE ROLE role WITH ADMIN`` is not supported.
-* The ``admin`` role must be enabled to execute ``CREATE ROLE``, ``DROP ROLE`` or ``CREATE SCHEMA``.
-* ``GRANT role TO user GRANTED BY someone`` is not supported.
-* ``REVOKE role FROM user GRANTED BY someone`` is not supported.
-* By default, all a user's roles, except ``admin``, are enabled in a new user session.
-* One particular role can be selected by executing ``SET ROLE role``.
-* ``SET ROLE ALL`` enables all of a user's roles except ``admin``.
-* The ``admin`` role must be enabled explicitly by executing ``SET ROLE admin``.
-* ``GRANT privilege ON SCHEMA schema`` is not supported. Schema ownership can be changed with ``ALTER SCHEMA schema SET AUTHORIZATION user``
\ No newline at end of file
+- `CREATE ROLE role WITH ADMIN` is not supported.
+- The `admin` role must be enabled to execute `CREATE ROLE`, `DROP ROLE` or `CREATE SCHEMA`.
+- `GRANT role TO user GRANTED BY someone` is not supported.
+- `REVOKE role FROM user GRANTED BY someone` is not supported.
+- By default, all a user's roles, except `admin`, are enabled in a new user session.
+- One particular role can be selected by executing `SET ROLE role`.
+- `SET ROLE ALL` enables all of a user's roles except `admin`.
+- The `admin` role must be enabled explicitly by executing `SET ROLE admin`.
+- `GRANT privilege ON SCHEMA schema` is not supported. Schema ownership can be changed with `ALTER SCHEMA schema SET AUTHORIZATION user`
diff --git a/docs/src/main/sphinx/connector/hive.rst b/docs/src/main/sphinx/connector/hive.md
similarity index 63%
rename from docs/src/main/sphinx/connector/hive.rst
rename to docs/src/main/sphinx/connector/hive.md
index 852aeca2b3b8..e4ad6e539545 100644
--- a/docs/src/main/sphinx/connector/hive.rst
+++ b/docs/src/main/sphinx/connector/hive.md
@@ -1,56 +1,54 @@
-==============
-Hive connector
-==============
-
-.. raw:: html
-
-
-
-.. toctree::
- :maxdepth: 1
- :hidden:
-
- Metastores
- Security
- Amazon S3
- Azure Storage
- Google Cloud Storage
- IBM Cloud Object Storage
- Storage Caching
- Alluxio
- Object storage file formats
+# Hive connector
+
+```{raw} html
+
+```
+
+```{toctree}
+:hidden: true
+:maxdepth: 1
+
+Metastores
+Security
+Amazon S3
+Azure Storage
+Google Cloud Storage
+IBM Cloud Object Storage
+Storage Caching
+Alluxio
+Object storage file formats
+```
The Hive connector allows querying data stored in an
-`Apache Hive `_
+[Apache Hive](https://hive.apache.org/)
data warehouse. Hive is a combination of three components:
-* Data files in varying formats, that are typically stored in the
+- Data files in varying formats, that are typically stored in the
Hadoop Distributed File System (HDFS) or in object storage systems
such as Amazon S3.
-* Metadata about how the data files are mapped to schemas and tables. This
+- Metadata about how the data files are mapped to schemas and tables. This
metadata is stored in a database, such as MySQL, and is accessed via the Hive
metastore service.
-* A query language called HiveQL. This query language is executed on a
+- A query language called HiveQL. This query language is executed on a
distributed computing framework such as MapReduce or Tez.
Trino only uses the first two components: the data and the metadata.
It does not use HiveQL or any part of Hive's execution environment.
-Requirements
-------------
+## Requirements
The Hive connector requires a
-:ref:`Hive metastore service ` (HMS), or a compatible
-implementation of the Hive metastore, such as
-:ref:`AWS Glue `.
+{ref}`Hive metastore service ` (HMS), or a compatible
+implementation of the Hive metastore, such as
+{ref}`AWS Glue `.
Apache Hadoop HDFS 2.x and 3.x are supported.
Many distributed storage systems including HDFS,
-:doc:`Amazon S3 ` or S3-compatible systems,
-`Google Cloud Storage <#google-cloud-storage-configuration>`__,
-:doc:`Azure Storage `, and
-:doc:`IBM Cloud Object Storage` can be queried with the Hive
+{doc}`Amazon S3 ` or S3-compatible systems,
+[Google Cloud Storage](hive-gcs-tutorial),
+{doc}`Azure Storage `, and
+{doc}`IBM Cloud Object Storage` can be queried with the Hive
connector.
The coordinator and all workers must have network access to the Hive metastore
@@ -60,62 +58,59 @@ to using port 9083.
Data files must be in a supported file format. Some file formats can be
configured using file format configuration properties per catalog:
-* :ref:`ORC `
-* :ref:`Parquet `
-* Avro
-* RCText (RCFile using ColumnarSerDe)
-* RCBinary (RCFile using LazyBinaryColumnarSerDe)
-* SequenceFile
-* JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
-* CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
-* TextFile
+- {ref}`ORC `
+- {ref}`Parquet `
+- Avro
+- RCText (RCFile using ColumnarSerDe)
+- RCBinary (RCFile using LazyBinaryColumnarSerDe)
+- SequenceFile
+- JSON (using org.apache.hive.hcatalog.data.JsonSerDe)
+- CSV (using org.apache.hadoop.hive.serde2.OpenCSVSerde)
+- TextFile
-General configuration
----------------------
+## General configuration
To configure the Hive connector, create a catalog properties file
-``etc/catalog/example.properties`` that references the ``hive``
+`etc/catalog/example.properties` that references the `hive`
connector and defines a metastore. You must configure a metastore for table
-metadata. If you are using a :ref:`Hive metastore `,
-``hive.metastore.uri`` must be configured:
+metadata. If you are using a {ref}`Hive metastore `,
+`hive.metastore.uri` must be configured:
-.. code-block:: properties
+```properties
+connector.name=hive
+hive.metastore.uri=thrift://example.net:9083
+```
- connector.name=hive
- hive.metastore.uri=thrift://example.net:9083
+If you are using {ref}`AWS Glue ` as your metastore, you
+must instead set `hive.metastore` to `glue`:
-If you are using :ref:`AWS Glue ` as your metastore, you
-must instead set ``hive.metastore`` to ``glue``:
-
-.. code-block:: properties
-
- connector.name=hive
- hive.metastore=glue
+```properties
+connector.name=hive
+hive.metastore=glue
+```
Each metastore type has specific configuration properties along with
-:ref:`general metastore configuration properties `.
+{ref}`general metastore configuration properties `.
-Multiple Hive clusters
-^^^^^^^^^^^^^^^^^^^^^^
+### Multiple Hive clusters
You can have as many catalogs as you need, so if you have additional
-Hive clusters, simply add another properties file to ``etc/catalog``
-with a different name, making sure it ends in ``.properties``. For
-example, if you name the property file ``sales.properties``, Trino
-creates a catalog named ``sales`` using the configured connector.
+Hive clusters, simply add another properties file to `etc/catalog`
+with a different name, making sure it ends in `.properties`. For
+example, if you name the property file `sales.properties`, Trino
+creates a catalog named `sales` using the configured connector.
-HDFS configuration
-^^^^^^^^^^^^^^^^^^
+### HDFS configuration
For basic setups, Trino configures the HDFS client automatically and
does not require any configuration files. In some cases, such as when using
federated HDFS or NameNode high availability, it is necessary to specify
additional HDFS client options in order to access your HDFS cluster. To do so,
-add the ``hive.config.resources`` property to reference your HDFS config files:
-
-.. code-block:: text
+add the `hive.config.resources` property to reference your HDFS config files:
- hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
+```text
+hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
+```
Only specify additional configuration files if necessary for your setup.
We recommend reducing the configuration files to have the minimum
@@ -125,43 +120,42 @@ The configuration files must exist on all Trino nodes. If you are
referencing existing Hadoop config files, make sure to copy them to
any Trino nodes that are not running Hadoop.
-HDFS username and permissions
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### HDFS username and permissions
-Before running any ``CREATE TABLE`` or ``CREATE TABLE AS`` statements
+Before running any `CREATE TABLE` or `CREATE TABLE AS` statements
for Hive tables in Trino, you must check that the user Trino is
using to access HDFS has access to the Hive warehouse directory. The Hive
warehouse directory is specified by the configuration variable
-``hive.metastore.warehouse.dir`` in ``hive-site.xml``, and the default
-value is ``/user/hive/warehouse``.
+`hive.metastore.warehouse.dir` in `hive-site.xml`, and the default
+value is `/user/hive/warehouse`.
When not using Kerberos with HDFS, Trino accesses HDFS using the
OS user of the Trino process. For example, if Trino is running as
-``nobody``, it accesses HDFS as ``nobody``. You can override this
-username by setting the ``HADOOP_USER_NAME`` system property in the
-Trino :ref:`jvm-config`, replacing ``hdfs_user`` with the
+`nobody`, it accesses HDFS as `nobody`. You can override this
+username by setting the `HADOOP_USER_NAME` system property in the
+Trino {ref}`jvm-config`, replacing `hdfs_user` with the
appropriate username:
-.. code-block:: text
+```text
+-DHADOOP_USER_NAME=hdfs_user
+```
- -DHADOOP_USER_NAME=hdfs_user
-
-The ``hive`` user generally works, since Hive is often started with
-the ``hive`` user and this user has access to the Hive warehouse.
+The `hive` user generally works, since Hive is often started with
+the `hive` user and this user has access to the Hive warehouse.
Whenever you change the user Trino is using to access HDFS, remove
-``/tmp/presto-*`` on HDFS, as the new user may not have access to
+`/tmp/presto-*` on HDFS, as the new user may not have access to
the existing temporary directories.
-.. _hive-configuration-properties:
+(hive-configuration-properties)=
-Hive general configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Hive general configuration properties
The following table lists general configuration properties for the Hive
connector. There are additional sets of configuration properties throughout the
Hive connector documentation.
+```{eval-rst}
.. list-table:: Hive general configuration properties
:widths: 35, 50, 15
:header-rows: 1
@@ -361,286 +355,306 @@ Hive connector documentation.
- Enables auto-commit for all writes. This can be used to disallow
multi-statement write transactions.
- ``false``
+```
-Storage
--------
+## Storage
The Hive connector supports the following storage options:
-* :doc:`Amazon S3 `
-* :doc:`Azure Storage `
-* :doc:`Google Cloud Storage `
-* :doc:`IBM Cloud Object Storage `
+- {doc}`Amazon S3 `
+- {doc}`Azure Storage `
+- {doc}`Google Cloud Storage `
+- {doc}`IBM Cloud Object Storage `
-The Hive connector also supports :doc:`storage caching `.
+The Hive connector also supports {doc}`storage caching `.
-Security
---------
+## Security
-Please see the :doc:`/connector/hive-security` section for information on the
+Please see the {doc}`/connector/hive-security` section for information on the
security options available for the Hive connector.
-.. _hive-sql-support:
+(hive-sql-support)=
-SQL support
------------
+## SQL support
The connector provides read access and write access to data and metadata in the
configured object storage system and metadata stores:
-* :ref:`Globally available statements `; see also
- :ref:`Globally available statements `
-* :ref:`Read operations `
-* :ref:`sql-write-operations`:
-
- * :ref:`sql-data-management`; see also
- :ref:`Hive-specific data management `
- * :ref:`sql-schema-table-management`; see also
- :ref:`Hive-specific schema and table management `
- * :ref:`sql-view-management`; see also
- :ref:`Hive-specific view management `
-
-* :ref:`sql-security-operations`: see also
- :ref:`SQL standard-based authorization for object storage `
-* :ref:`sql-transactions`
-
-Refer to :doc:`the migration guide ` for practical advice
-on migrating from Hive to Trino.
-
-The following sections provide Hive-specific information regarding SQL support.
-
-.. _hive-examples:
-
-Basic usage examples
-^^^^^^^^^^^^^^^^^^^^
-
-The examples shown here work on Google Cloud Storage by replacing ``s3://`` with
-``gs://``.
-
-Create a new Hive table named ``page_views`` in the ``web`` schema
-that is stored using the ORC file format, partitioned by date and
-country, and bucketed by user into ``50`` buckets. Note that Hive
-requires the partition columns to be the last columns in the table::
-
- CREATE TABLE example.web.page_views (
- view_time TIMESTAMP,
- user_id BIGINT,
- page_url VARCHAR,
- ds DATE,
- country VARCHAR
- )
- WITH (
- format = 'ORC',
- partitioned_by = ARRAY['ds', 'country'],
- bucketed_by = ARRAY['user_id'],
- bucket_count = 50
- )
-
-Create a new Hive schema named ``web`` that stores tables in an
-S3 bucket named ``my-bucket``::
-
- CREATE SCHEMA example.web
- WITH (location = 's3://my-bucket/')
-
-Drop a schema::
-
- DROP SCHEMA example.web
-
-Drop a partition from the ``page_views`` table::
-
- DELETE FROM example.web.page_views
- WHERE ds = DATE '2016-08-09'
- AND country = 'US'
-
-Query the ``page_views`` table::
-
- SELECT * FROM example.web.page_views
+- {ref}`Globally available statements `; see also
+ {ref}`Globally available statements `
-List the partitions of the ``page_views`` table::
+- {ref}`Read operations `
- SELECT * FROM example.web."page_views$partitions"
+- {ref}`sql-write-operations`:
-Create an external Hive table named ``request_logs`` that points at
-existing data in S3::
+ - {ref}`sql-data-management`; see also
+ {ref}`Hive-specific data management `
+ - {ref}`sql-schema-table-management`; see also
+ {ref}`Hive-specific schema and table management `
+ - {ref}`sql-view-management`; see also
+ {ref}`Hive-specific view management `
- CREATE TABLE example.web.request_logs (
- request_time TIMESTAMP,
- url VARCHAR,
- ip VARCHAR,
- user_agent VARCHAR
- )
- WITH (
- format = 'TEXTFILE',
- external_location = 's3://my-bucket/data/logs/'
- )
+- {ref}`sql-security-operations`: see also
+ {ref}`SQL standard-based authorization for object storage `
-Collect statistics for the ``request_logs`` table::
+- {ref}`sql-transactions`
- ANALYZE example.web.request_logs;
-
-Drop the external table ``request_logs``. This only drops the metadata
-for the table. The referenced data directory is not deleted::
-
- DROP TABLE example.web.request_logs
-
-* :doc:`/sql/create-table-as` can be used to create transactional tables in ORC format like this::
-
- CREATE TABLE
- WITH (
- format='ORC',
- transactional=true
- )
- AS
-
-
-Add an empty partition to the ``page_views`` table::
+Refer to {doc}`the migration guide ` for practical advice
+on migrating from Hive to Trino.
- CALL system.create_empty_partition(
- schema_name => 'web',
- table_name => 'page_views',
- partition_columns => ARRAY['ds', 'country'],
- partition_values => ARRAY['2016-08-09', 'US']);
+The following sections provide Hive-specific information regarding SQL support.
-Drop stats for a partition of the ``page_views`` table::
+(hive-examples)=
- CALL system.drop_stats(
- schema_name => 'web',
- table_name => 'page_views',
- partition_values => ARRAY[ARRAY['2016-08-09', 'US']]);
+### Basic usage examples
-.. _hive-procedures:
+The examples shown here work on Google Cloud Storage by replacing `s3://` with
+`gs://`.
-Procedures
-^^^^^^^^^^
+Create a new Hive table named `page_views` in the `web` schema
+that is stored using the ORC file format, partitioned by date and
+country, and bucketed by user into `50` buckets. Note that Hive
+requires the partition columns to be the last columns in the table:
+
+```
+CREATE TABLE example.web.page_views (
+ view_time TIMESTAMP,
+ user_id BIGINT,
+ page_url VARCHAR,
+ ds DATE,
+ country VARCHAR
+)
+WITH (
+ format = 'ORC',
+ partitioned_by = ARRAY['ds', 'country'],
+ bucketed_by = ARRAY['user_id'],
+ bucket_count = 50
+)
+```
+
+Create a new Hive schema named `web` that stores tables in an
+S3 bucket named `my-bucket`:
+
+```
+CREATE SCHEMA example.web
+WITH (location = 's3://my-bucket/')
+```
+
+Drop a schema:
+
+```
+DROP SCHEMA example.web
+```
+
+Drop a partition from the `page_views` table:
+
+```
+DELETE FROM example.web.page_views
+WHERE ds = DATE '2016-08-09'
+ AND country = 'US'
+```
+
+Query the `page_views` table:
+
+```
+SELECT * FROM example.web.page_views
+```
+
+List the partitions of the `page_views` table:
+
+```
+SELECT * FROM example.web."page_views$partitions"
+```
+
+Create an external Hive table named `request_logs` that points at
+existing data in S3:
+
+```
+CREATE TABLE example.web.request_logs (
+ request_time TIMESTAMP,
+ url VARCHAR,
+ ip VARCHAR,
+ user_agent VARCHAR
+)
+WITH (
+ format = 'TEXTFILE',
+ external_location = 's3://my-bucket/data/logs/'
+)
+```
+
+Collect statistics for the `request_logs` table:
+
+```
+ANALYZE example.web.request_logs;
+```
+
+Drop the external table `request_logs`. This only drops the metadata
+for the table. The referenced data directory is not deleted:
+
+```
+DROP TABLE example.web.request_logs
+```
+
+- {doc}`/sql/create-table-as` can be used to create transactional tables in ORC format like this:
+
+ ```
+ CREATE TABLE
+ WITH (
+ format='ORC',
+ transactional=true
+ )
+ AS
+ ```
+
+Add an empty partition to the `page_views` table:
+
+```
+CALL system.create_empty_partition(
+ schema_name => 'web',
+ table_name => 'page_views',
+ partition_columns => ARRAY['ds', 'country'],
+ partition_values => ARRAY['2016-08-09', 'US']);
+```
+
+Drop stats for a partition of the `page_views` table:
+
+```
+CALL system.drop_stats(
+ schema_name => 'web',
+ table_name => 'page_views',
+ partition_values => ARRAY[ARRAY['2016-08-09', 'US']]);
+```
+
+(hive-procedures)=
+
+### Procedures
-Use the :doc:`/sql/call` statement to perform data manipulation or
+Use the {doc}`/sql/call` statement to perform data manipulation or
administrative tasks. Procedures must include a qualified catalog name, if your
-Hive catalog is called ``web``::
+Hive catalog is called `web`:
- CALL web.system.example_procedure()
+```
+CALL web.system.example_procedure()
+```
The following procedures are available:
-* ``system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)``
+- `system.create_empty_partition(schema_name, table_name, partition_columns, partition_values)`
Create an empty partition in the specified table.
-* ``system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)``
+- `system.sync_partition_metadata(schema_name, table_name, mode, case_sensitive)`
Check and update partitions list in metastore. There are three modes available:
- * ``ADD`` : add any partitions that exist on the file system, but not in the metastore.
- * ``DROP``: drop any partitions that exist in the metastore, but not on the file system.
- * ``FULL``: perform both ``ADD`` and ``DROP``.
+ - `ADD` : add any partitions that exist on the file system, but not in the metastore.
+ - `DROP`: drop any partitions that exist in the metastore, but not on the file system.
+ - `FULL`: perform both `ADD` and `DROP`.
- The ``case_sensitive`` argument is optional. The default value is ``true`` for compatibility
- with Hive's ``MSCK REPAIR TABLE`` behavior, which expects the partition column names in
- file system paths to use lowercase (e.g. ``col_x=SomeValue``). Partitions on the file system
- not conforming to this convention are ignored, unless the argument is set to ``false``.
+ The `case_sensitive` argument is optional. The default value is `true` for compatibility
+ with Hive's `MSCK REPAIR TABLE` behavior, which expects the partition column names in
+ file system paths to use lowercase (e.g. `col_x=SomeValue`). Partitions on the file system
+ not conforming to this convention are ignored, unless the argument is set to `false`.
-* ``system.drop_stats(schema_name, table_name, partition_values)``
+- `system.drop_stats(schema_name, table_name, partition_values)`
Drops statistics for a subset of partitions or the entire table. The partitions are specified as an
- array whose elements are arrays of partition values (similar to the ``partition_values`` argument in
- ``create_empty_partition``). If ``partition_values`` argument is omitted, stats are dropped for the
+ array whose elements are arrays of partition values (similar to the `partition_values` argument in
+ `create_empty_partition`). If `partition_values` argument is omitted, stats are dropped for the
entire table.
-.. _register-partition:
+(register-partition)=
-* ``system.register_partition(schema_name, table_name, partition_columns, partition_values, location)``
+- `system.register_partition(schema_name, table_name, partition_columns, partition_values, location)`
Registers existing location as a new partition in the metastore for the specified table.
- When the ``location`` argument is omitted, the partition location is
- constructed using ``partition_columns`` and ``partition_values``.
+ When the `location` argument is omitted, the partition location is
+ constructed using `partition_columns` and `partition_values`.
- Due to security reasons, the procedure is enabled only when ``hive.allow-register-partition-procedure``
- is set to ``true``.
+ Due to security reasons, the procedure is enabled only when `hive.allow-register-partition-procedure`
+ is set to `true`.
-.. _unregister-partition:
+(unregister-partition)=
-* ``system.unregister_partition(schema_name, table_name, partition_columns, partition_values)``
+- `system.unregister_partition(schema_name, table_name, partition_columns, partition_values)`
Unregisters given, existing partition in the metastore for the specified table.
The partition data is not deleted.
-.. _hive-flush-metadata-cache:
+(hive-flush-metadata-cache)=
-* ``system.flush_metadata_cache()``
+- `system.flush_metadata_cache()`
Flush all Hive metadata caches.
-* ``system.flush_metadata_cache(schema_name => ..., table_name => ...)``
+- `system.flush_metadata_cache(schema_name => ..., table_name => ...)`
Flush Hive metadata caches entries connected with selected table.
Procedure requires named parameters to be passed
-* ``system.flush_metadata_cache(schema_name => ..., table_name => ..., partition_columns => ARRAY[...], partition_values => ARRAY[...])``
+- `system.flush_metadata_cache(schema_name => ..., table_name => ..., partition_columns => ARRAY[...], partition_values => ARRAY[...])`
Flush Hive metadata cache entries connected with selected partition.
Procedure requires named parameters to be passed.
-.. _hive-data-management:
+(hive-data-management)=
-Data management
-^^^^^^^^^^^^^^^
+### Data management
-Some :ref:`data management ` statements may be affected by
-the Hive catalog's authorization check policy. In the default ``legacy`` policy,
-some statements are disabled by default. See :doc:`hive-security` for more
+Some {ref}`data management ` statements may be affected by
+the Hive catalog's authorization check policy. In the default `legacy` policy,
+some statements are disabled by default. See {doc}`hive-security` for more
information.
-The :ref:`sql-data-management` functionality includes support for ``INSERT``,
-``UPDATE``, ``DELETE``, and ``MERGE`` statements, with the exact support
+The {ref}`sql-data-management` functionality includes support for `INSERT`,
+`UPDATE`, `DELETE`, and `MERGE` statements, with the exact support
depending on the storage system, file format, and metastore.
When connecting to a Hive metastore version 3.x, the Hive connector supports
reading from and writing to insert-only and ACID tables, with full support for
partitioning and bucketing.
-:doc:`/sql/delete` applied to non-transactional tables is only supported if the
-table is partitioned and the ``WHERE`` clause matches entire partitions.
+{doc}`/sql/delete` applied to non-transactional tables is only supported if the
+table is partitioned and the `WHERE` clause matches entire partitions.
Transactional Hive tables with ORC format support "row-by-row" deletion, in
-which the ``WHERE`` clause may match arbitrary sets of rows.
+which the `WHERE` clause may match arbitrary sets of rows.
-:doc:`/sql/update` is only supported for transactional Hive tables with format
-ORC. ``UPDATE`` of partition or bucket columns is not supported.
+{doc}`/sql/update` is only supported for transactional Hive tables with format
+ORC. `UPDATE` of partition or bucket columns is not supported.
-:doc:`/sql/merge` is only supported for ACID tables.
+{doc}`/sql/merge` is only supported for ACID tables.
-ACID tables created with `Hive Streaming Ingest `_
+ACID tables created with [Hive Streaming Ingest](https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
are not supported.
-.. _hive-schema-and-table-management:
+(hive-schema-and-table-management)=
-Schema and table management
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Schema and table management
The Hive connector supports querying and manipulating Hive tables and schemas
(databases). While some uncommon operations must be performed using
Hive directly, most operations can be performed using Trino.
-Schema evolution
-""""""""""""""""
+#### Schema evolution
Hive allows the partitions in a table to have a different schema than the
table. This occurs when the column types of a table are changed after
partitions already exist (that use the original column types). The Hive
connector supports this by allowing the same conversions as Hive:
-* ``VARCHAR`` to and from ``TINYINT``, ``SMALLINT``, ``INTEGER`` and ``BIGINT``
-* ``REAL`` to ``DOUBLE``
-* Widening conversions for integers, such as ``TINYINT`` to ``SMALLINT``
+- `VARCHAR` to and from `TINYINT`, `SMALLINT`, `INTEGER` and `BIGINT`
+- `REAL` to `DOUBLE`
+- Widening conversions for integers, such as `TINYINT` to `SMALLINT`
Any conversion failure results in null, which is the same behavior
-as Hive. For example, converting the string ``'foo'`` to a number,
-or converting the string ``'1234'`` to a ``TINYINT`` (which has a
-maximum value of ``127``).
+as Hive. For example, converting the string `'foo'` to a number,
+or converting the string `'1234'` to a `TINYINT` (which has a
+maximum value of `127`).
-.. _hive-avro-schema:
+(hive-avro-schema)=
-Avro schema evolution
-"""""""""""""""""""""
+#### Avro schema evolution
Trino supports querying and manipulating Hive tables with the Avro storage
format, which has the schema set based on an Avro schema file/literal. Trino is
@@ -648,36 +662,38 @@ also capable of creating the tables in Trino by infering the schema from a
valid Avro schema file located locally, or remotely in HDFS/Web server.
To specify that the Avro schema should be used for interpreting table data, use
-the ``avro_schema_url`` table property.
+the `avro_schema_url` table property.
The schema can be placed in the local file system or remotely in the following
locations:
-- HDFS (e.g. ``avro_schema_url = 'hdfs://user/avro/schema/avro_data.avsc'``)
-- S3 (e.g. ``avro_schema_url = 's3n:///schema_bucket/schema/avro_data.avsc'``)
-- A web server (e.g. ``avro_schema_url = 'http://example.org/schema/avro_data.avsc'``)
+- HDFS (e.g. `avro_schema_url = 'hdfs://user/avro/schema/avro_data.avsc'`)
+- S3 (e.g. `avro_schema_url = 's3n:///schema_bucket/schema/avro_data.avsc'`)
+- A web server (e.g. `avro_schema_url = 'http://example.org/schema/avro_data.avsc'`)
The URL, where the schema is located, must be accessible from the Hive metastore
and Trino coordinator/worker nodes.
-Alternatively, you can use the table property ``avro_schema_literal`` to define
+Alternatively, you can use the table property `avro_schema_literal` to define
the Avro schema.
-The table created in Trino using the ``avro_schema_url`` or
-``avro_schema_literal`` property behaves the same way as a Hive table with
-``avro.schema.url`` or ``avro.schema.literal`` set.
+The table created in Trino using the `avro_schema_url` or
+`avro_schema_literal` property behaves the same way as a Hive table with
+`avro.schema.url` or `avro.schema.literal` set.
-Example::
+Example:
- CREATE TABLE example.avro.avro_data (
- id BIGINT
- )
- WITH (
- format = 'AVRO',
- avro_schema_url = '/usr/local/avro_data.avsc'
- )
+```
+CREATE TABLE example.avro.avro_data (
+ id BIGINT
+ )
+WITH (
+ format = 'AVRO',
+ avro_schema_url = '/usr/local/avro_data.avsc'
+)
+```
-The columns listed in the DDL (``id`` in the above example) is ignored if ``avro_schema_url`` is specified.
+The columns listed in the DDL (`id` in the above example) is ignored if `avro_schema_url` is specified.
The table schema matches the schema in the Avro schema file. Before any read operation, the Avro schema is
accessed so the query result reflects any changes in schema. Thus Trino takes advantage of Avro's backward compatibility abilities.
@@ -686,100 +702,97 @@ Newly added/renamed fields *must* have a default value in the Avro schema file.
The schema evolution behavior is as follows:
-* Column added in new schema:
+- Column added in new schema:
Data created with an older schema produces a *default* value when table is using the new schema.
-
-* Column removed in new schema:
+- Column removed in new schema:
Data created with an older schema no longer outputs the data from the column that was removed.
-
-* Column is renamed in the new schema:
+- Column is renamed in the new schema:
This is equivalent to removing the column and adding a new one, and data created with an older schema
produces a *default* value when table is using the new schema.
-
-* Changing type of column in the new schema:
+- Changing type of column in the new schema:
If the type coercion is supported by Avro or the Hive connector, then the conversion happens.
An error is thrown for incompatible types.
-Limitations
-~~~~~~~~~~~
+##### Limitations
-The following operations are not supported when ``avro_schema_url`` is set:
+The following operations are not supported when `avro_schema_url` is set:
-* ``CREATE TABLE AS`` is not supported.
-* Bucketing(``bucketed_by``) columns are not supported in ``CREATE TABLE``.
-* ``ALTER TABLE`` commands modifying columns are not supported.
+- `CREATE TABLE AS` is not supported.
+- Bucketing(`bucketed_by`) columns are not supported in `CREATE TABLE`.
+- `ALTER TABLE` commands modifying columns are not supported.
-.. _hive-alter-table-execute:
+(hive-alter-table-execute)=
-ALTER TABLE EXECUTE
-"""""""""""""""""""
+#### ALTER TABLE EXECUTE
-The connector supports the ``optimize`` command for use with
-:ref:`ALTER TABLE EXECUTE `.
+The connector supports the `optimize` command for use with
+{ref}`ALTER TABLE EXECUTE `.
-The ``optimize`` command is used for rewriting the content
+The `optimize` command is used for rewriting the content
of the specified non-transactional table so that it is merged
into fewer but larger files.
In case that the table is partitioned, the data compaction
acts separately on each partition selected for optimization.
This operation improves read performance.
-All files with a size below the optional ``file_size_threshold``
-parameter (default value for the threshold is ``100MB``) are
+All files with a size below the optional `file_size_threshold`
+parameter (default value for the threshold is `100MB`) are
merged:
-.. code-block:: sql
-
- ALTER TABLE test_table EXECUTE optimize
+```sql
+ALTER TABLE test_table EXECUTE optimize
+```
The following statement merges files in a table that are
under 10 megabytes in size:
-.. code-block:: sql
-
- ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
+```sql
+ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
+```
-You can use a ``WHERE`` clause with the columns used to partition the table,
+You can use a `WHERE` clause with the columns used to partition the table,
to filter which partitions are optimized:
-.. code-block:: sql
+```sql
+ALTER TABLE test_partitioned_table EXECUTE optimize
+WHERE partition_key = 1
+```
- ALTER TABLE test_partitioned_table EXECUTE optimize
- WHERE partition_key = 1
-
-The ``optimize`` command is disabled by default, and can be enabled for a
-catalog with the ``.non_transactional_optimize_enabled``
+The `optimize` command is disabled by default, and can be enabled for a
+catalog with the `.non_transactional_optimize_enabled`
session property:
-.. code-block:: sql
-
- SET SESSION .non_transactional_optimize_enabled=true
+```sql
+SET SESSION .non_transactional_optimize_enabled=true
+```
-.. warning::
+:::{warning}
+Because Hive tables are non-transactional, take note of the following possible
+outcomes:
- Because Hive tables are non-transactional, take note of the following possible
- outcomes:
+- If queries are run against tables that are currently being optimized,
+ duplicate rows may be read.
+- In rare cases where exceptions occur during the `optimize` operation,
+ a manual cleanup of the table directory is needed. In this situation, refer
+ to the Trino logs and query failure messages to see which files must be
+ deleted.
+:::
- * If queries are run against tables that are currently being optimized,
- duplicate rows may be read.
- * In rare cases where exceptions occur during the ``optimize`` operation,
- a manual cleanup of the table directory is needed. In this situation, refer
- to the Trino logs and query failure messages to see which files must be
- deleted.
+(hive-table-properties)=
-.. _hive-table-properties:
-
-Table properties
-""""""""""""""""
+#### Table properties
Table properties supply or set metadata for the underlying tables. This
-is key for :doc:`/sql/create-table-as` statements. Table properties are passed
-to the connector using a :doc:`WITH ` clause::
+is key for {doc}`/sql/create-table-as` statements. Table properties are passed
+to the connector using a {doc}`WITH ` clause:
- CREATE TABLE tablename
- WITH (format='CSV',
- csv_escape = '"')
+```
+CREATE TABLE tablename
+WITH (format='CSV',
+ csv_escape = '"')
+```
+```{eval-rst}
.. list-table:: Hive connector table properties
:widths: 20, 60, 20
:header-rows: 1
@@ -890,59 +903,60 @@ to the connector using a :doc:`WITH ` clause::
and are available in the ``$properties`` metadata table.
The properties are not included in the output of ``SHOW CREATE TABLE`` statements.
-
+```
-.. _hive-special-tables:
+(hive-special-tables)=
-Metadata tables
-"""""""""""""""
+#### Metadata tables
The raw Hive table properties are available as a hidden table, containing a
separate column per table property, with a single row containing the property
values.
-``$properties`` table
-~~~~~~~~~~~~~~~~~~~~~
+##### `$properties` table
-The properties table name is composed with the table name and ``$properties`` appended.
+The properties table name is composed with the table name and `$properties` appended.
It exposes the parameters of the table in the metastore.
-You can inspect the property names and values with a simple query::
-
- SELECT * FROM example.web."page_views$properties";
+You can inspect the property names and values with a simple query:
+```
+SELECT * FROM example.web."page_views$properties";
+```
-.. code-block:: text
+```text
+ stats_generated_via_stats_task | auto.purge | presto_query_id | presto_version | transactional
+---------------------------------------------+------------+-----------------------------+----------------+---------------
+ workaround for potential lack of HIVE-12730 | false | 20230705_152456_00001_nfugi | 423 | false
+```
- stats_generated_via_stats_task | auto.purge | presto_query_id | presto_version | transactional
- ---------------------------------------------+------------+-----------------------------+----------------+---------------
- workaround for potential lack of HIVE-12730 | false | 20230705_152456_00001_nfugi | 423 | false
+##### `$partitions` table
-``$partitions`` table
-~~~~~~~~~~~~~~~~~~~~~
-
-The ``$partitions`` table provides a list of all partition values
+The `$partitions` table provides a list of all partition values
of a partitioned table.
The following example query returns all partition values from the
-``page_views`` table in the ``web`` schema of the ``example`` catalog::
-
- SELECT * FROM example.web."page_views$partitions";
+`page_views` table in the `web` schema of the `example` catalog:
-.. code-block:: text
+```
+SELECT * FROM example.web."page_views$partitions";
+```
- day | country
- ------------+---------
- 2023-07-01 | POL
- 2023-07-02 | POL
- 2023-07-03 | POL
- 2023-03-01 | USA
- 2023-03-02 | USA
+```text
+ day | country
+------------+---------
+ 2023-07-01 | POL
+ 2023-07-02 | POL
+ 2023-07-03 | POL
+ 2023-03-01 | USA
+ 2023-03-02 | USA
+```
-.. _hive-column-properties:
+(hive-column-properties)=
-Column properties
-"""""""""""""""""
+#### Column properties
+```{eval-rst}
.. list-table:: Hive connector column properties
:widths: 20, 60, 20
:header-rows: 1
@@ -1003,57 +1017,54 @@ Column properties
Mapped from the AWS Athena table property
`projection.${columnName}.interval.unit `_.
-
+```
-.. _hive-special-columns:
+(hive-special-columns)=
-Metadata columns
-""""""""""""""""
+#### Metadata columns
In addition to the defined columns, the Hive connector automatically exposes
metadata in a number of hidden columns in each table:
-* ``$bucket``: Bucket number for this row
-
-* ``$path``: Full file system path name of the file for this row
-
-* ``$file_modified_time``: Date and time of the last modification of the file for this row
-
-* ``$file_size``: Size of the file for this row
-
-* ``$partition``: Partition name for this row
+- `$bucket`: Bucket number for this row
+- `$path`: Full file system path name of the file for this row
+- `$file_modified_time`: Date and time of the last modification of the file for this row
+- `$file_size`: Size of the file for this row
+- `$partition`: Partition name for this row
You can use these columns in your SQL statements like any other column. They
can be selected directly, or used in conditional statements. For example, you
-can inspect the file size, location and partition for each record::
+can inspect the file size, location and partition for each record:
- SELECT *, "$path", "$file_size", "$partition"
- FROM example.web.page_views;
+```
+SELECT *, "$path", "$file_size", "$partition"
+FROM example.web.page_views;
+```
Retrieve all records that belong to files stored in the partition
-``ds=2016-08-09/country=US``::
+`ds=2016-08-09/country=US`:
- SELECT *, "$path", "$file_size"
- FROM example.web.page_views
- WHERE "$partition" = 'ds=2016-08-09/country=US'
+```
+SELECT *, "$path", "$file_size"
+FROM example.web.page_views
+WHERE "$partition" = 'ds=2016-08-09/country=US'
+```
-.. _hive-sql-view-management:
+(hive-sql-view-management)=
-View management
-^^^^^^^^^^^^^^^
+### View management
Trino allows reading from Hive materialized views, and can be configured to
support reading Hive views.
-Materialized views
-""""""""""""""""""
+#### Materialized views
The Hive connector supports reading from Hive materialized views.
In Trino, these views are presented as regular, read-only tables.
-.. _hive-views:
+(hive-views)=
-Hive views
-""""""""""
+#### Hive views
Hive views are defined in HiveQL and stored in the Hive Metastore Service. They
are analyzed to allow read access to the data.
@@ -1061,15 +1072,15 @@ are analyzed to allow read access to the data.
The Hive connector includes support for reading Hive views with three different
modes.
-* Disabled
-* Legacy
-* Experimental
+- Disabled
+- Legacy
+- Experimental
You can configure the behavior in your catalog properties file.
-By default, Hive views are executed with the ``RUN AS DEFINER`` security mode.
-Set the ``hive.hive-views.run-as-invoker`` catalog configuration property to
-``true`` to use ``RUN AS INVOKER`` semantics.
+By default, Hive views are executed with the `RUN AS DEFINER` security mode.
+Set the `hive.hive-views.run-as-invoker` catalog configuration property to
+`true` to use `RUN AS INVOKER` semantics.
**Disabled**
@@ -1080,12 +1091,12 @@ logic and data encoded in the views is not available in Trino.
A very simple implementation to execute Hive views, and therefore allow read
access to the data in Trino, can be enabled with
-``hive.hive-views.enabled=true`` and
-``hive.hive-views.legacy-translation=true``.
+`hive.hive-views.enabled=true` and
+`hive.hive-views.legacy-translation=true`.
For temporary usage of the legacy behavior for a specific catalog, you can set
-the ``hive_views_legacy_translation`` :doc:`catalog session property
-` to ``true``.
+the `hive_views_legacy_translation` {doc}`catalog session property
+` to `true`.
This legacy behavior interprets any HiveQL query that defines a view as if it
is written in SQL. It does not do any translation, but instead relies on the
@@ -1105,58 +1116,56 @@ rewrite Hive views and contained expressions and statements.
It supports the following Hive view functionality:
-* ``UNION [DISTINCT]`` and ``UNION ALL`` against Hive views
-* Nested ``GROUP BY`` clauses
-* ``current_user()``
-* ``LATERAL VIEW OUTER EXPLODE``
-* ``LATERAL VIEW [OUTER] EXPLODE`` on array of struct
-* ``LATERAL VIEW json_tuple``
+- `UNION [DISTINCT]` and `UNION ALL` against Hive views
+- Nested `GROUP BY` clauses
+- `current_user()`
+- `LATERAL VIEW OUTER EXPLODE`
+- `LATERAL VIEW [OUTER] EXPLODE` on array of struct
+- `LATERAL VIEW json_tuple`
You can enable the experimental behavior with
-``hive.hive-views.enabled=true``. Remove the
-``hive.hive-views.legacy-translation`` property or set it to ``false`` to make
+`hive.hive-views.enabled=true`. Remove the
+`hive.hive-views.legacy-translation` property or set it to `false` to make
sure legacy is not enabled.
Keep in mind that numerous features are not yet implemented when experimenting
with this feature. The following is an incomplete list of **missing**
functionality:
-* HiveQL ``current_date``, ``current_timestamp``, and others
-* Hive function calls including ``translate()``, window functions, and others
-* Common table expressions and simple case expressions
-* Honor timestamp precision setting
-* Support all Hive data types and correct mapping to Trino types
-* Ability to process custom UDFs
+- HiveQL `current_date`, `current_timestamp`, and others
+- Hive function calls including `translate()`, window functions, and others
+- Common table expressions and simple case expressions
+- Honor timestamp precision setting
+- Support all Hive data types and correct mapping to Trino types
+- Ability to process custom UDFs
-.. _hive-fte-support:
+(hive-fte-support)=
-Fault-tolerant execution support
---------------------------------
+## Fault-tolerant execution support
-The connector supports :doc:`/admin/fault-tolerant-execution` of query
+The connector supports {doc}`/admin/fault-tolerant-execution` of query
processing. Read and write operations are both supported with any retry policy
on non-transactional tables.
Read operations are supported with any retry policy on transactional tables.
-Write operations and ``CREATE TABLE ... AS`` operations are not supported with
+Write operations and `CREATE TABLE ... AS` operations are not supported with
any retry policy on transactional tables.
-Performance
------------
+## Performance
The connector includes a number of performance improvements, detailed in the
following sections.
-Table statistics
-^^^^^^^^^^^^^^^^
+### Table statistics
-The Hive connector supports collecting and managing :doc:`table statistics
+The Hive connector supports collecting and managing {doc}`table statistics
` to improve query processing performance.
When writing data, the Hive connector always collects basic statistics
-(``numFiles``, ``numRows``, ``rawDataSize``, ``totalSize``)
+(`numFiles`, `numRows`, `rawDataSize`, `totalSize`)
and by default will also collect column level statistics:
+```{eval-rst}
.. list-table:: Available table statistics
:widths: 35, 65
:header-rows: 1
@@ -1189,58 +1198,65 @@ and by default will also collect column level statistics:
- Number of nulls
* - ``BOOLEAN``
- Number of nulls, number of true/false values
+```
-.. _hive-analyze:
+(hive-analyze)=
-Updating table and partition statistics
-"""""""""""""""""""""""""""""""""""""""
+#### Updating table and partition statistics
If your queries are complex and include joining large data sets,
-running :doc:`/sql/analyze` on tables/partitions may improve query performance
+running {doc}`/sql/analyze` on tables/partitions may improve query performance
by collecting statistical information about the data.
When analyzing a partitioned table, the partitions to analyze can be specified
-via the optional ``partitions`` property, which is an array containing
-the values of the partition keys in the order they are declared in the table schema::
+via the optional `partitions` property, which is an array containing
+the values of the partition keys in the order they are declared in the table schema:
- ANALYZE table_name WITH (
- partitions = ARRAY[
- ARRAY['p1_value1', 'p1_value2'],
- ARRAY['p2_value1', 'p2_value2']])
+```
+ANALYZE table_name WITH (
+ partitions = ARRAY[
+ ARRAY['p1_value1', 'p1_value2'],
+ ARRAY['p2_value1', 'p2_value2']])
+```
This query will collect statistics for two partitions with keys
-``p1_value1, p1_value2`` and ``p2_value1, p2_value2``.
+`p1_value1, p1_value2` and `p2_value1, p2_value2`.
On wide tables, collecting statistics for all columns can be expensive and can have a
detrimental effect on query planning. It is also typically unnecessary - statistics are
only useful on specific columns, like join keys, predicates, grouping keys. One can
-specify a subset of columns to be analyzed via the optional ``columns`` property::
+specify a subset of columns to be analyzed via the optional `columns` property:
- ANALYZE table_name WITH (
- partitions = ARRAY[ARRAY['p2_value1', 'p2_value2']],
- columns = ARRAY['col_1', 'col_2'])
+```
+ANALYZE table_name WITH (
+ partitions = ARRAY[ARRAY['p2_value1', 'p2_value2']],
+ columns = ARRAY['col_1', 'col_2'])
+```
-This query collects statistics for columns ``col_1`` and ``col_2`` for the partition
-with keys ``p2_value1, p2_value2``.
+This query collects statistics for columns `col_1` and `col_2` for the partition
+with keys `p2_value1, p2_value2`.
Note that if statistics were previously collected for all columns, they must be dropped
-before re-analyzing just a subset::
+before re-analyzing just a subset:
- CALL system.drop_stats('schema_name', 'table_name')
+```
+CALL system.drop_stats('schema_name', 'table_name')
+```
-You can also drop statistics for selected partitions only::
+You can also drop statistics for selected partitions only:
- CALL system.drop_stats(
- schema_name => 'schema',
- table_name => 'table',
- partition_values => ARRAY[ARRAY['p2_value1', 'p2_value2']])
+```
+CALL system.drop_stats(
+ schema_name => 'schema',
+ table_name => 'table',
+ partition_values => ARRAY[ARRAY['p2_value1', 'p2_value2']])
+```
-.. _hive-dynamic-filtering:
+(hive-dynamic-filtering)=
-Dynamic filtering
-^^^^^^^^^^^^^^^^^
+### Dynamic filtering
-The Hive connector supports the :doc:`dynamic filtering ` optimization.
+The Hive connector supports the {doc}`dynamic filtering ` optimization.
Dynamic partition pruning is supported for partitioned tables stored in any file format
for broadcast as well as partitioned joins.
Dynamic bucket pruning is supported for bucketed tables stored in any file format for
@@ -1255,8 +1271,7 @@ This is because grouping similar data within the same stripe or row-group
greatly improves the selectivity of the min/max indexes maintained at stripe or
row-group level.
-Delaying execution for dynamic filters
-""""""""""""""""""""""""""""""""""""""
+#### Delaying execution for dynamic filters
It can often be beneficial to wait for the collection of dynamic filters before starting
a table scan. This extra wait time can potentially result in significant overall savings
@@ -1264,36 +1279,36 @@ in query and CPU time, if dynamic filtering is able to reduce the amount of scan
For the Hive connector, a table scan can be delayed for a configured amount of
time until the collection of dynamic filters by using the configuration property
-``hive.dynamic-filtering.wait-timeout`` in the catalog file or the catalog
-session property ``.dynamic_filtering_wait_timeout``.
+`hive.dynamic-filtering.wait-timeout` in the catalog file or the catalog
+session property `.dynamic_filtering_wait_timeout`.
-.. _hive-table-redirection:
+(hive-table-redirection)=
-Table redirection
-^^^^^^^^^^^^^^^^^
+### Table redirection
-.. include:: table-redirection.fragment
+```{include} table-redirection.fragment
+```
The connector supports redirection from Hive tables to Iceberg
and Delta Lake tables with the following catalog configuration properties:
-- ``hive.iceberg-catalog-name`` for redirecting the query to :doc:`/connector/iceberg`
-- ``hive.delta-lake-catalog-name`` for redirecting the query to :doc:`/connector/delta-lake`
+- `hive.iceberg-catalog-name` for redirecting the query to {doc}`/connector/iceberg`
+- `hive.delta-lake-catalog-name` for redirecting the query to {doc}`/connector/delta-lake`
-.. _hive-performance-tuning-configuration:
+(hive-performance-tuning-configuration)=
-Performance tuning configuration properties
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Performance tuning configuration properties
The following table describes performance tuning properties for the Hive
connector.
-.. warning::
-
- Performance tuning configuration properties are considered expert-level
- features. Altering these properties from their default values is likely to
- cause instability and performance degradation.
+:::{warning}
+Performance tuning configuration properties are considered expert-level
+features. Altering these properties from their default values is likely to
+cause instability and performance degradation.
+:::
+```{eval-rst}
.. list-table::
:widths: 30, 50, 20
:header-rows: 1
@@ -1330,23 +1345,20 @@ connector.
splits result in more parallelism and thus can decrease latency, but
also have more overhead and increase load on the system.
- ``64 MB``
+```
-Hive 3-related limitations
---------------------------
+## Hive 3-related limitations
-* For security reasons, the ``sys`` system catalog is not accessible.
-
-* Hive's ``timestamp with local zone`` data type is mapped to
- ``timestamp with time zone`` with UTC timezone. It only supports reading
+- For security reasons, the `sys` system catalog is not accessible.
+- Hive's `timestamp with local zone` data type is mapped to
+ `timestamp with time zone` with UTC timezone. It only supports reading
values - writing to tables with columns of this type is not supported.
-
-* Due to Hive issues `HIVE-21002 `_
- and `HIVE-22167 `_, Trino does
- not correctly read ``TIMESTAMP`` values from Parquet, RCBinary, or Avro
+- Due to Hive issues [HIVE-21002](https://issues.apache.org/jira/browse/HIVE-21002)
+ and [HIVE-22167](https://issues.apache.org/jira/browse/HIVE-22167), Trino does
+ not correctly read `TIMESTAMP` values from Parquet, RCBinary, or Avro
file formats created by Hive 3.1 or later. When reading from these file formats,
Trino returns different results than Hive.
-
-* Trino does not support gathering table statistics for Hive transactional tables.
+- Trino does not support gathering table statistics for Hive transactional tables.
You must use Hive to gather table statistics with
- `ANALYZE statement `_
+ [ANALYZE statement](https://cwiki.apache.org/confluence/display/hive/statsdev#StatsDev-ExistingTables%E2%80%93ANALYZE)
after table creation.
diff --git a/docs/src/main/sphinx/connector/hudi.rst b/docs/src/main/sphinx/connector/hudi.md
similarity index 57%
rename from docs/src/main/sphinx/connector/hudi.rst
rename to docs/src/main/sphinx/connector/hudi.md
index 025a42ec9416..2c257abd9627 100644
--- a/docs/src/main/sphinx/connector/hudi.rst
+++ b/docs/src/main/sphinx/connector/hudi.md
@@ -1,46 +1,42 @@
-==============
-Hudi connector
-==============
+# Hudi connector
-.. raw:: html
+```{raw} html
+
+```
-
+The Hudi connector enables querying [Hudi](https://hudi.apache.org/docs/overview/) tables.
-The Hudi connector enables querying `Hudi `_ tables.
-
-Requirements
-------------
+## Requirements
To use the Hudi connector, you need:
-* Hudi version 0.12.3 or higher.
-* Network access from the Trino coordinator and workers to the Hudi storage.
-* Access to a Hive metastore service (HMS).
-* Network access from the Trino coordinator to the HMS.
-* Data files stored in the Parquet file format. These can be configured using
- :ref:`file format configuration properties ` per
+- Hudi version 0.12.3 or higher.
+- Network access from the Trino coordinator and workers to the Hudi storage.
+- Access to a Hive metastore service (HMS).
+- Network access from the Trino coordinator to the HMS.
+- Data files stored in the Parquet file format. These can be configured using
+ {ref}`file format configuration properties ` per
catalog.
-General configuration
----------------------
+## General configuration
To configure the Hive connector, create a catalog properties file
-``etc/catalog/example.properties`` that references the ``hudi``
-connector and defines the HMS to use with the ``hive.metastore.uri``
+`etc/catalog/example.properties` that references the `hudi`
+connector and defines the HMS to use with the `hive.metastore.uri`
configuration property:
-.. code-block:: properties
+```properties
+connector.name=hudi
+hive.metastore.uri=thrift://example.net:9083
+```
- connector.name=hudi
- hive.metastore.uri=thrift://example.net:9083
-
-There are :ref:`HMS configuration properties `
+There are {ref}`HMS configuration properties `
available for use with the Hudi connector. The connector recognizes Hudi tables
-synced to the metastore by the `Hudi sync tool
-`_.
+synced to the metastore by the [Hudi sync tool](https://hudi.apache.org/docs/syncing_metastore).
Additionally, following configuration properties can be set depending on the use-case:
+```{eval-rst}
.. list-table:: Hudi configuration properties
:widths: 30, 55, 15
:header-rows: 1
@@ -90,70 +86,68 @@ Additionally, following configuration properties can be set depending on the use
the Hive metastore cache.
- ``2000``
+```
-SQL support
------------
+## SQL support
The connector provides read access to data in the Hudi table that has been synced to
-Hive metastore. The :ref:`globally available `
-and :ref:`read operation ` statements are supported.
-
-Basic usage examples
-^^^^^^^^^^^^^^^^^^^^
-
-In the following example queries, ``stock_ticks_cow`` is the Hudi copy-on-write
-table referred to in the Hudi `quickstart guide
-`_.
-
-.. code-block:: sql
-
- USE example.example_schema;
-
- SELECT symbol, max(ts)
- FROM stock_ticks_cow
- GROUP BY symbol
- HAVING symbol = 'GOOG';
-
-.. code-block:: text
-
- symbol | _col1 |
- -----------+----------------------+
- GOOG | 2018-08-31 10:59:00 |
- (1 rows)
-
-.. code-block:: sql
-
- SELECT dt, symbol
- FROM stock_ticks_cow
- WHERE symbol = 'GOOG';
-
-.. code-block:: text
-
- dt | symbol |
- ------------+--------+
- 2018-08-31 | GOOG |
- (1 rows)
-
-.. code-block:: sql
-
- SELECT dt, count(*)
- FROM stock_ticks_cow
- GROUP BY dt;
-
-.. code-block:: text
-
- dt | _col1 |
- ------------+--------+
- 2018-08-31 | 99 |
- (1 rows)
-
-Schema and table management
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Hudi supports `two types of tables `_
+Hive metastore. The {ref}`globally available `
+and {ref}`read operation ` statements are supported.
+
+### Basic usage examples
+
+In the following example queries, `stock_ticks_cow` is the Hudi copy-on-write
+table referred to in the Hudi [quickstart guide](https://hudi.apache.org/docs/docker_demo/).
+
+```sql
+USE example.example_schema;
+
+SELECT symbol, max(ts)
+FROM stock_ticks_cow
+GROUP BY symbol
+HAVING symbol = 'GOOG';
+```
+
+```text
+ symbol | _col1 |
+-----------+----------------------+
+ GOOG | 2018-08-31 10:59:00 |
+(1 rows)
+```
+
+```sql
+SELECT dt, symbol
+FROM stock_ticks_cow
+WHERE symbol = 'GOOG';
+```
+
+```text
+ dt | symbol |
+------------+--------+
+ 2018-08-31 | GOOG |
+(1 rows)
+```
+
+```sql
+SELECT dt, count(*)
+FROM stock_ticks_cow
+GROUP BY dt;
+```
+
+```text
+ dt | _col1 |
+------------+--------+
+ 2018-08-31 | 99 |
+(1 rows)
+```
+
+### Schema and table management
+
+Hudi supports [two types of tables](https://hudi.apache.org/docs/table_types)
depending on how the data is indexed and laid out on the file system. The following
table displays a support matrix of tables types and query types for the connector:
+```{eval-rst}
.. list-table:: Hudi configuration properties
:widths: 45, 55
:header-rows: 1
@@ -164,39 +158,43 @@ table displays a support matrix of tables types and query types for the connecto
- Snapshot queries
* - Merge on read
- Read-optimized queries
+```
-.. _hudi-metadata-tables:
+(hudi-metadata-tables)=
-Metadata tables
-"""""""""""""""
+#### Metadata tables
The connector exposes a metadata table for each Hudi table.
The metadata table contains information about the internal structure
of the Hudi table. You can query each metadata table by appending the
-metadata table name to the table name::
+metadata table name to the table name:
- SELECT * FROM "test_table$timeline"
+```
+SELECT * FROM "test_table$timeline"
+```
-``$timeline`` table
-~~~~~~~~~~~~~~~~~~~
+##### `$timeline` table
-The ``$timeline`` table provides a detailed view of meta-data instants
+The `$timeline` table provides a detailed view of meta-data instants
in the Hudi table. Instants are specific points in time.
You can retrieve the information about the timeline of the Hudi table
-``test_table`` by using the following query::
-
- SELECT * FROM "test_table$timeline"
+`test_table` by using the following query:
-.. code-block:: text
+```
+SELECT * FROM "test_table$timeline"
+```
- timestamp | action | state
- --------------------+---------+-----------
- 8667764846443717831 | commit | COMPLETED
- 7860805980949777961 | commit | COMPLETED
+```text
+ timestamp | action | state
+--------------------+---------+-----------
+8667764846443717831 | commit | COMPLETED
+7860805980949777961 | commit | COMPLETED
+```
The output of the query has the following columns:
+```{eval-rst}
.. list-table:: Timeline columns
:widths: 20, 30, 50
:header-rows: 1
@@ -212,4 +210,5 @@ The output of the query has the following columns:
- `Type of action `_ performed on the table.
* - ``state``
- ``VARCHAR``
- - Current state of the instant.
\ No newline at end of file
+ - Current state of the instant.
+```
diff --git a/docs/src/main/sphinx/connector/iceberg.rst b/docs/src/main/sphinx/connector/iceberg.md
similarity index 54%
rename from docs/src/main/sphinx/connector/iceberg.rst
rename to docs/src/main/sphinx/connector/iceberg.md
index 06906891d731..d10a3b8eab15 100644
--- a/docs/src/main/sphinx/connector/iceberg.rst
+++ b/docs/src/main/sphinx/connector/iceberg.md
@@ -1,14 +1,12 @@
-=================
-Iceberg connector
-=================
+# Iceberg connector
-.. raw:: html
-
-
+```{raw} html
+
+```
Apache Iceberg is an open table format for huge analytic datasets. The Iceberg
connector allows querying data stored in files written in Iceberg format, as
-defined in the `Iceberg Table Spec `_. The
+defined in the [Iceberg Table Spec](https://iceberg.apache.org/spec/). The
connector supports Apache Iceberg table spec versions 1 and 2.
The table state is maintained in metadata files. All changes to table
@@ -17,13 +15,13 @@ swap. The table metadata file tracks the table schema, partitioning
configuration, custom properties, and snapshots of the table contents.
Iceberg data files are stored in either Parquet, ORC, or Avro format, as
-determined by the ``format`` property in the table definition. The default
-``format`` value is ``PARQUET``.
+determined by the `format` property in the table definition. The default
+`format` value is `PARQUET`.
Iceberg is designed to improve on the known scalability limitations of Hive,
which stores table metadata in a metastore that is backed by a relational
database such as MySQL. It tracks partition locations in the metastore, but not
-individual data files. Trino queries using the :doc:`/connector/hive` must
+individual data files. Trino queries using the {doc}`/connector/hive` must
first call the metastore to get partition locations, then call the underlying
file system to list all data files inside each partition, and then read metadata
from each data file.
@@ -31,47 +29,48 @@ from each data file.
Since Iceberg stores the paths to data files in the metadata files, it only
consults the underlying file system for files that must be read.
-Requirements
-------------
+## Requirements
To use Iceberg, you need:
-* Network access from the Trino coordinator and workers to the distributed
+- Network access from the Trino coordinator and workers to the distributed
object storage.
-* Access to a :ref:`Hive metastore service (HMS) `, an
- :ref:`AWS Glue catalog `, a :ref:`JDBC catalog
- `, a :ref:`REST catalog `, or a
- :ref:`Nessie server `.
-* Data files stored in a supported file format. These can be configured using
+
+- Access to a {ref}`Hive metastore service (HMS) `, an
+ {ref}`AWS Glue catalog `, a {ref}`JDBC catalog
+ `, a {ref}`REST catalog `, or a
+ {ref}`Nessie server `.
+
+- Data files stored in a supported file format. These can be configured using
file format configuration properties per catalog:
-
- - :ref:`ORC `
- - :ref:`Parquet ` (default)
-General configuration
----------------------
+ - {ref}`ORC `
+ - {ref}`Parquet ` (default)
+
+## General configuration
To configure the Iceberg connector, create a catalog properties file
-``etc/catalog/example.properties`` that references the ``iceberg``
+`etc/catalog/example.properties` that references the `iceberg`
connector and defines a metastore type. The Hive metastore catalog is the
-default implementation. To use a :ref:`Hive metastore `,
-``iceberg.catalog.type`` must be set to ``hive_metastore`` and
-``hive.metastore.uri`` must be configured:
+default implementation. To use a {ref}`Hive metastore `,
+`iceberg.catalog.type` must be set to `hive_metastore` and
+`hive.metastore.uri` must be configured:
-.. code-block:: properties
-
- connector.name=iceberg
- iceberg.catalog.type=hive_metastore
- hive.metastore.uri=thrift://example.net:9083
+```properties
+connector.name=iceberg
+iceberg.catalog.type=hive_metastore
+hive.metastore.uri=thrift://example.net:9083
+```
Other metadata catalog types as listed in the requirements section of this topic
are available. Each metastore type has specific configuration properties along
-with :ref:`general metastore configuration properties
+with {ref}`general metastore configuration properties
`.
The following configuration properties are independent of which catalog
implementation is used:
+```{eval-rst}
.. list-table:: Iceberg general configuration properties
:widths: 30, 58, 12
:header-rows: 1
@@ -158,15 +157,15 @@ implementation is used:
* - ``iceberg.register-table-procedure.enabled``
- Enable to allow user to call ``register_table`` procedure.
- ``false``
+```
-Type mapping
-------------
+## Type mapping
The connector reads and writes data into the supported data file formats Avro,
ORC, and Parquet, following the Iceberg specification.
Because Trino and Iceberg each support types that the other does not, this
-connector :ref:`modifies some types ` when reading or
+connector {ref}`modifies some types ` when reading or
writing data. Data types may not map the same way in both directions between
Trino and the data source. Refer to the following sections for type mapping in
each direction.
@@ -174,16 +173,16 @@ each direction.
The Iceberg specification includes supported data types and the mapping to the
formating in the Avro, ORC, or Parquet files:
-* `Iceberg to Avro `_
-* `Iceberg to ORC `_
-* `Iceberg to Parquet `_
+- [Iceberg to Avro](https://iceberg.apache.org/spec/#avro)
+- [Iceberg to ORC](https://iceberg.apache.org/spec/#orc)
+- [Iceberg to Parquet](https://iceberg.apache.org/spec/#parquet)
-Iceberg to Trino type mapping
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Iceberg to Trino type mapping
The connector maps Iceberg types to the corresponding Trino types according to
the following table:
+```{eval-rst}
.. list-table:: Iceberg to Trino type mapping
:widths: 40, 60
:header-rows: 1
@@ -224,15 +223,16 @@ the following table:
- ``ARRAY(e)``
* - ``MAP(k,v)``
- ``MAP(k,v)``
+```
No other types are supported.
-Trino to Iceberg type mapping
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Trino to Iceberg type mapping
The connector maps Trino types to the corresponding Iceberg types according to
the following table:
+```{eval-rst}
.. list-table:: Trino to Iceberg type mapping
:widths: 40, 60
:header-rows: 1
@@ -271,24 +271,24 @@ the following table:
- ``LIST(e)``
* - ``MAP(k,v)``
- ``MAP(k,v)``
+```
No other types are supported.
-Security
---------
+## Security
The Iceberg connector allows you to choose one of several means of providing
authorization at the catalog level.
-.. _iceberg-authorization:
+(iceberg-authorization)=
-Authorization checks
-^^^^^^^^^^^^^^^^^^^^
+### Authorization checks
You can enable authorization checks for the connector by setting the
-``iceberg.security`` property in the catalog properties file. This property must
+`iceberg.security` property in the catalog properties file. This property must
be one of the following values:
+```{eval-rst}
.. list-table:: Iceberg security values
:widths: 30, 60
:header-rows: 1
@@ -310,370 +310,386 @@ be one of the following values:
catalog configuration property. See
:ref:`catalog-file-based-access-control` for information on the
authorization configuration file.
+```
-.. _iceberg-sql-support:
+(iceberg-sql-support)=
-SQL support
------------
+## SQL support
This connector provides read access and write access to data and metadata in
-Iceberg. In addition to the :ref:`globally available `
-and :ref:`read operation ` statements, the connector
+Iceberg. In addition to the {ref}`globally available `
+and {ref}`read operation ` statements, the connector
supports the following features:
-* :ref:`sql-write-operations`:
+- {ref}`sql-write-operations`:
- * :ref:`iceberg-schema-table-management` and :ref:`iceberg-tables`
- * :ref:`iceberg-data-management`
- * :ref:`sql-view-management`
- * :ref:`sql-materialized-view-management`, see also :ref:`iceberg-materialized-views`
+ - {ref}`iceberg-schema-table-management` and {ref}`iceberg-tables`
+ - {ref}`iceberg-data-management`
+ - {ref}`sql-view-management`
+ - {ref}`sql-materialized-view-management`, see also {ref}`iceberg-materialized-views`
-Basic usage examples
-^^^^^^^^^^^^^^^^^^^^
+### Basic usage examples
The connector supports creating schemas. You can create a schema with or without
a specified location.
-You can create a schema with the :doc:`/sql/create-schema` statement and the
-``location`` schema property. The tables in this schema, which have no explicit
-``location`` set in :doc:`/sql/create-table` statement, are located in a
+You can create a schema with the {doc}`/sql/create-schema` statement and the
+`location` schema property. The tables in this schema, which have no explicit
+`location` set in {doc}`/sql/create-table` statement, are located in a
subdirectory under the directory corresponding to the schema location.
-Create a schema on S3::
+Create a schema on S3:
- CREATE SCHEMA example.example_s3_schema
- WITH (location = 's3://my-bucket/a/path/');
+```
+CREATE SCHEMA example.example_s3_schema
+WITH (location = 's3://my-bucket/a/path/');
+```
-Create a schema on an S3-compatible object storage such as MinIO::
+Create a schema on an S3-compatible object storage such as MinIO:
- CREATE SCHEMA example.example_s3a_schema
- WITH (location = 's3a://my-bucket/a/path/');
+```
+CREATE SCHEMA example.example_s3a_schema
+WITH (location = 's3a://my-bucket/a/path/');
+```
-Create a schema on HDFS::
+Create a schema on HDFS:
- CREATE SCHEMA example.example_hdfs_schema
- WITH (location='hdfs://hadoop-master:9000/user/hive/warehouse/a/path/');
+```
+CREATE SCHEMA example.example_hdfs_schema
+WITH (location='hdfs://hadoop-master:9000/user/hive/warehouse/a/path/');
+```
-Optionally, on HDFS, the location can be omitted::
+Optionally, on HDFS, the location can be omitted:
- CREATE SCHEMA example.example_hdfs_schema;
+```
+CREATE SCHEMA example.example_hdfs_schema;
+```
-The Iceberg connector supports creating tables using the :doc:`CREATE TABLE
-` syntax. Optionally, specify the :ref:`table properties
-` supported by this connector::
+The Iceberg connector supports creating tables using the {doc}`CREATE TABLE
+` syntax. Optionally, specify the {ref}`table properties
+` supported by this connector:
- CREATE TABLE example_table (
- c1 INTEGER,
- c2 DATE,
- c3 DOUBLE
- )
- WITH (
- format = 'PARQUET',
- partitioning = ARRAY['c1', 'c2'],
- sorted_by = ARRAY['c3'],
- location = 's3://my-bucket/a/path/'
- );
+```
+CREATE TABLE example_table (
+ c1 INTEGER,
+ c2 DATE,
+ c3 DOUBLE
+)
+WITH (
+ format = 'PARQUET',
+ partitioning = ARRAY['c1', 'c2'],
+ sorted_by = ARRAY['c3'],
+ location = 's3://my-bucket/a/path/'
+);
+```
-When the ``location`` table property is omitted, the content of the table is
+When the `location` table property is omitted, the content of the table is
stored in a subdirectory under the directory corresponding to the schema
location.
-The Iceberg connector supports creating tables using the :doc:`CREATE TABLE AS
-` with :doc:`SELECT ` syntax::
-
- CREATE TABLE tiny_nation
- WITH (
- format = 'PARQUET'
- )
- AS
- SELECT *
- FROM nation
- WHERE nationkey < 10;
-
-Another flavor of creating tables with :doc:`CREATE TABLE AS
-` is with :doc:`VALUES ` syntax::
-
- CREATE TABLE yearly_clicks (
- year,
- clicks
- )
- WITH (
- partitioning = ARRAY['year']
- )
- AS VALUES
- (2021, 10000),
- (2022, 20000);
-
-Procedures
-^^^^^^^^^^
-
-Use the :doc:`/sql/call` statement to perform data manipulation or
+The Iceberg connector supports creating tables using the {doc}`CREATE TABLE AS
+` with {doc}`SELECT ` syntax:
+
+```
+CREATE TABLE tiny_nation
+WITH (
+ format = 'PARQUET'
+)
+AS
+ SELECT *
+ FROM nation
+ WHERE nationkey < 10;
+```
+
+Another flavor of creating tables with {doc}`CREATE TABLE AS
+` is with {doc}`VALUES ` syntax:
+
+```
+CREATE TABLE yearly_clicks (
+ year,
+ clicks
+)
+WITH (
+ partitioning = ARRAY['year']
+)
+AS VALUES
+ (2021, 10000),
+ (2022, 20000);
+```
+
+### Procedures
+
+Use the {doc}`/sql/call` statement to perform data manipulation or
administrative tasks. Procedures are available in the system schema of each
catalog. The following code snippet displays how to call the
-``example_procedure`` in the ``examplecatalog`` catalog::
+`example_procedure` in the `examplecatalog` catalog:
- CALL examplecatalog.system.example_procedure()
+```
+CALL examplecatalog.system.example_procedure()
+```
-.. _iceberg-register-table:
+(iceberg-register-table)=
+
+#### Register table
-Register table
-""""""""""""""
The connector can register existing Iceberg tables with the catalog.
-The procedure ``system.register_table`` allows the caller to register an
+The procedure `system.register_table` allows the caller to register an
existing Iceberg table in the metastore, using its existing metadata and data
-files::
+files:
- CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders-581fad8517934af6be1857a903559d44')
+```
+CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders-581fad8517934af6be1857a903559d44')
+```
In addition, you can provide a file name to register a table with specific
metadata. This may be used to register the table with some specific table state,
or may be necessary if the connector cannot automatically figure out the
-metadata version to use::
+metadata version to use:
- CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders-581fad8517934af6be1857a903559d44', metadata_file_name => '00003-409702ba-4735-4645-8f14-09537cc0b2c8.metadata.json')
+```
+CALL example.system.register_table(schema_name => 'testdb', table_name => 'customer_orders', table_location => 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders-581fad8517934af6be1857a903559d44', metadata_file_name => '00003-409702ba-4735-4645-8f14-09537cc0b2c8.metadata.json')
+```
To prevent unauthorized users from accessing data, this procedure is disabled by
default. The procedure is enabled only when
-``iceberg.register-table-procedure.enabled`` is set to ``true``.
+`iceberg.register-table-procedure.enabled` is set to `true`.
+
+(iceberg-unregister-table)=
-.. _iceberg-unregister-table:
+#### Unregister table
-Unregister table
-""""""""""""""""
The connector can unregister existing Iceberg tables from the catalog.
-The procedure ``system.unregister_table`` allows the caller to unregister an
-existing Iceberg table from the metastores without deleting the data::
+The procedure `system.unregister_table` allows the caller to unregister an
+existing Iceberg table from the metastores without deleting the data:
- CALL example.system.unregister_table(schema_name => 'testdb', table_name => 'customer_orders')
+```
+CALL example.system.unregister_table(schema_name => 'testdb', table_name => 'customer_orders')
+```
-Migrate table
-"""""""""""""
+#### Migrate table
The connector can read from or write to Hive tables that have been migrated to
Iceberg.
-Use the procedure ``system.migrate`` to move a table from the Hive format to the
+Use the procedure `system.migrate` to move a table from the Hive format to the
Iceberg format, loaded with the source’s data files. Table schema, partitioning,
properties, and location are copied from the source table. A bucketed Hive table
will be migrated as a non-bucketed Iceberg table. The data files in the Hive table
must use the Parquet, ORC, or Avro file format.
-The procedure must be called for a specific catalog ``example`` with the
+The procedure must be called for a specific catalog `example` with the
relevant schema and table names supplied with the required parameters
-``schema_name`` and ``table_name``::
+`schema_name` and `table_name`:
- CALL example.system.migrate(
- schema_name => 'testdb',
- table_name => 'customer_orders')
+```
+CALL example.system.migrate(
+ schema_name => 'testdb',
+ table_name => 'customer_orders')
+```
Migrate fails if any table partition uses an unsupported file format.
-In addition, you can provide a ``recursive_directory`` argument to migrate a
-Hive table that contains subdirectories::
+In addition, you can provide a `recursive_directory` argument to migrate a
+Hive table that contains subdirectories:
- CALL example.system.migrate(
- schema_name => 'testdb',
- table_name => 'customer_orders',
- recursive_directory => 'true')
+```
+CALL example.system.migrate(
+ schema_name => 'testdb',
+ table_name => 'customer_orders',
+ recursive_directory => 'true')
+```
-The default value is ``fail``, which causes the migrate procedure to throw an
-exception if subdirectories are found. Set the value to ``true`` to migrate
-nested directories, or ``false`` to ignore them.
+The default value is `fail`, which causes the migrate procedure to throw an
+exception if subdirectories are found. Set the value to `true` to migrate
+nested directories, or `false` to ignore them.
-.. _iceberg-data-management:
+(iceberg-data-management)=
-Data management
-^^^^^^^^^^^^^^^
+### Data management
-The :ref:`sql-data-management` functionality includes support for ``INSERT``,
-``UPDATE``, ``DELETE``, and ``MERGE`` statements.
+The {ref}`sql-data-management` functionality includes support for `INSERT`,
+`UPDATE`, `DELETE`, and `MERGE` statements.
-.. _iceberg-delete:
+(iceberg-delete)=
-Deletion by partition
-"""""""""""""""""""""
+#### Deletion by partition
For partitioned tables, the Iceberg connector supports the deletion of entire
-partitions if the ``WHERE`` clause specifies filters only on the
+partitions if the `WHERE` clause specifies filters only on the
identity-transformed partitioning columns, that can match entire partitions.
-Given the table definition from :ref:`Partitioned Tables `
+Given the table definition from {ref}`Partitioned Tables `
section, the following SQL statement deletes all partitions for which
-``country`` is ``US``::
+`country` is `US`:
- DELETE FROM example.testdb.customer_orders
- WHERE country = 'US'
+```
+DELETE FROM example.testdb.customer_orders
+WHERE country = 'US'
+```
-A partition delete is performed if the ``WHERE`` clause meets these conditions.
+A partition delete is performed if the `WHERE` clause meets these conditions.
-Row level deletion
-""""""""""""""""""
+#### Row level deletion
Tables using v2 of the Iceberg specification support deletion of individual rows
by writing position delete files.
-.. _iceberg-schema-table-management:
+(iceberg-schema-table-management)=
-Schema and table management
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+### Schema and table management
-The :ref:`sql-schema-table-management` functionality includes support for:
+The {ref}`sql-schema-table-management` functionality includes support for:
-* :doc:`/sql/create-schema`
-* :doc:`/sql/drop-schema`
-* :doc:`/sql/alter-schema`
-* :doc:`/sql/create-table`
-* :doc:`/sql/create-table-as`
-* :doc:`/sql/drop-table`
-* :doc:`/sql/alter-table`
-* :doc:`/sql/comment`
+- {doc}`/sql/create-schema`
+- {doc}`/sql/drop-schema`
+- {doc}`/sql/alter-schema`
+- {doc}`/sql/create-table`
+- {doc}`/sql/create-table-as`
+- {doc}`/sql/drop-table`
+- {doc}`/sql/alter-table`
+- {doc}`/sql/comment`
-Schema evolution
-""""""""""""""""
+#### Schema evolution
Iceberg supports schema evolution, with safe column add, drop, reorder, and
rename operations, including in nested structures. Table partitioning can also
be changed and the connector can still query data created before the
partitioning change.
-.. _iceberg-alter-table-execute:
+(iceberg-alter-table-execute)=
-ALTER TABLE EXECUTE
-"""""""""""""""""""
+#### ALTER TABLE EXECUTE
-The connector supports the following commands for use with :ref:`ALTER TABLE
+The connector supports the following commands for use with {ref}`ALTER TABLE
EXECUTE `.
-optimize
-~~~~~~~~
+##### optimize
-The ``optimize`` command is used for rewriting the active content of the
+The `optimize` command is used for rewriting the active content of the
specified table so that it is merged into fewer but larger files. In case that
the table is partitioned, the data compaction acts separately on each partition
selected for optimization. This operation improves read performance.
-All files with a size below the optional ``file_size_threshold`` parameter
-(default value for the threshold is ``100MB``) are merged:
+All files with a size below the optional `file_size_threshold` parameter
+(default value for the threshold is `100MB`) are merged:
-.. code-block:: sql
-
- ALTER TABLE test_table EXECUTE optimize
+```sql
+ALTER TABLE test_table EXECUTE optimize
+```
The following statement merges the files in a table that are under 10 megabytes
in size:
-.. code-block:: sql
-
- ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
+```sql
+ALTER TABLE test_table EXECUTE optimize(file_size_threshold => '10MB')
+```
-You can use a ``WHERE`` clause with the columns used to partition the table, to
-apply ``optimize`` only on the partitions corresponding to the filter:
+You can use a `WHERE` clause with the columns used to partition the table, to
+apply `optimize` only on the partitions corresponding to the filter:
-.. code-block:: sql
+```sql
+ALTER TABLE test_partitioned_table EXECUTE optimize
+WHERE partition_key = 1
+```
- ALTER TABLE test_partitioned_table EXECUTE optimize
- WHERE partition_key = 1
+##### expire_snapshots
-expire_snapshots
-~~~~~~~~~~~~~~~~
-
-The ``expire_snapshots`` command removes all snapshots and all related metadata
+The `expire_snapshots` command removes all snapshots and all related metadata
and data files. Regularly expiring snapshots is recommended to delete data files
that are no longer needed, and to keep the size of table metadata small. The
procedure affects all snapshots that are older than the time period configured
-with the ``retention_threshold`` parameter.
-
-``expire_snapshots`` can be run as follows:
+with the `retention_threshold` parameter.
-.. code-block:: sql
+`expire_snapshots` can be run as follows:
- ALTER TABLE test_table EXECUTE expire_snapshots(retention_threshold => '7d')
+```sql
+ALTER TABLE test_table EXECUTE expire_snapshots(retention_threshold => '7d')
+```
-The value for ``retention_threshold`` must be higher than or equal to
-``iceberg.expire_snapshots.min-retention`` in the catalog, otherwise the
-procedure fails with a similar message: ``Retention specified (1.00d) is shorter
-than the minimum retention configured in the system (7.00d)``. The default value
-for this property is ``7d``.
+The value for `retention_threshold` must be higher than or equal to
+`iceberg.expire_snapshots.min-retention` in the catalog, otherwise the
+procedure fails with a similar message: `Retention specified (1.00d) is shorter
+than the minimum retention configured in the system (7.00d)`. The default value
+for this property is `7d`.
-remove_orphan_files
-~~~~~~~~~~~~~~~~~~~
+##### remove_orphan_files
-The ``remove_orphan_files`` command removes all files from a table's data
+The `remove_orphan_files` command removes all files from a table's data
directory that are not linked from metadata files and that are older than the
-value of ``retention_threshold`` parameter. Deleting orphan files from time to
+value of `retention_threshold` parameter. Deleting orphan files from time to
time is recommended to keep size of a table's data directory under control.
-``remove_orphan_files`` can be run as follows:
-
-.. code-block:: sql
+`remove_orphan_files` can be run as follows:
- ALTER TABLE test_table EXECUTE remove_orphan_files(retention_threshold => '7d')
+```sql
+ALTER TABLE test_table EXECUTE remove_orphan_files(retention_threshold => '7d')
+```
-The value for ``retention_threshold`` must be higher than or equal to
-``iceberg.remove_orphan_files.min-retention`` in the catalog otherwise the
-procedure fails with a similar message: ``Retention specified (1.00d) is shorter
-than the minimum retention configured in the system (7.00d)``. The default value
-for this property is ``7d``.
+The value for `retention_threshold` must be higher than or equal to
+`iceberg.remove_orphan_files.min-retention` in the catalog otherwise the
+procedure fails with a similar message: `Retention specified (1.00d) is shorter
+than the minimum retention configured in the system (7.00d)`. The default value
+for this property is `7d`.
-.. _drop-extended-stats:
+(drop-extended-stats)=
-drop_extended_stats
-~~~~~~~~~~~~~~~~~~~
+##### drop_extended_stats
-The ``drop_extended_stats`` command removes all extended statistics information
+The `drop_extended_stats` command removes all extended statistics information
from the table.
-``drop_extended_stats`` can be run as follows:
+`drop_extended_stats` can be run as follows:
-.. code-block:: sql
+```sql
+ALTER TABLE test_table EXECUTE drop_extended_stats
+```
- ALTER TABLE test_table EXECUTE drop_extended_stats
+(iceberg-alter-table-set-properties)=
-.. _iceberg-alter-table-set-properties:
-
-ALTER TABLE SET PROPERTIES
-""""""""""""""""""""""""""
+#### ALTER TABLE SET PROPERTIES
The connector supports modifying the properties on existing tables using
-:ref:`ALTER TABLE SET PROPERTIES `.
+{ref}`ALTER TABLE SET PROPERTIES `.
The following table properties can be updated after a table is created:
-* ``format``
-* ``format_version``
-* ``partitioning``
-* ``sorted_by``
+- `format`
+- `format_version`
+- `partitioning`
+- `sorted_by`
For example, to update a table from v1 of the Iceberg specification to v2:
-.. code-block:: sql
-
- ALTER TABLE table_name SET PROPERTIES format_version = 2;
+```sql
+ALTER TABLE table_name SET PROPERTIES format_version = 2;
+```
-Or to set the column ``my_new_partition_column`` as a partition column on a
+Or to set the column `my_new_partition_column` as a partition column on a
table:
-.. code-block:: sql
+```sql
+ALTER TABLE table_name SET PROPERTIES partitioning = ARRAY[