diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 59bdd78f4d8e..ad98019d280e 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -22,7 +22,8 @@ To connect to Databricks Delta Lake, you need: * Deployments using AWS, HDFS, Azure Storage, and Google Cloud Storage (GCS) are fully supported. * Network access from the coordinator and workers to the Delta Lake storage. -* Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS. +* Access to the Hive metastore service (HMS) of Delta Lake or a separate HMS, + or a Glue metastore. * Network access to the HMS from the coordinator and workers. Port 9083 is the default port for the Thrift protocol used by the HMS. * Data files stored in the Parquet file format. These can be configured using @@ -32,36 +33,31 @@ To connect to Databricks Delta Lake, you need: General configuration --------------------- -The connector requires a Hive metastore for table metadata and supports the same -metastore configuration properties as the :doc:`Hive connector -`. At a minimum, ``hive.metastore.uri`` must be configured. - -The connector recognizes Delta tables created in the metastore by the Databricks -runtime. If non-Delta tables are present in the metastore as well, they are not -visible to the connector. - To configure the Delta Lake connector, create a catalog properties file ``etc/catalog/example.properties`` that references the ``delta_lake`` -connector. Update the ``hive.metastore.uri`` with the URI of your Hive metastore -Thrift service: +connector and defines a metastore. You must configure a metastore for table +metadata. If you are using a :ref:`Hive metastore `, +``hive.metastore.uri`` must be configured: .. code-block:: properties connector.name=delta_lake hive.metastore.uri=thrift://example.net:9083 -If you are using AWS Glue as Hive metastore, you can simply set the metastore to -``glue``: +If you are using :ref:`AWS Glue ` as your metastore, you +must instead set ``hive.metastore`` to ``glue``: .. code-block:: properties connector.name=delta_lake hive.metastore=glue -The Delta Lake connector reuses certain functionalities from the Hive connector, -including the metastore :ref:`Thrift ` and :ref:`Glue -` configuration, detailed in the :doc:`Hive connector -documentation `. +Each metastore type has specific configuration properties along with +:ref:`general metastore configuration properties `. + +The connector recognizes Delta Lake tables created in the metastore by the Databricks +runtime. If non-Delta Lake tables are present in the metastore as well, they are not +visible to the connector. To configure access to S3 and S3-compatible storage, Azure storage, and others, consult the appropriate section of the Hive documentation: diff --git a/docs/src/main/sphinx/connector/hive.rst b/docs/src/main/sphinx/connector/hive.rst index 3ab0b9973455..852aeca2b3b8 100644 --- a/docs/src/main/sphinx/connector/hive.rst +++ b/docs/src/main/sphinx/connector/hive.rst @@ -10,6 +10,7 @@ Hive connector :maxdepth: 1 :hidden: + Metastores Security Amazon S3 Azure Storage @@ -38,9 +39,10 @@ It does not use HiveQL or any part of Hive's execution environment. Requirements ------------ -The Hive connector requires a Hive metastore service (HMS), or a compatible -implementation of the Hive metastore, such as -`AWS Glue Data Catalog `_. +The Hive connector requires a +:ref:`Hive metastore service ` (HMS), or a compatible +implementation of the Hive metastore, such as +:ref:`AWS Glue `. Apache Hadoop HDFS 2.x and 3.x are supported. @@ -71,16 +73,28 @@ configured using file format configuration properties per catalog: General configuration --------------------- -Create ``etc/catalog/example.properties`` with the following contents -to mount the ``hive`` connector as the ``example`` catalog, -replacing ``example.net:9083`` with the correct host and port -for your Hive metastore Thrift service: +To configure the Hive connector, create a catalog properties file +``etc/catalog/example.properties`` that references the ``hive`` +connector and defines a metastore. You must configure a metastore for table +metadata. If you are using a :ref:`Hive metastore `, +``hive.metastore.uri`` must be configured: -.. code-block:: text +.. code-block:: properties connector.name=hive hive.metastore.uri=thrift://example.net:9083 +If you are using :ref:`AWS Glue ` as your metastore, you +must instead set ``hive.metastore`` to ``glue``: + +.. code-block:: properties + + connector.name=hive + hive.metastore=glue + +Each metastore type has specific configuration properties along with +:ref:`general metastore configuration properties `. + Multiple Hive clusters ^^^^^^^^^^^^^^^^^^^^^^ @@ -348,294 +362,6 @@ Hive connector documentation. multi-statement write transactions. - ``false`` -Metastores ----------- - -The Hive connector supports the use of the Hive Metastore Service (HMS) and AWS -Glue data catalog. - -Additionally, accessing tables with Athena partition projection metadata, as -well as first class support for Avro tables, are available with additional -configuration. - -General metastore configuration properties -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The required Hive metastore can be configured with a number of properties. -Specific properties can be used to further configure the -`Thrift <#thrift-metastore-configuration-properties>`__ or -`Glue <#aws-glue-catalog-configuration-properties>`__ metastore. - -.. list-table:: General metastore configuration properties - :widths: 35, 50, 15 - :header-rows: 1 - - * - Property Name - - Description - - Default - * - ``hive.metastore`` - - The type of Hive metastore to use. Trino currently supports the default - Hive Thrift metastore (``thrift``), and the AWS Glue Catalog (``glue``) - as metadata sources. - - ``thrift`` - * - ``hive.metastore-cache.cache-partitions`` - - Enable caching for partition metadata. You can disable caching to avoid - inconsistent behavior that results from it. - - ``true`` - * - ``hive.metastore-cache-ttl`` - - Duration of how long cached metastore data is considered valid. - - ``0s`` - * - ``hive.metastore-stats-cache-ttl`` - - Duration of how long cached metastore statistics are considered valid. - If ``hive.metastore-cache-ttl`` is larger then it takes precedence - over ``hive.metastore-stats-cache-ttl``. - - ``5m`` - * - ``hive.metastore-cache-maximum-size`` - - Maximum number of metastore data objects in the Hive metastore cache. - - ``10000`` - * - ``hive.metastore-refresh-interval`` - - Asynchronously refresh cached metastore data after access if it is older - than this but is not yet expired, allowing subsequent accesses to see - fresh data. - - - * - ``hive.metastore-refresh-max-threads`` - - Maximum threads used to refresh cached metastore data. - - ``10`` - * - ``hive.metastore-timeout`` - - Timeout for Hive metastore requests. - - ``10s`` - * - ``hive.hide-delta-lake-tables`` - - Controls whether to hide Delta Lake tables in table listings. Currently - applies only when using the AWS Glue metastore. - - ``false`` - -.. _hive-thrift-metastore: - -Thrift metastore configuration properties -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In order to use a Hive Thrift metastore, you must configure the metastore with -``hive.metastore=thrift`` and provide further details with the following -properties: - -.. list-table:: Thrift metastore configuration properties - :widths: 35, 50, 15 - :header-rows: 1 - - * - Property name - - Description - - Default - * - ``hive.metastore.uri`` - - The URIs of the Hive metastore to connect to using the Thrift protocol. - If a comma-separated list of URIs is provided, the first URI is used by - default, and the rest of the URIs are fallback metastores. This property - is required. Example: ``thrift://192.0.2.3:9083`` or - ``thrift://192.0.2.3:9083,thrift://192.0.2.4:9083`` - - - * - ``hive.metastore.username`` - - The username Trino uses to access the Hive metastore. - - - * - ``hive.metastore.authentication.type`` - - Hive metastore authentication type. Possible values are ``NONE`` or - ``KERBEROS``. - - ``NONE`` - * - ``hive.metastore.thrift.impersonation.enabled`` - - Enable Hive metastore end user impersonation. - - - * - ``hive.metastore.thrift.use-spark-table-statistics-fallback`` - - Enable usage of table statistics generated by Apache Spark when Hive - table statistics are not available. - - ``true`` - * - ``hive.metastore.thrift.delegation-token.cache-ttl`` - - Time to live delegation token cache for metastore. - - ``1h`` - * - ``hive.metastore.thrift.delegation-token.cache-maximum-size`` - - Delegation token cache maximum size. - - ``1000`` - * - ``hive.metastore.thrift.client.ssl.enabled`` - - Use SSL when connecting to metastore. - - ``false`` - * - ``hive.metastore.thrift.client.ssl.key`` - - Path to private key and client certification (key store). - - - * - ``hive.metastore.thrift.client.ssl.key-password`` - - Password for the private key. - - - * - ``hive.metastore.thrift.client.ssl.trust-certificate`` - - Path to the server certificate chain (trust store). Required when SSL is - enabled. - - - * - ``hive.metastore.thrift.client.ssl.trust-certificate-password`` - - Password for the trust store. - - - * - ``hive.metastore.service.principal`` - - The Kerberos principal of the Hive metastore service. - - - * - ``hive.metastore.client.principal`` - - The Kerberos principal that Trino uses when connecting to the Hive - metastore service. - - - * - ``hive.metastore.client.keytab`` - - Hive metastore client keytab location. - - - * - ``hive.metastore.thrift.delete-files-on-drop`` - - Actively delete the files for managed tables when performing drop table - or partition operations, for cases when the metastore does not delete the - files. - - ``false`` - * - ``hive.metastore.thrift.assume-canonical-partition-keys`` - - Allow the metastore to assume that the values of partition columns can be - converted to string values. This can lead to performance improvements in - queries which apply filters on the partition columns. Partition keys with - a ``TIMESTAMP`` type do not get canonicalized. - - ``false`` - * - ``hive.metastore.thrift.client.socks-proxy`` - - SOCKS proxy to use for the Thrift Hive metastore. - - - * - ``hive.metastore.thrift.client.max-retries`` - - Maximum number of retry attempts for metastore requests. - - ``9`` - * - ``hive.metastore.thrift.client.backoff-scale-factor`` - - Scale factor for metastore request retry delay. - - ``2.0`` - * - ``hive.metastore.thrift.client.max-retry-time`` - - Total allowed time limit for a metastore request to be retried. - - ``30s`` - * - ``hive.metastore.thrift.client.min-backoff-delay`` - - Minimum delay between metastore request retries. - - ``1s`` - * - ``hive.metastore.thrift.client.max-backoff-delay`` - - Maximum delay between metastore request retries. - - ``1s`` - * - ``hive.metastore.thrift.txn-lock-max-wait`` - - Maximum time to wait to acquire hive transaction lock. - - ``10m`` - -.. _hive-glue-metastore: - -AWS Glue catalog configuration properties -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In order to use a Glue catalog, you must configure the metastore with -``hive.metastore=glue`` and provide further details with the following -properties: - -.. list-table:: AWS Glue catalog configuration properties - :widths: 35, 50, 15 - :header-rows: 1 - - * - Property Name - - Description - - Default - * - ``hive.metastore.glue.region`` - - AWS region of the Glue Catalog. This is required when not running in - EC2, or when the catalog is in a different region. Example: - ``us-east-1`` - - - * - ``hive.metastore.glue.endpoint-url`` - - Glue API endpoint URL (optional). Example: - ``https://glue.us-east-1.amazonaws.com`` - - - * - ``hive.metastore.glue.sts.region`` - - AWS region of the STS service to authenticate with. This is required - when running in a GovCloud region. Example: ``us-gov-east-1`` - - - * - ``hive.metastore.glue.proxy-api-id`` - - The ID of the Glue Proxy API, when accessing Glue via an VPC endpoint in - API Gateway. - - - * - ``hive.metastore.glue.sts.endpoint`` - - STS endpoint URL to use when authenticating to Glue (optional). Example: - ``https://sts.us-gov-east-1.amazonaws.com`` - - - * - ``hive.metastore.glue.pin-client-to-current-region`` - - Pin Glue requests to the same region as the EC2 instance where Trino is - running. - - ``false`` - * - ``hive.metastore.glue.max-connections`` - - Max number of concurrent connections to Glue. - - ``30`` - * - ``hive.metastore.glue.max-error-retries`` - - Maximum number of error retries for the Glue client. - - ``10`` - * - ``hive.metastore.glue.default-warehouse-dir`` - - Default warehouse directory for schemas created without an explicit - ``location`` property. - - - * - ``hive.metastore.glue.aws-credentials-provider`` - - Fully qualified name of the Java class to use for obtaining AWS - credentials. Can be used to supply a custom credentials provider. - - - * - ``hive.metastore.glue.aws-access-key`` - - AWS access key to use to connect to the Glue Catalog. If specified along - with ``hive.metastore.glue.aws-secret-key``, this parameter takes - precedence over ``hive.metastore.glue.iam-role``. - - - * - ``hive.metastore.glue.aws-secret-key`` - - AWS secret key to use to connect to the Glue Catalog. If specified along - with ``hive.metastore.glue.aws-access-key``, this parameter takes - precedence over ``hive.metastore.glue.iam-role``. - - - * - ``hive.metastore.glue.catalogid`` - - The ID of the Glue Catalog in which the metadata database resides. - - - * - ``hive.metastore.glue.iam-role`` - - ARN of an IAM role to assume when connecting to the Glue Catalog. - - - * - ``hive.metastore.glue.external-id`` - - External ID for the IAM role trust policy when connecting to the Glue - Catalog. - - - * - ``hive.metastore.glue.partitions-segments`` - - Number of segments for partitioned Glue tables. - - ``5`` - * - ``hive.metastore.glue.get-partition-threads`` - - Number of threads for parallel partition fetches from Glue. - - ``20`` - * - ``hive.metastore.glue.read-statistics-threads`` - - Number of threads for parallel statistic fetches from Glue. - - ``5`` - * - ``hive.metastore.glue.write-statistics-threads`` - - Number of threads for parallel statistic writes to Glue. - - ``5`` - -.. _partition-projection: - -Accessing tables with Athena partition projection metadata -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -`Partition projection `_ -is a feature of AWS Athena often used to speed up query processing with highly -partitioned tables. - -Trino supports partition projection table properties stored in the metastore, -and it reimplements this functionality. Currently, there is a limitation in -comparison to AWS Athena for date projection, as it only supports intervals of -``DAYS``, ``HOURS``, ``MINUTES``, and ``SECONDS``. - -If there are any compatibility issues blocking access to a requested table when -you have partition projection enabled, you can set the -``partition_projection_ignore`` table property to ``true`` for a table to bypass -any errors. - -Refer to :ref:`hive-table-properties` and :ref:`hive-column-properties` for -configuration of partition projection. - -Metastore configuration for Avro -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In order to enable first-class support for Avro tables when using -Hive 3.x, you must add the following property definition to the Hive metastore -configuration file ``hive-site.xml`` and restart the metastore service: - -.. code-block:: xml - - - - metastore.storage.schema.reader.impl - org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader - - Storage ------- diff --git a/docs/src/main/sphinx/connector/hudi.rst b/docs/src/main/sphinx/connector/hudi.rst index 059ea4a37c17..025a42ec9416 100644 --- a/docs/src/main/sphinx/connector/hudi.rst +++ b/docs/src/main/sphinx/connector/hudi.rst @@ -15,7 +15,7 @@ To use the Hudi connector, you need: * Hudi version 0.12.3 or higher. * Network access from the Trino coordinator and workers to the Hudi storage. -* Access to the Hive metastore service (HMS). +* Access to a Hive metastore service (HMS). * Network access from the Trino coordinator to the HMS. * Data files stored in the Parquet file format. These can be configured using :ref:`file format configuration properties ` per @@ -24,21 +24,20 @@ To use the Hudi connector, you need: General configuration --------------------- -The connector requires a Hive metastore for table metadata and supports the same -metastore configuration properties as the :doc:`Hive connector -`. At a minimum, ``hive.metastore.uri`` must be configured. -The connector recognizes Hudi tables synced to the metastore by the -`Hudi sync tool `_. - -To create a catalog that uses the Hudi connector, create a catalog properties -file ``etc/catalog/example.properties`` that references the ``hudi`` connector. -Update the ``hive.metastore.uri`` with the URI of your Hive metastore Thrift -service: +To configure the Hive connector, create a catalog properties file +``etc/catalog/example.properties`` that references the ``hudi`` +connector and defines the HMS to use with the ``hive.metastore.uri`` +configuration property: .. code-block:: properties connector.name=hudi hive.metastore.uri=thrift://example.net:9083 + +There are :ref:`HMS configuration properties ` +available for use with the Hudi connector. The connector recognizes Hudi tables +synced to the metastore by the `Hudi sync tool +`_. Additionally, following configuration properties can be set depending on the use-case: diff --git a/docs/src/main/sphinx/connector/iceberg.rst b/docs/src/main/sphinx/connector/iceberg.rst index d25f6b3cf615..06906891d731 100644 --- a/docs/src/main/sphinx/connector/iceberg.rst +++ b/docs/src/main/sphinx/connector/iceberg.rst @@ -38,10 +38,10 @@ To use Iceberg, you need: * Network access from the Trino coordinator and workers to the distributed object storage. -* Access to a :ref:`Hive metastore service (HMS)`, an - :ref:`AWS Glue catalog`, a :ref:`JDBC catalog - `, a :ref:`REST catalog`, or a - :ref:`Nessie server`. +* Access to a :ref:`Hive metastore service (HMS) `, an + :ref:`AWS Glue catalog `, a :ref:`JDBC catalog + `, a :ref:`REST catalog `, or a + :ref:`Nessie server `. * Data files stored in a supported file format. These can be configured using file format configuration properties per catalog: @@ -51,8 +51,26 @@ To use Iceberg, you need: General configuration --------------------- -These configuration properties are independent of which catalog implementation -is used. +To configure the Iceberg connector, create a catalog properties file +``etc/catalog/example.properties`` that references the ``iceberg`` +connector and defines a metastore type. The Hive metastore catalog is the +default implementation. To use a :ref:`Hive metastore `, +``iceberg.catalog.type`` must be set to ``hive_metastore`` and +``hive.metastore.uri`` must be configured: + +.. code-block:: properties + + connector.name=iceberg + iceberg.catalog.type=hive_metastore + hive.metastore.uri=thrift://example.net:9083 + +Other metadata catalog types as listed in the requirements section of this topic +are available. Each metastore type has specific configuration properties along +with :ref:`general metastore configuration properties +`. + +The following configuration properties are independent of which catalog +implementation is used: .. list-table:: Iceberg general configuration properties :widths: 30, 58, 12 @@ -61,6 +79,15 @@ is used. * - Property name - Description - Default + * - ``iceberg.catalog.type`` + - Define the metastore type to use. Possible values are: + + * ``hive_metastore`` + * ``glue`` + * ``jdbc`` + * ``rest`` + * ``nessie`` + - * - ``iceberg.file-format`` - Define the data storage file format for Iceberg tables. Possible values are: @@ -132,170 +159,6 @@ is used. - Enable to allow user to call ``register_table`` procedure. - ``false`` -Metastores ----------- - -The Iceberg table format manages most metadata in metadata files in the object -storage itself. A small amount of metadata, however, still requires the use of a -metastore. In the Iceberg ecosystem, these smaller metastores are called Iceberg -metadata catalogs, or just catalogs. The examples in each subsection depict the -contents of a Trino catalog file that uses the the Iceberg connector to -configures different Iceberg metadata catalogs. - -The connector supports multiple Iceberg catalog types; you can use either a Hive -metastore service (HMS), AWS Glue, a REST catalog, or Nessie. The catalog type -is determined by the ``iceberg.catalog.type`` property. It can be set to -``HIVE_METASTORE``, ``GLUE``, ``JDBC``, ``REST``, or ``NESSIE``. - -.. _iceberg-hive-catalog: - -Hive metastore catalog -^^^^^^^^^^^^^^^^^^^^^^ - -The Hive metastore catalog is the default implementation. When using it, the -Iceberg connector supports the same metastore configuration properties as the -Hive connector. At a minimum, ``hive.metastore.uri`` must be configured. See -:ref:`Thrift metastore configuration`. - -.. code-block:: text - - connector.name=iceberg - hive.metastore.uri=thrift://localhost:9083 - -.. _iceberg-glue-catalog: - -Glue catalog -^^^^^^^^^^^^ - -When using the Glue catalog, the Iceberg connector supports the same -configuration properties as the Hive connector's Glue setup. See :ref:`AWS Glue -metastore configuration`. - -.. code-block:: text - - connector.name=iceberg - iceberg.catalog.type=glue - -.. list-table:: Iceberg Glue catalog configuration properties - :widths: 35, 50, 15 - :header-rows: 1 - - * - Property name - - Description - - Default - * - ``iceberg.glue.skip-archive`` - - Skip archiving an old table version when creating a new version in a - commit. See `AWS Glue Skip Archive - `_. - - ``false`` - -.. _iceberg-rest-catalog: - -REST catalog -^^^^^^^^^^^^^^ - -In order to use the Iceberg REST catalog, ensure to configure the catalog type -with ``iceberg.catalog.type=rest`` and provide further details with the -following properties: - -.. list-table:: Iceberg REST catalog configuration properties - :widths: 40, 60 - :header-rows: 1 - - * - Property name - - Description - * - ``iceberg.rest-catalog.uri`` - - REST server API endpoint URI (required). - Example: ``http://iceberg-with-rest:8181`` - * - ``iceberg.rest-catalog.warehouse`` - - Warehouse identifier/location for the catalog (optional). Example: - ``s3://my_bucket/warehouse_location`` - * - ``iceberg.rest-catalog.security`` - - The type of security to use (default: ``NONE``). ``OAUTH2`` requires - either a ``token`` or ``credential``. Example: ``OAUTH2`` - * - ``iceberg.rest-catalog.session`` - - Session information included when communicating with the REST Catalog. - Options are ``NONE`` or ``USER`` (default: ``NONE``). - * - ``iceberg.rest-catalog.oauth2.token`` - - The bearer token used for interactions with the server. A ``token`` or - ``credential`` is required for ``OAUTH2`` security. Example: - ``AbCdEf123456`` - * - ``iceberg.rest-catalog.oauth2.credential`` - - The credential to exchange for a token in the OAuth2 client credentials - flow with the server. A ``token`` or ``credential`` is required for - ``OAUTH2`` security. Example: ``AbCdEf123456`` - -.. code-block:: text - - connector.name=iceberg - iceberg.catalog.type=rest - iceberg.rest-catalog.uri=http://iceberg-with-rest:8181 - -REST catalog does not support :doc:`views` or -:doc:`materialized views`. - -.. _iceberg-nessie-catalog: - -Nessie catalog -^^^^^^^^^^^^^^ - -In order to use a Nessie catalog, ensure to configure the catalog type with -``iceberg.catalog.type=nessie`` and provide further details with the following -properties: - -.. list-table:: Nessie catalog configuration properties - :widths: 40, 60 - :header-rows: 1 - - * - Property name - - Description - * - ``iceberg.nessie-catalog.uri`` - - Nessie API endpoint URI (required). - Example: ``https://localhost:19120/api/v1`` - * - ``iceberg.nessie-catalog.ref`` - - The branch/tag to use for Nessie, defaults to ``main``. - * - ``iceberg.nessie-catalog.default-warehouse-dir`` - - Default warehouse directory for schemas created without an explicit - ``location`` property. Example: ``/tmp`` - -.. code-block:: text - - connector.name=iceberg - iceberg.catalog.type=nessie - iceberg.nessie-catalog.uri=https://localhost:19120/api/v1 - iceberg.nessie-catalog.default-warehouse-dir=/tmp - -.. _iceberg-jdbc-catalog: - -JDBC catalog -^^^^^^^^^^^^ - -.. warning:: - - The JDBC catalog could face the compatibility issue if Iceberg introduces - breaking changes in the future. Consider the :ref:`REST catalog - ` as an alternative solution. - -At a minimum, ``iceberg.jdbc-catalog.driver-class``, -``iceberg.jdbc-catalog.connection-url``, and -``iceberg.jdbc-catalog.catalog-name`` must be configured. When using any -database besides PostgreSQL, a JDBC driver jar file must be placed in the plugin -directory. - -.. code-block:: text - - connector.name=iceberg - iceberg.catalog.type=jdbc - iceberg.jdbc-catalog.catalog-name=test - iceberg.jdbc-catalog.driver-class=org.postgresql.Driver - iceberg.jdbc-catalog.connection-url=jdbc:postgresql://example.net:5432/database - iceberg.jdbc-catalog.connection-user=admin - iceberg.jdbc-catalog.connection-password=test - iceberg.jdbc-catalog.default-warehouse-dir=s3://bucket - -JDBC catalog does not support :doc:`views` or -:doc:`materialized views`. - Type mapping ------------ diff --git a/docs/src/main/sphinx/connector/metastores.rst b/docs/src/main/sphinx/connector/metastores.rst new file mode 100644 index 000000000000..5718475ba535 --- /dev/null +++ b/docs/src/main/sphinx/connector/metastores.rst @@ -0,0 +1,465 @@ +========== +Metastores +========== + +Object storage access is mediated through a *metastore*. Metastores provide +information on directory structure, file format, and metadata about the stored +data. Object storage connectors support the use of one or more metastores. A +supported metastore is required to use any object storage connector. + +Additional configuration is required in order to access tables with Athena +partition projection metadata or implement first class support for Avro tables. +These requirements are discussed later in this topic. + +.. _general-metastore-properties: + +General metastore configuration properties +------------------------------------------ + +The following table describes general metastore configuration properties, most +of which are used with either metastore. + +At a minimum, each Delta Lake, Hive or Hudi object storage catalog file must set +the ``hive.metastore`` configuration property to define the type of metastore to +use. Iceberg catalogs instead use the ``iceberg.catalog.type`` configuration +property to define the type of metastore to use. + +Additional configuration properties specific to the Thrift and Glue Metastores +are also available. They are discussed later in this topic. + +.. list-table:: General metastore configuration properties + :widths: 35, 50, 15 + :header-rows: 1 + + * - Property Name + - Description + - Default + * - ``hive.metastore`` + - The type of Hive metastore to use. Trino currently supports the default + Hive Thrift metastore (``thrift``), and the AWS Glue Catalog (``glue``) + as metadata sources. You must use this for all object storage catalogs + except Iceberg. + - ``thrift`` + * - ``iceberg.catalog.type`` + - The Iceberg table format manages most metadata in metadata files in the + object storage itself. A small amount of metadata, however, still + requires the use of a metastore. In the Iceberg ecosystem, these smaller + metastores are called Iceberg metadata catalogs, or just catalogs. The + examples in each subsection depict the contents of a Trino catalog file + that uses the the Iceberg connector to configures different Iceberg + metadata catalogs. + + You must set this property in all Iceberg catalog property files. + Valid values are ``HIVE_METASTORE``, ``GLUE``, ``JDBC``, and ``REST``. + - + * - ``hive.metastore-cache.cache-partitions`` + - Enable caching for partition metadata. You can disable caching to avoid + inconsistent behavior that results from it. + - ``true`` + * - ``hive.metastore-cache-ttl`` + - Duration of how long cached metastore data is considered valid. + - ``0s`` + * - ``hive.metastore-stats-cache-ttl`` + - Duration of how long cached metastore statistics are considered valid. + If ``hive.metastore-cache-ttl`` is larger then it takes precedence + over ``hive.metastore-stats-cache-ttl``. + - ``5m`` + * - ``hive.metastore-cache-maximum-size`` + - Maximum number of metastore data objects in the Hive metastore cache. + - ``10000`` + * - ``hive.metastore-refresh-interval`` + - Asynchronously refresh cached metastore data after access if it is older + than this but is not yet expired, allowing subsequent accesses to see + fresh data. + - + * - ``hive.metastore-refresh-max-threads`` + - Maximum threads used to refresh cached metastore data. + - ``10`` + * - ``hive.metastore-timeout`` + - Timeout for Hive metastore requests. + - ``10s`` + * - ``hive.hide-delta-lake-tables`` + - Controls whether to hide Delta Lake tables in table listings. Currently + applies only when using the AWS Glue metastore. + - ``false`` + +.. _hive-thrift-metastore: + +Thrift metastore configuration properties +----------------------------------------- + +In order to use a Hive Thrift metastore, you must configure the metastore with +``hive.metastore=thrift`` and provide further details with the following +properties: + +.. list-table:: Thrift metastore configuration properties + :widths: 35, 50, 15 + :header-rows: 1 + + * - Property name + - Description + - Default + * - ``hive.metastore.uri`` + - The URIs of the Hive metastore to connect to using the Thrift protocol. + If a comma-separated list of URIs is provided, the first URI is used by + default, and the rest of the URIs are fallback metastores. This property + is required. Example: ``thrift://192.0.2.3:9083`` or + ``thrift://192.0.2.3:9083,thrift://192.0.2.4:9083`` + - + * - ``hive.metastore.username`` + - The username Trino uses to access the Hive metastore. + - + * - ``hive.metastore.authentication.type`` + - Hive metastore authentication type. Possible values are ``NONE`` or + ``KERBEROS``. + - ``NONE`` + * - ``hive.metastore.thrift.impersonation.enabled`` + - Enable Hive metastore end user impersonation. + - + * - ``hive.metastore.thrift.use-spark-table-statistics-fallback`` + - Enable usage of table statistics generated by Apache Spark when Hive + table statistics are not available. + - ``true`` + * - ``hive.metastore.thrift.delegation-token.cache-ttl`` + - Time to live delegation token cache for metastore. + - ``1h`` + * - ``hive.metastore.thrift.delegation-token.cache-maximum-size`` + - Delegation token cache maximum size. + - ``1000`` + * - ``hive.metastore.thrift.client.ssl.enabled`` + - Use SSL when connecting to metastore. + - ``false`` + * - ``hive.metastore.thrift.client.ssl.key`` + - Path to private key and client certification (key store). + - + * - ``hive.metastore.thrift.client.ssl.key-password`` + - Password for the private key. + - + * - ``hive.metastore.thrift.client.ssl.trust-certificate`` + - Path to the server certificate chain (trust store). Required when SSL is + enabled. + - + * - ``hive.metastore.thrift.client.ssl.trust-certificate-password`` + - Password for the trust store. + - + * - ``hive.metastore.service.principal`` + - The Kerberos principal of the Hive metastore service. + - + * - ``hive.metastore.client.principal`` + - The Kerberos principal that Trino uses when connecting to the Hive + metastore service. + - + * - ``hive.metastore.client.keytab`` + - Hive metastore client keytab location. + - + * - ``hive.metastore.thrift.delete-files-on-drop`` + - Actively delete the files for managed tables when performing drop table + or partition operations, for cases when the metastore does not delete the + files. + - ``false`` + * - ``hive.metastore.thrift.assume-canonical-partition-keys`` + - Allow the metastore to assume that the values of partition columns can be + converted to string values. This can lead to performance improvements in + queries which apply filters on the partition columns. Partition keys with + a ``TIMESTAMP`` type do not get canonicalized. + - ``false`` + * - ``hive.metastore.thrift.client.socks-proxy`` + - SOCKS proxy to use for the Thrift Hive metastore. + - + * - ``hive.metastore.thrift.client.max-retries`` + - Maximum number of retry attempts for metastore requests. + - ``9`` + * - ``hive.metastore.thrift.client.backoff-scale-factor`` + - Scale factor for metastore request retry delay. + - ``2.0`` + * - ``hive.metastore.thrift.client.max-retry-time`` + - Total allowed time limit for a metastore request to be retried. + - ``30s`` + * - ``hive.metastore.thrift.client.min-backoff-delay`` + - Minimum delay between metastore request retries. + - ``1s`` + * - ``hive.metastore.thrift.client.max-backoff-delay`` + - Maximum delay between metastore request retries. + - ``1s`` + * - ``hive.metastore.thrift.txn-lock-max-wait`` + - Maximum time to wait to acquire hive transaction lock. + - ``10m`` + +.. _hive-glue-metastore: + +AWS Glue catalog configuration properties +----------------------------------------- + +In order to use an AWS Glue catalog, you must configure your catalog file as +follows: + +``hive.metastore=glue`` and provide further details with the following +properties: + +.. list-table:: AWS Glue catalog configuration properties + :widths: 35, 50, 15 + :header-rows: 1 + + * - Property Name + - Description + - Default + * - ``hive.metastore.glue.region`` + - AWS region of the Glue Catalog. This is required when not running in + EC2, or when the catalog is in a different region. Example: + ``us-east-1`` + - + * - ``hive.metastore.glue.endpoint-url`` + - Glue API endpoint URL (optional). Example: + ``https://glue.us-east-1.amazonaws.com`` + - + * - ``hive.metastore.glue.sts.region`` + - AWS region of the STS service to authenticate with. This is required + when running in a GovCloud region. Example: ``us-gov-east-1`` + - + * - ``hive.metastore.glue.proxy-api-id`` + - The ID of the Glue Proxy API, when accessing Glue via an VPC endpoint in + API Gateway. + - + * - ``hive.metastore.glue.sts.endpoint`` + - STS endpoint URL to use when authenticating to Glue (optional). Example: + ``https://sts.us-gov-east-1.amazonaws.com`` + - + * - ``hive.metastore.glue.pin-client-to-current-region`` + - Pin Glue requests to the same region as the EC2 instance where Trino is + running. + - ``false`` + * - ``hive.metastore.glue.max-connections`` + - Max number of concurrent connections to Glue. + - ``30`` + * - ``hive.metastore.glue.max-error-retries`` + - Maximum number of error retries for the Glue client. + - ``10`` + * - ``hive.metastore.glue.default-warehouse-dir`` + - Default warehouse directory for schemas created without an explicit + ``location`` property. + - + * - ``hive.metastore.glue.aws-credentials-provider`` + - Fully qualified name of the Java class to use for obtaining AWS + credentials. Can be used to supply a custom credentials provider. + - + * - ``hive.metastore.glue.aws-access-key`` + - AWS access key to use to connect to the Glue Catalog. If specified along + with ``hive.metastore.glue.aws-secret-key``, this parameter takes + precedence over ``hive.metastore.glue.iam-role``. + - + * - ``hive.metastore.glue.aws-secret-key`` + - AWS secret key to use to connect to the Glue Catalog. If specified along + with ``hive.metastore.glue.aws-access-key``, this parameter takes + precedence over ``hive.metastore.glue.iam-role``. + - + * - ``hive.metastore.glue.catalogid`` + - The ID of the Glue Catalog in which the metadata database resides. + - + * - ``hive.metastore.glue.iam-role`` + - ARN of an IAM role to assume when connecting to the Glue Catalog. + - + * - ``hive.metastore.glue.external-id`` + - External ID for the IAM role trust policy when connecting to the Glue + Catalog. + - + * - ``hive.metastore.glue.partitions-segments`` + - Number of segments for partitioned Glue tables. + - ``5`` + * - ``hive.metastore.glue.get-partition-threads`` + - Number of threads for parallel partition fetches from Glue. + - ``20`` + * - ``hive.metastore.glue.read-statistics-threads`` + - Number of threads for parallel statistic fetches from Glue. + - ``5`` + * - ``hive.metastore.glue.write-statistics-threads`` + - Number of threads for parallel statistic writes to Glue. + - ``5`` + +.. _iceberg-glue-catalog: + +Iceberg-specific Glue catalog configuration properties +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When using the Glue catalog, the Iceberg connector supports the same +:ref:`general Glue configuration properties ` as previously +described with the following additional property: + +.. list-table:: Iceberg Glue catalog configuration property + :widths: 35, 50, 15 + :header-rows: 1 + + * - Property name + - Description + - Default + * - ``iceberg.glue.skip-archive`` + - Skip archiving an old table version when creating a new version in a + commit. See `AWS Glue Skip Archive + `_. + - ``false`` + +Iceberg-specific metastores +--------------------------- + +The Iceberg table format manages most metadata in metadata files in the object +storage itself. A small amount of metadata, however, still requires the use of a +metastore. In the Iceberg ecosystem, these smaller metastores are called Iceberg +metadata catalogs, or just catalogs. + +You can use a general metastore such as an HMS or AWS Glue, or you can use the +Iceberg-specific REST, Nessie or JDBC metadata catalogs, as discussed in this +section. + +.. _iceberg-rest-catalog: + +REST catalog +^^^^^^^^^^^^ + +In order to use the Iceberg REST catalog, configure the catalog type +with ``iceberg.catalog.type=rest``, and provide further details with the +following properties: + +.. list-table:: Iceberg REST catalog configuration properties + :widths: 40, 60 + :header-rows: 1 + + * - Property name + - Description + * - ``iceberg.rest-catalog.uri`` + - REST server API endpoint URI (required). + Example: ``http://iceberg-with-rest:8181`` + * - ``iceberg.rest-catalog.warehouse`` + - Warehouse identifier/location for the catalog (optional). + Example: ``s3://my_bucket/warehouse_location`` + * - ``iceberg.rest-catalog.security`` + - The type of security to use (default: ``NONE``). ``OAUTH2`` requires + either a ``token`` or ``credential``. Example: ``OAUTH2`` + * - ``iceberg.rest-catalog.session`` + - Session information included when communicating with the REST Catalog. + Options are ``NONE`` or ``USER`` (default: ``NONE``). + * - ``iceberg.rest-catalog.oauth2.token`` + - The bearer token used for interactions with the server. A + ``token`` or ``credential`` is required for ``OAUTH2`` security. + Example: ``AbCdEf123456`` + * - ``iceberg.rest-catalog.oauth2.credential`` + - The credential to exchange for a token in the OAuth2 client credentials + flow with the server. A ``token`` or ``credential`` is required for + ``OAUTH2`` security. Example: ``AbCdEf123456`` + +The following example shows a minimal catalog configuration using an Iceberg +REST metadata catalog: + +.. code-block:: properties + + connector.name=iceberg + iceberg.catalog.type=rest + iceberg.rest-catalog.uri=http://iceberg-with-rest:8181 + +The REST catalog does not support :doc:`views` or +:doc:`materialized views`. + +.. _iceberg-jdbc-catalog: + +JDBC catalog +^^^^^^^^^^^^ + +The Iceberg REST catalog is supported for the Iceberg connector. At a minimum, +``iceberg.jdbc-catalog.driver-class``, ``iceberg.jdbc-catalog.connection-url`` +and ``iceberg.jdbc-catalog.catalog-name`` must be configured. When using any +database besides PostgreSQL, a JDBC driver jar file must be placed in the plugin +directory. + +.. warning:: + + The JDBC catalog may have compatibility issues if Iceberg introduces breaking + changes in the future. Consider the :ref:`REST catalog + ` as an alternative solution. + +At a minimum, ``iceberg.jdbc-catalog.driver-class``, +``iceberg.jdbc-catalog.connection-url``, and +``iceberg.jdbc-catalog.catalog-name`` must be configured. When using any +database besides PostgreSQL, a JDBC driver jar file must be placed in the plugin +directory. The following example shows a minimal catalog configuration using an +Iceberg REST metadata catalog: + +.. code-block:: text + + connector.name=iceberg + iceberg.catalog.type=jdbc + iceberg.jdbc-catalog.catalog-name=test + iceberg.jdbc-catalog.driver-class=org.postgresql.Driver + iceberg.jdbc-catalog.connection-url=jdbc:postgresql://example.net:5432/database + iceberg.jdbc-catalog.connection-user=admin + iceberg.jdbc-catalog.connection-password=test + iceberg.jdbc-catalog.default-warehouse-dir=s3://bucket + +The JDBC catalog does not support :doc:`views` or +:doc:`materialized views`. + +.. _iceberg-nessie-catalog: + +Nessie catalog +^^^^^^^^^^^^^^ + +In order to use a Nessie catalog, configure the catalog type with +``iceberg.catalog.type=nessie`` and provide further details with the following +properties: + +.. list-table:: Nessie catalog configuration properties + :widths: 40, 60 + :header-rows: 1 + + * - Property name + - Description + * - ``iceberg.nessie-catalog.uri`` + - Nessie API endpoint URI (required). + Example: ``https://localhost:19120/api/v1`` + * - ``iceberg.nessie-catalog.ref`` + - The branch/tag to use for Nessie, defaults to ``main``. + * - ``iceberg.nessie-catalog.default-warehouse-dir`` + - Default warehouse directory for schemas created without an explicit + ``location`` property. Example: ``/tmp`` + +.. code-block:: text + + connector.name=iceberg + iceberg.catalog.type=nessie + iceberg.nessie-catalog.uri=https://localhost:19120/api/v1 + iceberg.nessie-catalog.default-warehouse-dir=/tmp + +.. _partition-projection: + +Access tables with Athena partition projection metadata +------------------------------------------------------- + +`Partition projection `_ +is a feature of AWS Athena often used to speed up query processing with highly +partitioned tables when using the Hive connector. + +Trino supports partition projection table properties stored in the Hive +metastore or Glue catalog, and it reimplements this functionality. Currently, +there is a limitation in comparison to AWS Athena for date projection, as it +only supports intervals of ``DAYS``, ``HOURS``, ``MINUTES``, and ``SECONDS``. + +If there are any compatibility issues blocking access to a requested table when +partition projection is enabled, set the +``partition_projection_ignore`` table property to ``true`` for a table to bypass +any errors. + +Refer to :ref:`hive-table-properties` and :ref:`hive-column-properties` for +configuration of partition projection. + +Configure metastore for Avro +---------------------------- + +For catalogs using the Hive connector, you must add the following property +definition to the Hive metastore configuration file ``hive-site.xml`` and +restart the metastore service to enable first-class support for Avro tables when +using Hive 3.x: + +.. code-block:: xml + + + + metastore.storage.schema.reader.impl + org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader + \ No newline at end of file