Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 110 additions & 4 deletions docs/src/main/sphinx/connector/hive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -865,6 +865,8 @@ as Hive. For example, converting the string ``'foo'`` to a number,
or converting the string ``'1234'`` to a ``tinyint`` (which has a
maximum value of ``127``).

.. _hive_avro_schema:

Avro schema evolution
---------------------

Expand Down Expand Up @@ -984,6 +986,109 @@ Procedures
Flush Hive metadata cache entries connected with selected partition.
Procedure requires named parameters to be passed

.. _hive_table_properties:

Table properties
----------------

Table properties supply or set metadata for the underlying tables. This
is key for :doc:`/sql/create-table-as` statements. Table properties are passed
to the connector using a :doc:`WITH </sql/create-table-as>` clause::

CREATE TABLE tablename
WITH (format='CSV',
csv_escape = '"')

See the :ref:`hive_examples` for more information.

.. list-table:: Hive connector table properties
:widths: 20, 60, 20
:header-rows: 1

* - Property name
- Description
- Default
* - ``auto_purge``
- Indicates to the configured metastore to perform a purge when a table or
partition is deleted instead of a soft deletion using the trash.
-
* - ``avro_schema_url``
- The URI pointing to :ref:`hive_avro_schema` for the table.
-
* - ``bucket_count``
- The number of buckets to group data into. Only valid if used with
``bucketed_by``.
- 0
* - ``bucketed_by``
- The bucketing column for the storage table. Only valid if used with
``bucket_count``.
- ``[]``
* - ``bucketing_version``
- Specifies which Hive bucketing version to use. Valid values are ``1``
or ``2``.
-
* - ``csv_escape``
- The CSV escape character. Requires CSV format.
-
* - ``csv_quote``
- The CSV quote character. Requires CSV format.
-
* - ``csv_separator``
- The CSV separator character. Requires CSV format.
-
* - ``external_location``
- The URI for an external Hive table on S3, Azure Blob Storage, etc. See the
:ref:`hive_examples` for more information.
-
* - ``format``
- The table file format. Valid values include ``ORC``, ``PARQUET``, ``AVRO``,
``RCBINARY``, ``RCTEXT``, ``SEQUENCEFILE``, ``JSON``, ``TEXTFILE``, and
``CSV``. The catalog property ``hive.storage-format`` sets the default
value and can change it to a different default.
-
* - ``null_format``
- The serialization format for ``NULL`` value. Requires TextFile, RCText,
or SequenceFile format.
Comment on lines 1050 to 1051
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for information - null_format can be set to a string that is then interpreted as a marker for NULL values in table files.

e.g. with null_format = 'NULL_VALUE_MARKER' and a TextFile table with some line like a\u0001NULL_VALUE_MARKER\u0001null value would get read as 3 columns a, NULL and null value.

If we want to document this it should be a separate effort - not part of this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mosabua how should I proceed here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create an issue to follow up IMO?

-
* - ``orc_bloom_filter_columns``
- Comma separated list of columns to use for ORC bloom filter. It improves
the performance of queries using range predicates when reading ORC files.
Requires ORC format.
- ``[]``
* - ``orc_bloom_filter_fpp``
- The ORC bloom filters false positive probability. Requires ORC format.
- 0.05
* - ``partitioned_by``
- The partitioning column for the storage table. The columns listed in the
``partitioned_by`` clause must be the last columns as defined in the DDL.
- ``[]``
* - ``skip_footer_line_count``
- The number of footer lines to ignore when parsing the file for data.
Requires TextFile or CSV format tables.
-
* - ``skip_header_line_count``
- The number of header lines to ignore when parsing the file for data.
Requires TextFile or CSV format tables.
-
* - ``sorted_by``
- The column to sort by to determine bucketing for row. Only valid if
``bucketed_by`` and ``bucket_count`` are specified as well.
- ``[]``
* - ``textfile_field_separator``
- Allows the use of custom field separators, such as '|', for TextFile
formatted tables.
-
* - ``textfile_field_separator_escape``
- Allows the use of a custom escape character for TextFile formatted tables.
-
* - ``transactional``
- Set this property to ``true`` to create an ORC ACID transactional table.
Requires ORC format. This property may be shown as true for insert-only
tables created using older versions of Hive.
-

.. _hive_special_columns:

Special columns
---------------

Expand Down Expand Up @@ -1014,11 +1119,10 @@ Retrieve all records that belong to files stored in the partition
FROM hive.web.page_views
WHERE "$partition" = 'ds=2016-08-09/country=US'

Special tables
----------------
.. _hive_special_tables:

Table properties
^^^^^^^^^^^^^^^^
Special tables
--------------

The raw Hive table properties are available as a hidden table, containing a
separate column per table property, with a single row containing the property
Expand All @@ -1029,6 +1133,8 @@ You can inspect the property names and values with a simple query::

SELECT * FROM hive.web."page_views$properties";

.. _hive_examples:

Examples
--------

Expand Down