diff --git a/presto-docs/src/main/sphinx/functions.rst b/presto-docs/src/main/sphinx/functions.rst index 02f8f7e46f9d5..45593f3468532 100644 --- a/presto-docs/src/main/sphinx/functions.rst +++ b/presto-docs/src/main/sphinx/functions.rst @@ -28,6 +28,7 @@ Functions and Operators functions/hyperloglog functions/khyperloglog functions/qdigest + functions/tdigest functions/color functions/session functions/teradata diff --git a/presto-docs/src/main/sphinx/functions/qdigest.rst b/presto-docs/src/main/sphinx/functions/qdigest.rst index 11bf6b307602c..c7e609e66417a 100644 --- a/presto-docs/src/main/sphinx/functions/qdigest.rst +++ b/presto-docs/src/main/sphinx/functions/qdigest.rst @@ -2,6 +2,18 @@ Quantile Digest Functions ========================= +Presto implements two algorithms for estimating rank-based metrics, `quantile +digest `_ and `T-digest +`_. T-digest has `better +performance `_ in general while the Presto +implementation of quantile digests supports more numeric types. T-digest has +better accuracy at the tails, often dramatically better, but may have worse +accuracy at the median, depending on the compression factor used. In +comparison, quantile digests supports a maximum rank error, which guarantees +relative uniformity of precision along the quantiles. Quantile digests are +also formally proven to support lossless merges, while T-digest is not (but +does empirically demonstrate lossless merges). + Presto implements the ``approx_percentile`` function with the quantile digest data structure. The underlying data structure, :ref:`qdigest `, is exposed as a data type in Presto, and can be created, queried and stored diff --git a/presto-docs/src/main/sphinx/functions/tdigest.rst b/presto-docs/src/main/sphinx/functions/tdigest.rst new file mode 100644 index 0000000000000..539efe4050d8c --- /dev/null +++ b/presto-docs/src/main/sphinx/functions/tdigest.rst @@ -0,0 +1,82 @@ +================== +T-Digest Functions +================== + +Presto implements two algorithms for estimating rank-based metrics, `quantile +digest `_ and `T-digest +`_. T-digest has `better +performance `_ in general while the Presto +implementation of quantile digests supports more numeric types. T-digest has +better accuracy at the tails, often dramatically better, but may have worse +accuracy at the median, depending on the compression factor used. In +comparison, quantile digests supports a maximum rank error, which guarantees +relative uniformity of precision along the quantiles. Quantile digests are +also formally proven to support lossless merges, while T-digest is not (but +does empirically demonstrate lossless merges). + +T-digest was developed by Ted Dunning. + +Data Structures +--------------- + +A T-digest is a data sketch which stores approximate percentile information. +The Presto type for this data structure is called :ref:`tdigest `, +and it accepts a parameter of type ``double`` which represents the set of +numbers to be ingested by the ``tdigest``. Other numeric types may be added +in a future release. + +T-digests may be merged without losing precision, and for storage and retrieval +they may be cast to/from ``VARBINARY``. + +Functions +--------- + +.. function:: merge(tdigest) -> tdigest + :noindex: + + Merges all input ``tdigest``\ s into a single ``tdigest``. + +.. function:: value_at_quantile(tdigest, quantile) -> double + + Returns the approximate percentile values from the T-digest given the + number ``quantile`` between 0 and 1. + +.. function:: quantile_at_value(tdigest, value) -> double + + Returns the approximate quantile number between 0 and 1 from the T-digest + given an input ``value``. Null is returned if the T-digest is empty or the + input value is outside of the range of the digest. + +.. function:: scale_tdigest(tdigest, scale_factor) -> tdigest + + Returns a ``tdigest`` whose distribution has been scaled by a factor + specified by ``scale_factor``. + +.. function:: values_at_quantiles(tdigest, quantiles) -> array + + Returns the approximate percentile values as an array given the input + T-digest and array of values between 0 and 1 which represent the quantiles + to return. + +.. function:: tdigest_agg(x) -> tdigest + + Returns the ``tdigest`` which is composed of all input values of ``x``. + +.. function:: tdigest_agg(x, w) -> tdigest + + Returns the ``tdigest`` which is composed of all input values of ``x`` using + the per-item weight ``w``. + +.. function:: tdigest_agg(x, w, accuracy) -> tdigest + + Returns the ``tdigest`` which is composed of all input values of ``x`` using + the per-item weight ``w`` and maximum error of ``accuracy``. ``accuracy`` + must be a value greater than zero and less than one, and it must be constant + for all input rows. + +.. function:: destructure_tdigest(tdigest) -> row, centroid_weights array, compression double, min double, max double, sum double, count bigint> + + Returns a row that represents a ``tdigest`` data structure in the form of + its component parts. These include arrays of the centroid means and weights, + the compression factor, and the the maximum, minimum, sum and count of the + values in the digest. diff --git a/presto-docs/src/main/sphinx/language/types.rst b/presto-docs/src/main/sphinx/language/types.rst index 560fcc79d1a25..4e7b5897a1ffe 100644 --- a/presto-docs/src/main/sphinx/language/types.rst +++ b/presto-docs/src/main/sphinx/language/types.rst @@ -325,3 +325,18 @@ Quantile Digest percentile values that are read over the course of a week. Instead of calculating the past week of data with ``approx_percentile``, ``qdigest``\ s could be stored daily, and quickly merged to retrieve the 99th percentile value. + + See :doc:`/functions/qdigest`. + +T-Digest +--------------- + +.. _tdigest_type: + +``TDigest`` +^^^^^^^^^^^ + + A t-digest is similar to :ref:`qdigest `, but it uses `a different algorithm + `_ to represent the approximate distribution of a set + of numbers. T-digest has better performance than quantile digests but only supports the + ``DOUBLE`` type. See :doc:`/functions/tdigest`.