Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 194 additions & 12 deletions presto-docs/src/main/sphinx/functions/aggregate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -564,6 +564,7 @@ To find the `ROC curve <https://en.wikipedia.org/wiki/Receiver_operating_charact

The thresholds are defined as a sequence whose :math:`j`-th entry is the :math:`j`-th smallest threshold.


Differential Entropy Functions
-------------------------------

Expand All @@ -572,7 +573,7 @@ That is, for a random variable :math:`x`, they approximate

.. math ::

H(x) = - \int x \log_2\left(f(x)\right) dx,
h(x) = - \int f(x) \log_2\left(f(x)\right) dx,

where :math:`f(x)` is the partial density function of :math:`x`.

Expand All @@ -599,8 +600,8 @@ where :math:`f(x)` is the partial density function of :math:`x`.
.. note::

If :math:`x` has a known lower and upper bound,
prefer the versions taking ``(bucket_count, x, 1.0, "fixed_histogram_mle", min, max)``,
or ``(bucket_count, x, 1.0, "fixed_histogram_jacknife", min, max)``,
prefer the versions taking ``(bucket_count, x, 1.0, 'fixed_histogram_mle', min, max)``,
or ``(bucket_count, x, 1.0, 'fixed_histogram_jacknife', min, max)``,
as they have better convergence.

.. function:: differential_entropy(sample_size, x, weight)
Expand Down Expand Up @@ -629,8 +630,8 @@ where :math:`f(x)` is the partial density function of :math:`x`.
.. note::

If :math:`x` has a known lower and upper bound,
prefer the versions taking ``(bucket_count, x, weight, "fixed_histogram_mle", min, max)``,
or ``(bucket_count, x, weight, "fixed_histogram_jacknife", min, max)``,
prefer the versions taking ``(bucket_count, x, weight, 'fixed_histogram_mle', min, max)``,
or ``(bucket_count, x, weight, 'fixed_histogram_jacknife', min, max)``,
as they have better convergence.

.. function:: differential_entropy(bucket_count, x, weight, method, min, max) -> double
Expand Down Expand Up @@ -688,23 +689,204 @@ where :math:`f(x)` is the partial density function of :math:`x`.

Otherwise, if the number of distinct weights is low,
especially if the number of samples is low, consider using the version taking
``(bucket_count, x, weight, "fixed_histogram_jacknife", min, max)``, as jacknife bias correction,
``(bucket_count, x, weight, 'fixed_histogram_jacknife', min, max)``, as jacknife bias correction,
is better than maximum likelihood estimation. However, if the number of distinct weights is high,
consider using the version taking ``(bucket_count, x, weight, "fixed_histogram_mle", min, max)``,
consider using the version taking ``(bucket_count, x, weight, 'fixed_histogram_mle', min, max)``,
as this will reduce memory and running time.


Discrete Entropy Functions
-------------------------------

The following functions approximate the `discrete entropy <https://en.wikipedia.org/wiki/Entropy_(information_theory)>`_.
That is, for a random variable :math:`x`, they approximate

.. math ::

H(x) = - \sum P(x) \log_2\left(P(x)\right) dx,

where :math:`P(x)` is probability of :math:`x`.

.. function:: discrete_entropy(x)

Returns the approximate log-2 discrete entropy from a random variable's sample outcomes. The function internally
creates a map of the (hashed) outcomes of :math:`x` to the number of their occurrences, then calculates
the entropy based on the maximum-likelihood estimate of the counts.

``x`` (``boolean``, ```double``, ``int``, ``long``, or ``varchar``) is the samples.

For example, to find the differential entropy of ``x``, use

.. code-block:: none

SELECT
discrete_entropy(x)
FROM
data

.. note::

This is equivalent to ``discrete_entropy(x, 'mle')``. If the number of instances is small,
consider using jacknife correction via ``discrete_entropy(x, 'jacknife')``.

.. function:: discrete_entropy(x, weight)

Returns the approximate log-2 discrete entropy from a random variable's sample weighted outcomes. The function internally
creates a map of the (hashed) outcomes of :math:`x` to the total weight of their occurrences, then calculates
the entropy based on the maximum-likelihood estimate of the weights.

``x`` (``boolean``, ```double``, ``int``, ``long``, or ``varchar``) is the samples.

``weight`` (``double``) is the non-negative weights.

For example, to find the differential entropy of ``x`` with weights ``weight``, use

.. code-block:: none

SELECT
discrete_entropy(x, weight)
FROM
data

.. note::

This is equivalent to ``discrete_entropy(x, weight, 'mle')``. If the number of instances is small,
consider using jacknife correction via ``discrete_entropy(x, weight, 'jacknife')``.

.. function:: discrete_entropy(x, method)

Returns the approximate log-2 discrete entropy from a random variable's sample outcomes.
If ``method`` is ``'mle'``, this is equivalent to ``discrete_entropy(x)``. If ``method`` is ``'jacknife'``,
the function internally
creates a map of the (hashed) outcomes of :math:`x` and their weights to the number of their occurrences,
then calculates the entropy based on the jacknife-corrected maximum-likelihood estimate of the counts.

``x`` (``boolean``, ```double``, ``int``, ``long``, or ``varchar``) is the samples.

``method`` is either ``'mle'`` or ``'jacknife'``.

For example, to find the differential entropy of ``x``, use

.. code-block:: none

SELECT
discrete_entropy(x, 'jacknife')
FROM
data

.. note::

If the number of instances is large, prefer using ``'mle'`` to ``jacknife``, as it is faster.

.. function:: discrete_entropy(x, weight, method)

Returns the approximate log-2 discrete entropy from a random variable's sample outcomes.
If ``method`` is ``'mle'``, this is equivalent to ``discrete_entropy(x, 'weight')``. If ``method`` is ``'jacknife'``,
the function internally
creates a map of the (hashed) outcomes of :math:`x` and their weights to the number of their occurrences,
then calculates the entropy based on the jacknife-corrected maximum-likelihood estimate of the counts.

``x`` (``boolean``, ```double``, ``int``, ``long``, or ``varchar``) is the samples.

``weight`` (``double``) is the non-negative weights.

``method`` is either ``'mle'`` or ``'jacknife'``.

For example, to find the differential entropy of ``x`` using weights ``weight`` and jacknife estimation, use

.. code-block:: none

SELECT
discrete_entropy(x, weight, 'jacknife')
FROM
data

.. note::

If the number of instances is large, prefer using ``'mle'`` to ``jacknife``, as it is faster. If the number
of distinct weights is large, ``'jacknife'`` might have high memory usage.


Mutual Information for Classification Functions
--------------------------------------------------------------

The following functions approximate the binary
normalized `mutual information <https://en.wikipedia.org/wiki/Mutual_information>`_, which is a measure
of usefulness of a numerical feature for classification. They output a number between 0 (not predictive)
and 1 (completely predictive). See [Krier2006]_ for further details.

For a discrete random variable :math:`y` and a numerical random variable :math:`x`, they approximate

.. math ::

I(x, y) = {h(x) - h(x \;|\; y) \over H(y)},

where :math:`H` is `discrete entropy <https://en.wikipedia.org/wiki/Entropy_(information_theory)>`_ and
:math:`h` is `differential entropy <https://en.wikipedia.org/wiki/Differential_entropy>`_.
Thus, they measure by how much the entropy of :math:`y` is reduced by knowing :math:`x`,
normalized by the entropy of :math:`y`.

.. function:: normalized_differential_mutual_information_classification(sample_size, y, x)

Returns the approximate normalized mutual information between a discrete ``y`` and a continuous ``x`` using
reservoir sampling (see :func:`differential_entropy`).

The parameter ```sample_size`` determines the maximal number of reservoir samples.

If :math:`x` has a known lower and upper bound,
prefer the 'fixed_histogram_mle' or 'fixed_histogram_jacknife' methods, as they have better convergence.

.. function:: normalized_differential_mutual_information_classification(sample_size, y, x, weight, 'reservoir_vasicek')

Returns the approximate normalized mutual information between a discrete ``y`` and a continuous ``x`` using
reservoir sampling (see :func:`differential_entropy`).

The parameter `sample_size` determines the maxima number of reservoir samples. The parameter `weight` is the weight
of the sample, and must be non-negative.

If :math:`x` has a known lower and upper bound,
prefer the 'fixed_histogram_mle' or 'fixed_histogram_jacknife' methods, as they have better convergence.

.. function:: normalized_differential_mutual_information_classification(bucket_count, y, x, weight, 'fixed_histogram_mle', min, max) -> double

Returns the approximate normalized mutual information between a discrete ``y`` and a continuous ``x`` using
the maximum-likelihood approximation of a histogram (see :func:`differential_entropy`).

The parameter ``bucket_count`` determines the number of histogram buckets. The parameters ``min`` and ``max`` are the
minimal and maximal values, respectively; the function will throw if there is an input outside this range.
The parameter ``weight`` is the weight of the sample, and must be non-negative.

If :math:`x` doesn't have known lower and upper bounds, prefer one of the two methods based on reservoir sampling.
Otherwise, if the number of samples is low, consider using the 'fixed_histogram_jacknife' version.

.. function:: normalized_differential_mutual_information_classification(bucket_count, y, x, weight, 'fixed_histogram_jacknife', min, max) -> double

Returns the approximate normalized mutual information between a discrete ``y`` and a continuous ``x`` using
a jacknife approximation of a histogram (see :func:`differential_entropy`).

The parameter ``bucket_count`` determines the number of histogram buckets. The parameters `min` and `max` are the
minimal and maximal values, respectively; the function will throw if there is an input outside this range.
The parameter `weight` is the weight of the sample, and must be non-negative.

If :math:`x` doesn't have known lower and upper bounds, prefer one of the two methods based on reservoir sampling.
Otherwise, if :math:`weight` can take on a wide range of distinct values, avoid using this method, as space and time costs
might be very high; instead, use 'fixed_histogram_mle'.

---------------------------

.. [Alizadeh2010] Alizadeh Noughabi, Hadi & Arghami, N. (2010). "A New Estimator of Entropy".
.. [Alizadeh2010] Alizadeh Noughabi, Hadi & Arghami, N. (2010). 'A New Estimator of Entropy'.

.. [Beirlant2001] Beirlant, Dudewicz, Gyorfi, and van der Meulen,
"Nonparametric entropy estimation: an overview", (2001)
'Nonparametric entropy estimation: an overview', (2001)

.. [BenHaimTomTov2010] Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
.. [BenHaimTomTov2010] Yael Ben-Haim and Elad Tom-Tov, 'A streaming parallel decision tree algorithm',
J. Machine Learning Research 11 (2010), pp. 849--872.

.. [Black2015] Black, Paul E. (26 January 2015). "Reservoir sampling". Dictionary of Algorithms and Data Structures.
.. [Black2015] Black, Paul E. (26 January 2015). 'Reservoir sampling'. Dictionary of Algorithms and Data Structures.

.. [Efraimidis2006] Efraimidis, Pavlos S.; Spirakis, Paul G. (2006-03-16). "Weighted random sampling with a reservoir".
.. [Efraimidis2006] Efraimidis, Pavlos S.; Spirakis, Paul G. (2006-03-16). 'Weighted random sampling with a reservoir'.
Information Processing Letters. 97 (5): 181–185.

.. [Krier2006] Krier, C & François, Damien & Wertz, Vincent & Verleysen, Michel. (2006).
Feature scoring by mutual information for classification of mass spectra. 10.1142/9789812774118_0079.

Loading