Calculate Iceberg NDV with a Theta sketch#14290
Conversation
6584896 to
4e250f4
Compare
|
CI #14239 (pinot failure considered unrelated) |
|
Is there any performance difference from using that implementation? I realize we don't have choice, even if the performance is worse, but I'm just curious. |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
Outdated
Show resolved
Hide resolved
i don't know From user perspective, this is an experimental and opt-in feature still, so perf regressions are fine. |
|
CI #13946 but -- hold your breath -- other faulttolerant failure is actually related. |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/aggregation/DataSketchState.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Outdated
Show resolved
Hide resolved
c481861 to
b64bd26
Compare
|
The faulttolerant tests expected that ANALYZE results are stable, and with the new implementation they are not. |
916783a to
148b0ee
Compare
|
CI -- |
|
I discussed the problem of lack of stability with David. It's seems fine to roll with current implementation and improve stability (eg by calculating HLL as well) later, if needed. |
Reduce code noise by moving Symbol -> SymbolReference converstion to the construction method.
A hypothetical new format value should automatically get the same coverage (same number of assertions) as all the existing ones. Before the change, some assertions were run conditionally only if format is one of the known ones, which wasn't future-proof.
The test is analyzing _a table_ to calculate statistics.
ANALYZE can compute data size and number distinct values (NDV) using approximations and the result is not guaranteed to be stable. When testing ANALYZE with fault tolerance, allow the results to be differ between test run and the control run by some small percentage.
b989476 to
81cb1bd
Compare
|
(rebased to resolve conflict; build was green https://github.com/trinodb/trino/actions/runs/3142623927/jobs/5106437131) |
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This looks asymmetrical from the serialize method, it writes a null Block if there is no sketch set, but here you throw.
There was a problem hiding this comment.
Non-static deserialize(Block block, int index, DataSketchState state) is a counter part to non-static serialize(DataSketchState state, BlockBuilder out)
This method is a helper only used
- in
deserialize(Block block, int index, DataSketchState state), conditionally after checking for non-null - in IcebergMetadata to pull final results (where it's guaranteed to be non-null, by means of the
@CombineFunction
There was a problem hiding this comment.
This is ensured by the engine?
There was a problem hiding this comment.
I think SQL spec says that NULL values do not contribute to aggregations and engine enforces that, but whole truth may be more complex :)
There was a problem hiding this comment.
yes, it's enforced by the engine.
i thought it's still nice to check for null before reading some dummy/stupid value, so that code is self-descriptive.
|
For reference, I run a small experiment comparing For that, I've changed the output function of the i was running on a 3-node cluster, should the distribution matter Test query SELECT
format(
'SELECT
column_name, exact_count,
approx_count, round(abs(approx_count-exact_count)*1e2/exact_count, 1) approx_err,
theta_count, round(abs(theta_count-exact_count)*1e2/exact_count, 1) theta_err
FROM ( SELECT %s FROM orders)
CROSS JOIN UNNEST(ARRAY[%s]) AS _(column_name, exact_count, approx_count, theta_count)
;',
LISTAGG(
format('count(DISTINCT %1$s) AS "%1$s_exact", approx_distinct(%1$s) AS "%1$s_approx", "$iceberg_theta_stat"(%1$s) AS "%1$s_theta"', column_name),
U&',\000a') WITHIN GROUP (ORDER BY ordinal_position),
LISTAGG(
format('(''%1$s'', "%1$s_exact", "%1$s_approx", "%1$s_theta")', column_name),
U&',\000a') WITHIN GROUP (ORDER BY ordinal_position)
)
FROM information_schema.columns
WHERE table_schema = CURRENT_SCHEMA
AND table_name = 'orders';Results Same experiment, this time for Observations:
|
- use applicable static imports - cast to primitive directly (same behavior, but looks nicer)
Move `ExpressionConverter.getIcebergLiteralValue` to `IcebergTypes` for re-use. Name it `convertTrinoValueToIceberg` to be consistent with existing `convertIcebergValueToTrino` method in that class.
- suppress intentional "redundant casts" - exact division - cast to primitive directly (looks nicer, same meaning) - remove redundant `instanceof` - put decimals together with other numbers
Iceberg specification defines the Apache DataSketches's Theta as the common data sketch for keeping track of distinct values in a table. This change replaces the use of HLL within Iceberg's ANALYZE with Theta sketch. The follow-up work is to store the serialized compact form of the sketch inside the Iceberg Puffin statistics file, but this requires Iceberg API changes, which are still in progress. A side effect of this change is that complex types (array, map, row) can no longer be analyzed: Trino can calculate a HyperLogLog for these types, while Iceberg does not specify binary representation for these types, which is required to feed data into a Theta sketch. However, NDV for complex types is not as useful as it is for scalar types, so this shouldn't matter in practice.
81cb1bd to
90fb612
Compare
|
Added some code reuse in response to #14290 (comment), no other changes. |
Iceberg specification defines the Apache DataSketches's Theta as the
common data sketch for keeping track of distinct values in a table.
This change replaces the use of HLL within Iceberg's ANALYZE with Theta
sketch. The follow-up work is to store the serialized compact form of
the sketch inside the Iceberg Puffin statistics file, but this requires
Iceberg API changes, which are still in progress (e.g. apache/iceberg#4741, apache/iceberg#5794).