From 35d6684829dd326254b7f35ca8242c22ecc055cb Mon Sep 17 00:00:00 2001 From: Simhadri Govindappa Date: Wed, 2 Aug 2023 21:07:04 +0530 Subject: [PATCH] Add KLL Datasketch and Hive ColumnStatisticsObj as standard blob types to puffin file --- landing-page/content/common/puffin-spec.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/landing-page/content/common/puffin-spec.md b/landing-page/content/common/puffin-spec.md index 36481a950..674a2d364 100644 --- a/landing-page/content/common/puffin-spec.md +++ b/landing-page/content/common/puffin-spec.md @@ -126,6 +126,24 @@ The blob metadata for this blob may include following properties: - `ndv`: estimate of number of distinct values, derived from the sketch. +#### `column-statistics-obj` blob type + +A serialized form of Hive ColumnStatsObject. + +The columnStatsObject supports Histograms, NDV, Min and Max values, Number of nulls, Number of trues, column name, type. +A full list of supported statistics is listed in the table here: +[ColumnStatistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) + +#### `apache-datasketches-KLL-sketch` blob type + +A serialized form of a "compact" KLL-sketch produced by the [Apache +DataSketches](https://datasketches.apache.org/) library. +Apache-datasketches-KLL-sketch is an implementation of a very compact quantiles +sketch with lazy compaction scheme and nearly optimal accuracy per bit. + +Histograms are derived from this sketch. + + ### Compression codecs The data can also be uncompressed. If it is compressed the codec should be one of