[Hotfix]nullsCount in columnStatistic should marked as not present in… by VicoWu · Pull Request #11549 · prestodb/presto

VicoWu · 2018-09-21T12:07:53Z

My presto is running on my Cloudera 5.7.6 hive meastore.
When upgrading to 0.210, Many queries previously run successfully failed with following error:

java.lang.IllegalArgumentException: Nulls fraction should be within [0, 1] or NaN, got: -0.16666666666666666
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:210)
	at com.facebook.presto.cost.SymbolStatsEstimate.<init>(SymbolStatsEstimate.java:54)
	at com.facebook.presto.cost.SymbolStatsEstimate$Builder.build(SymbolStatsEstimate.java:225)
	at com.facebook.presto.cost.TableScanStatsRule.toSymbolStatistics(TableScanStatsRule.java:92)
	at com.facebook.presto.cost.TableScanStatsRule.lambda$doCalculate$0(TableScanStatsRule.java:73)
	at java.util.Optional.map(Optional.java:215)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:73)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:41)
	at com.facebook.presto.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:80)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
	at com.facebook.presto.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:70)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.getStats(CostCalculatorUsingExchanges.java:243)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.visitTableScan(CostCalculatorUsingExchanges.java:141)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.visitTableScan(CostCalculatorUsingExchanges.java:99)
	at com.facebook.presto.sql.planner.plan.TableScanNode.accept(TableScanNode.java:128)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges.calculateCost(CostCalculatorUsingExchanges.java:96)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges.calculateCost(CostCalculatorWithEstimatedExchanges.java:73)
	at com.facebook.presto.cost.CachingCostProvider.calculateCumulativeCost(CachingCostProvider.java:103)
	at com.facebook.presto.cost.CachingCostProvider.getGroupCost(CachingCostProvider.java:95)
	at com.facebook.presto.cost.CachingCostProvider.getCumulativeCost(CachingCostProvider.java:72)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoinEnumerationResult(ReorderJoins.java:391)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.getJoinSource(ReorderJoins.java:342)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoin(ReorderJoins.java:258)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoinAccordingToPartitioning(ReorderJoins.java:229)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.chooseJoinOrder(ReorderJoins.java:173)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.access$000(ReorderJoins.java:135)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins.apply(ReorderJoins.java:127)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins.apply(ReorderJoins.java:88)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:165)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:138)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:103)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:94)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:150)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:139)
	at com.facebook.presto.execution.SqlQueryExecution.doAnalyzeQuery(SqlQueryExecution.java:360)
	at com.facebook.presto.execution.SqlQueryExecution.analyzeQuery(SqlQueryExecution.java:345)
	at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:289)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is because hive metastore use -1 to mark the column nullCounts as unavailable.

After I upgrading to master branch, the error still exists and the error become this:

com.facebook.presto.spi.PrestoException: Corrupted partition statistics (Table: default.plus_higher_category Partition: [<UNPARTITIONED>] Column: plus_higher_category_name): nullsCount must be greater than or equal to zero: -1
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.checkStatistics(MetastoreHiveStatisticsProvider.java:350)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$validateColumnStatistics$14(MetastoreHiveStatisticsProvider.java:222)
	at java.util.OptionalLong.ifPresent(OptionalLong.java:142)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.validateColumnStatistics(MetastoreHiveStatisticsProvider.java:221)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$null$10(MetastoreHiveStatisticsProvider.java:211)
	at com.google.common.collect.RegularImmutableMap.forEach(RegularImmutableMap.java:187)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$validatePartitionStatistics$11(MetastoreHiveStatisticsProvider.java:211)
	at com.google.common.collect.SingletonImmutableBiMap.forEach(SingletonImmutableBiMap.java:65)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.validatePartitionStatistics(MetastoreHiveStatisticsProvider.java:204)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.getTableStatistics(MetastoreHiveStatisticsProvider.java:142)
	at com.facebook.presto.hive.HiveMetadata.getTableStatistics(HiveMetadata.java:536)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.getTableStatistics(ClassLoaderSafeConnectorMetadata.java:201)
	at com.facebook.presto.metadata.MetadataManager.getTableStatistics(MetadataManager.java:401)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:61)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:36)
	at com.facebook.presto.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:80)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
	at com.facebook.presto.cost.CachingStatsProvider.getGroupStats(CachingStatsProvider.java:91)
	at com.facebook.presto.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:68)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.getStats(CostCalculatorWithEstimatedExchanges.java:198)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.calculateJoinCost(CostCalculatorWithEstimatedExchanges.java:153)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.visitJoin(CostCalculatorWithEstimatedExchanges.java:133)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.visitJoin(CostCalculatorWithEstimatedExchanges.java:76)
	at com.facebook.presto.sql.planner.plan.JoinNode.accept(JoinNode.java:289)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges.calculateCost(CostCalculatorWithEstimatedExchanges.java:72)
	at com.facebook.presto.cost.CachingCostProvider.calculateCumulativeCost(CachingCostProvider.java:103)
	at com.facebook.presto.cost.CachingCostProvider.getCumulativeCost(CachingCostProvider.java:80)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.getJoinNodeWithCost(DetermineJoinDistributionType.java:133)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.getCostBasedJoin(DetermineJoinDistributionType.java:77)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:66)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:43)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:165)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:138)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:103)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:94)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:150)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:139)
	at com.facebook.presto.execution.SqlQueryExecution.doAnalyzeQuery(SqlQueryExecution.java:362)
	at com.facebook.presto.execution.SqlQueryExecution.analyzeQuery(SqlQueryExecution.java:347)
	at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:291)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The safe solution is to mark the nullsCount as unavailable when hive return -1 for it, instead of just failing user's query!
**Anytime we should not fail users' query when statistic information is not present or looks illegal! Illegal statistics information should be marked as absent **

facebook-github-bot · 2018-09-21T12:07:59Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2018-09-21T12:18:45Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

arhimondr

@VicoWu

You should fix the corrupted statistics for the tables you have. As a workaround there is a session (ignore_corrupted_statistics) and a config (hive.ignore-corrupted-statistics) property that allows to temporarily ignore the column statistics that doesn't satisfy basic sanity checks. When this property is set, and the corruption is detected - all the statistics for the table are ignored.

VicoWu · 2018-09-23T02:26:32Z

@arhimondr , thank you for your reply. But I have following concerns:

I think there should be a value in hive to mark the column statistic as absent. In my hive, it is -1. So, what value is supposed to be when a column statistic is absent in Presto? I believe column Statistics value missing is very common.
hive.ignore-corrupted-statistics and ignore_corrupted_statistics will ignore all other statistics for this table, instead just ignore this absent column statistics, why ?
Above config are session configs, after I upgrade presto to 0.210, should I have to notify all users to add these configs if they get above error?

findepi · 2018-09-23T20:19:13Z

@VicoWu

Above config are session configs, after I upgrade presto to 0.210, should I have to notify all users to add these configs if they get above error?
hive.ignore-corrupted-statistics is something you can set cluster-wide, in hive.properties (catalog configuration file)

I think there should be a value in hive to mark the column statistic as absent. In my hive, it is -1.

I think the absence of the statistic is often signalled by the absence of the corresponding table property. Do you know why this isn't in your case?
If you delete the table property with value -1, will it be written again? How do you collect stats?

(Maybe this behavior is very standard for Cloudera 5.7.6, i just don't have this version at hand.)

VicoWu · 2018-09-25T01:02:38Z

@findepi , thank you for your response.
I have created an issue to Cloudera tech support, any update I will respond here.
But I still think that for the statistic information, only existing and legal value should be considered, all other cases, absent or illegal , should be considered absent. We should never consider a statistic value as illegal and fail users' query, because at any time a statistic information should not be guaranteed existence.

electrum · 2018-09-25T01:53:02Z

There seems to be some miscommunication on both sides, so let me try to clear things up:

The Hive metastore should support skipping (not storing) statistics for certain columns. Presto supports this today -- if the metastore simply leaves the field unset/null in the Thrift RPC response, then it will be ignored by Presto.

Presto does not allow invalid values such as -1. The field needs to be unset or null.

We consider invalid statistics to be a bug in whatever is writing the data and publishing the stats to the Hive metastore. We choose to fail on corruption because we want to know about it immediately and fix the problem and thus minimize the amount of bad data. If one value is obviously bad, we don't trust the other statistics, because they might also be bad, but simply in a way that we cannot detect.

If you want, you can set hive.ignore-corrupted-statistics=true as a Hive config property. This will take effect for all queries, so users will not need to change anything. The session property allows overriding it on a per-query basis.

VicoWu · 2018-09-25T02:44:45Z

@electrum Thanks for your response.
You mean that the statistic information may be written back to hive metastore? I have already seen CachingHiveMetastore.updatePartitionStatistics() method which may do this. So, the illegal statistic information is intolerable, right?
And also, I cannot find hive.ignore-corrupted-statistics in presto and hive. You mean that I should add this feature by myself? Is there any misunderstanding between us?

arhimondr · 2018-09-26T15:05:48Z

@VicoWu , hive.ignore-corrupted-statistics is existing experimental configuration property of the presto-hive connector.

https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java#L1103

Also you can ignore corrupted statistics session wide, by setting ignore_corrupted_statistics session property for hive connector, e.g.:

SET SESSION hive.ignore_corrupted_statistics = true;

(given the catalog you have the Hive connector installed is hive)

VicoWu · 2018-09-27T00:55:52Z

@arhimondr , yes , this is just what I want. I didn't find it because I am search it in my 0.210 branch. Thank you.

kokosing · 2018-09-27T07:35:10Z

@VicoWu How did you get such statistics? How have you analyzed the table? With Hive or Impala or something else (maybe spark?)? What exact commands have you run?

To me if there is a case that same external tool (like Hive) produces such values and Presto integrates with these tool then we should accept it.

kokosing · 2018-09-27T07:36:43Z

Also is your table partitioned?

kokosing · 2018-09-27T07:46:56Z

@VicoWu For what data types it happens?

kokosing · 2018-09-27T08:16:14Z

@VicoWu I tried to reproduce it, but with no luck on CDH 5.13. Can you try to reproduce it on CDH 5.13?

Can you try to use the below to reproduce your issue, then it will be easier to fix it.

# checkout presto codebase
# fully compile presto
cd presto-product-tests
export HADOOP_BASE_IMAGE=prestodb/cdh5.13-hive
./conf/docker/singlenode/compose.sh up -d
# wait a while
docker exec -it common_hadoop-master_1 hive
# do something that will make null_count hive statistics to become -1
../presto-cli/target/presto-cli-0.208-e.0.2-executable.jar --server localhost:8080 --catalog hive --schema default
# run SHOW STATS or query that reproduces the issue

VicoWu · 2018-09-29T08:08:15Z

@kokosing sorry for late reply. I have contacted with Cloudera Official Support, as their experiment, they claimed that this -1 is generated by Impala, and they are doing more experiments and inner team talk about this phenomenon, I am also waiting for their reply .
I believe it is a very common case that statistics information is -1 in my Cloudera environment. I will sync here if I get any response.

VicoWu · 2018-09-29T10:14:24Z

@kokosing , for your information:

+----------------------------------------------------------------+--+
| CREATE TABLE `default.plus_higher_category`(                   |
|   `plus_higher_category_id` int,                               |
|   `plus_higher_category_name` string)                          |
| ROW FORMAT SERDE                                               |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                  |
| STORED AS INPUTFORMAT                                          |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'            |
| OUTPUTFORMAT                                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'           |
| LOCATION                                                       |
|   'hdfs://warehousestore/hive/warehouse/plus_higher_category'  |
| TBLPROPERTIES (                                                |
|   'COLUMN_STATS_ACCURATE'='false',                             |
|   'DO_NOT_UPDATE_STATS'='true',                                |
|   'numFiles'='1',                                              |
|   'numRows'='-1',                                              |
|   'rawDataSize'='-1',                                          |
|   'totalSize'='413',                                           |
|   'transient_lastDdlTime'='1538176155')                        |
+----------------------------------------------------------------+--+
19 rows selected (0.051 seconds)
0: jdbc:hive2://metrics-hive-server01:10000/d> describe formatted default.plus_higher_category plus_higher_category_name;
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
|          col_name          |       data_type       |          min          |          max          |       num_nulls       |    distinct_count     |      avg_col_len      |      max_col_len      |       num_trues       |      num_falses       |        comment        |
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
| # col_name                 | data_type             | min                   | max                   | num_nulls             | distinct_count        | avg_col_len           | max_col_len           | num_trues             | num_falses            | comment               |
|                            | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  |
| plus_higher_category_name  | string                |                       |                       | -1                    | 6                     | 7.333333333333333     | 11                    |                       |                       | from deserializer     |
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
3 rows selected (0.143 seconds)
0: jdbc:hive2://hive-server01:10000/d>

You could see that the num_nulls is -1! this is a flat table, orc format.

kokosing · 2018-09-29T20:27:05Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

            OptionalLong min = longStatsData.isSetLowValue() ? OptionalLong.of(longStatsData.getLowValue()) : OptionalLong.empty();
            OptionalLong max = longStatsData.isSetHighValue() ? OptionalLong.of(longStatsData.getHighValue()) : OptionalLong.empty();
-            OptionalLong nullsCount = longStatsData.isSetNumNulls() ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty();
+            OptionalLong nullsCount = longStatsData.isSetNumNulls() && longStatsData.getNumNulls() >= 0l ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty();


please add a feature toggle to HiveClientConfig like hive.negative-count-statistics-enabled, so when this feature toggle is not set -1 will mean that stats are corrupted

please compare exactly against -1 here

@kokosing , thanks.
So, from my understanding:
When hive.negative-count-statistics-enabled = true, it means that when we met -1, we didn't corrupt the query, just ignore it.
When hive.negative-count-statistics-enabled = false, we should consider this -1 as corrupted and thus failing my query(if hive.ignore-corrupted-statistics=false)

Did I get your point ?

@kokosing , i have resubmit my code. I think that the name hive.negative-count-statistics-enabled is not value accurate, because it is just used indicate a single negative value -1, instead of all negative values.
what's your opinion?

We talked offline with @kokosing. Let's not add a new toggle for that.

kokosing · 2018-09-29T20:37:33Z

@arhimondr @electrum We have very similar case with one of our customers who is also using Impala. I will take a look into Impala code and see why they are using -1.

VicoWu · 2018-09-30T05:27:56Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveErrorCode.java

    HIVE_TABLE_DROPPED_DURING_QUERY(35, EXTERNAL),
    // HIVE_TOO_MANY_BUCKET_SORT_FILES(36) is deprecated
    HIVE_CORRUPTED_COLUMN_STATISTICS(37, EXTERNAL),
+    HIVE_NULL_COLUMN_NEGATIVE_STATISTICS(38, EXTERNAL),


add Exception code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS when we meet -1 for nullCount

VicoWu · 2018-09-30T05:29:58Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

        catch (PrestoException e) {
-            if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session)) {
+            if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session)
+                    || e.getErrorCode().equals(HIVE_NULL_COLUMN_NEGATIVE_STATISTICS.toErrorCode()) && isNegativeStatisticCountEnabled(session)) {


if we meet -1 for nullCounts and the toggle hive.negative-count-statistics-enabled is true, then we think the statistics as not present

VicoWu · 2018-09-30T05:30:44Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java


+    private static void checkMinusOneNullsCountStatistic(boolean expression, SchemaTableName table, String partition, String column, String message, Object... args)
+    {
+        if (!expression) {


if nullsCount = -1, we throw PrestoException with specific error code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS

VicoWu · 2018-09-30T05:32:48Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java

                        "Experimental: Enables automatic column level statistics collection on write",
                        hiveClientConfig.isCollectColumnStatisticsOnWrite(),
+                        false),
+                booleanProperty(


add it as session property.

VicoWu · 2018-09-30T05:34:25Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

    private int partitionStatisticsSampleSize = 100;
    private boolean ignoreCorruptedStatistics;
    private boolean collectColumnStatisticsOnWrite;
+    private boolean negativeStatisticCountEnabled;


add this HiveClientConfig. default value is false, which means that if we meet -1, we think that the statistics value is corrupted.

kokosing · 2018-09-30T20:44:47Z

Some pointers to Impala:

arhimondr · 2018-10-01T15:40:30Z

@VicoWu Adding a flag seems to be an overkill.

Please just add the fromMetastoreNullsCount method to the https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java#L336, where you check the nulls count for -1 explicitly, and if it is you set it as empty(). Please make sure to add a link to the ticket in the Impala ticket tracking system.

VicoWu · 2018-10-02T02:18:37Z

@arhimondr I have reset my code as your awesome advice.
Do you mean I should fire an impala ticket and add the ticket link to my presto code comment?

sopel39

Alternatively we could make ignore_corrupted_statistics to relax checks and try to extract whatever stats there seem to be valid (instead of ignoring entire table stats).

findepi · 2018-10-02T08:44:47Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

nit: "uses"

This is not really a javadoc comment (it does not describe what the method is generally doing). I would replace it with inline // Impala uses -1 to ... comment inside fromMetastoreNullsCount method body

what is "no-sample nullsCount"?

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

findepi · 2018-10-02T08:47:42Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

There are other places below that need an update (search for "nullsCount")

@findepi , thanks for your correction.

I have add the same conversion for other nullsCounts in method fromMetastoreApiColumnStatistics

I thinks other methods like createDateStatistics which is used to write back the statistics information to Hive, so I should not add this conversion method to it, right?
What your opinion?

When writing, -1 should by no means be a written value, so no need to change there.

@findepi

There are more places to be fixed as @findepi marked

sopel39 · 2018-10-02T08:53:40Z

Consider adding tests to: TestThriftMetastoreUtil

findepi · 2018-10-02T20:18:33Z

FWIW:

(i'm linking to today's master head commit, so that line markers remain accurate)

https://github.com/apache/impala/blob/ea2809f5ddcf48f3f41dcd12743e8a17b4ea8cd7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java#L128-L130

https://github.com/apache/impala/blob/ea2809f5ddcf48f3f41dcd12743e8a17b4ea8cd7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java#L253-L263

(I don't understand why counting NULLs in Impala should make stats 2x slower.. I always thought NDVs is what is hard to compute.)

arhimondr · 2018-10-02T20:32:18Z

@findepi From what i understand they literally compute number of nulls in IMPALA. Since there is no aggregation to directly compute number of nulls, a projection is needed. In Presto we compute number of non nulls by running count(column) directly, and than we simply subtract the value from the number of rows to get the number of nulls.

findepi · 2018-10-02T20:52:56Z

@arhimondr let me try -- https://gerrit.cloudera.org/11565

findepi · 2018-10-02T21:20:05Z

@kokosing @arhimondr i am concerned (but i didn't test), that the change we want to introduce here won't really help.

It will surely fix the SHOW STATS output for a table. However, when null fraction is unknown, the stats calculation rules will generally refuse to work. This is because com.facebook.presto.cost.SymbolStatsEstimate#getNullsFraction is most of time taken as is, without any polyfill value.
I am unsure how to address this. Perhaps, to actually achieve some useful interop with the stats computed by Impala, treating -1 as 0 (not unknown) would be more practical (in virtually all cases other than x IS NULL queries).

(This observation is by no means a blocker for this PR. I am still convinced the change we want to introduce here is an improvement.)

VicoWu · 2018-10-08T02:04:32Z

@findepi , so , do you have any further update for this PR , or , what can I do next?

findepi

LGTM. Please squash the commits and change the commit message to something like

Treat -1 null count statistic as absent

Impala `COMPUTE STATS` will write -1 as the null count.
See https://issues.apache.org/jira/browse/IMPALA-7497

arhimondr

LGTM % comments

arhimondr · 2018-10-08T14:10:51Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

nit: remove this comment

arhimondr · 2018-10-08T14:11:10Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

Add the link to the ticket number in IMPALA

arhimondr · 2018-10-08T14:11:38Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

squash this commit

@arhimondr , thanks for your correction.
I have squash my commit and re-submit. Please help to review.

Impala `COMPUTE STATS` will write -1 as the null count. See https://issues.apache.org/jira/browse/IMPALA-7497

kokosing · 2018-10-09T09:22:12Z

Merged, thanks!

VicoWu · 2018-10-09T09:31:25Z

@kokosing @findepi @arhimondr @sopel39
Great thanks for all you guys' review and hard work! I really learned a lot!

facebook-github-bot added the CLA Signed label Sep 21, 2018

findepi mentioned this pull request Sep 21, 2018

Refactor HiveStatisticsProvider and statistics SPI #11463

Merged

arhimondr requested changes Sep 21, 2018

View reviewed changes

kokosing reviewed Sep 29, 2018

View reviewed changes

VicoWu commented Sep 30, 2018

View reviewed changes

VicoWu force-pushed the hotfix-nullcounts-failed-query branch from 1f7ae71 to f5a5581 Compare October 2, 2018 03:09

sopel39 previously approved these changes Oct 2, 2018

View reviewed changes

findepi suggested changes Oct 2, 2018

View reviewed changes

VicoWu force-pushed the hotfix-nullcounts-failed-query branch 3 times, most recently from 481d964 to 0e4290c Compare October 3, 2018 02:33

findepi reviewed Oct 8, 2018

View reviewed changes

arhimondr approved these changes Oct 8, 2018

View reviewed changes

Treat -1 null count statistic as absent

202124b

Impala `COMPUTE STATS` will write -1 as the null count. See https://issues.apache.org/jira/browse/IMPALA-7497

VicoWu force-pushed the hotfix-nullcounts-failed-query branch from 0e4290c to 202124b Compare October 9, 2018 02:16

kokosing approved these changes Oct 9, 2018

View reviewed changes

findepi approved these changes Oct 9, 2018

View reviewed changes

kokosing merged commit c09b1e7 into prestodb:master Oct 9, 2018

VicoWu mentioned this pull request Nov 6, 2018

[hotfix]numFalse in columnStatistic should marked as not present when… #11859

Closed

Conversation

VicoWu commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 21, 2018

Uh oh!

facebook-github-bot commented Sep 21, 2018

Uh oh!

arhimondr left a comment

Choose a reason for hiding this comment

Uh oh!

VicoWu commented Sep 23, 2018

Uh oh!

findepi commented Sep 23, 2018

Uh oh!

VicoWu commented Sep 25, 2018

Uh oh!

electrum commented Sep 25, 2018

Uh oh!

VicoWu commented Sep 25, 2018

Uh oh!

arhimondr commented Sep 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VicoWu commented Sep 27, 2018

Uh oh!

kokosing commented Sep 27, 2018

Uh oh!

kokosing commented Sep 27, 2018

Uh oh!

kokosing commented Sep 27, 2018

Uh oh!

kokosing commented Sep 27, 2018

Uh oh!

VicoWu commented Sep 29, 2018

Uh oh!

VicoWu commented Sep 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kokosing commented Sep 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kokosing commented Sep 30, 2018

Uh oh!

arhimondr commented Oct 1, 2018

Uh oh!

VicoWu commented Oct 2, 2018

Uh oh!

sopel39 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VicoWu Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 commented Oct 2, 2018

VicoWu commented Sep 21, 2018 •

edited

Loading

arhimondr commented Sep 26, 2018 •

edited

Loading

VicoWu commented Sep 29, 2018 •

edited

Loading

VicoWu Oct 3, 2018 •

edited

Loading

findepi commented Oct 2, 2018 •

edited

Loading