Skip to content

[Hotfix]nullsCount in columnStatistic should marked as not present in…#11549

Merged
kokosing merged 1 commit intoprestodb:masterfrom
VicoWu:hotfix-nullcounts-failed-query
Oct 9, 2018
Merged

[Hotfix]nullsCount in columnStatistic should marked as not present in…#11549
kokosing merged 1 commit intoprestodb:masterfrom
VicoWu:hotfix-nullcounts-failed-query

Conversation

@VicoWu
Copy link

@VicoWu VicoWu commented Sep 21, 2018

My presto is running on my Cloudera 5.7.6 hive meastore.
When upgrading to 0.210, Many queries previously run successfully failed with following error:

java.lang.IllegalArgumentException: Nulls fraction should be within [0, 1] or NaN, got: -0.16666666666666666
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:210)
	at com.facebook.presto.cost.SymbolStatsEstimate.<init>(SymbolStatsEstimate.java:54)
	at com.facebook.presto.cost.SymbolStatsEstimate$Builder.build(SymbolStatsEstimate.java:225)
	at com.facebook.presto.cost.TableScanStatsRule.toSymbolStatistics(TableScanStatsRule.java:92)
	at com.facebook.presto.cost.TableScanStatsRule.lambda$doCalculate$0(TableScanStatsRule.java:73)
	at java.util.Optional.map(Optional.java:215)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:73)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:41)
	at com.facebook.presto.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:80)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
	at com.facebook.presto.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:70)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.getStats(CostCalculatorUsingExchanges.java:243)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.visitTableScan(CostCalculatorUsingExchanges.java:141)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges$CostEstimator.visitTableScan(CostCalculatorUsingExchanges.java:99)
	at com.facebook.presto.sql.planner.plan.TableScanNode.accept(TableScanNode.java:128)
	at com.facebook.presto.cost.CostCalculatorUsingExchanges.calculateCost(CostCalculatorUsingExchanges.java:96)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges.calculateCost(CostCalculatorWithEstimatedExchanges.java:73)
	at com.facebook.presto.cost.CachingCostProvider.calculateCumulativeCost(CachingCostProvider.java:103)
	at com.facebook.presto.cost.CachingCostProvider.getGroupCost(CachingCostProvider.java:95)
	at com.facebook.presto.cost.CachingCostProvider.getCumulativeCost(CachingCostProvider.java:72)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoinEnumerationResult(ReorderJoins.java:391)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.getJoinSource(ReorderJoins.java:342)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoin(ReorderJoins.java:258)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.createJoinAccordingToPartitioning(ReorderJoins.java:229)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.chooseJoinOrder(ReorderJoins.java:173)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins$JoinEnumerator.access$000(ReorderJoins.java:135)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins.apply(ReorderJoins.java:127)
	at com.facebook.presto.sql.planner.iterative.rule.ReorderJoins.apply(ReorderJoins.java:88)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:165)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:138)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:103)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:94)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:150)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:139)
	at com.facebook.presto.execution.SqlQueryExecution.doAnalyzeQuery(SqlQueryExecution.java:360)
	at com.facebook.presto.execution.SqlQueryExecution.analyzeQuery(SqlQueryExecution.java:345)
	at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:289)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

This is because hive metastore use -1 to mark the column nullCounts as unavailable.

After I upgrading to master branch, the error still exists and the error become this:

com.facebook.presto.spi.PrestoException: Corrupted partition statistics (Table: default.plus_higher_category Partition: [<UNPARTITIONED>] Column: plus_higher_category_name): nullsCount must be greater than or equal to zero: -1
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.checkStatistics(MetastoreHiveStatisticsProvider.java:350)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$validateColumnStatistics$14(MetastoreHiveStatisticsProvider.java:222)
	at java.util.OptionalLong.ifPresent(OptionalLong.java:142)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.validateColumnStatistics(MetastoreHiveStatisticsProvider.java:221)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$null$10(MetastoreHiveStatisticsProvider.java:211)
	at com.google.common.collect.RegularImmutableMap.forEach(RegularImmutableMap.java:187)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.lambda$validatePartitionStatistics$11(MetastoreHiveStatisticsProvider.java:211)
	at com.google.common.collect.SingletonImmutableBiMap.forEach(SingletonImmutableBiMap.java:65)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.validatePartitionStatistics(MetastoreHiveStatisticsProvider.java:204)
	at com.facebook.presto.hive.statistics.MetastoreHiveStatisticsProvider.getTableStatistics(MetastoreHiveStatisticsProvider.java:142)
	at com.facebook.presto.hive.HiveMetadata.getTableStatistics(HiveMetadata.java:536)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.getTableStatistics(ClassLoaderSafeConnectorMetadata.java:201)
	at com.facebook.presto.metadata.MetadataManager.getTableStatistics(MetadataManager.java:401)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:61)
	at com.facebook.presto.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:36)
	at com.facebook.presto.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:80)
	at com.facebook.presto.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:70)
	at com.facebook.presto.cost.CachingStatsProvider.getGroupStats(CachingStatsProvider.java:91)
	at com.facebook.presto.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:68)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.getStats(CostCalculatorWithEstimatedExchanges.java:198)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.calculateJoinCost(CostCalculatorWithEstimatedExchanges.java:153)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.visitJoin(CostCalculatorWithEstimatedExchanges.java:133)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges$ExchangeCostEstimator.visitJoin(CostCalculatorWithEstimatedExchanges.java:76)
	at com.facebook.presto.sql.planner.plan.JoinNode.accept(JoinNode.java:289)
	at com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges.calculateCost(CostCalculatorWithEstimatedExchanges.java:72)
	at com.facebook.presto.cost.CachingCostProvider.calculateCumulativeCost(CachingCostProvider.java:103)
	at com.facebook.presto.cost.CachingCostProvider.getCumulativeCost(CachingCostProvider.java:80)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.getJoinNodeWithCost(DetermineJoinDistributionType.java:133)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.getCostBasedJoin(DetermineJoinDistributionType.java:77)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:66)
	at com.facebook.presto.sql.planner.iterative.rule.DetermineJoinDistributionType.apply(DetermineJoinDistributionType.java:43)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:165)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:138)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:103)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:185)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:105)
	at com.facebook.presto.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:94)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:150)
	at com.facebook.presto.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:139)
	at com.facebook.presto.execution.SqlQueryExecution.doAnalyzeQuery(SqlQueryExecution.java:362)
	at com.facebook.presto.execution.SqlQueryExecution.analyzeQuery(SqlQueryExecution.java:347)
	at com.facebook.presto.execution.SqlQueryExecution.start(SqlQueryExecution.java:291)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The safe solution is to mark the nullsCount as unavailable when hive return -1 for it, instead of just failing user's query!
**Anytime we should not fail users' query when statistic information is not present or looks illegal! Illegal statistics information should be marked as absent **

@facebook-github-bot
Copy link
Collaborator

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot
Copy link
Collaborator

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VicoWu

You should fix the corrupted statistics for the tables you have. As a workaround there is a session (ignore_corrupted_statistics) and a config (hive.ignore-corrupted-statistics) property that allows to temporarily ignore the column statistics that doesn't satisfy basic sanity checks. When this property is set, and the corruption is detected - all the statistics for the table are ignored.

@VicoWu
Copy link
Author

VicoWu commented Sep 23, 2018

@arhimondr , thank you for your reply. But I have following concerns:

  1. I think there should be a value in hive to mark the column statistic as absent. In my hive, it is -1. So, what value is supposed to be when a column statistic is absent in Presto? I believe column Statistics value missing is very common.
  2. hive.ignore-corrupted-statistics and ignore_corrupted_statistics will ignore all other statistics for this table, instead just ignore this absent column statistics, why ?
  3. Above config are session configs, after I upgrade presto to 0.210, should I have to notify all users to add these configs if they get above error?

@findepi
Copy link
Contributor

findepi commented Sep 23, 2018

@VicoWu

Above config are session configs, after I upgrade presto to 0.210, should I have to notify all users to add these configs if they get above error?
hive.ignore-corrupted-statistics is something you can set cluster-wide, in hive.properties (catalog configuration file)

I think there should be a value in hive to mark the column statistic as absent. In my hive, it is -1.

I think the absence of the statistic is often signalled by the absence of the corresponding table property. Do you know why this isn't in your case?
If you delete the table property with value -1, will it be written again? How do you collect stats?

(Maybe this behavior is very standard for Cloudera 5.7.6, i just don't have this version at hand.)

@VicoWu
Copy link
Author

VicoWu commented Sep 25, 2018

@findepi , thank you for your response.
I have created an issue to Cloudera tech support, any update I will respond here.
But I still think that for the statistic information, only existing and legal value should be considered, all other cases, absent or illegal , should be considered absent. We should never consider a statistic value as illegal and fail users' query, because at any time a statistic information should not be guaranteed existence.

@electrum
Copy link
Contributor

There seems to be some miscommunication on both sides, so let me try to clear things up:

The Hive metastore should support skipping (not storing) statistics for certain columns. Presto supports this today -- if the metastore simply leaves the field unset/null in the Thrift RPC response, then it will be ignored by Presto.

Presto does not allow invalid values such as -1. The field needs to be unset or null.

We consider invalid statistics to be a bug in whatever is writing the data and publishing the stats to the Hive metastore. We choose to fail on corruption because we want to know about it immediately and fix the problem and thus minimize the amount of bad data. If one value is obviously bad, we don't trust the other statistics, because they might also be bad, but simply in a way that we cannot detect.

If you want, you can set hive.ignore-corrupted-statistics=true as a Hive config property. This will take effect for all queries, so users will not need to change anything. The session property allows overriding it on a per-query basis.

@VicoWu
Copy link
Author

VicoWu commented Sep 25, 2018

@electrum Thanks for your response.
You mean that the statistic information may be written back to hive metastore? I have already seen CachingHiveMetastore.updatePartitionStatistics() method which may do this. So, the illegal statistic information is intolerable, right?
And also, I cannot find hive.ignore-corrupted-statistics in presto and hive. You mean that I should add this feature by myself? Is there any misunderstanding between us?

@arhimondr
Copy link
Member

arhimondr commented Sep 26, 2018

@VicoWu , hive.ignore-corrupted-statistics is existing experimental configuration property of the presto-hive connector.

https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java#L1103

Also you can ignore corrupted statistics session wide, by setting ignore_corrupted_statistics session property for hive connector, e.g.:

SET SESSION hive.ignore_corrupted_statistics = true;

(given the catalog you have the Hive connector installed is hive)

@VicoWu
Copy link
Author

VicoWu commented Sep 27, 2018

@arhimondr , yes , this is just what I want. I didn't find it because I am search it in my 0.210 branch. Thank you.

@kokosing
Copy link
Contributor

@VicoWu How did you get such statistics? How have you analyzed the table? With Hive or Impala or something else (maybe spark?)? What exact commands have you run?

To me if there is a case that same external tool (like Hive) produces such values and Presto integrates with these tool then we should accept it.

@kokosing
Copy link
Contributor

Also is your table partitioned?

@kokosing
Copy link
Contributor

@VicoWu For what data types it happens?

@kokosing
Copy link
Contributor

@VicoWu I tried to reproduce it, but with no luck on CDH 5.13. Can you try to reproduce it on CDH 5.13?

Can you try to use the below to reproduce your issue, then it will be easier to fix it.

# checkout presto codebase
# fully compile presto
cd presto-product-tests
export HADOOP_BASE_IMAGE=prestodb/cdh5.13-hive
./conf/docker/singlenode/compose.sh up -d
# wait a while
docker exec -it common_hadoop-master_1 hive
# do something that will make null_count hive statistics to become -1
../presto-cli/target/presto-cli-0.208-e.0.2-executable.jar --server localhost:8080 --catalog hive --schema default
# run SHOW STATS or query that reproduces the issue

@VicoWu
Copy link
Author

VicoWu commented Sep 29, 2018

@kokosing sorry for late reply. I have contacted with Cloudera Official Support, as their experiment, they claimed that this -1 is generated by Impala, and they are doing more experiments and inner team talk about this phenomenon, I am also waiting for their reply .
I believe it is a very common case that statistics information is -1 in my Cloudera environment. I will sync here if I get any response.

@VicoWu
Copy link
Author

VicoWu commented Sep 29, 2018

@kokosing , for your information:

+----------------------------------------------------------------+--+
| CREATE TABLE `default.plus_higher_category`(                   |
|   `plus_higher_category_id` int,                               |
|   `plus_higher_category_name` string)                          |
| ROW FORMAT SERDE                                               |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                  |
| STORED AS INPUTFORMAT                                          |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'            |
| OUTPUTFORMAT                                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'           |
| LOCATION                                                       |
|   'hdfs://warehousestore/hive/warehouse/plus_higher_category'  |
| TBLPROPERTIES (                                                |
|   'COLUMN_STATS_ACCURATE'='false',                             |
|   'DO_NOT_UPDATE_STATS'='true',                                |
|   'numFiles'='1',                                              |
|   'numRows'='-1',                                              |
|   'rawDataSize'='-1',                                          |
|   'totalSize'='413',                                           |
|   'transient_lastDdlTime'='1538176155')                        |
+----------------------------------------------------------------+--+
19 rows selected (0.051 seconds)
0: jdbc:hive2://metrics-hive-server01:10000/d> describe formatted default.plus_higher_category plus_higher_category_name;
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
|          col_name          |       data_type       |          min          |          max          |       num_nulls       |    distinct_count     |      avg_col_len      |      max_col_len      |       num_trues       |      num_falses       |        comment        |
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
| # col_name                 | data_type             | min                   | max                   | num_nulls             | distinct_count        | avg_col_len           | max_col_len           | num_trues             | num_falses            | comment               |
|                            | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  | NULL                  |
| plus_higher_category_name  | string                |                       |                       | -1                    | 6                     | 7.333333333333333     | 11                    |                       |                       | from deserializer     |
+----------------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+--+
3 rows selected (0.143 seconds)
0: jdbc:hive2://hive-server01:10000/d>

You could see that the num_nulls is -1! this is a flat table, orc format.

OptionalLong min = longStatsData.isSetLowValue() ? OptionalLong.of(longStatsData.getLowValue()) : OptionalLong.empty();
OptionalLong max = longStatsData.isSetHighValue() ? OptionalLong.of(longStatsData.getHighValue()) : OptionalLong.empty();
OptionalLong nullsCount = longStatsData.isSetNumNulls() ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty();
OptionalLong nullsCount = longStatsData.isSetNumNulls() && longStatsData.getNumNulls() >= 0l ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • please add a feature toggle to HiveClientConfig like hive.negative-count-statistics-enabled, so when this feature toggle is not set -1 will mean that stats are corrupted
  • please compare exactly against -1 here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing , thanks.
So, from my understanding:
When hive.negative-count-statistics-enabled = true, it means that when we met -1, we didn't corrupt the query, just ignore it.
When hive.negative-count-statistics-enabled = false, we should consider this -1 as corrupted and thus failing my query(if hive.ignore-corrupted-statistics=false)

Did I get your point ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing , i have resubmit my code. I think that the name hive.negative-count-statistics-enabled is not value accurate, because it is just used indicate a single negative value -1, instead of all negative values.
what's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline with @kokosing. Let's not add a new toggle for that.

@kokosing
Copy link
Contributor

@arhimondr @electrum We have very similar case with one of our customers who is also using Impala. I will take a look into Impala code and see why they are using -1.

HIVE_TABLE_DROPPED_DURING_QUERY(35, EXTERNAL),
// HIVE_TOO_MANY_BUCKET_SORT_FILES(36) is deprecated
HIVE_CORRUPTED_COLUMN_STATISTICS(37, EXTERNAL),
HIVE_NULL_COLUMN_NEGATIVE_STATISTICS(38, EXTERNAL),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Exception code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS when we meet -1 for nullCount

catch (PrestoException e) {
if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session)) {
if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session)
|| e.getErrorCode().equals(HIVE_NULL_COLUMN_NEGATIVE_STATISTICS.toErrorCode()) && isNegativeStatisticCountEnabled(session)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we meet -1 for nullCounts and the toggle hive.negative-count-statistics-enabled is true, then we think the statistics as not present


private static void checkMinusOneNullsCountStatistic(boolean expression, SchemaTableName table, String partition, String column, String message, Object... args)
{
if (!expression) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nullsCount = -1, we throw PrestoException with specific error code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS

"Experimental: Enables automatic column level statistics collection on write",
hiveClientConfig.isCollectColumnStatisticsOnWrite(),
false),
booleanProperty(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add it as session property.

private int partitionStatisticsSampleSize = 100;
private boolean ignoreCorruptedStatistics;
private boolean collectColumnStatisticsOnWrite;
private boolean negativeStatisticCountEnabled;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this HiveClientConfig. default value is false, which means that if we meet -1, we think that the statistics value is corrupted.

@arhimondr
Copy link
Member

@VicoWu Adding a flag seems to be an overkill.

Please just add the fromMetastoreNullsCount method to the https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java#L336, where you check the nulls count for -1 explicitly, and if it is you set it as empty(). Please make sure to add a link to the ticket in the Impala ticket tracking system.

@VicoWu
Copy link
Author

VicoWu commented Oct 2, 2018

@arhimondr I have reset my code as your awesome advice.
Do you mean I should fire an impala ticket and add the ticket link to my presto code comment?

@VicoWu VicoWu force-pushed the hotfix-nullcounts-failed-query branch from 1f7ae71 to f5a5581 Compare October 2, 2018 03:09
sopel39
sopel39 previously approved these changes Oct 2, 2018
Copy link
Contributor

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could make ignore_corrupted_statistics to relax checks and try to extract whatever stats there seem to be valid (instead of ignoring entire table stats).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • nit: "uses"
  • This is not really a javadoc comment (it does not describe what the method is generally doing). I would replace it with inline // Impala uses -1 to ... comment inside fromMetastoreNullsCount method body

what is "no-sample nullsCount"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places below that need an update (search for "nullsCount")

Copy link
Author

@VicoWu VicoWu Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi , thanks for your correction.

I have add the same conversion for other nullsCounts in method fromMetastoreApiColumnStatistics

I thinks other methods like createDateStatistics which is used to write back the statistics information to Hive, so I should not add this conversion method to it, right?
What your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing, -1 should by no means be a written value, so no need to change there.

@sopel39 sopel39 dismissed their stale review October 2, 2018 08:50

There are more places to be fixed as @findepi marked

@sopel39
Copy link
Contributor

sopel39 commented Oct 2, 2018

Consider adding tests to: TestThriftMetastoreUtil

@findepi
Copy link
Contributor

findepi commented Oct 2, 2018

FWIW:

(i'm linking to today's master head commit, so that line markers remain accurate)

https://github.com/apache/impala/blob/ea2809f5ddcf48f3f41dcd12743e8a17b4ea8cd7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java#L128-L130

https://github.com/apache/impala/blob/ea2809f5ddcf48f3f41dcd12743e8a17b4ea8cd7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java#L253-L263

(I don't understand why counting NULLs in Impala should make stats 2x slower.. I always thought NDVs is what is hard to compute.)

@arhimondr
Copy link
Member

@findepi From what i understand they literally compute number of nulls in IMPALA. Since there is no aggregation to directly compute number of nulls, a projection is needed. In Presto we compute number of non nulls by running count(column) directly, and than we simply subtract the value from the number of rows to get the number of nulls.

@findepi
Copy link
Contributor

findepi commented Oct 2, 2018

@arhimondr let me try -- https://gerrit.cloudera.org/11565

@findepi
Copy link
Contributor

findepi commented Oct 2, 2018

@kokosing @arhimondr i am concerned (but i didn't test), that the change we want to introduce here won't really help.

It will surely fix the SHOW STATS output for a table. However, when null fraction is unknown, the stats calculation rules will generally refuse to work. This is because com.facebook.presto.cost.SymbolStatsEstimate#getNullsFraction is most of time taken as is, without any polyfill value.
I am unsure how to address this. Perhaps, to actually achieve some useful interop with the stats computed by Impala, treating -1 as 0 (not unknown) would be more practical (in virtually all cases other than x IS NULL queries).

(This observation is by no means a blocker for this PR. I am still convinced the change we want to introduce here is an improvement.)

@VicoWu VicoWu force-pushed the hotfix-nullcounts-failed-query branch 3 times, most recently from 481d964 to 0e4290c Compare October 3, 2018 02:33
@VicoWu
Copy link
Author

VicoWu commented Oct 8, 2018

@findepi , so , do you have any further update for this PR , or , what can I do next?

Copy link
Contributor

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please squash the commits and change the commit message to something like

Treat -1 null count statistic as absent

Impala `COMPUTE STATS` will write -1 as the null count.
See https://issues.apache.org/jira/browse/IMPALA-7497

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the link to the ticket number in IMPALA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

squash this commit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr , thanks for your correction.
I have squash my commit and re-submit. Please help to review.

Impala `COMPUTE STATS` will write -1 as the null count.
See https://issues.apache.org/jira/browse/IMPALA-7497
@VicoWu VicoWu force-pushed the hotfix-nullcounts-failed-query branch from 0e4290c to 202124b Compare October 9, 2018 02:16
@kokosing kokosing merged commit c09b1e7 into prestodb:master Oct 9, 2018
@kokosing
Copy link
Contributor

kokosing commented Oct 9, 2018

Merged, thanks!

@VicoWu
Copy link
Author

VicoWu commented Oct 9, 2018

@kokosing @findepi @arhimondr @sopel39
Great thanks for all you guys' review and hard work! I really learned a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants