[Hotfix]nullsCount in columnStatistic should marked as not present in…#11549
Conversation
|
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need the corporate CLA signed. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
arhimondr
left a comment
There was a problem hiding this comment.
You should fix the corrupted statistics for the tables you have. As a workaround there is a session (ignore_corrupted_statistics) and a config (hive.ignore-corrupted-statistics) property that allows to temporarily ignore the column statistics that doesn't satisfy basic sanity checks. When this property is set, and the corruption is detected - all the statistics for the table are ignored.
|
@arhimondr , thank you for your reply. But I have following concerns:
|
I think the absence of the statistic is often signalled by the absence of the corresponding table property. Do you know why this isn't in your case? (Maybe this behavior is very standard for Cloudera 5.7.6, i just don't have this version at hand.) |
|
@findepi , thank you for your response. |
|
There seems to be some miscommunication on both sides, so let me try to clear things up: The Hive metastore should support skipping (not storing) statistics for certain columns. Presto supports this today -- if the metastore simply leaves the field unset/null in the Thrift RPC response, then it will be ignored by Presto. Presto does not allow invalid values such as We consider invalid statistics to be a bug in whatever is writing the data and publishing the stats to the Hive metastore. We choose to fail on corruption because we want to know about it immediately and fix the problem and thus minimize the amount of bad data. If one value is obviously bad, we don't trust the other statistics, because they might also be bad, but simply in a way that we cannot detect. If you want, you can set |
|
@electrum Thanks for your response. |
|
@VicoWu , Also you can ignore corrupted statistics session wide, by setting (given the catalog you have the Hive connector installed is |
|
@arhimondr , yes , this is just what I want. I didn't find it because I am search it in my 0.210 branch. Thank you. |
|
@VicoWu How did you get such statistics? How have you analyzed the table? With Hive or Impala or something else (maybe spark?)? What exact commands have you run? To me if there is a case that same external tool (like Hive) produces such values and Presto integrates with these tool then we should accept it. |
|
Also is your table partitioned? |
|
@VicoWu For what data types it happens? |
|
@VicoWu I tried to reproduce it, but with no luck on CDH 5.13. Can you try to reproduce it on CDH 5.13? Can you try to use the below to reproduce your issue, then it will be easier to fix it. |
|
@kokosing sorry for late reply. I have contacted with Cloudera Official Support, as their experiment, they claimed that this -1 is generated by Impala, and they are doing more experiments and inner team talk about this phenomenon, I am also waiting for their reply . |
|
@kokosing , for your information: You could see that the num_nulls is -1! this is a flat table, orc format. |
| OptionalLong min = longStatsData.isSetLowValue() ? OptionalLong.of(longStatsData.getLowValue()) : OptionalLong.empty(); | ||
| OptionalLong max = longStatsData.isSetHighValue() ? OptionalLong.of(longStatsData.getHighValue()) : OptionalLong.empty(); | ||
| OptionalLong nullsCount = longStatsData.isSetNumNulls() ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty(); | ||
| OptionalLong nullsCount = longStatsData.isSetNumNulls() && longStatsData.getNumNulls() >= 0l ? OptionalLong.of(longStatsData.getNumNulls()) : OptionalLong.empty(); |
There was a problem hiding this comment.
- please add a feature toggle to
HiveClientConfiglikehive.negative-count-statistics-enabled, so when this feature toggle is not set-1will mean that stats are corrupted - please compare exactly against
-1here
There was a problem hiding this comment.
@kokosing , thanks.
So, from my understanding:
When hive.negative-count-statistics-enabled = true, it means that when we met -1, we didn't corrupt the query, just ignore it.
When hive.negative-count-statistics-enabled = false, we should consider this -1 as corrupted and thus failing my query(if hive.ignore-corrupted-statistics=false)
Did I get your point ?
There was a problem hiding this comment.
@kokosing , i have resubmit my code. I think that the name hive.negative-count-statistics-enabled is not value accurate, because it is just used indicate a single negative value -1, instead of all negative values.
what's your opinion?
There was a problem hiding this comment.
We talked offline with @kokosing. Let's not add a new toggle for that.
|
@arhimondr @electrum We have very similar case with one of our customers who is also using Impala. I will take a look into Impala code and see why they are using |
| HIVE_TABLE_DROPPED_DURING_QUERY(35, EXTERNAL), | ||
| // HIVE_TOO_MANY_BUCKET_SORT_FILES(36) is deprecated | ||
| HIVE_CORRUPTED_COLUMN_STATISTICS(37, EXTERNAL), | ||
| HIVE_NULL_COLUMN_NEGATIVE_STATISTICS(38, EXTERNAL), |
There was a problem hiding this comment.
add Exception code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS when we meet -1 for nullCount
| catch (PrestoException e) { | ||
| if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session)) { | ||
| if (e.getErrorCode().equals(HIVE_CORRUPTED_COLUMN_STATISTICS.toErrorCode()) && isIgnoreCorruptedStatistics(session) | ||
| || e.getErrorCode().equals(HIVE_NULL_COLUMN_NEGATIVE_STATISTICS.toErrorCode()) && isNegativeStatisticCountEnabled(session)) { |
There was a problem hiding this comment.
if we meet -1 for nullCounts and the toggle hive.negative-count-statistics-enabled is true, then we think the statistics as not present
|
|
||
| private static void checkMinusOneNullsCountStatistic(boolean expression, SchemaTableName table, String partition, String column, String message, Object... args) | ||
| { | ||
| if (!expression) { |
There was a problem hiding this comment.
if nullsCount = -1, we throw PrestoException with specific error code HIVE_NULL_COLUMN_NEGATIVE_STATISTICS
| "Experimental: Enables automatic column level statistics collection on write", | ||
| hiveClientConfig.isCollectColumnStatisticsOnWrite(), | ||
| false), | ||
| booleanProperty( |
| private int partitionStatisticsSampleSize = 100; | ||
| private boolean ignoreCorruptedStatistics; | ||
| private boolean collectColumnStatisticsOnWrite; | ||
| private boolean negativeStatisticCountEnabled; |
There was a problem hiding this comment.
add this HiveClientConfig. default value is false, which means that if we meet -1, we think that the statistics value is corrupted.
|
@VicoWu Adding a flag seems to be an overkill. Please just add the |
|
@arhimondr I have reset my code as your awesome advice. |
1f7ae71 to
f5a5581
Compare
sopel39
left a comment
There was a problem hiding this comment.
Alternatively we could make ignore_corrupted_statistics to relax checks and try to extract whatever stats there seem to be valid (instead of ignoring entire table stats).
There was a problem hiding this comment.
- nit: "uses"
- This is not really a javadoc comment (it does not describe what the method is generally doing). I would replace it with inline
// Impala uses -1 to ...comment insidefromMetastoreNullsCountmethod body
what is "no-sample nullsCount"?
presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
There are other places below that need an update (search for "nullsCount")
There was a problem hiding this comment.
@findepi , thanks for your correction.
I have add the same conversion for other nullsCounts in method fromMetastoreApiColumnStatistics
I thinks other methods like createDateStatistics which is used to write back the statistics information to Hive, so I should not add this conversion method to it, right?
What your opinion?
There was a problem hiding this comment.
When writing, -1 should by no means be a written value, so no need to change there.
There are more places to be fixed as @findepi marked
|
Consider adding tests to: |
|
FWIW: (i'm linking to today's (I don't understand why counting NULLs in Impala should make stats 2x slower.. I always thought NDVs is what is hard to compute.) |
|
@findepi From what i understand they literally compute number of nulls in IMPALA. Since there is no aggregation to directly compute number of nulls, a projection is needed. In Presto we compute number of non nulls by running |
|
@arhimondr let me try -- https://gerrit.cloudera.org/11565 |
|
@kokosing @arhimondr i am concerned (but i didn't test), that the change we want to introduce here won't really help. It will surely fix the (This observation is by no means a blocker for this PR. I am still convinced the change we want to introduce here is an improvement.) |
481d964 to
0e4290c
Compare
|
@findepi , so , do you have any further update for this PR , or , what can I do next? |
findepi
left a comment
There was a problem hiding this comment.
LGTM. Please squash the commits and change the commit message to something like
Treat -1 null count statistic as absent
Impala `COMPUTE STATS` will write -1 as the null count.
See https://issues.apache.org/jira/browse/IMPALA-7497
There was a problem hiding this comment.
Add the link to the ticket number in IMPALA
There was a problem hiding this comment.
@arhimondr , thanks for your correction.
I have squash my commit and re-submit. Please help to review.
Impala `COMPUTE STATS` will write -1 as the null count. See https://issues.apache.org/jira/browse/IMPALA-7497
0e4290c to
202124b
Compare
|
Merged, thanks! |
|
@kokosing @findepi @arhimondr @sopel39 |
My presto is running on my Cloudera 5.7.6 hive meastore.
When upgrading to 0.210, Many queries previously run successfully failed with following error:
This is because hive metastore use -1 to mark the column nullCounts as unavailable.
After I upgrading to master branch, the error still exists and the error become this:
The safe solution is to mark the nullsCount as unavailable when hive return -1 for it, instead of just failing user's query!
**Anytime we should not fail users' query when statistic information is not present or looks illegal! Illegal statistics information should be marked as absent **