Refactor HiveStatisticsProvider and statistics SPI#11463
Refactor HiveStatisticsProvider and statistics SPI#11463arhimondr merged 14 commits intoprestodb:masterfrom
Conversation
2c66de4 to
5180de8
Compare
There was a problem hiding this comment.
Maybe you could extract to some Preconditions class in spi?
There was a problem hiding this comment.
We got a lot of checkArgument and vefiry methods implementation across the presto-spi module. I think that a good idea to have a Precondition class in that module to avoid implementing this boilerplate code every time someone needs that. However to keep it consistent i would rather like to do this as a separate PR, and i would like to replace the usages everywhere.
There was a problem hiding this comment.
s/create/estimate
then you could static import this
There was a problem hiding this comment.
Although i agree that createObjectName sounds more fluent that ObectName.create, i can see some disadvantages in the createObjectName approach. Here some of them:
- Additional (static) import introduces additional merge conflicts
- It is not always clear what is the exact name of the factory method. For example taking a class called
DummyFunnyCircle, it is not always obvious how's the factory method is called.createDummyFunnyCircle?createFunnyCircle?createCircle? - Static import may introduce a name clash. For example in test classes you may want to have a private factory method called
createFunnyCircle(some parameters). In that method you have no other way, but useDummyFunnyCircle.createFunnyCircle()without static import. Using it without static import is tedious, and tautological.
I can agree that these arguments are not very strong. But still, i think it is better to follow some existing conventions:
- https://docs.oracle.com/javase/tutorial/datetime/overview/naming.html
- http://www.informit.com/articles/article.aspx?p=1216151 (check the last paragraph)
Considering that we are broadly using Optional and Guava's ImmutableList, ImmutableMap, Immutable*... i think that it is more consistent to use similar convention when naming the factory methods. Just because people are more used to it.
I'm going to rename Estimate.create to Estimate.of to follow the Optional convention.
Also i think we should adopt this convention all across the code.
For example:
- Some neutral, "empty" value must be expressed as
Something.empty(), and notEMPTY_SOMETHING,SOMETHING_EMPTY,emptySomething(), so on. - Static factory methods should follow this convention: https://docs.oracle.com/javase/tutorial/datetime/overview/naming.html
- Builder constructors should look like
Something.builder(), and notbulderOfSomethingor similar.
Ideally it would be nice to have a static rule that checks it. But i don't think it is easily implementable.
There was a problem hiding this comment.
emptyStatistics? That way you could static import this.
There was a problem hiding this comment.
Ditto. e.g. Optional.empty()
presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
what happened to timestamp?
There was a problem hiding this comment.
It is not supported by the optimizer. I found it to be misleading, that's why i decided to remove it until we support it in the optimizer.
There was a problem hiding this comment.
This is a transitional method that is removed in the very next commit. I just moved it to the MetastoreHiveStatisticsProvider, since this is the only class that uses it.
There was a problem hiding this comment.
This value is out of range of the values, precisely represented by double. Before, when the value was transferred as Object it was truncated when converting to SymbolStatsEstimate. But now the truncation is visible to the end user.
There was a problem hiding this comment.
explain in commit message what you change and motivation behind this
5180de8 to
3328e0f
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
Remove RangeColumnStatistics
There was a problem hiding this comment.
consider using MoreObjects#toStringHelper
There was a problem hiding this comment.
Unfortunately Guava is not available in the presto-spi module
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Perhaps, remove range from the message. Here and in 2 more places further down.
There was a problem hiding this comment.
Perhaps, add a check for nonNullRowsCount being 0?
presto-spi/src/main/java/com/facebook/presto/spi/statistics/Estimate.java
Outdated
Show resolved
Hide resolved
mbasmanova
left a comment
There was a problem hiding this comment.
Add TableStatistics#empty() method
There was a problem hiding this comment.
Do we need isEmpty() method?
There was a problem hiding this comment.
No, i don't think we ever check it for empty explicitly.
presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java
Outdated
Show resolved
Hide resolved
mbasmanova
left a comment
There was a problem hiding this comment.
Refactor TableStatistics
There was a problem hiding this comment.
Perhaps, use MoreObjects.toStringHelper
There was a problem hiding this comment.
It is not available in presto-spi
mbasmanova
left a comment
There was a problem hiding this comment.
Add ignore_corrupted_statistics session property
LGTM
mbasmanova
left a comment
There was a problem hiding this comment.
Extract parsePartition in PartitionManager
There was a problem hiding this comment.
s/partitionId/partitionName for consistency with partitionName parameter in extractPartitionKeyValues method
There was a problem hiding this comment.
s/keys/values - extractPartitionKeyValues name is confusing; it suggests that the return value is a key-value maps of partition column names and values; in fact, the return value is a list of values of partition columns
There was a problem hiding this comment.
This code assumes that partition_name lists partition columns in proper order, however, extractPartitionKeyValues doesn't check for that; should we have an explicit check?
There was a problem hiding this comment.
The partitionName is something very confusing. Although all the column names are listed, and technically it is possible to shuffle the column names without loosing any information. However it is not legit. The partition values in the partition name should always be in order of partition columns.
mbasmanova
left a comment
There was a problem hiding this comment.
Make getPartitionsSample deterministic
LGTM
mbasmanova
left a comment
There was a problem hiding this comment.
Print column statistics in deterministic manner
Could you update commit message to clarify that this change applies to SHOW STATS command?
There was a problem hiding this comment.
Shouldn't we use a List to preserve the order of columns as defined in a table?
There was a problem hiding this comment.
We rely that the Map is ordered. Unfortunately this is something we rely on in the other parts of the code.
There was a problem hiding this comment.
@arhimondr Not sure I understand this completely. A more robust approach could be to fetch column list from ConnectorTableMetadata#columns.
There was a problem hiding this comment.
Could you add a test for a case when some columns (e.g. partition columns, maybe) don't have stats?
There was a problem hiding this comment.
That's one of the reasons why am i changing this. Before we were always assume that the TableStatistics object contains the statistics for all the columns. And it was true before the refactor. After refactor it changes. So it is implicitly tested by the existing tests.
mbasmanova
left a comment
There was a problem hiding this comment.
Remove validation from HiveBasicStatistics
LGTM
mbasmanova
left a comment
There was a problem hiding this comment.
Represent min and max statistics in SPI as double
There was a problem hiding this comment.
MoreObjects.toStringHelper
There was a problem hiding this comment.
@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)
There was a problem hiding this comment.
@findepi Piotr, thanks for explaining. In this case, let's copy-paste most useful utilities like this one.
There was a problem hiding this comment.
@mbasmanova We didn't do this yet, because we didn't want to add publicly accessible classes not intended for public consumption and package private didn't make it, as there as subpackages in SPI... Of course, this is something we can revisit anytime. I am all for it.
There was a problem hiding this comment.
This is something we can explore after moving to Java 9 modules, as that will allow us to have module-private cross-package utility code.
There was a problem hiding this comment.
The name Range a bit too generic. Would you consider changing to DoubleRange. BTW, commons-lang also has DoubleRange class.
There was a problem hiding this comment.
We also already have com.facebook.presto.spi.predicate.Range
There was a problem hiding this comment.
com.facebook.presto.spi.predicate.Range is too generic. I gonna go with DoubleRange
There was a problem hiding this comment.
nit: perhaps, test the resulting object as well?
private static void assertRange(double min, double max)
{
Range range = new Range(min, max);
assertEquals(range.getMin(), min);
assertEquals(range.getMax(), max);
}
There was a problem hiding this comment.
Would you update the commit message to explicitly mention this change?
There was a problem hiding this comment.
This should be a separate commit.
There was a problem hiding this comment.
Extracted to a separate commit
3328e0f to
5e8d9e7
Compare
findepi
left a comment
There was a problem hiding this comment.
i just skimmed selected commits
@arhimondr apologies that i didn't do proper review you asked me for :/
There was a problem hiding this comment.
cmt msg
If corruption is detected and the session property is set to true,
the statistics provider logs the corruption details and returns empty statistics.
this isn't true yet, since no-one reads the new flag. Maybe squash this commit with the usage? Or amend the cmt msg to reflect this?
There was a problem hiding this comment.
I'm just going to squash it
There was a problem hiding this comment.
cmt msg:
Make getPartitionsSample deterministic
... but the test added here doesn't look like testing the determinism.
(the previous tests look more like it... but probably were insufficient)
There was a problem hiding this comment.
I don't know why did i add it. Let me remove it.
There was a problem hiding this comment.
Sure. Anyway, would it be possible to cover Make getPartitionsSample deterministic with some kind of tests?
There was a problem hiding this comment.
I see what is going on. Accidentally the test that verifies determinism went as part of the Extract parsePartition in PartitionManager commit. Let me move it here.
There was a problem hiding this comment.
This should be a separate commit.
There was a problem hiding this comment.
"ToDouble" doesn't convey the intended semantics; convertPrestoValueToStatsRepresentation
There was a problem hiding this comment.
nit: express in ctor params order:
if (min > max) {
throw new IllegalArgumentException(format("min must be lower than or equal to max. min: %s. max: %s.", min, max));
There was a problem hiding this comment.
Or even simpler
format("min (%s) cannot be larger than max (%s)", min, max)There was a problem hiding this comment.
@electrum I don't like to mix the values with the static part of an error message, as it is harder than to search for it in the codebase. If you don't mind I would like to leave it as is.
There was a problem hiding this comment.
@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)
There was a problem hiding this comment.
Let me remove it. Currently all the representation that are allowed for Presto types are equatable
There was a problem hiding this comment.
cmt msg Normalize distinct values count:
s/or/of in than a total or non-null rows count
There was a problem hiding this comment.
"produced by the HLL"
or anything else, depending who wrote those stats (Hive, Spark, Impala, ...). Does Hive use HLL?
There was a problem hiding this comment.
Does Hive use HLL?
Yes, by default. There some other, legacy sketch it can use.
Let me remove the HLL
mbasmanova
left a comment
There was a problem hiding this comment.
Refactor MetastoreHiveStatisticsProvider
@arhimondr I'm still looking, but want to share comments I made before the code changed.
There was a problem hiding this comment.
perhaps, add checks to avoid creating empty instances using public constructor
There was a problem hiding this comment.
We should allow creating empty() implicitly. For example if there is no information neither about nullsFraction nor distinctValuesCount nor dataSize nor range in the Metastore.
There was a problem hiding this comment.
perhaps, combine these if statements
There was a problem hiding this comment.
I find it confusing to mix and match id and name terms wrt partition. I think name is more widely used term.
mbasmanova
left a comment
There was a problem hiding this comment.
Refactor MetastoreHiveStatisticsProvider
Some more questions.
...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java
Outdated
Show resolved
Hide resolved
...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
nit: perhaps, add Avg to the name to clarify that this method computes average number of rows
There was a problem hiding this comment.
renamed to calculateAverageRowsPerPartition. Also renamed all the rowsPerPartition variables to averageRowsPerPartition
...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java
Outdated
Show resolved
Hide resolved
...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java
Outdated
Show resolved
Hide resolved
mbasmanova
left a comment
There was a problem hiding this comment.
Normalize distinct values count
LGTM
|
@arhimondr Andrii, I finished reviewing this PR. Overall, I like these changes a lot. I have a few questions, though. |
5e8d9e7 to
2934f53
Compare
|
Comments addressed. Also i added one more tiny commit: |
mbasmanova
left a comment
There was a problem hiding this comment.
@arhimondr Andrii, this change overall looks great. I made a few comments, but nothing major. I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.
Done |
e1ee23d to
c6b78b9
Compare
|
Thanks for the review. Comments addressed. |
|
I am sorry that I was not able to do a proper review. From the discussions we had (online and offline ones) I see no blockers for that to be merged. |
|
@arhimondr Andrii, thanks for making all of these changes. I'm looking forward to seeing fewer queries fail when generating plans with |
- Rename factory methods: unknownValue -> unknown, zeroValue -> zero - Add factory method create(value). This factory method checks if value is finite - Validate estimates in ColumnStatistics
Instead of TableStatistics#EMPTY_STATISTICS static field
Use deterministic murmur3 hash. goodFastHash changes its seed on every restart.
Make SHOW STATS to print column statistics in the same order as they appear in the table. Also print rows with all nulls for columns with missing statistics.
This class must be able to store exactly what is stored in metastore. Sanity checks must be applied explicitly in MetastoreHiveStatisticsProvider.
MIN and MAX for TIMESTAMP column is not used by the optimizer. Having it in the Hive connector is misleading.
min and max for other types than numeric were simply ignored by the optimizer. Although SHOW STATS used to print min and max statistics for strings. Since the min and max are represented as double, the SHOW STATS command will no longer print these statistics, better representing the statistics that the optimizer actually takes into account.
- Add sanity checks to make sure that statistics returned make sense - Make the class to be more unit test friendly - Add extensive unit tests
Since the number of distinct values is estimated, it may end up higher than a total of non-null rows count. It makes sense to normalize it before writing and reading from the metastore.
If the statistics are corrupted, it doesn't make much sense to restore them on rollback.
c6b78b9 to
652112e
Compare
@mbasmanova @arhimondr what are the new constraints? |
The checks introduced are basically very straightforward sanity checks. Like if |
|
i am fine with not documenting these. However, if we run into real-life use-cases where the assertions do not hold (because of the way some other program populated the stats), we will need to update the code to support that. |
|
Aggreed. I'm going add a note to the release note about this change though. |
|
Apparently we have first real-life case of this already -- #11549 |
No description provided.