Refactor HiveStatisticsProvider and statistics SPI by arhimondr · Pull Request #11463 · prestodb/presto

arhimondr · 2018-09-12T16:22:16Z

No description provided.

kokosing

Just skimmed so far

kokosing · 2018-09-13T05:24:41Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java

Maybe you could extract to some Preconditions class in spi?

We got a lot of checkArgument and vefiry methods implementation across the presto-spi module. I think that a good idea to have a Precondition class in that module to avoid implementing this boilerplate code every time someone needs that. However to keep it consistent i would rather like to do this as a separate PR, and i would like to replace the usages everywhere.

kokosing · 2018-09-13T05:26:09Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Estimate.java

s/create/estimate

then you could static import this

Although i agree that createObjectName sounds more fluent that ObectName.create, i can see some disadvantages in the createObjectName approach. Here some of them:

Additional (static) import introduces additional merge conflicts

It is not always clear what is the exact name of the factory method. For example taking a class called DummyFunnyCircle, it is not always obvious how's the factory method is called. createDummyFunnyCircle? createFunnyCircle? createCircle?

Static import may introduce a name clash. For example in test classes you may want to have a private factory method called createFunnyCircle(some parameters). In that method you have no other way, but use DummyFunnyCircle.createFunnyCircle() without static import. Using it without static import is tedious, and tautological.

I can agree that these arguments are not very strong. But still, i think it is better to follow some existing conventions:

https://docs.oracle.com/javase/tutorial/datetime/overview/naming.html

http://www.informit.com/articles/article.aspx?p=1216151 (check the last paragraph)

Considering that we are broadly using Optional and Guava's ImmutableList, ImmutableMap, Immutable*... i think that it is more consistent to use similar convention when naming the factory methods. Just because people are more used to it.

I'm going to rename Estimate.create to Estimate.of to follow the Optional convention.

Also i think we should adopt this convention all across the code.

For example:

Some neutral, "empty" value must be expressed as Something.empty(), and not EMPTY_SOMETHING, SOMETHING_EMPTY, emptySomething(), so on.

Static factory methods should follow this convention: https://docs.oracle.com/javase/tutorial/datetime/overview/naming.html

Builder constructors should look like Something.builder(), and not bulderOfSomething or similar.

Ideally it would be nice to have a static rule that checks it. But i don't think it is easily implementable.

CC: @electrum @martint

kokosing · 2018-09-13T05:27:29Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java

emptyStatistics? That way you could static import this.

Ditto. e.g. Optional.empty()

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

kokosing · 2018-09-13T05:38:31Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

what happened to timestamp?

It is not supported by the optimizer. I found it to be misleading, that's why i decided to remove it until we support it in the optimizer.

kokosing · 2018-09-13T05:40:19Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

just create?

This is a transitional method that is removed in the very next commit. I just moved it to the MetastoreHiveStatisticsProvider, since this is the only class that uses it.

kokosing · 2018-09-13T05:42:40Z

presto-main/src/main/java/com/facebook/presto/sql/rewrite/ShowStatsRewrite.java

toStringLiteral?

kokosing · 2018-09-13T05:43:26Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveTableStatistics.java

why it differs?

This value is out of range of the values, precisely represented by double. Before, when the value was transferred as Object it was truncated when converting to SymbolStatsEstimate. But now the truncation is visible to the end user.

kokosing · 2018-09-13T05:46:40Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveMetadata.java

explain in commit message what you change and motivation behind this

mbasmanova

@arhimondr

Remove RangeColumnStatistics

mbasmanova · 2018-09-13T17:16:50Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java

consider using MoreObjects#toStringHelper

Unfortunately Guava is not available in the presto-spi module

presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java

mbasmanova · 2018-09-13T17:23:14Z

presto-hive/src/test/java/com/facebook/presto/hive/AbstractTestHiveClient.java

Perhaps, remove range from the message. Here and in 2 more places further down.

mbasmanova · 2018-09-13T17:24:28Z

presto-main/src/main/java/com/facebook/presto/cost/TableScanStatsRule.java

Perhaps, add a check for nonNullRowsCount being 0?

mbasmanova

@arhimondr

Refactor Estimate

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Estimate.java

mbasmanova

Add TableStatistics#empty() method

mbasmanova · 2018-09-13T18:22:46Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java

Do we need isEmpty() method?

No, i don't think we ever check it for empty explicitly.

presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java

mbasmanova

Refactor TableStatistics

mbasmanova · 2018-09-13T18:24:57Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/TableStatistics.java

Perhaps, use MoreObjects.toStringHelper

It is not available in presto-spi

mbasmanova

Add ignore_corrupted_statistics session property

LGTM

mbasmanova

Extract parsePartition in PartitionManager

mbasmanova · 2018-09-13T18:31:22Z

presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java

s/partitionId/partitionName for consistency with partitionName parameter in extractPartitionKeyValues method

mbasmanova · 2018-09-13T18:32:27Z

presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java

s/keys/values - extractPartitionKeyValues name is confusing; it suggests that the return value is a key-value maps of partition column names and values; in fact, the return value is a list of values of partition columns

mbasmanova · 2018-09-13T18:34:04Z

presto-hive/src/main/java/com/facebook/presto/hive/HivePartitionManager.java

This code assumes that partition_name lists partition columns in proper order, however, extractPartitionKeyValues doesn't check for that; should we have an explicit check?

The partitionName is something very confusing. Although all the column names are listed, and technically it is possible to shuffle the column names without loosing any information. However it is not legit. The partition values in the partition name should always be in order of partition columns.

mbasmanova

Make getPartitionsSample deterministic

LGTM

mbasmanova

Print column statistics in deterministic manner

Could you update commit message to clarify that this change applies to SHOW STATS command?

mbasmanova · 2018-09-13T18:39:26Z

presto-main/src/main/java/com/facebook/presto/sql/rewrite/ShowStatsRewrite.java

Shouldn't we use a List to preserve the order of columns as defined in a table?

We rely that the Map is ordered. Unfortunately this is something we rely on in the other parts of the code.

@arhimondr Not sure I understand this completely. A more robust approach could be to fetch column list from ConnectorTableMetadata#columns.

mbasmanova · 2018-09-13T18:41:41Z

presto-main/src/main/java/com/facebook/presto/sql/rewrite/ShowStatsRewrite.java

Could you add a test for a case when some columns (e.g. partition columns, maybe) don't have stats?

That's one of the reasons why am i changing this. Before we were always assume that the TableStatistics object contains the statistics for all the columns. And it was true before the refactor. After refactor it changes. So it is implicitly tested by the existing tests.

mbasmanova

Remove validation from HiveBasicStatistics

LGTM

mbasmanova

Represent min and max statistics in SPI as double

mbasmanova · 2018-09-13T18:52:24Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

MoreObjects.toStringHelper

@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)

@findepi Piotr, thanks for explaining. In this case, let's copy-paste most useful utilities like this one.

@mbasmanova We didn't do this yet, because we didn't want to add publicly accessible classes not intended for public consumption and package private didn't make it, as there as subpackages in SPI... Of course, this is something we can revisit anytime. I am all for it.

This is something we can explore after moving to Java 9 modules, as that will allow us to have module-private cross-package utility code.

mbasmanova · 2018-09-13T18:54:14Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

The name Range a bit too generic. Would you consider changing to DoubleRange. BTW, commons-lang also has DoubleRange class.

We also already have com.facebook.presto.spi.predicate.Range

com.facebook.presto.spi.predicate.Range is too generic. I gonna go with DoubleRange

mbasmanova · 2018-09-13T18:56:17Z

presto-spi/src/test/java/com/facebook/presto/spi/statistics/TestRange.java

nit: perhaps, test the resulting object as well?

private static void assertRange(double min, double max) { Range range = new Range(min, max); assertEquals(range.getMin(), min); assertEquals(range.getMax(), max); }

mbasmanova · 2018-09-13T18:57:27Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

Would you update the commit message to explicitly mention this change?

This should be a separate commit.

Extracted to a separate commit

findepi

i just skimmed selected commits
@arhimondr apologies that i didn't do proper review you asked me for :/

findepi · 2018-09-14T13:49:46Z

presto-hive/src/main/java/com/facebook/presto/hive/HiveClientConfig.java

cmt msg

If corruption is detected and the session property is set to true,
the statistics provider logs the corruption details and returns empty statistics.

this isn't true yet, since no-one reads the new flag. Maybe squash this commit with the usage? Or amend the cmt msg to reflect this?

I'm just going to squash it

findepi · 2018-09-14T13:58:39Z

...e/src/test/java/com/facebook/presto/hive/statistics/TestMetastoreHiveStatisticsProvider.java

cmt msg:

Make getPartitionsSample deterministic

... but the test added here doesn't look like testing the determinism.
(the previous tests look more like it... but probably were insufficient)

I don't know why did i add it. Let me remove it.

Sure. Anyway, would it be possible to cover Make getPartitionsSample deterministic with some kind of tests?

I see what is going on. Accidentally the test that verifies determinism went as part of the Extract parsePartition in PartitionManager commit. Let me move it here.

findepi · 2018-09-14T14:01:43Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

This should be a separate commit.

findepi · 2018-09-14T14:02:37Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

"ToDouble" doesn't convey the intended semantics; convertPrestoValueToStatsRepresentation

findepi · 2018-09-14T14:06:16Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

nit: express in ctor params order:

if (min > max) { throw new IllegalArgumentException(format("min must be lower than or equal to max. min: %s. max: %s.", min, max));

Or even simpler

format("min (%s) cannot be larger than max (%s)", min, max)

@electrum I don't like to mix the values with the static part of an error message, as it is harder than to search for it in the codebase. If you don't mind I would like to leave it as is.

findepi · 2018-09-14T14:08:16Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)

findepi · 2018-09-14T14:09:16Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

Let me remove it. Currently all the representation that are allowed for Presto types are equatable

findepi · 2018-09-14T14:10:58Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

cmt msg Normalize distinct values count:

s/or/of in than a total or non-null rows count

findepi · 2018-09-14T14:11:58Z

presto-hive/src/main/java/com/facebook/presto/hive/metastore/thrift/ThriftMetastoreUtil.java

"produced by the HLL"

or anything else, depending who wrote those stats (Hive, Spark, Impala, ...). Does Hive use HLL?

Does Hive use HLL?

Yes, by default. There some other, legacy sketch it can use.

Let me remove the HLL

mbasmanova

Refactor MetastoreHiveStatisticsProvider

@arhimondr I'm still looking, but want to share comments I made before the code changed.

mbasmanova · 2018-09-13T20:17:40Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java

perhaps, add checks to avoid creating empty instances using public constructor

We should allow creating empty() implicitly. For example if there is no information neither about nullsFraction nor distinctValuesCount nor dataSize nor range in the Metastore.

mbasmanova · 2018-09-13T20:18:22Z

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Range.java

Why not use Math.min?

mbasmanova · 2018-09-13T20:30:14Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

perhaps, combine these if statements

mbasmanova · 2018-09-13T20:31:21Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

I find it confusing to mix and match id and name terms wrt partition. I think name is more widely used term.

mbasmanova

Refactor MetastoreHiveStatisticsProvider

Some more questions.

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

mbasmanova · 2018-09-14T14:47:20Z

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

nit: perhaps, add Avg to the name to clarify that this method computes average number of rows

renamed to calculateAverageRowsPerPartition. Also renamed all the rowsPerPartition variables to averageRowsPerPartition

...-hive/src/main/java/com/facebook/presto/hive/statistics/MetastoreHiveStatisticsProvider.java

mbasmanova

Normalize distinct values count

LGTM

mbasmanova · 2018-09-14T14:55:46Z

@arhimondr Andrii, I finished reviewing this PR. Overall, I like these changes a lot. I have a few questions, though.

arhimondr · 2018-09-17T21:19:49Z

Comments addressed. Also i added one more tiny commit: Ignore corrupted statistics when altering partition

mbasmanova

@arhimondr Andrii, this change overall looks great. I made a few comments, but nothing major. I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

arhimondr · 2018-09-19T18:12:45Z

@mbasmanova

Would you update the commit message to clarify that this change affects SHOW STATS command?

Done

arhimondr · 2018-09-19T18:52:38Z

@mbasmanova

Thanks for the review. Comments addressed.

kokosing · 2018-09-19T19:42:40Z

I am sorry that I was not able to do a proper review. From the discussions we had (online and offline ones) I see no blockers for that to be merged.

mbasmanova · 2018-09-19T21:08:32Z

@arhimondr Andrii, thanks for making all of these changes. I'm looking forward to seeing fewer queries fail when generating plans with hive.statistics_enabled = true.

- Rename factory methods: unknownValue -> unknown, zeroValue -> zero - Add factory method create(value). This factory method checks if value is finite - Validate estimates in ColumnStatistics

Instead of TableStatistics#EMPTY_STATISTICS static field

Use deterministic murmur3 hash. goodFastHash changes its seed on every restart.

Make SHOW STATS to print column statistics in the same order as they appear in the table. Also print rows with all nulls for columns with missing statistics.

This class must be able to store exactly what is stored in metastore. Sanity checks must be applied explicitly in MetastoreHiveStatisticsProvider.

MIN and MAX for TIMESTAMP column is not used by the optimizer. Having it in the Hive connector is misleading.

min and max for other types than numeric were simply ignored by the optimizer. Although SHOW STATS used to print min and max statistics for strings. Since the min and max are represented as double, the SHOW STATS command will no longer print these statistics, better representing the statistics that the optimizer actually takes into account.

- Add sanity checks to make sure that statistics returned make sense - Make the class to be more unit test friendly - Add extensive unit tests

Since the number of distinct values is estimated, it may end up higher than a total of non-null rows count. It makes sense to normalize it before writing and reading from the metastore.

If the statistics are corrupted, it doesn't make much sense to restore them on rollback.

findepi · 2018-09-20T07:27:00Z

I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

@mbasmanova @arhimondr what are the new constraints?

arhimondr · 2018-09-20T13:36:25Z

@mbasmanova @findepi

I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

The checks introduced are basically very straightforward sanity checks. Like if rowCount is not negative, or if nullsCount is less than or equal to rowCount. I'm not sure if it makes sense to explicitly document it, as it seems to be pretty obvious.

findepi · 2018-09-20T13:41:07Z

i am fine with not documenting these. However, if we run into real-life use-cases where the assertions do not hold (because of the way some other program populated the stats), we will need to update the code to support that.

arhimondr · 2018-09-20T14:40:24Z

Aggreed. I'm going add a note to the release note about this change though.

findepi · 2018-09-21T13:28:31Z

Apparently we have first real-life case of this already -- #11549

facebook-github-bot added the CLA Signed label Sep 12, 2018

arhimondr force-pushed the refactor-statistics-spi branch 3 times, most recently from 2c66de4 to 5180de8 Compare September 12, 2018 23:38

arhimondr requested review from findepi, kokosing, mbasmanova and rschlussel September 12, 2018 23:39

arhimondr changed the title ~~WIP: Refactor HiveStatisticsProvider and statistics SPI~~ Refactor HiveStatisticsProvider and statistics SPI Sep 12, 2018

kokosing reviewed Sep 13, 2018

View reviewed changes

arhimondr force-pushed the refactor-statistics-spi branch from 5180de8 to 3328e0f Compare September 13, 2018 14:05

mbasmanova self-assigned this Sep 13, 2018

mbasmanova reviewed Sep 13, 2018

View reviewed changes

presto-spi/src/main/java/com/facebook/presto/spi/statistics/Estimate.java Outdated Show resolved Hide resolved

mbasmanova reviewed Sep 13, 2018

View reviewed changes

arhimondr force-pushed the refactor-statistics-spi branch from 3328e0f to 5e8d9e7 Compare September 13, 2018 22:05

findepi reviewed Sep 14, 2018

View reviewed changes

mbasmanova reviewed Sep 14, 2018

View reviewed changes

arhimondr force-pushed the refactor-statistics-spi branch from 5e8d9e7 to 2934f53 Compare September 17, 2018 21:17

mbasmanova approved these changes Sep 19, 2018

View reviewed changes

arhimondr force-pushed the refactor-statistics-spi branch from e1ee23d to c6b78b9 Compare September 19, 2018 18:51

arhimondr added 14 commits September 19, 2018 19:28

Remove RangeColumnStatistics

04920e5

Refactor Estimate

351a61f

- Rename factory methods: unknownValue -> unknown, zeroValue -> zero - Add factory method create(value). This factory method checks if value is finite - Validate estimates in ColumnStatistics

Add TableStatistics#empty() method

31ce369

Instead of TableStatistics#EMPTY_STATISTICS static field

Refactor TableStatistics

aed92f8

Extract parsePartition in PartitionManager

1ff1e28

Fix incorrect partitions sample size

b3827bf

Make getPartitionsSample deterministic

f847336

Use deterministic murmur3 hash. goodFastHash changes its seed on every restart.

Print column statistics in deterministic manner

845a0d8

Make SHOW STATS to print column statistics in the same order as they appear in the table. Also print rows with all nulls for columns with missing statistics.

Remove validation from HiveBasicStatistics

9229cda

This class must be able to store exactly what is stored in metastore. Sanity checks must be applied explicitly in MetastoreHiveStatisticsProvider.

Remove MIN/MAX statistics support for TIMESTAMP in Hive Connector

5b9a58f

MIN and MAX for TIMESTAMP column is not used by the optimizer. Having it in the Hive connector is misleading.

Refactor MetastoreHiveStatisticsProvider

5f5e662

- Add sanity checks to make sure that statistics returned make sense - Make the class to be more unit test friendly - Add extensive unit tests

Normalize distinct values count

1344eb9

Since the number of distinct values is estimated, it may end up higher than a total of non-null rows count. It makes sense to normalize it before writing and reading from the metastore.

Ignore corrupted statistics when altering partition

652112e

If the statistics are corrupted, it doesn't make much sense to restore them on rollback.

arhimondr force-pushed the refactor-statistics-spi branch from c6b78b9 to 652112e Compare September 19, 2018 23:28

arhimondr merged commit 652112e into prestodb:master Sep 19, 2018

arhimondr deleted the refactor-statistics-spi branch September 20, 2018 13:36

arhimondr mentioned this pull request Oct 2, 2018

Avoid making wild guesses in FILTER and JOIN estimates #11625

Merged

arhimondr mentioned this pull request Oct 10, 2018

Replace constants with static factory methods #11660

Merged

Conversation

arhimondr commented Sep 12, 2018

Uh oh!

kokosing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!