Skip to content

Refactor HiveStatisticsProvider and statistics SPI#11463

Merged
arhimondr merged 14 commits intoprestodb:masterfrom
arhimondr:refactor-statistics-spi
Sep 19, 2018
Merged

Refactor HiveStatisticsProvider and statistics SPI#11463
arhimondr merged 14 commits intoprestodb:masterfrom
arhimondr:refactor-statistics-spi

Conversation

@arhimondr
Copy link
Member

No description provided.

@arhimondr arhimondr force-pushed the refactor-statistics-spi branch 3 times, most recently from 2c66de4 to 5180de8 Compare September 12, 2018 23:38
@arhimondr arhimondr changed the title WIP: Refactor HiveStatisticsProvider and statistics SPI Refactor HiveStatisticsProvider and statistics SPI Sep 12, 2018
Copy link
Contributor

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed so far

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could extract to some Preconditions class in spi?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got a lot of checkArgument and vefiry methods implementation across the presto-spi module. I think that a good idea to have a Precondition class in that module to avoid implementing this boilerplate code every time someone needs that. However to keep it consistent i would rather like to do this as a separate PR, and i would like to replace the usages everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/create/estimate

then you could static import this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although i agree that createObjectName sounds more fluent that ObectName.create, i can see some disadvantages in the createObjectName approach. Here some of them:

  • Additional (static) import introduces additional merge conflicts
  • It is not always clear what is the exact name of the factory method. For example taking a class called DummyFunnyCircle, it is not always obvious how's the factory method is called. createDummyFunnyCircle? createFunnyCircle? createCircle?
  • Static import may introduce a name clash. For example in test classes you may want to have a private factory method called createFunnyCircle(some parameters). In that method you have no other way, but use DummyFunnyCircle.createFunnyCircle() without static import. Using it without static import is tedious, and tautological.

I can agree that these arguments are not very strong. But still, i think it is better to follow some existing conventions:

Considering that we are broadly using Optional and Guava's ImmutableList, ImmutableMap, Immutable*... i think that it is more consistent to use similar convention when naming the factory methods. Just because people are more used to it.

I'm going to rename Estimate.create to Estimate.of to follow the Optional convention.

Also i think we should adopt this convention all across the code.

For example:

  • Some neutral, "empty" value must be expressed as Something.empty(), and not EMPTY_SOMETHING, SOMETHING_EMPTY, emptySomething(), so on.
  • Static factory methods should follow this convention: https://docs.oracle.com/javase/tutorial/datetime/overview/naming.html
  • Builder constructors should look like Something.builder(), and not bulderOfSomething or similar.

Ideally it would be nice to have a static rule that checks it. But i don't think it is easily implementable.

CC: @electrum @martint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emptyStatistics? That way you could static import this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. e.g. Optional.empty()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happened to timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not supported by the optimizer. I found it to be misleading, that's why i decided to remove it until we support it in the optimizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just create?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a transitional method that is removed in the very next commit. I just moved it to the MetastoreHiveStatisticsProvider, since this is the only class that uses it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toStringLiteral?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it differs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is out of range of the values, precisely represented by double. Before, when the value was transferred as Object it was truncated when converting to SymbolStatsEstimate. But now the truncation is visible to the end user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain in commit message what you change and motivation behind this

@arhimondr arhimondr force-pushed the refactor-statistics-spi branch from 5180de8 to 3328e0f Compare September 13, 2018 14:05
@mbasmanova mbasmanova self-assigned this Sep 13, 2018
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr

Remove RangeColumnStatistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using MoreObjects#toStringHelper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately Guava is not available in the presto-spi module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, remove range from the message. Here and in 2 more places further down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, add a check for nonNullRowsCount being 0?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr

Refactor Estimate

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TableStatistics#empty() method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need isEmpty() method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, i don't think we ever check it for empty explicitly.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor TableStatistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, use MoreObjects.toStringHelper

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not available in presto-spi

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ignore_corrupted_statistics session property

LGTM

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract parsePartition in PartitionManager

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/partitionId/partitionName for consistency with partitionName parameter in extractPartitionKeyValues method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/keys/values - extractPartitionKeyValues name is confusing; it suggests that the return value is a key-value maps of partition column names and values; in fact, the return value is a list of values of partition columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes that partition_name lists partition columns in proper order, however, extractPartitionKeyValues doesn't check for that; should we have an explicit check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partitionName is something very confusing. Although all the column names are listed, and technically it is possible to shuffle the column names without loosing any information. However it is not legit. The partition values in the partition name should always be in order of partition columns.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make getPartitionsSample deterministic

LGTM

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print column statistics in deterministic manner

Could you update commit message to clarify that this change applies to SHOW STATS command?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use a List to preserve the order of columns as defined in a table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely that the Map is ordered. Unfortunately this is something we rely on in the other parts of the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Not sure I understand this completely. A more robust approach could be to fetch column list from ConnectorTableMetadata#columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for a case when some columns (e.g. partition columns, maybe) don't have stats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the reasons why am i changing this. Before we were always assume that the TableStatistics object contains the statistics for all the columns. And it was true before the refactor. After refactor it changes. So it is implicitly tested by the existing tests.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove validation from HiveBasicStatistics

LGTM

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Represent min and max statistics in SPI as double

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoreObjects.toStringHelper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi Piotr, thanks for explaining. In this case, let's copy-paste most useful utilities like this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova We didn't do this yet, because we didn't want to add publicly accessible classes not intended for public consumption and package private didn't make it, as there as subpackages in SPI... Of course, this is something we can revisit anytime. I am all for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we can explore after moving to Java 9 modules, as that will allow us to have module-private cross-package utility code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name Range a bit too generic. Would you consider changing to DoubleRange. BTW, commons-lang also has DoubleRange class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also already have com.facebook.presto.spi.predicate.Range

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.facebook.presto.spi.predicate.Range is too generic. I gonna go with DoubleRange

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, test the resulting object as well?

private static void assertRange(double min, double max)
{
    Range range = new Range(min, max);
    assertEquals(range.getMin(), min);
    assertEquals(range.getMax(), max);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you update the commit message to explicitly mention this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to a separate commit

@arhimondr arhimondr force-pushed the refactor-statistics-spi branch from 3328e0f to 5e8d9e7 Compare September 13, 2018 22:05
Copy link
Contributor

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just skimmed selected commits
@arhimondr apologies that i didn't do proper review you asked me for :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmt msg

If corruption is detected and the session property is set to true,
the statistics provider logs the corruption details and returns empty statistics.

this isn't true yet, since no-one reads the new flag. Maybe squash this commit with the usage? Or amend the cmt msg to reflect this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to squash it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmt msg:

Make getPartitionsSample deterministic

... but the test added here doesn't look like testing the determinism.
(the previous tests look more like it... but probably were insufficient)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why did i add it. Let me remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Anyway, would it be possible to cover Make getPartitionsSample deterministic with some kind of tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what is going on. Accidentally the test that verifies determinism went as part of the Extract parsePartition in PartitionManager commit. Let me move it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a separate commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ToDouble" doesn't convey the intended semantics; convertPrestoValueToStatsRepresentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: express in ctor params order:

if (min > max) {
            throw new IllegalArgumentException(format("min must be lower than or equal to max. min: %s. max: %s.", min, max));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even simpler

format("min (%s) cannot be larger than max (%s)", min, max)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum I don't like to mix the values with the static part of an error message, as it is harder than to search for it in the codebase. If you don't mind I would like to leave it as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova in SPI we don't have Guava dependency (and don't want to have it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it. Currently all the representation that are allowed for Presto types are equatable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmt msg Normalize distinct values count:

s/or/of in than a total or non-null rows count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"produced by the HLL"

or anything else, depending who wrote those stats (Hive, Spark, Impala, ...). Does Hive use HLL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Hive use HLL?

Yes, by default. There some other, legacy sketch it can use.

Let me remove the HLL

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor MetastoreHiveStatisticsProvider

@arhimondr I'm still looking, but want to share comments I made before the code changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, add checks to avoid creating empty instances using public constructor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow creating empty() implicitly. For example if there is no information neither about nullsFraction nor distinctValuesCount nor dataSize nor range in the Metastore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use Math.min?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, combine these if statements

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing to mix and match id and name terms wrt partition. I think name is more widely used term.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor MetastoreHiveStatisticsProvider

Some more questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, add Avg to the name to clarify that this method computes average number of rows

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to calculateAverageRowsPerPartition. Also renamed all the rowsPerPartition variables to averageRowsPerPartition

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalize distinct values count

LGTM

@mbasmanova
Copy link
Contributor

@arhimondr Andrii, I finished reviewing this PR. Overall, I like these changes a lot. I have a few questions, though.

@arhimondr arhimondr force-pushed the refactor-statistics-spi branch from 5e8d9e7 to 2934f53 Compare September 17, 2018 21:17
@arhimondr
Copy link
Member Author

Comments addressed. Also i added one more tiny commit: Ignore corrupted statistics when altering partition

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr Andrii, this change overall looks great. I made a few comments, but nothing major. I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

@arhimondr
Copy link
Member Author

@mbasmanova

Would you update the commit message to clarify that this change affects SHOW STATS command?

Done

@arhimondr arhimondr force-pushed the refactor-statistics-spi branch from e1ee23d to c6b78b9 Compare September 19, 2018 18:51
@arhimondr
Copy link
Member Author

@mbasmanova

Thanks for the review. Comments addressed.

@kokosing
Copy link
Contributor

I am sorry that I was not able to do a proper review. From the discussions we had (online and offline ones) I see no blockers for that to be merged.

@mbasmanova
Copy link
Contributor

@arhimondr Andrii, thanks for making all of these changes. I'm looking forward to seeing fewer queries fail when generating plans with hive.statistics_enabled = true.

- Rename factory methods: unknownValue -> unknown, zeroValue -> zero
- Add factory method create(value). This factory method checks if value is finite
- Validate estimates in ColumnStatistics
Instead of TableStatistics#EMPTY_STATISTICS static field
Use deterministic murmur3 hash. goodFastHash changes its seed on every restart.
Make SHOW STATS to print column statistics in the same order as they appear in the table.
Also print rows with all nulls for columns with missing statistics.
This class must be able to store exactly what is stored in metastore. Sanity checks
must be applied explicitly in MetastoreHiveStatisticsProvider.
MIN and MAX for TIMESTAMP column is not used by the optimizer.
Having it in the Hive connector is misleading.
min and max for other types than numeric were simply ignored by the optimizer.
Although SHOW STATS used to print min and max statistics for strings. Since the
min and max are represented as double, the SHOW STATS command will no longer print
these statistics, better representing the statistics that the optimizer actually takes
into account.
- Add sanity checks to make sure that statistics returned make sense
- Make the class to be more unit test friendly
- Add extensive unit tests
Since the number of distinct values is estimated, it may end up higher than a total
of non-null rows count. It makes sense to normalize it before writing and reading from
the metastore.
If the statistics are corrupted, it doesn't make much sense to restore
them on rollback.
@arhimondr arhimondr force-pushed the refactor-statistics-spi branch from c6b78b9 to 652112e Compare September 19, 2018 23:28
@arhimondr arhimondr merged commit 652112e into prestodb:master Sep 19, 2018
@findepi
Copy link
Contributor

findepi commented Sep 20, 2018

I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

@mbasmanova @arhimondr what are the new constraints?

@arhimondr
Copy link
Member Author

@mbasmanova @findepi

I'm thinking that it would be helpful to document the new constraints that Presto enforces on Metastore stats.

The checks introduced are basically very straightforward sanity checks. Like if rowCount is not negative, or if nullsCount is less than or equal to rowCount. I'm not sure if it makes sense to explicitly document it, as it seems to be pretty obvious.

@arhimondr arhimondr deleted the refactor-statistics-spi branch September 20, 2018 13:36
@findepi
Copy link
Contributor

findepi commented Sep 20, 2018

i am fine with not documenting these. However, if we run into real-life use-cases where the assertions do not hold (because of the way some other program populated the stats), we will need to update the code to support that.

@arhimondr
Copy link
Member Author

Aggreed. I'm going add a note to the release note about this change though.

@findepi
Copy link
Contributor

findepi commented Sep 21, 2018

Apparently we have first real-life case of this already -- #11549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants