Skip to content

Add a comment about ColumnStatistics' "unknown" estimates semantics#12177

Merged
rschlussel merged 1 commit intoprestodb:masterfrom
rzeyde-varada:column-estimate-semantics
Feb 22, 2019
Merged

Add a comment about ColumnStatistics' "unknown" estimates semantics#12177
rschlussel merged 1 commit intoprestodb:masterfrom
rzeyde-varada:column-estimate-semantics

Conversation

@rzeyde-varada
Copy link
Contributor

Recently, we have implemented column statistics estimation for our SPI connector [1].

One (a bit surprising) issue was that we had to return a not-NaN estimate for nullsFraction [2] even if we had an estimate of the row count of the table (initially we assumed that Presto will assume 0 NULLs if it isn’t specified, so we specified only the dataSize of each column statistic).

Since “unknown” values are marked by NaN, the resulting costs became NaN as well [3], so the CBO was ignoring our queries (until we fixed that by returning a not-NaN values for each column statistic).

[1]

TableStatistics getTableStatistics(Session session, TableHandle tableHandle, Constraint<ColumnHandle> constraint);

[2]

[3]

private SymbolStatsEstimate toSymbolStatistics(TableStatistics tableStatistics, ColumnStatistics columnStatistics)

Copy link
Contributor

@electrum electrum Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment implies that cost based optimization can only happen when all estimates are provided. If that's true, why do we allow setting them individually? What happens if, say, dataSize is missing, but the others are present?

Copy link
Contributor Author

@rzeyde-varada rzeyde-varada Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum If I understand correctly, in case dataSize is missing - the cost won't be estimated, since the averageRowSize below will be NaN:

double averageRowSize = nonNullRowsCount == 0 ? 0 : columnStatistics.getDataSize().getValue() / nonNullRowsCount;

Is it possible to verify that the user sets all the estimates, or none of them at Builder.build()?

public ColumnStatistics build()
{
return new ColumnStatistics(nullsFraction, distinctValuesCount, dataSize, range);
}

Please let me know if the comment needs to be changed to clarify the intended use of this API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have dataSize it will be unknown in the symbol stats estimate, but we use a default value for cost estimations. For fixed-width types (e.g. int, boolean) we use the sizes of the data type assuming no compression and for non-fixed-width types (e.g. varchar, varbinary) we use a default value of 50 bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(So basically the code comment itself incorrect, but your impression of the consequences may be)

Copy link
Contributor Author

@rzeyde-varada rzeyde-varada Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel Thanks for the clarification, I have removed the last line from the comment, and pushed 1627b3d7389a77c0dceb664f29306731a6b09e10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if the rest of the comment should be changed, or whether it can be removed. The main issue I was trying to solve is to document the requirements for the ColumnStatistics provider (which estimates must be specified and which can be omitted).

@rzeyde-varada rzeyde-varada force-pushed the column-estimate-semantics branch from e6a9dae to 1627b3d Compare January 9, 2019 07:02
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the word "value" here to make the sentence easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!
Fixed at 37f6cff and force-pushed.

@rzeyde-varada rzeyde-varada force-pushed the column-estimate-semantics branch from 1627b3d to 37f6cff Compare January 10, 2019 17:41
@rzeyde-varada
Copy link
Contributor Author

Ping :)

@rschlussel
Copy link
Contributor

I'm so sorry I forgot to merge this. Master is frozen now for the release, but I'll merge it as soon as it's open again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants