Add a comment about ColumnStatistics' "unknown" estimates semantics#12177
Conversation
presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This comment implies that cost based optimization can only happen when all estimates are provided. If that's true, why do we allow setting them individually? What happens if, say, dataSize is missing, but the others are present?
There was a problem hiding this comment.
@electrum If I understand correctly, in case dataSize is missing - the cost won't be estimated, since the averageRowSize below will be NaN:
Is it possible to verify that the user sets all the estimates, or none of them at Builder.build()?
Please let me know if the comment needs to be changed to clarify the intended use of this API.
There was a problem hiding this comment.
If we don't have dataSize it will be unknown in the symbol stats estimate, but we use a default value for cost estimations. For fixed-width types (e.g. int, boolean) we use the sizes of the data type assuming no compression and for non-fixed-width types (e.g. varchar, varbinary) we use a default value of 50 bytes.
There was a problem hiding this comment.
(So basically the code comment itself incorrect, but your impression of the consequences may be)
There was a problem hiding this comment.
@rschlussel Thanks for the clarification, I have removed the last line from the comment, and pushed 1627b3d7389a77c0dceb664f29306731a6b09e10.
There was a problem hiding this comment.
Please let me know if the rest of the comment should be changed, or whether it can be removed. The main issue I was trying to solve is to document the requirements for the ColumnStatistics provider (which estimates must be specified and which can be omitted).
e6a9dae to
1627b3d
Compare
There was a problem hiding this comment.
nit: remove the word "value" here to make the sentence easier to read.
There was a problem hiding this comment.
Good catch, thanks!
Fixed at 37f6cff and force-pushed.
1627b3d to
37f6cff
Compare
|
Ping :) |
|
I'm so sorry I forgot to merge this. Master is frozen now for the release, but I'll merge it as soon as it's open again. |
Recently, we have implemented column statistics estimation for our SPI connector [1].
One (a bit surprising) issue was that we had to return a not-NaN estimate for
nullsFraction[2] even if we had an estimate of the row count of the table (initially we assumed that Presto will assume 0 NULLs if it isn’t specified, so we specified only thedataSizeof each column statistic).Since “unknown” values are marked by NaN, the resulting costs became NaN as well [3], so the CBO was ignoring our queries (until we fixed that by returning a not-NaN values for each column statistic).
[1]
presto/presto-main/src/main/java/com/facebook/presto/metadata/Metadata.java
Line 102 in 57c70ae
[2]
presto/presto-spi/src/main/java/com/facebook/presto/spi/statistics/ColumnStatistics.java
Line 22 in b6addd4
[3]
presto/presto-main/src/main/java/com/facebook/presto/cost/TableScanStatsRule.java
Line 75 in f112441