Avoid computation of output size if underlying stats are not confident#21280
Avoid computation of output size if underlying stats are not confident#21280jaystarshot wants to merge 1 commit intoprestodb:masterfrom
Conversation
5d568aa to
9c9745c
Compare
...t/java/com/facebook/presto/sql/planner/iterative/rule/TestDetermineJoinDistributionType.java
Outdated
Show resolved
Hide resolved
9c9745c to
c687df9
Compare
05d509d to
ec7eeb5
Compare
Summary: Open source stats framework expects all tables to have stats but we in uber don't for now. Test Plan: Existing unit tests + Shadow test - https://querybuilder.uberinternal.com/r/Bif6wGlLL/run/lj0NW5XEl Reviewers: #ldap_presto-core, hitarth Reviewed By: #ldap_presto-core, hitarth Subscribers: hitarth, O4263 subscribe to presto changes JIRA Issues: PRESTO-5669 Differential Revision: https://code.uberinternal.com/D11559869
ec7eeb5 to
6d30fa4
Compare
| Constraint<ColumnHandle> constraint = new Constraint<>(node.getCurrentConstraint()); | ||
|
|
||
| TableStatistics tableStatistics = metadata.getTableStatistics(session, node.getTable(), ImmutableList.copyOf(node.getAssignments().values()), constraint); | ||
| if (tableStatistics.getRowCount().isUnknown()) { |
There was a problem hiding this comment.
If no variable statistics, we could set the confidence to be false but not sure that would be effective
| "subgraph cluster_1 {\n" + | ||
| "label = \"SOURCE\"\n" + | ||
| "plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" + | ||
| "plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" + |
There was a problem hiding this comment.
In this case, the size is 0B because the number of output variables is 0, i.e. no output at all, and outputing 0B sounds better here.
feilong-liu
left a comment
There was a problem hiding this comment.
My understanding is that, this PR addresses the issue where the input table statistics are unknown, and we should not output any valid output size estimate. However, when we have input statistics to be unknown, is it that the case that all downstream plan nodes will have unknown estimates as well in current production?
The high level question I have here, in which cases, we will have valid output size estimate given the input table statistics is unknown?
| if (inputTableStatistics.stream().anyMatch(stat -> stat.getRowCount().isUnknown())) { | ||
| // return most recent run stats if input table stats were not found | ||
| return lastRunsStatistics.get(lastRunsStatistics.size() - 1).getPlanStatistics(); | ||
| } |
There was a problem hiding this comment.
In current logic, for case when input statistics for some table is unknown, it will match history which also has unknown statistics for the same table (and similar statistics for other table), otherwise will not match. This will make HBO to always return the latest run, even there are history with closer match, i.e. unknown for the same table and similar statistics for other tables.
There was a problem hiding this comment.
I see makes sense, i can remove this
| return planNodeStatsEstimate; | ||
| } | ||
| boolean confident = sourceStats.getStats(node.getSources().get(0)).isConfident(); | ||
| for (PlanNode source : node.getSources()) { |
There was a problem hiding this comment.
Confidence level should not only depends on the source inputs, for example, EnforceSingleRowNode node should always be confident that the output is one single row. We need to exclude these rules from this check.
There was a problem hiding this comment.
I see maybe better to add this to an abstract method and override it in those rules
| "subgraph cluster_1 {\n" + | ||
| "label = \"SOURCE\"\n" + | ||
| "plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" + | ||
| "plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" + |
There was a problem hiding this comment.
In this case, the size is 0B because the number of output variables is 0, i.e. no output at all, and outputing 0B sounds better here.
Yes according to the current implementation. Unless some downstream plan has historical stats. (StatsProvider will provide these which are always confident -here)
I think only in cases where intermediate plan nodes have historical statistics |
If the output size is from historical statistics, then downstream plan nodes can use these statistics to estimate their output size, and we do not need the change here? |
Indeed, that's true. However, this solution addresses situations in which historical statistics are accessible for one side of the upstream plan, while even table statistics are unavailable for the other side. Currently, we resort to random estimations in such cases. The purpose of this pull request or discussion is to rectify this issue. |
Can you give an example of this? I just do not understand why this will happen and giving an example will be very helpful. |
Sure Lets say we have historical stats for Join1 (or table stats for table1, table2) |
Can you give a working example to reproduce what you described above? |
|
I think the output estimation of the second join will be unknown in my previous example. |
|
Closing this since @feilong-liu has a planned improvement over this PR in #22791 |
Summary: Current stats framework expects all tables to have stats. It might not be true and if there are no stats, we should not calcuate some random output size estimation which is used in cost based rules like DetermineJoinDistributionType and Reorder Joins.
This change propagates the confidence as is downstream but we can change rules later to tailor confidence propagation downstream.
Test Plan:
Existing unit tests +
Shadow test
Description
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.
If release note is NOT required, use: