Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,10 @@ public static PlanStatistics getPredictedPlanStatistics(
if (lastRunsStatistics.isEmpty()) {
return PlanStatistics.empty();
}

if (inputTableStatistics.stream().anyMatch(stat -> stat.getRowCount().isUnknown())) {
// return most recent run stats if input table stats were not found
return lastRunsStatistics.get(lastRunsStatistics.size() - 1).getPlanStatistics();
}
Comment on lines +43 to +46
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current logic, for case when input statistics for some table is unknown, it will match history which also has unknown statistics for the same table (and similar statistics for other table), otherwise will not match. This will make HBO to always return the latest run, even there are history with closer match, i.e. unknown for the same table and similar statistics for other tables.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see makes sense, i can remove this

Optional<Integer> similarStatsIndex = getSimilarStatsIndex(historicalPlanStatistics, inputTableStatistics, historyMatchingThreshold);

if (similarStatsIndex.isPresent()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,10 @@ public double getOutputSizeInBytes(PlanNode planNode)
if (!sourceInfo.estimateSizeUsingVariables() && !isNaN(totalSize)) {
return totalSize;
}
if (!isConfident()) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main change

// If we are not confident ( Non hbo stats + no row count info available) then we should not compute the output size
return NaN;
}

return getOutputSizeForVariables(planNode.getOutputVariables());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,23 @@ protected SimpleStatsRule(StatsNormalizer normalizer)
@Override
public final Optional<PlanNodeStatsEstimate> calculate(T node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types)
{
return doCalculate(node, sourceStats, lookup, session, types)
Optional<PlanNodeStatsEstimate> planNodeStatsEstimate = doCalculate(node, sourceStats, lookup, session, types)
.map(estimate -> normalizer.normalize(estimate, node.getOutputVariables()));
if (node.getSources().isEmpty()) {
// dont do the confident check for tablescan stats
Copy link
Copy Markdown
Member Author

@jaystarshot jaystarshot Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagated up

return planNodeStatsEstimate;
}
boolean confident = sourceStats.getStats(node.getSources().get(0)).isConfident();
for (PlanNode source : node.getSources()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confidence level should not only depends on the source inputs, for example, EnforceSingleRowNode node should always be confident that the output is one single row. We need to exclude these rules from this check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see maybe better to add this to an abstract method and override it in those rules

confident = sourceStats.getStats(source).isConfident();
if (!confident) {
break;
}
}
boolean finalConfident = confident;
return planNodeStatsEstimate.map(p -> new PlanNodeStatsEstimate(p.getOutputRowCount(),
p.getTotalSize(), finalConfident,
p.getVariableStatistics(), p.getJoinNodeStatsEstimate(), p.getTableWriterNodeStatsEstimate()));
}

protected abstract Optional<PlanNodeStatsEstimate> doCalculate(T node, StatsProvider sourceStats, Lookup lookup, Session session, TypeProvider types);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ protected Optional<PlanNodeStatsEstimate> doCalculate(TableScanNode node, StatsP
Constraint<ColumnHandle> constraint = new Constraint<>(node.getCurrentConstraint());

TableStatistics tableStatistics = metadata.getTableStatistics(session, node.getTable(), ImmutableList.copyOf(node.getAssignments().values()), constraint);
if (tableStatistics.getRowCount().isUnknown()) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no variable statistics, we could set the confidence to be false but not sure that would be effective

// Since we do not have any hms statistics, we should not be confident
return Optional.of(PlanNodeStatsEstimate.unknown());
}
Map<VariableReferenceExpression, VariableStatsEstimate> outputVariableStats = new HashMap<>();

for (Map.Entry<VariableReferenceExpression, ColumnHandle> entry : node.getAssignments().entrySet()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -756,6 +756,7 @@ private static PlanNodeStatsEstimate statsEstimate(Collection<VariableReferenceE
.setAverageRowSize(AVERAGE_ROW_SIZE)
.build());
}
builder.setConfident(true);
return builder.build();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,39 @@ public void testRetainDistributionType()
.doesNotFire();
}

// Replicated join was not created without confidence
@Test
public void testJoinDeterminationWithNonConfidentStats()
{
int aRows = 100;
int bRows = 10_000;
assertDetermineJoinDistributionType()
.setSystemProperty(JOIN_DISTRIBUTION_TYPE, JoinDistributionType.AUTOMATIC.name())
.overrideStats("valuesA", PlanNodeStatsEstimate.builder()
.setOutputRowCount(aRows)
.addVariableStatistics(ImmutableMap.of(new VariableReferenceExpression(Optional.empty(), "A1", BIGINT), new VariableStatsEstimate(0, 100, 0, 6400, 100)))
.build(), false)
.overrideStats("valuesB", PlanNodeStatsEstimate.builder()
.setOutputRowCount(bRows)
.addVariableStatistics(ImmutableMap.of(new VariableReferenceExpression(Optional.empty(), "B1", BIGINT), new VariableStatsEstimate(0, 100, 0, 640000, 100)))
.build(), false)
.on(p ->
p.join(
INNER,
p.values(new PlanNodeId("valuesA"), aRows, p.variable("A1", BIGINT)),
p.values(new PlanNodeId("valuesB"), bRows, p.variable("B1", BIGINT)),
ImmutableList.of(new JoinNode.EquiJoinClause(p.variable("A1", BIGINT), p.variable("B1", BIGINT))),
ImmutableList.of(p.variable("A1", BIGINT), p.variable("B1", BIGINT)),
Optional.empty()))
.matches(join(
INNER,
ImmutableList.of(equiJoinClause("B1", "A1")),
Optional.empty(),
Optional.of(PARTITIONED),
values(ImmutableMap.of("B1", 0)),
values(ImmutableMap.of("A1", 0))));
}

@Test
public void testFlipAndReplicateWhenOneTableMuchSmaller()
{
Expand Down Expand Up @@ -1333,9 +1366,11 @@ public void testGetSourceTablesSizeInBytes()
// two source plan nodes
PlanNodeStatsEstimate sourceStatsEstimate1 = PlanNodeStatsEstimate.builder()
.setOutputRowCount(10)
.setConfident(true)
.build();
PlanNodeStatsEstimate sourceStatsEstimate2 = PlanNodeStatsEstimate.builder()
.setOutputRowCount(20)
.setConfident(true)
.build();
assertEquals(
getSourceTablesSizeInBytes(
Expand Down Expand Up @@ -1405,15 +1440,19 @@ public void testGetApproximateSourceSizeInBytes()
// two source plan nodes
PlanNodeStatsEstimate sourceStatsEstimate1 = PlanNodeStatsEstimate.builder()
.setOutputRowCount(1000)
.setConfident(true)
.build();
PlanNodeStatsEstimate sourceStatsEstimate2 = PlanNodeStatsEstimate.builder()
.setOutputRowCount(2000)
.setConfident(true)
.build();
PlanNodeStatsEstimate filterStatsEstimate = PlanNodeStatsEstimate.builder()
.setOutputRowCount(250)
.setConfident(true)
.build();
PlanNodeStatsEstimate limitStatsEstimate = PlanNodeStatsEstimate.builder()
.setOutputRowCount(20)
.setConfident(true)
.build();
double sourceRowCount = sourceStatsEstimate1.getOutputRowCount() + sourceStatsEstimate2.getOutputRowCount();
double unionInputRowCount = filterStatsEstimate.getOutputRowCount() + limitStatsEstimate.getOutputRowCount();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,16 @@ public RuleAssert withSession(Session session)

public RuleAssert overrideStats(String nodeId, PlanNodeStatsEstimate nodeStats)
{
statsCalculator.setNodeStats(new PlanNodeId(nodeId), nodeStats);
// For testing all stats are confident
return overrideStats(nodeId, nodeStats, true);
}

public RuleAssert overrideStats(String nodeId, PlanNodeStatsEstimate nodeStats, boolean confidence)
{
PlanNodeStatsEstimate statsWithConfidence = new PlanNodeStatsEstimate(nodeStats.getOutputRowCount(),
nodeStats.getTotalSize(), confidence,
nodeStats.getVariableStatistics(), nodeStats.getJoinNodeStatsEstimate(), nodeStats.getTableWriterNodeStatsEstimate());
statsCalculator.setNodeStats(new PlanNodeId(nodeId), statsWithConfidence);
return this;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ public class TestGraphvizPrinter
TupleDomain.all(),
TupleDomain.all());
private static final String TEST_TABLE_SCAN_NODE_INNER_OUTPUT = format(
"label=\"{TableScan | [TableHandle \\{connectorId='%s', connectorHandle='%s', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"label=\"{TableScan | [TableHandle \\{connectorId='%s', connectorHandle='%s', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
"}\", style=\"rounded, filled\", shape=record, fillcolor=deepskyblue",
TEST_CONNECTOR_ID,
TEST_CONNECTOR_TABLE_HANDLE);
Expand Down Expand Up @@ -133,12 +133,12 @@ public void testPrintDistributedFromFragments()
String expected = "digraph distributed_plan {\n" +
"subgraph cluster_0 {\n" +
"label = \"SOURCE\"\n" +
"plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
"}\", style=\"rounded, filled\", shape=record, fillcolor=deepskyblue];\n" +
"}\n" +
"subgraph cluster_1 {\n" +
"label = \"SOURCE\"\n" +
"plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"plannode_1[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the size is 0B because the number of output variables is 0, i.e. no output at all, and outputing 0B sounds better here.

"}\", style=\"rounded, filled\", shape=record, fillcolor=deepskyblue];\n" +
"}\n" +
"}\n";
Expand Down Expand Up @@ -175,11 +175,11 @@ public void testPrintLogicalForJoinNode()
String expected = "digraph logical_plan {\n" +
"subgraph cluster_0 {\n" +
"label = \"SOURCE\"\n" +
"plannode_1[label=\"{CrossJoin[REPLICATED]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"plannode_1[label=\"{CrossJoin[REPLICATED]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
"}\", style=\"rounded, filled\", shape=record, fillcolor=orange];\n" +
"plannode_2[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"plannode_2[label=\"{TableScan | [TableHandle \\{connectorId='connector_id', connectorHandle='com.facebook.presto.testing.TestingMetadata$TestingTableHandle@1af56f7', layout='Optional.empty'\\}]|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
"}\", style=\"rounded, filled\", shape=record, fillcolor=deepskyblue];\n" +
"plannode_3[label=\"{Values|Estimates: \\{rows: ? (0B), cpu: ?, memory: ?, network: ?\\}\n" +
"plannode_3[label=\"{Values|Estimates: \\{rows: ? (?), cpu: ?, memory: ?, network: ?\\}\n" +
"}\", style=\"rounded, filled\", shape=record, fillcolor=deepskyblue];\n" +
"}\n" +
"plannode_1 -> plannode_3 [label = \"Build\"];\n" + //valuesNode should be the Build side
Expand Down