Test cost-based decisions on TPCH, TPCDS queries and CBO improvements#11267
Test cost-based decisions on TPCH, TPCDS queries and CBO improvements#11267findepi wants to merge 8 commits intoprestodb:masterfrom
Conversation
fd9806a to
e6f254e
Compare
e6f254e to
7e47c8a
Compare
There was a problem hiding this comment.
This was renamed to PARTITIONED
There was a problem hiding this comment.
Maybe extract JoinEnumerationResult from ReorderJoins instead.
There was a problem hiding this comment.
+1. Or better use this class in ReorderJoins
There was a problem hiding this comment.
The classes are simple and yet different -- join reordering result needs Optional<PlanNode>. IMO reusing single class wouldn't improve readability.
For now, i just moved this class as an private member of DetermineJoinDistributionType.
There was a problem hiding this comment.
This is hard to follow and I don't like that it relies on knowing what makes something shouldRepartition and mustBroadcast (i.e. when flipping will be a problem)
Instead try:
if(!mustBroadcast(joinNode)) {
possibleJoinNodes.add(getJoinNodeWithCost(costProvider, joinNode.withDistributionType(PARTITIONED))
}
if(!mustRepartition(joinNode)) {
possibleJoinNodes.add(getJoinNodeWithCost(costProvider, joinNode.withDistributionType(REPLICATED))
}
JoinNode flipped = joinNode.flipChildren();
if(!mustBroadcast(flipped)) {
possibleJoinNodes.add(getJoinNodeWithCost(costProvider, flipped.withDistributionType(PARTITIONED))
}
if(!mustRepartition(flipped)) {
possibleJoinNodes.add(getJoinNodeWithCost(costProvider, flipped.withDistributionType(REPLICATED))
}
And then have
private static boolean mustRepartition(JoinNode node)
{
JoinNode.Type type = joinNode.getType();
return type == RIGHT || type == FULL
}
There was a problem hiding this comment.
if(mustRepartition(node) || (!mustBroadcast(node) && joinDistributionType.canRepartition())) {
return node.withDistributionType(PARTITIONED)
}
return node.withDistributionType(REPLICATED);
There was a problem hiding this comment.
OR
private PlanNode getSyntacticOrderJoin(JoinNode node, Context context, JoinDistributionType joinDistributionType)
{
if (mustRepartition(node)) {
return node.withDistributionType(PARTITIONED);
}
if (mustBroadcast(node)) {
return node.withDistributionType(REPLICATED);
}
if (isAtMostScalar(node.getRight(), context.getLookup())) {
return node.withDistributionType(REPLICATED);
}
if (joinDistributionType.canPartition()) {
return node.withDistributionType(PARTITIONED);
}
return node.withDistributionType(REPLICATED);
}
private static boolean mustRepartition(JoinNode node)
{
// With REPLICATED, the unmatched rows from right-side would be duplicated.
return node.getType() == RIGHT || node.getType() == FULL;
}
private static boolean mustBroadcast(JoinNode node)
{
return node.getCriteria().isEmpty();
}
There was a problem hiding this comment.
I think it's pretty confusing what shouldRepartition means. for some things it's "must" and for some things it's "can". As suggested above, I'd replace this with a mustRepartition method and put the other logic in the calling code in getSyntacticOrderJoin (we don't even care about those things for getCostBasedJoin).
There was a problem hiding this comment.
i rewrite all the logic in DetermineJoinDistributionType according to your suggestions.
There was a problem hiding this comment.
why only joins? seems like it would be equally useful for other kinds of nodes. and in the future we might generate multiple plans that differ in things other than join nodes.
There was a problem hiding this comment.
Determining whether plans describe the same relation is not an easy problem and for caching purposes, this needs to be computed efficiently. Note also, that we not always need to find out equivalent plans. Memo equivalence groups contain equivalent plans (today, only 1 plan, i know), without need for any computations.
There was a problem hiding this comment.
remove? this looks left over from manual testing.
There was a problem hiding this comment.
literalEncoder.toExpression() already returns the Expression if the value is an instanceof Expression, and a NullLiteral if it equals null.
I would get rid of the if statements and always return literalEncoder.toExpression(value,BOOLEAN). And add a visitNullLiteral to FilterExpressionStatsCalculatingVisitor.
There was a problem hiding this comment.
literalEncoder.toExpression(value,BOOLEAN) returns CAST(NULL AS boolean) which cannot be handled as nicely as defining visitNullLiteral().
There was a problem hiding this comment.
I'm obviously not going through all the plan output files for tpc-h and tpc-ds so I'm just going to assume someone verified that the plans are good.
I have some concerns about this testing style though. It seems very brittle and frustrating to update for intentional changes. I like the idea of having a quick way to know whether you possibly broke tpch or tpcds, but I think it will become more trouble than it's worth.
I'm also not sure why this doesn't use the existing plan matching framework. Is it because this is simpler and someone autogenerated the result files?
There was a problem hiding this comment.
I'm also a little bit afraid of s burden introduced by this PR. If something changes in the plan structure we would need to analyze all the failures, and fix over 9000 text files.
Or, as it will happen more likely - we will be just blindly regenerating the test results (we end up doing this for plan matching tests in product tests at Teradata). That kinda kills the purpose of a test.
I would really try to hard about whats the purpose of this test. And if the value worth the maintenance costs.
There was a problem hiding this comment.
There was a problem hiding this comment.
Those tests are valuable as they allow us to catch real world regressions with supposedly innocent CBO/stats changes. Consider them as product tests that test CBO end-to-end. In general, any CBO changes should be benchmarked. We've seen changes that should improve stats calculations, but caused performance regressions.
Without framework here you would need to benchmark some set of queries to see if performance has changed. With this PR, you can focus on just a subset of queries and you can also validate if changed plans are rational. This greatly simplifies process of evaluating CBO modifications but also provides an automated way of catching potential regressions (we've been kicked with such case once).
One drawback though is that in order to validate performance changes one need to recreate real environment used for generating this plans. Currently it assumes 9 machines and we run actually benchmarks against them. So the question is how we provide standardized performance testbed so that we can validate CBO decisions against real query times.
There was a problem hiding this comment.
I don't have a strong opinion. Perhaps you can keep them for now. Let's see how often will we need to change them in practice.
There was a problem hiding this comment.
I would rather not have them here until it's easily actionable for people if they change. If I don't have a way to evaluate whether a plan change caused a regression, I'm just going to blindly change the results.
There was a problem hiding this comment.
That's ok as long as you know what you're doing (you will know) and also, you know that you are doing this (you will know because this of this test).
There was a problem hiding this comment.
but i wouldn't actually check for a regression because I wouldn't have the cluster set up to to do that, so it's just adding overhead. It's one (or 100) more things to change without really helping with identifying problems.
There was a problem hiding this comment.
+1. Or better use this class in ReorderJoins
There was a problem hiding this comment.
Why not just node.getCriteria().isEmpty()? If there is no equi criteria - the type of join doesn't matter much.
There was a problem hiding this comment.
You cannot replicate RIGHT join, otherwise unmatched RIGHT rows would be duplicated.
There was a problem hiding this comment.
But if there is no equi criteria, how can you partition the RIGHT join? Should we fail saying that right non equi join is not supported?
There was a problem hiding this comment.
it will get partitioned on "no symbols", i.e. a constant, single node.
There was a problem hiding this comment.
This is actually looks more like a should
There was a problem hiding this comment.
Because it still can be executed as a repartitioned join (very skewed one though).
There was a problem hiding this comment.
Files.asCharSource(file, UTF8).read()
There was a problem hiding this comment.
It's used by IntelliJ to annotate test executions with @DataProvider
There was a problem hiding this comment.
The name of this method is confusion. How about generateQueryPlan, and change the parameter name queryFilePath
There was a problem hiding this comment.
Do we need to have a similar one for the SemiJoin?
There was a problem hiding this comment.
Hm... we should. But why now?
There was a problem hiding this comment.
Yeah, we can introduce it as a separate PR. Just in case you can have it stashed somewhere, we can include it here.
|
Please create a separate PR for Also i would recommend dropping the Also i would think twice about the |
|
@sopel39 could you please add more details to the |
@arhimondr @rschlussel ok, but why so? |
|
@arhimondr I think the description for |
7e47c8a to
50305bb
Compare
|
AC |
It has to be disabled by default at least. We do not enable other cost based rules by default not because we are not sure in rules correctness itself, but rather because our estimates might not be accurate enough. Also the commit itself is huge. It might be easier to have it as a separate PR. |
|
i will drop Should I squash the fixups now? |
Yup |
605cfe1 to
0f6ffb1
Compare
|
squashed & rebased |
0f6ffb1 to
fdea61c
Compare
|
i dropped |
f93ce09 to
f5966cf
Compare
There was a problem hiding this comment.
why get rid of the override?
There was a problem hiding this comment.
i don't understand?
There was a problem hiding this comment.
wow i totally misread that. you added it.
rschlussel
left a comment
There was a problem hiding this comment.
I still don't think we should have the plan tests for tpc-h and tpc-ds queries, but everything else looks good.
This is great feature. Image you are changing some heuristic. To test your change you would need to run whole benchmark suite. Having tests you could easily find queries which are affected, so you can only tests them. |
|
@kokosing I agree it's very useful if you're then going to run benchmarks. However, right now nobody other than you guys has the set up to run those benchmarks. That means people will just blindly change test files without getting any extra signal. That's why I feel that we shouldn't have this feature until there's a test set up the community can use. |
|
And it will impact planner changes unrelated to CBO heuristics, which could become burdensome. However, I'm not going to hold up the PR for this. We can always re-evalute if we find it does become burdensome. |
But at least we will be aware that plans got changed. We will know what change affect plans and what is not. It is very easy to regenerate these plans and do a review plan changes. Notice that plan format is much simpler and it is should not change so frequently like output of the EXPLAIN. |
Formatted dates were added in fc5419d, but the reading code was dropped as "unused" during review process.
This is often necessary in tests to have VALUES filled with some data rows, otherwise `QueryCardinalityUtil` will deem VALUES as scalar, affecting testing conditions.
When join reordering is disabled, `ReorderJoins` doesn't select join distribution. This commit makes `DetermineJoinDistributionType` rule to make this decision on cost basis, if only table statistics are available.
During TPCH/TPCDS benchmarking this optimization proved to performed worse that without it. After investigation we found that cross-column correlation is not accounted for when applying filtering for lo-max ranges (e.g: when filtering dimension table on date, we don't adjust row id column range which is tightly correlated to date column). This caused great underestimation of filtered rows.
Since an optimizer rule may create new expressions (new filters, new joins) and call cost/stats estimation on that, `FilterStatsCalculator` should not assume that all expressions have been simplified earlier.
TPC-H connector has 2 tables with partitioning (`lineitem`, `orders`). By introducing a switch disabling this behavior, we can simulate tables that are not partitioned.
f5966cf to
6653f68
Compare
|
Let's postpone the discussion about merit of these tests until we have some real experience. Our (internal) experience was positive, but I acknowledge the mileage may vary. I am going to merge this once tests pass. |
arhimondr
left a comment
There was a problem hiding this comment.
Reviewed Add test for CBO decisions on TPC-H, TPC-DS queries. Everything else looks good.
|
|
||
| public static <T> List<T> nElements(int n, IntFunction<T> function) | ||
| { | ||
| checkArgument(n >= 0); |
| .collect(toImmutableList()); | ||
| } | ||
|
|
||
| public static <T> List<T> nElements(int n, IntFunction<T> function) |
There was a problem hiding this comment.
This one of these function that are being only "sometimes" reused even by the author itself. Someone who doesn't know that this very specific function particularly exist - never would try to lookup one, but instead will try to implement either a copy, or just inline the algorithm. I'm not even saying it is available only in a single module. I don't think it is worth adding such a functions.
| System.out.println(query.id); | ||
| } | ||
| catch (IOException e) { | ||
| throw new RuntimeException(e); |
| return Files.asCharSource(file.toFile(), UTF_8).read(); | ||
| } | ||
| catch (IOException e) { | ||
| throw new RuntimeException(e); |
|
|
||
| public static final class Query | ||
| { | ||
| private final String id; |
There was a problem hiding this comment.
nit: getters-settere even for the inner class are good
| @Override | ||
| protected Path getExpectedJoinOrderingFile(String queryId) | ||
| { | ||
| return Paths.get(format("./presto-main/src/test/resources/tpch-join-ordering/%s.txt", queryId)); |
There was a problem hiding this comment.
This is one is more tricker. It is used for generation. It is easy to read from resources, but it is difficult to update them.
| .boxed() | ||
| .flatMap(i -> { | ||
| String queryId = format("q%02d", i); | ||
| if (i == 14 || i == 23 || i == 24 || i == 39) { |
There was a problem hiding this comment.
This looks like something very ugly
There was a problem hiding this comment.
It looks, but this is how tpcds standard is defined. What do you propose to handle queries 14_1, 14_2, 23_1, 23_2...?
|
|
||
| private static Path tpcdsQuery(String queryId) | ||
| { | ||
| return Paths.get(format("./presto-benchto-benchmarks/src/main/resources/sql/presto/tpcds/%s.sql", queryId)); |
| @Override | ||
| protected Path getExpectedJoinOrderingFile(String queryId) | ||
| { | ||
| return Paths.get(format("./presto-tpcds/src/test/resources/tpcds-join-ordering/%s.txt", queryId)); |
There was a problem hiding this comment.
It is needed to expected query plans generation.
| private String generateQueryPlan(Path queryFile) | ||
| { | ||
| String sql = read(queryFile) | ||
| .replaceFirst(";", "") |
There was a problem hiding this comment.
Please just copy the queries here. After you have the queries in, introduce some convention, and remove all the weird methods for resolving the queries/results. Once you have them in a single folder, you can scan the classpath, run everything under this directory and compare it with the results stored in some other directory under the classpath.
There was a problem hiding this comment.
It is about 125 queries which we modify from time to time. The goal was to test exactly what we benchmark. Once we copy them here it will be no longer guaranteed.
|
Merged everything, but |
|
Most of these was merged, remaining is superseded with: #11665 |
The idea was abandoned during prestodb#11267 review.
The idea was abandoned during prestodb#11267 review.
The idea was abandoned during prestodb/presto#11267 review.
The idea was abandoned during prestodb/presto#11267 review.
The idea was abandoned during prestodb/presto#11267 review.
No description provided.