Skip to content

Test cost-based decisions on TPCH, TPCDS queries and CBO improvements#11267

Closed
findepi wants to merge 8 commits intoprestodb:masterfrom
starburstdata:epic/cbo/pr/42
Closed

Test cost-based decisions on TPCH, TPCDS queries and CBO improvements#11267
findepi wants to merge 8 commits intoprestodb:masterfrom
starburstdata:epic/cbo/pr/42

Conversation

@findepi
Copy link
Contributor

@findepi findepi commented Aug 14, 2018

No description provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was renamed to PARTITIONED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract JoinEnumerationResult from ReorderJoins instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Or better use this class in ReorderJoins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The classes are simple and yet different -- join reordering result needs Optional<PlanNode>. IMO reusing single class wouldn't improve readability.
For now, i just moved this class as an private member of DetermineJoinDistributionType.

Copy link
Contributor

@rschlussel rschlussel Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to follow and I don't like that it relies on knowing what makes something shouldRepartition and mustBroadcast (i.e. when flipping will be a problem)
Instead try:

if(!mustBroadcast(joinNode)) {
    possibleJoinNodes.add(getJoinNodeWithCost(costProvider, joinNode.withDistributionType(PARTITIONED))
}
if(!mustRepartition(joinNode)) {
   possibleJoinNodes.add(getJoinNodeWithCost(costProvider, joinNode.withDistributionType(REPLICATED))
}
JoinNode flipped = joinNode.flipChildren();
if(!mustBroadcast(flipped)) {
    possibleJoinNodes.add(getJoinNodeWithCost(costProvider, flipped.withDistributionType(PARTITIONED))
}
if(!mustRepartition(flipped)) {
   possibleJoinNodes.add(getJoinNodeWithCost(costProvider, flipped.withDistributionType(REPLICATED))
}

And then have

private static boolean mustRepartition(JoinNode node) 
{
     JoinNode.Type type = joinNode.getType();
     return type == RIGHT || type == FULL
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if(mustRepartition(node) || (!mustBroadcast(node) && joinDistributionType.canRepartition())) {
    return node.withDistributionType(PARTITIONED)
} 
return node.withDistributionType(REPLICATED);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OR

private PlanNode getSyntacticOrderJoin(JoinNode node, Context context, JoinDistributionType joinDistributionType)
    {
        if (mustRepartition(node)) {
            return node.withDistributionType(PARTITIONED);
        }
        if (mustBroadcast(node)) {
            return node.withDistributionType(REPLICATED);
        }
        if (isAtMostScalar(node.getRight(), context.getLookup())) {
            return node.withDistributionType(REPLICATED);
        }
        if (joinDistributionType.canPartition()) {
            return node.withDistributionType(PARTITIONED);
        }
        return node.withDistributionType(REPLICATED);
    }

    private static boolean mustRepartition(JoinNode node)
    {
        // With REPLICATED, the unmatched rows from right-side would be duplicated.
        return node.getType() == RIGHT || node.getType() == FULL;
    }

    private static boolean mustBroadcast(JoinNode node)
    {
        return node.getCriteria().isEmpty();
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty confusing what shouldRepartition means. for some things it's "must" and for some things it's "can". As suggested above, I'd replace this with a mustRepartition method and put the other logic in the calling code in getSyntacticOrderJoin (we don't even care about those things for getCostBasedJoin).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i rewrite all the logic in DetermineJoinDistributionType according to your suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only joins? seems like it would be equally useful for other kinds of nodes. and in the future we might generate multiple plans that differ in things other than join nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Determining whether plans describe the same relation is not an easy problem and for caching purposes, this needs to be computed efficiently. Note also, that we not always need to find out equivalent plans. Memo equivalence groups contain equivalent plans (today, only 1 plan, i know), without need for any computations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove? this looks left over from manual testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literalEncoder.toExpression() already returns the Expression if the value is an instanceof Expression, and a NullLiteral if it equals null.

I would get rid of the if statements and always return literalEncoder.toExpression(value,BOOLEAN). And add a visitNullLiteral to FilterExpressionStatsCalculatingVisitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literalEncoder.toExpression(value,BOOLEAN) returns CAST(NULL AS boolean) which cannot be handled as nicely as defining visitNullLiteral().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/join/JOIN/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm obviously not going through all the plan output files for tpc-h and tpc-ds so I'm just going to assume someone verified that the plans are good.

I have some concerns about this testing style though. It seems very brittle and frustrating to update for intentional changes. I like the idea of having a quick way to know whether you possibly broke tpch or tpcds, but I think it will become more trouble than it's worth.

I'm also not sure why this doesn't use the existing plan matching framework. Is it because this is simpler and someone autogenerated the result files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little bit afraid of s burden introduced by this PR. If something changes in the plan structure we would need to analyze all the failures, and fix over 9000 text files.

Or, as it will happen more likely - we will be just blindly regenerating the test results (we end up doing this for plan matching tests in product tests at Teradata). That kinda kills the purpose of a test.

I would really try to hard about whats the purpose of this test. And if the value worth the maintenance costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plans are not well-covered by other existing test. We created this tests because for an innocent change we don't really know if it causes any (unintentional) plan changes or not and it's not always possible to run a change on a benchmarking cluster.

@kokosing @sopel39 what's your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests are valuable as they allow us to catch real world regressions with supposedly innocent CBO/stats changes. Consider them as product tests that test CBO end-to-end. In general, any CBO changes should be benchmarked. We've seen changes that should improve stats calculations, but caused performance regressions.

Without framework here you would need to benchmark some set of queries to see if performance has changed. With this PR, you can focus on just a subset of queries and you can also validate if changed plans are rational. This greatly simplifies process of evaluating CBO modifications but also provides an automated way of catching potential regressions (we've been kicked with such case once).

One drawback though is that in order to validate performance changes one need to recreate real environment used for generating this plans. Currently it assumes 9 machines and we run actually benchmarks against them. So the question is how we provide standardized performance testbed so that we can validate CBO decisions against real query times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion. Perhaps you can keep them for now. Let's see how often will we need to change them in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not have them here until it's easily actionable for people if they change. If I don't have a way to evaluate whether a plan change caused a regression, I'm just going to blindly change the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok as long as you know what you're doing (you will know) and also, you know that you are doing this (you will know because this of this test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i wouldn't actually check for a regression because I wouldn't have the cluster set up to to do that, so it's just adding overhead. It's one (or 100) more things to change without really helping with identifying problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Or better use this class in ReorderJoins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just node.getCriteria().isEmpty()? If there is no equi criteria - the type of join doesn't matter much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot replicate RIGHT join, otherwise unmatched RIGHT rows would be duplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if there is no equi criteria, how can you partition the RIGHT join? Should we fail saying that right non equi join is not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will get partitioned on "no symbols", i.e. a constant, single node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually looks more like a should

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it still can be executed as a repartitioned join (very skewed one though).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files.asCharSource(file, UTF8).read()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why path is not included?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used by IntelliJ to annotate test executions with @DataProvider

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queryFile ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method is confusion. How about generateQueryPlan, and change the parameter name queryFilePath

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a similar one for the SemiJoin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... we should. But why now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can introduce it as a separate PR. Just in case you can have it stashed somewhere, we can include it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't :(

@arhimondr
Copy link
Member

Please create a separate PR for Add Exchange before GroupId to improve Partial Aggregation.

Also i would recommend dropping the Add test for CBO decisions on TPC-H, TPC-DS queries (and the Introduce partitioning switch in TPC-H connector). IMFO this type of testing introduces more burden that value.

Also i would think twice about the Cache stats for equivalent join nodes. It introduces additional "brittleness" for a really little value.

@arhimondr
Copy link
Member

@sopel39 could you please add more details to the Remove min/max intersection heuristics from equality stats calculator commit message?

@findepi
Copy link
Contributor Author

findepi commented Aug 18, 2018

Please create a separate PR for Add Exchange before GroupId to improve Partial Aggregation.

@arhimondr @rschlussel ok, but why so?
This is a cost-based (actually: stats-based) rule for some particular use of partial aggregations.
Also, it's relation to the rest of the PR is fundamental -- in some tpc-ds queries (two or three), ReoderJoins makes optimal decision (as long as joins are concerns), but causing a regression overall (because the plan above joins benefited from the fact the data was being repartitioned for the sake of the joins).

@sopel39
Copy link
Contributor

sopel39 commented Aug 20, 2018

@arhimondr I think the description for Remove min/max intersection heuristics from equality stats calculator should be along the lines:

During TPCH/TPCDS benchmarking this optimization proved to performed worse that without it.
After investigation we found that cross-column correlation is not accounted for when applying
filtering for lo-max ranges (e.g: when filtering dimension table on date, we don't adjust row id
column range which is tightly correlated to date column). This caused great underestimation
of filtered rows.

@findepi
Copy link
Contributor Author

findepi commented Aug 20, 2018

AC

@arhimondr
Copy link
Member

This is a cost-based (actually: stats-based) rule for some particular use of partial aggregations.
Also, it's relation to the rest of the PR is fundamental -- in some tpc-ds queries (two or three), ReoderJoins makes optimal decision (as long as joins are concerns), but causing a regression overall (because the plan above joins benefited from the fact the data was being repartitioned for the sake of the joins).

It has to be disabled by default at least. We do not enable other cost based rules by default not because we are not sure in rules correctness itself, but rather because our estimates might not be accurate enough.

Also the commit itself is huge. It might be easier to have it as a separate PR.

@findepi
Copy link
Contributor Author

findepi commented Aug 21, 2018

i will drop Add Exchange before GroupId to improve Partial Aggregation from this PR in the next squash/rebase.

Should I squash the fixups now?

@arhimondr
Copy link
Member

Should I squash the fixups now?

Yup

@findepi
Copy link
Contributor Author

findepi commented Aug 28, 2018

squashed & rebased

@findepi
Copy link
Contributor Author

findepi commented Aug 28, 2018

i dropped Add Exchange before GroupId to improve Partial Aggregation (will be in separate PR) and Cache stats for equivalent join nodes commits.

@findepi findepi force-pushed the epic/cbo/pr/42 branch 2 times, most recently from f93ce09 to f5966cf Compare August 28, 2018 10:40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why get rid of the override?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow i totally misread that. you added it.

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we should have the plan tests for tpc-h and tpc-ds queries, but everything else looks good.

@kokosing
Copy link
Contributor

I still don't think we should have the plan tests for tpc-h and tpc-ds queries, but everything else looks good.

This is great feature. Image you are changing some heuristic. To test your change you would need to run whole benchmark suite. Having tests you could easily find queries which are affected, so you can only tests them.

@rschlussel
Copy link
Contributor

@kokosing I agree it's very useful if you're then going to run benchmarks. However, right now nobody other than you guys has the set up to run those benchmarks. That means people will just blindly change test files without getting any extra signal. That's why I feel that we shouldn't have this feature until there's a test set up the community can use.

@rschlussel
Copy link
Contributor

And it will impact planner changes unrelated to CBO heuristics, which could become burdensome. However, I'm not going to hold up the PR for this. We can always re-evalute if we find it does become burdensome.

@kokosing
Copy link
Contributor

That means people will just blindly change test files without getting any extra signal.

But at least we will be aware that plans got changed. We will know what change affect plans and what is not. It is very easy to regenerate these plans and do a review plan changes. Notice that plan format is much simpler and it is should not change so frequently like output of the EXPLAIN.

findepi and others added 8 commits August 30, 2018 12:11
Formatted dates were added in fc5419d,
but the reading code was dropped as "unused" during review process.
This is often necessary in tests to have VALUES filled with some data
rows, otherwise `QueryCardinalityUtil` will deem VALUES as scalar,
affecting testing conditions.
When join reordering is disabled, `ReorderJoins` doesn't select join
distribution.  This commit makes `DetermineJoinDistributionType` rule to
make this decision on cost basis, if only table statistics are
available.
During TPCH/TPCDS benchmarking this optimization proved to performed worse that without it.
After investigation we found that cross-column correlation is not accounted for when applying
filtering for lo-max ranges (e.g: when filtering dimension table on date, we don't adjust row id
column range which is tightly correlated to date column). This caused great underestimation
of filtered rows.
Since an optimizer rule may create new expressions (new filters, new
joins) and call cost/stats estimation on that, `FilterStatsCalculator`
should not assume that all expressions have been simplified earlier.
TPC-H connector has 2 tables with partitioning (`lineitem`, `orders`).
By introducing a switch disabling this behavior, we can simulate tables
that are not partitioned.
@findepi
Copy link
Contributor Author

findepi commented Aug 30, 2018

Let's postpone the discussion about merit of these tests until we have some real experience. Our (internal) experience was positive, but I acknowledge the mileage may vary.

I am going to merge this once tests pass.

Copy link
Member

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed Add test for CBO decisions on TPC-H, TPC-DS queries. Everything else looks good.


public static <T> List<T> nElements(int n, IntFunction<T> function)
{
checkArgument(n >= 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a message

.collect(toImmutableList());
}

public static <T> List<T> nElements(int n, IntFunction<T> function)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one of these function that are being only "sometimes" reused even by the author itself. Someone who doesn't know that this very specific function particularly exist - never would try to lookup one, but instead will try to implement either a copy, or just inline the algorithm. I'm not even saying it is available only in a single module. I don't think it is worth adding such a functions.

System.out.println(query.id);
}
catch (IOException e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException

return Files.asCharSource(file.toFile(), UTF_8).read();
}
catch (IOException e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UncheckedIOException


public static final class Query
{
private final String id;
Copy link
Member

@arhimondr arhimondr Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getters-settere even for the inner class are good

@Override
protected Path getExpectedJoinOrderingFile(String queryId)
{
return Paths.get(format("./presto-main/src/test/resources/tpch-join-ordering/%s.txt", queryId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one is more tricker. It is used for generation. It is easy to read from resources, but it is difficult to update them.

.boxed()
.flatMap(i -> {
String queryId = format("q%02d", i);
if (i == 14 || i == 23 || i == 24 || i == 39) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like something very ugly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks, but this is how tpcds standard is defined. What do you propose to handle queries 14_1, 14_2, 23_1, 23_2...?


private static Path tpcdsQuery(String queryId)
{
return Paths.get(format("./presto-benchto-benchmarks/src/main/resources/sql/presto/tpcds/%s.sql", queryId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about paths

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed

@Override
protected Path getExpectedJoinOrderingFile(String queryId)
{
return Paths.get(format("./presto-tpcds/src/test/resources/tpcds-join-ordering/%s.txt", queryId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto about paths

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed to expected query plans generation.

private String generateQueryPlan(Path queryFile)
{
String sql = read(queryFile)
.replaceFirst(";", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please just copy the queries here. After you have the queries in, introduce some convention, and remove all the weird methods for resolving the queries/results. Once you have them in a single folder, you can scan the classpath, run everything under this directory and compare it with the results stored in some other directory under the classpath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is about 125 queries which we modify from time to time. The goal was to test exactly what we benchmark. Once we copy them here it will be no longer guaranteed.

@arhimondr
Copy link
Member

@findepi @kokosing I'm going to merge everything, but the last commit. You can address the comments for the last commit and merge it in separate.

@arhimondr
Copy link
Member

arhimondr commented Sep 4, 2018

Merged everything, but Add test for CBO decisions on TPC-H, TPC-DS queries

@kokosing
Copy link
Contributor

kokosing commented Oct 9, 2018

Most of these was merged, remaining is superseded with: #11665

@kokosing kokosing closed this Oct 9, 2018
@findepi findepi deleted the epic/cbo/pr/42 branch October 9, 2018 09:29
kokosing pushed a commit to starburstdata/facebook-presto that referenced this pull request Dec 14, 2018
The idea was abandoned during
prestodb#11267 review.
kokosing pushed a commit to starburstdata/facebook-presto that referenced this pull request Dec 14, 2018
The idea was abandoned during
prestodb#11267 review.
findepi added a commit to findepi/trino that referenced this pull request Jan 29, 2019
The idea was abandoned during
prestodb/presto#11267 review.
findepi added a commit to trinodb/trino that referenced this pull request Feb 1, 2019
The idea was abandoned during
prestodb/presto#11267 review.
findepi added a commit to starburstdata/trino that referenced this pull request Feb 1, 2019
The idea was abandoned during
prestodb/presto#11267 review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants