Skip to content

Add timeout for HBO stats fetch#19933

Merged
feilong-liu merged 2 commits intoprestodb:masterfrom
feilong-liu:add_timeout_hbo
Jul 21, 2023
Merged

Add timeout for HBO stats fetch#19933
feilong-liu merged 2 commits intoprestodb:masterfrom
feilong-liu:add_timeout_hbo

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Jun 22, 2023

Fix issue #20355
Depended by https://github.com/facebookexternal/presto-facebook/pull/2394

  • Add timeout to HBO optimizers, will not populate stats equivalent nodes field if timeout
  • Add queries which timeouts to log

Test plan - (Please fill in how you tested your changes)

Tested locally end to end.

== RELEASE NOTES ==

General Changes
* Add timeout for HBO optimizer, timeout set by session parameter `history_based_optimizer_timeout_limit`

@feilong-liu feilong-liu requested a review from a team as a code owner June 22, 2023 00:44
@feilong-liu feilong-liu requested a review from presto-oss June 22, 2023 00:44
@feilong-liu feilong-liu marked this pull request as draft June 22, 2023 00:44
@feilong-liu feilong-liu force-pushed the add_timeout_hbo branch 12 times, most recently from eff74bd to 169e486 Compare June 27, 2023 05:03
Comment on lines 102 to 108
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When HBO register function failed to load statistics, add warning and also add to logging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we consider empty stats returned as failure of loading history

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove query ID when we invalidate the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass timeout limit to stats provider

@feilong-liu feilong-liu reopened this Jun 28, 2023
@feilong-liu feilong-liu force-pushed the add_timeout_hbo branch 2 times, most recently from 2e2cf6d to d745c1a Compare June 28, 2023 02:10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly return if load of history failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store the query ID which failed to get history stats

@feilong-liu feilong-liu marked this pull request as ready for review June 28, 2023 02:15
@feilong-liu feilong-liu changed the title [Draft] Add timeout for HBO Add timeout for HBO stats fetch Jun 28, 2023
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if it catches timeout appropriately for pipeline in D46899410

A lot of cost will come from HistoricalStatisticsEquivalentPlanMarkingOptimizer where we prefetch all hbo stats

@feilong-liu feilong-liu force-pushed the add_timeout_hbo branch 4 times, most recently from a8ef310 to d8204b5 Compare July 4, 2023 00:14
@feilong-liu feilong-liu force-pushed the add_timeout_hbo branch 3 times, most recently from fb07a4a to cdcbe41 Compare July 10, 2023 18:28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In unit tests, we want to test for queries which do not have join/aggregation too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only care about query plans with join/aggregation, return false here, and skip adding stats equivalent node for this query plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for unit tests, where queries without join/aggregation are tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if false returned, we will skip adding stats equivalent node in HBO marking optimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if fetch history timeouts or return empty, we still need StatsEquivalentPlanNode as it will be used in populating HBO history.

@feilong-liu feilong-liu marked this pull request as ready for review July 10, 2023 18:40
@feilong-liu
Copy link
Contributor Author

Can you check if it catches timeout appropriately for pipeline in D46899410

A lot of cost will come from HistoricalStatisticsEquivalentPlanMarkingOptimizer where we prefetch all hbo stats

I debug the pipeline in this diff, the problem is not in fetching stats, but from calculating plan node hashes, which includes inputing table meta data access that is the problem here.

This PR does not solve this problem, as I do not find a way to pass timeout to meta data access here, and solving this problem should be from a separate PR imo.

But still, this PR can solve the following problems: 1) reduce overhead, as I added a new field to track queries which fail to fetch history stats, and skip trying to get HBO stats in this case 2) add new field in logging for timeout queries, so that we can track these queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to log additional information about the failure?
do we plan to use this for other usecases beyond HBO timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps make the function more general and pass in the optimizer name as argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to log additional information about the failure?

Maybe just this for now, we can add more later if needed.

do we plan to use this for other usecases beyond HBO timeout?

Other optimizers can populate this field as well.

perhaps make the function more general and pass in the optimizer name as argument

This is a private method in HBO optimizer, will not be used by other optimizers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this is a set? do we expect to have more than 1 item there? or is it so we can quickly check for containment without worrying if it is set or not

Copy link
Contributor Author

@feilong-liu feilong-liu Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query ID will be removed after the query completes. Yes, there can be more than 1 item, when multiple queries are running concurrently.

The life span of the query ID will be the same as in other Maps in the same class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's cool!
I think another use case for this field could be tracking bugs in disabled optimizers: when checking if an optimizer is applicable (when verbose_optimizer_info_enabled=true) I added a try/block to prevent the main loop crashing with a buggy (but disabled) optimizer: instead of silently ignoring this, we can record that we found a bug for an applicable optimizer (see PlanOptimizer::isApplicable)

@feilong-liu feilong-liu requested review from pranjalssh and removed request for pranjalssh July 12, 2023 23:23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call it "optimizer.history-based-optimizer-timeout". And use Duration instead of int. Config can then read values like "1s" or "100ms"

@feilong-liu feilong-liu requested a review from jainxrohit July 20, 2023 17:02
@feilong-liu feilong-liu force-pushed the add_timeout_hbo branch 3 times, most recently from 8944c73 to 99182f3 Compare July 21, 2023 06:10
Add timeout to HBO mark optimizer, the timeout value is specified by
session property history_based_optimizer_timeout_limit.
@feilong-liu feilong-liu merged commit 342dd5f into prestodb:master Jul 21, 2023
@feilong-liu feilong-liu deleted the add_timeout_hbo branch July 21, 2023 21:20
@wanglinsong wanglinsong mentioned this pull request Jul 27, 2023
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants