Skip to content

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Sep 18, 2023

Description

Addresses #20355
Record the number of tasks used in scaled writers in HBO, and use HBO to set the initial number of writers to begin with for scaled writers.

Motivation and Context

Scaled writers first have only 1 task to write data out, and increase the number of tasks as needed when the source is throttled. In this PR, the scaled writer will start with a number based on the number of previous runs, so that it can have larger parallelism in the beginning and hence improve latency.

Impact

Latency improvement for scaled writer pipelines

Test Plan

Test query

Also run with verifier suite

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add optimization for scaled writers with HBO, it can improve latency for query with scaled writer enabled. It's controlled by session property `enable_hbo_for_scaled_writer` and default to false.

@feilong-liu feilong-liu requested review from a team and shrinidhijoshi as code owners September 18, 2023 21:33
@feilong-liu feilong-liu marked this pull request as draft September 18, 2023 21:33
@feilong-liu feilong-liu force-pushed the scaledwriter_hbo branch 6 times, most recently from 3c0ba5b to f8efc49 Compare September 19, 2023 23:13
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the number of tasks for the stage, and record it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add table writer stats to estimate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add table writer node statistics to plan statistics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a field which specify the number of tasks to start from for scaled writer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start from 1 if no initial task number specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the suggested number of table writer tasks from query plan, it finds the TableWriterNode, and read from its task number if scale writer field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a rule to set the initial number of tasks for a table writer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a field to specify the number of tasks to begin with if it's a scaled writer

@feilong-liu feilong-liu force-pushed the scaledwriter_hbo branch 4 times, most recently from 0660644 to 70ad759 Compare September 27, 2023 22:34
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the preferred number of tasks from table writer nodes in the plan

@feilong-liu feilong-liu marked this pull request as ready for review September 28, 2023 00:06
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is TableWriterMergeNode relevant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the scaled writer optimization is only related to table writer node, not related to table writer merger node.

@feilong-liu feilong-liu requested a review from mlyublena October 2, 2023 17:34
@feilong-liu feilong-liu force-pushed the scaledwriter_hbo branch 2 times, most recently from dc63547 to af6430c Compare October 6, 2023 18:47
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just merged #20990 that tracks which optimizers were cost-based and the source of stats used (CBO/HBO).
Can you override functions isCostBased and getStatsSource so this optimizer also gets tracked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a separate PR #21120

@feilong-liu feilong-liu merged commit 18ac248 into prestodb:master Oct 11, 2023
@feilong-liu feilong-liu deleted the scaledwriter_hbo branch October 11, 2023 21:51
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants