Skip to content

Add an option to have HBO return output size with individual variable size#25400

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:hbo_cpp
Jun 26, 2025
Merged

Add an option to have HBO return output size with individual variable size#25400
feilong-liu merged 1 commit intoprestodb:masterfrom
feilong-liu:hbo_cpp

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Jun 21, 2025

Description

For history based optimizer (HBO), when getting the size of output of a plan node, it will use the recorded output size in history, i.e. estimateSizeUsingVariables() returns false for HBO

public double getOutputSizeInBytes(PlanNode planNode)
{
requireNonNull(planNode, "planNode is null");
if (!sourceInfo.estimateSizeUsingVariables() && !isNaN(totalSize)) {
return totalSize;
}

However, this can be a problem for Prestissimo. The reported output size of a plan node is a best estimation and depending on a lot of factors, e.g. data late materialization, decode/encode etc.
For example, according to @kevinwilfong, in hash join, data will be materialized for hash build side, while for probe side, data may or may not be materialized depending on batch size. And data not materialized does not count when reporting output size. This can lead to a problem, for example, the probe side input will report a smaller size if not materialized. However, when later the join is reordered that the probe side is switched to build side, data will be materialized, the output size will be larger than history records, hence lead to query oom.

In this PR, I added an option to HBO to use individual variable size to estimate the plan output size. This is also what CBO is using.

Motivation and Context

To make HBO work better with Prestissimo

Impact

To make HBO work better with Prestissimo

Test Plan

Unit test

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a session property to have HBO estimate plan node output size using individual variables

@feilong-liu feilong-liu requested a review from a team as a code owner June 21, 2025 17:08
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Jun 21, 2025
@feilong-liu feilong-liu marked this pull request as draft June 21, 2025 17:08
@feilong-liu feilong-liu force-pushed the hbo_cpp branch 2 times, most recently from a326b45 to f046dee Compare June 22, 2025 21:57
@feilong-liu feilong-liu marked this pull request as ready for review June 23, 2025 05:54
@jaystarshot
Copy link
Member

If outputsize is incorrect why won't outputRowCount be incorrect?
If thats is incorrect even getOutputSizeForVariables(planNode.getOutputVariables()); will be wrong? Maybe prestissimo should report the correct value?

@feilong-liu
Copy link
Contributor Author

If outputsize is incorrect why won't outputRowCount be incorrect? If thats is incorrect even getOutputSizeForVariables(planNode.getOutputVariables()); will be wrong? Maybe prestissimo should report the correct value?

OutputRowCount is correct. Only the output size can be inaccurate. It's due to multiple reasons in my understanding, for example encoding of data where you do not know the size after decoding, late materialization where at this stage no knowledge of data size since it's not loaded yet etc. And that's why I think having an option to use the variable sizes with accurate row count may be a better option.

@steveburnett
Copy link
Contributor

steveburnett commented Jun 25, 2025

  • Please include documentation for the new session property in this PR. See iii(b) in Designing Your Code in CONTRIBUTING.md.

  • Add the session property name in the release note as a link to the documentation. See Formatting in the Release Notes Guidelines and the example here for how to create the link.

== RELEASE NOTES ==

General Changes
* Add the session property :ref:`admin/properties-session:\`\`name_of_session_property\`\`` to have the HBO estimate plan node output size using individual variables.

@feilong-liu feilong-liu merged commit 73c3d4a into prestodb:master Jun 26, 2025
173 of 175 checks passed
@feilong-liu feilong-liu deleted the hbo_cpp branch June 26, 2025 16:26
@steveburnett
Copy link
Contributor

steveburnett commented Jun 26, 2025

Opened #25447 for the documentation of the new session property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants