Skip to content

Conversation

@feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented May 23, 2025

Description

This optimization converts queries like

select id, max(ds), max_by(feature1, ds), max_by(feature2, ds) from table group by id

to

select id, ds, feature1, feature2 from (select id, ds, feature1, feature2, row_number() over (partition by id order by ds desc) row_num) where row_num = 1

Here feature1, feature2 are maps. This rewrite can avoid the expensive cost of aggregations on feature1 and feature2. This is commonly used in getting latest features in machine learning workload.

Motivation and Context

Query optimization to reduce cost.

Impact

Query optimization to reduce cost.

Test Plan

Unit tests

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a new optimization `MinMaxByToWindowFunction` to rewrite min_by/max_by aggregations with row_number window function

@feilong-liu feilong-liu requested a review from ZacBlanco May 23, 2025 20:57
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label May 23, 2025
@feilong-liu feilong-liu marked this pull request as draft May 23, 2025 20:57
@feilong-liu feilong-liu requested a review from kaikalur May 23, 2025 20:57
@feilong-liu feilong-liu force-pushed the feature_dedup branch 3 times, most recently from dbc5e09 to f1cd3c3 Compare May 28, 2025 05:24
@feilong-liu feilong-liu marked this pull request as ready for review May 28, 2025 16:36
@feilong-liu feilong-liu requested a review from rschlussel May 28, 2025 16:36
@feilong-liu feilong-liu requested a review from hantangwangd June 2, 2025 19:29
@jaystarshot
Copy link
Member

jaystarshot commented Jun 2, 2025

Maybe I am missing something but In
select id, max(ds), max_by(feature1, ds), max_by(feature2, ds) from table group by id
feature1/feature2 are not aggregated right? (they just fetch the feature from the max ds row?) so i don't understand

Here feature1, feature2 are maps. This rewrite can avoid the expensive cost of aggregations on feature1 and feature2.

@feilong-liu
Copy link
Contributor Author

Maybe I am missing something but In select id, max(ds), max_by(feature1, ds), max_by(feature2, ds) from table group by id feature1/feature2 are not aggregated right? (they just fetch the feature from the max ds row?) so i don't understand

Here feature1, feature2 are maps. This rewrite can avoid the expensive cost of aggregations on feature1 and feature2.

Correct, here is the definition of max_by(x, y) -> [same as x]() https://prestodb.io/docs/current/functions/aggregate.html#max_by-x-y-same-as-x Returns the value of x associated with the maximum value of y over all input values
This is commonly used in feature selection, i.e. to select the latest feature in a table for a user.

@feilong-liu
Copy link
Contributor Author

Maybe I am missing something but In select id, max(ds), max_by(feature1, ds), max_by(feature2, ds) from table group by id feature1/feature2 are not aggregated right? (they just fetch the feature from the max ds row?) so i don't understand

Here feature1, feature2 are maps. This rewrite can avoid the expensive cost of aggregations on feature1 and feature2.

And the expensive cost of aggregations I mean the process of feature maps in the accumulator (although it's not actually doing aggregation on it by semantics, it can still be expensive, especially for large maps)

import static com.facebook.presto.sql.relational.Expressions.comparisonExpression;
import static com.google.common.collect.ImmutableMap.toImmutableMap;

public class MinMaxByToWindowFunction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment explaining the plan changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private int eagerPlanValidationThreadPoolSize = 20;
private boolean innerJoinPushdownEnabled;
private boolean inEqualityJoinPushdownEnabled;
private boolean rewriteMinMaxByToTopNEnabled;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be on by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess row number adds sorting so might not be always efficient but if your performance numbers show other wise then we can make it on by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be conservative for now. Will consider to set it to be true after getting more stats for this optimizer

@feilong-liu feilong-liu merged commit 0729c1d into prestodb:master Jun 4, 2025
97 checks passed
@feilong-liu feilong-liu deleted the feature_dedup branch June 4, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants