Skip to content

optimize APPROX_DISTINCT operations on constant values#25262

Closed
hdikeman wants to merge 1 commit intoprestodb:masterfrom
hdikeman:export-D76161617
Closed

optimize APPROX_DISTINCT operations on constant values#25262
hdikeman wants to merge 1 commit intoprestodb:masterfrom
hdikeman:export-D76161617

Conversation

@hdikeman
Copy link
Contributor

@hdikeman hdikeman commented Jun 6, 2025

Description

APPROX_DISTINCT operations on a constant value (e.g. APPROX_DISTINCT(IF(..., const))) are more expensive than and functionally equivalent to ARBITRARY(IF(..., 1, 0))
Adding an optimizer rule to replace any APPROX_DISTINCT operations on constant conditionals with equivalent calls to ARBITRARY

Motivation and Context

Some autogenerated queries use this pattern, which is inefficient and causes OOM errors for complex queries

Impact

Queries which use this APPROX_DISTINCT pattern will consume less memory

Test Plan

Adding test coverage, E2E and unit tests

Also did some E2E testing manually to make sure the substitution was occurring:

presto:tiny> explain select approx_distinct(if(orderkey = orderkey, 'const')) from orders;
                                                                                                                    Query Plan                                                               >
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 - Output[PlanNodeId 8][_col0] => [approx_distinct:bigint]                                                                                                                                   >
         _col0 := approx_distinct (1:16)                                                                                                                                                     >
     - Aggregate(FINAL)[PlanNodeId 3] => [approx_distinct:bigint]                                                                                                                            >
             approx_distinct := "presto.default.arbitrary"((arbitrary)) (1:16)                                                                                                               >
         - LocalExchange[PlanNodeId 285][SINGLE] () => [arbitrary:bigint]                                                                                                                    >
             - RemoteStreamingExchange[PlanNodeId 291][GATHER - COLUMNAR] => [arbitrary:bigint]                                                                                              >
                 - Aggregate(PARTIAL)[PlanNodeId 289] => [arbitrary:bigint]                                                                                                                  >
                         arbitrary := "presto.default.arbitrary"((expr_4)) (1:16)                                                                                                            >
                     - ScanProject[PlanNodeId 0,1][table = TableHandle {connectorId='tpch', connectorHandle='orders:sf0.01', layout='Optional[orders:sf0.01]'}, projectLocality = LOCAL] => [>
                             Estimates: {source: CostBasedSourceInfo, rows: 15,000 (131.84kB), cpu: 135,000.00, memory: 0.00, network: 0.00}/{source: CostBasedSourceInfo, rows: 15,000 (131.>
                             expr_4 := IF((orderkey) = (orderkey), BIGINT'1', BIGINT'0') (1:72)                                                                                              >
                             orderkey := tpch:orderkey (1:71)                                                                                                                                >
                             tpch:orderstatus                                                                                                                                                >
                                 :: [["F"], ["O"], ["P"]] 

All unit tests were run on latest revision. Also, a verifier run was performed:

2025-06-23T16:28:11.319-0700    INFO    main    com.facebook.presto.verifier.framework.VerificationManager      Progress: 4637 succeeded, 2510 skipped, 1 resolved, 13 failed, 210 resubmitted, 100.00% done                                
[...]
2025-06-23 16:28:23 INFO: [verifier.py:1484] Imported verifier logs. Success rate: 99.87%. Correctness rate: 99.94%.

failed queries were due to load

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@hdikeman hdikeman requested a review from a team as a code owner June 6, 2025 23:45
@hdikeman hdikeman requested a review from hantangwangd June 6, 2025 23:45
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

@hdikeman hdikeman marked this pull request as draft June 6, 2025 23:53
hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 9, 2025
Summary:

`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`

Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule

Differential Revision: D76161617
@hdikeman hdikeman force-pushed the export-D76161617 branch from 7f3cedb to 22421bd Compare June 9, 2025 21:57
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

@kaikalur
Copy link
Contributor

kaikalur commented Jun 9, 2025

on thinking more - I think we can simply replace this to 1 - we don't even need arbitrary. I was just concerned about messing with aggs but I think we have good enough optimizations now to figure out the constant 1 and hoist it.

@kaikalur
Copy link
Contributor

kaikalur commented Jun 9, 2025

on thinking more - I think we can simply replace this to 1 - we don't even need arbitrary. I was just concerned about messing with aggs but I think we have good enough optimizations now to figure out the constant 1 and hoist it.

Actually approx_distinct(null) is 0 but any other constant should be 1

hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 10, 2025
Summary:

`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`

Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule

Differential Revision: D76161617
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 10, 2025
Summary:

`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`

Right now I inline constants into the aggregation itself, though I could do this with another optimizer rule

Differential Revision: D76161617
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

@tdcmeehan tdcmeehan added the from:Meta PR from Meta label Jun 12, 2025
@prestodb-ci
Copy link
Contributor

Saved that user @hdikeman is from Meta

@kaikalur kaikalur requested a review from feilong-liu June 12, 2025 17:47
@kaikalur
Copy link
Contributor

Also look at:

#25301

that's the most general way to do this. And actually easier!

@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 13, 2025
Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a constant value (e.g. `APPROX_DISTINCT('abcd')`) are more expensive than and functionally equivalent to `ARBITRARY(1)`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant values with equivalent calls to `ARBITRARY`

Differential Revision: D76161617
@hdikeman
Copy link
Contributor Author

on thinking more - I think we can simply replace this to 1 - we don't even need arbitrary. I was just concerned about messing with aggs but I think we have good enough optimizations now to figure out the constant 1 and hoist it.

Gotcha. Wonder if we could defer that or open a separate task. Might be finicky to try to eliminate the aggregation entirely

Actually approx_distinct(null) is 0 but any other constant should be 1

The way I got around this @kaikalur was by just ignoring NULLs and only replacing true constants (I check !isNull prior to firing the optimizer on a particular aggregation)

I could have handled it but have another task for optimization of functions with NULL inputs where could handle it more generally

By the way, thanks for the review! Sorry I'm just seeing this, I dumped this PR into draft mode and forgot about it while I worked out a couple bugs

@hdikeman hdikeman marked this pull request as ready for review June 13, 2025 04:59
@kaikalur
Copy link
Contributor

I could have handled it but have another task for optimization of functions with NULL inputs where could handle it more generally

This non-trivial. Too many subtle semantics. So your two choices:

  • Just complete this PR to handle approx_distinct for all null and non-null constants only
  • Fix the linked issue to extend RowExpressionInterpter to handle all these single value aggs like min/max/approx_distinct/count(distinct)/arbitarary explicitly for constant inputs.

@hdikeman
Copy link
Contributor Author

hdikeman commented Jun 13, 2025

@kaikalur and I talked offline

For the case which motivated this change (APPROX_DISTINCT(IF(expression, constant))) the RowExpressionEvaluator will not work, since the conditional predicate may contain non-constant variable references. But for cases of aggregation(constant) generally, that is a better approach

So I will remove the handling for the APPROX_DISTINCT(constant) case and leave the if-then replacement in. The handling for aggregations with pure constant inputs will be handled in a follow-up task. Will get an amendment shortly

@kaikalur
Copy link
Contributor

@kaikalur and I talked offline

For the case which motivated this change (APPROX_DISTINCT(expression, constant)) the RowExpressionEvaluator will not work, since the conditional predicate may contain non-constant variable references. But for cases of aggregation(constant) generally, that is a better approach

So I will remove the handling for the APPROX_DISTINCT(constant) case and leave the if-then replacement in. The handling for aggregations with pure constant inputs will be handled in a follow-up task. Will get an amendment shortly

Yeah - let's make sure to run this rule after constant pull up happens

@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D76161617

@facebook-github-bot
Copy link
Collaborator

@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@steveburnett
Copy link
Contributor

If this PR doesn't need a release note, please add

== NO RELEASE NOTE ==

Thanks!

hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 16, 2025
…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@facebook-github-bot
Copy link
Collaborator

@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hdikeman hdikeman requested a review from kaikalur June 16, 2025 20:02
hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 17, 2025
…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@hdikeman
Copy link
Contributor Author

We talked about this comment (#25262 (comment)) more offline:

I am not sure about this part. The result can be either 1 or 2, when the two non-null arguments are different.

Turns out this indeterminism is also an issue even for the "typical" case of approx_distinct(if(..., const))

i.e. the function arbitrary(if(..., 1, 0)) is ALSO indeterminate in the case of a condition which may be either true or false for different rows: arbitrary will return 0 or 1 at random basically. What we really want is arbitrary(if(..., 1, NULL))

However, this has the added complication that approx_distinct(NULL) = 0 whereas arbitrary(NULL) = NULL. So we need to add another project on top to convert those NULLs to zeros appropriately

Something else to consider: we could switch from ARBITRARY to MAX and handle this as max(if(..., 1, 0)), but it would halfway defeat the point of optimizing this case at all. Maybe would still represent an improvement though (if my assumption is correct and max is likely to chew up less memory than approx_distinct)

hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 19, 2025
…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@facebook-github-bot
Copy link
Collaborator

@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@feilong-liu feilong-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.
Overall lgtm.
Can you follow the instructions here to run some tests?

…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@facebook-github-bot
Copy link
Collaborator

@hdikeman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hdikeman hdikeman requested a review from feilong-liu June 24, 2025 16:35
hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 24, 2025
…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@hdikeman
Copy link
Contributor Author

I don't know how to resolve Sreeni's change request. He's out on vacation. I was told the easiest way around this is to close and reopen the PR. New one is here: #25428

@hdikeman hdikeman closed this Jun 24, 2025
hdikeman added a commit to hdikeman/presto that referenced this pull request Jun 25, 2025
…restodb#25262)

Summary:
Pull Request resolved: prestodb#25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
feilong-liu pushed a commit that referenced this pull request Jun 25, 2025
…25262)

Summary:
Pull Request resolved: #25262

`APPROX_DISTINCT` operations on a conditional constant value (e.g. `APPROX_DISTINCT(IF(expr, 'abcd'))`) are more expensive than and functionally equivalent to `ARBITRARY(IF(expr, 1, 0))`

Adding an optimizer rule to replace any `APPROX_DISTINCT` operations on constant conditional values with equivalent calls to `ARBITRARY`

This comes up in some automated queries

Differential Revision: D76161617
@prestodb-ci prestodb-ci mentioned this pull request Jul 28, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants