Enable subfield pushdown for map_subset function#25394
Enable subfield pushdown for map_subset function#25394feilong-liu merged 1 commit intoprestodb:masterfrom
Conversation
f894a58 to
63eace6
Compare
63eace6 to
fb0de3c
Compare
| if (!isPushDownSubfieldsFromLambdasEnabled) { | ||
| context.setLambdaSubfields(Context.ALL_SUBFIELDS_OF_ARRAY_ELEMENT_OR_MAP_VALUE); | ||
| call.getArguments().forEach(argument -> argument.accept(this, context)); | ||
| // map_subset(feature, constant_array) is only accessing fields specified in feature map. |
There was a problem hiding this comment.
why does isPushDownSubfieldsFromLambdasEnabled have to be false?
There was a problem hiding this comment.
It does not need to, it's just the lambda handling part is so complex and I haven't figured out how to make it work for it. Since this isPushDownSubfieldsFromLambdasEnabled is default to false, and what is added here is more like a new feature, so I chose to fix for the case when isPushDownSubfieldsFromLambdasEnabled for now. We can create a new issue for it when this PR is merged.
There was a problem hiding this comment.
I just followed @kevintang2022 's suggestion and the fix worked for lambda too (and also added test)
| call.getArguments().forEach(argument -> argument.accept(this, context)); | ||
| // map_subset(feature, constant_array) is only accessing fields specified in feature map. | ||
| // For example map_subset(feature, array[1, 2]) is equivalent to calling element_at(feature, 1) and element_at(feature, 2) for subfield extraction | ||
| if (isPushdownSubfieldsForMapSubsetEnabled && functionResolution.isMapSubSetFunction(call.getFunctionHandle()) && call.getArguments().get(0) instanceof VariableReferenceExpression && call.getArguments().get(1) instanceof ConstantExpression) { |
There was a problem hiding this comment.
I am wondering if PushdownSubfields is the right place for this optimization because here only the context is changed and the name of this visitor also is SubfieldExtractor
There was a problem hiding this comment.
Yeah, I have been thinking to add a new optimizer which translates all map_subset(feature, array[1, 2, 3]) with constant array elements to something like map(array[1, 2, 3], array[element_at(feature, 1), element_at(feature, 2), element_at(feature, 3)]).
However then I realized that this rewrite will not be useful if there are other reference to this feature map. In order to make it work, we need to check for other non subfield access of feature, which is nontrivial and is basically duplicating most of the logic here.
And also I think it kind of make sense to put it here, as the optimizer is to extract subfield access and push it down, which is what map_subset is doing, i.e. accessing only a part of a map.
|
Thanks for the release note entry! Suggest adding a link to the doc. See Formatting in the Release Notes Guidelines. Also, can you describe what user-visible effect this might have? Reading the PR description I came up with this. Let me know what you think! |
|
I think there's a place in the same file PushdownSubfields.java that is doing something similar. Is there a way to add this change in the same place? It's in the toSubfield method |
16f3445 to
750a2e7
Compare
| } | ||
|
|
||
| private static Optional<Subfield> toSubfield( | ||
| private static Optional<List<Subfield>> toSubfield( |
There was a problem hiding this comment.
map_subset can return a list of subfields
@kevintang2022 Thanks for the suggestion, updated the PR to move the change here. |
hantangwangd
left a comment
There was a problem hiding this comment.
Thanks for the optimization, LGTM! There is an unused import in TestHiveLogicalPlanner that causes the checkstyle failure.
99cf482
750a2e7 to
99cf482
Compare
| if (index == null) { | ||
| return Optional.empty(); | ||
| } | ||
| if (index instanceof Number) { |
There was a problem hiding this comment.
i thought map_subset specifies map keys and not indexes? why can't they be negative? And why do they need to be numeric?
There was a problem hiding this comment.
If that is the case we cannot translate to elementAt function i think
There was a problem hiding this comment.
element_at for maps also takes a key and not an index (see docs here: https://prestodb.io/docs/current/functions/map.html). element_at returns the value for a given key. map_subset returns a subset of the map that includes only the entries with keys in the specified array.
There was a problem hiding this comment.
Nice catch, I have tried removing the restriction on negative values and adding support for varchar type here, and found that both can be successfully pushed down and got the expected results.
There was a problem hiding this comment.
i thought map_subset specifies map keys and not indexes?
Rename to mapKey
And why do they need to be numeric?
Add case for varchar keys
why can't they be negative?
You are right, the associated issue is for array, not for map
There was a problem hiding this comment.
By the way, since subfield pushdown of negative keys work for map, I drafted another PR #25445 to fix the current implementation for element_at and subscript functions.
| } | ||
|
|
||
| @Test | ||
| public void testPushdownSubfieldsForMapSubset() |
There was a problem hiding this comment.
can you add a test where the specified keys to subset are not also the index of that key? Also for non-numeric map keys.
There was a problem hiding this comment.
a test where the specified keys to subset are not also the index of that key
What do you mean by not also the index of that key? Do you mean the type of index in array is different from map key?
If map key is of integer type, and element in array is of bigint type, the map will be wrapped with a cast and thie pushdown field will not work (and this needs to be resolved with this PR #25395)
If element type of array is not compatible with map key type, for example integer vs. varchar, query will fail during during parsing
Also for non-numeric map keys.
Added tests for varchar keys too
e0d8dbe to
f1f5138
Compare
f1f5138 to
3928122
Compare
3928122 to
fd18a02
Compare
Description
MAP_SUBSET function is commonly used in ML workload, which is used to extract a subset of features. For example, map_subset(feature, array[301, 205]) is used to extract the features with key to be 301 and 205.
If this is the only case where feature is accessed, we only need to read two fields of the map. However currently PushDownSubfields does not recognize this pattern. In this PR I enabled subfield pushdown for map_subset when the input array is a constant array.
In one of the motivating queries we found, this enhancement reduce the resource usage by 17% (orig vs. improved)
Without change:
With change:
map2 := map2:map<int,int>:1:REGULAR (1:67)vs.map2 := map2:map<int,int>:1:REGULAR:[map2[1]] (1:67)Motivation and Context
As in description
Impact
In one of the motivating queries we found, this enhancement reduce the resource usage by 17%
Test Plan
unit tests, also end to end test locally
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.