feat: Add subfields pushdown for cardinality function#26834
feat: Add subfields pushdown for cardinality function#26834kevintang2022 merged 1 commit intoprestodb:masterfrom
Conversation
Reviewer's GuideImplements subfield pushdown support for the cardinality() function by introducing a new StructureOnly subfield path element, a configurable optimizer/session property, and planner logic to extract and de-duplicate these subfields, with corresponding tokenizer and configuration tests plus planner tests for MAP/ARRAY usages. Sequence diagram for cardinality() subfield pushdown during planningsequenceDiagram
actor User
participant Session
participant SystemSessionProperties
participant FeaturesConfig
participant Planner as PushdownSubfields
participant Extractor as SubfieldExtractor
participant FR as FunctionResolution
participant TS as TableScanNode
User->>Session: submit query with cardinality(map_col)
Note over Session,FeaturesConfig: Session created with FeaturesConfig
Session->>SystemSessionProperties: get PUSHDOWN_SUBFIELDS_FOR_CARDINALITY
SystemSessionProperties-->>Session: isPushSubfieldsForCardinalityEnabled = true
Session->>Planner: plan(query)
Planner->>Extractor: new SubfieldExtractor(FunctionResolution, ExpressionOptimizer, ConnectorSession, FunctionAndTypeManager, Session)
Extractor->>SystemSessionProperties: isPushSubfieldsForCardinalityEnabled(Session)
SystemSessionProperties-->>Extractor: true
Planner->>Extractor: visitCall(cardinality(map_col), Context)
Extractor->>FR: isCardinalityFunction(call.functionHandle)
FR-->>Extractor: true
Extractor->>Extractor: check single argument and type MapType or ArrayType
Extractor->>Context: add Subfield(map_col, [StructureOnly])
Planner->>TS: collect required Subfields from Context
Planner->>Planner: removeRedundantStructureOnlySubfields(columnSubfields)
Planner-->>TS: set requiredSubfields without redundant StructureOnly
TS-->>User: execute scan reading only structure for map_col
Class diagram for Subfield, StructureOnly, and SubfieldTokenizer changesclassDiagram
class Subfield {
-String rootName
-List~PathElement~ path
+Subfield(String path)
+Subfield(String rootName, List~PathElement~ path)
+String getRootName()
+List~PathElement~ getPath()
+String toString()
+static PathElement noSubfield()
+static PathElement wildcard()
+static PathElement allSubscripts()
+static PathElement structureOnly()
}
class PathElement {
<<interface>>
+boolean isSubscript()
}
class StructureOnly {
-static StructureOnly STRUCTURE_ONLY
-StructureOnly()
+static StructureOnly getInstance()
+boolean isSubscript()
+String toString()
}
class NestedField {
+NestedField(String name)
+String getName()
+boolean isSubscript()
+String toString()
}
class Subscript {
+Subscript(long index)
+long getIndex()
+boolean isSubscript()
+String toString()
}
class AllSubscripts {
+static AllSubscripts getInstance()
+boolean isSubscript()
+String toString()
}
class Wildcard {
+static Wildcard getInstance()
+boolean isSubscript()
+String toString()
}
class NoSubfield {
+static NoSubfield getInstance()
+boolean isSubscript()
+String toString()
}
class SubfieldTokenizer {
-String path
-int index
+SubfieldTokenizer(String path)
+boolean hasNext()
+Subfield.PathElement next()
-Subfield.PathElement computeNext()
-Subfield.PathElement matchQuotedSubscript()
-Subfield.PathElement matchWildcardSubscript()
-Subfield.PathElement matchUnquotedSubscript()
-Subfield.PathElement matchDollarPathElement()
-Subfield.PathElement matchStructureOnlySubscript()
}
Subfield o-- PathElement
StructureOnly ..|> PathElement
NestedField ..|> PathElement
Subscript ..|> PathElement
AllSubscripts ..|> PathElement
Wildcard ..|> PathElement
NoSubfield ..|> PathElement
SubfieldTokenizer --> PathElement
SubfieldTokenizer ..> StructureOnly : creates
Subfield ..> StructureOnly : structureOnly()
Class diagram for planner/cardinality subfield pushdown configurationclassDiagram
class PushdownSubfields {
+PlanNode visitTableScan(TableScanNode node, RewriteContext~Context~ context)
-static List~Subfield~ removeRedundantStructureOnlySubfields(List~Subfield~ subfields)
}
class SubfieldExtractor {
-FunctionResolution functionResolution
-ExpressionOptimizer expressionOptimizer
-ConnectorSession connectorSession
-FunctionAndTypeManager functionAndTypeManager
-boolean isPushDownSubfieldsFromLambdasEnabled
-boolean isPushdownSubfieldsForMapFunctionsEnabled
-boolean isPushdownSubfieldsForCardinalityEnabled
+SubfieldExtractor(FunctionResolution functionResolution, ExpressionOptimizer expressionOptimizer, ConnectorSession connectorSession, FunctionAndTypeManager functionAndTypeManager, Session session)
+Void visitCall(CallExpression call, Context context)
}
class FunctionResolution {
+boolean isMapFilterFunction(FunctionHandle functionHandle)
+boolean isCardinalityFunction(FunctionHandle functionHandle)
+FunctionHandle lookupBuiltInFunction(String functionName, List~Type~ inputTypes)
}
class FeaturesConfig {
-boolean pushdownSubfieldForMapFunctions
-boolean pushdownSubfieldForCardinality
+boolean isPushdownSubfieldForMapFunctions()
+FeaturesConfig setPushdownSubfieldForMapFunctions(boolean value)
+boolean isPushdownSubfieldForCardinality()
+FeaturesConfig setPushdownSubfieldForCardinality(boolean value)
}
class SystemSessionProperties {
+static String PUSHDOWN_SUBFIELDS_FOR_MAP_FUNCTIONS
+static String PUSHDOWN_SUBFIELDS_FOR_CARDINALITY
+SystemSessionProperties(FeaturesConfig featuresConfig, boolean experimentalSyntaxEnabled, boolean distributedIndexJoinsEnabled, boolean optimizedHashGenerationEnabled)
+static boolean isPushSubfieldsForMapFunctionsEnabled(Session session)
+static boolean isPushSubfieldsForCardinalityEnabled(Session session)
}
class Session {
+T getSystemProperty(String name, Class~T~ type)
}
class CallExpression {
+FunctionHandle getFunctionHandle()
+List~RowExpression~ getArguments()
}
class Context {
+List~Subfield~ subfields
}
class VariableReferenceExpression {
+String getName()
+Type getType()
}
class MapType {
}
class ArrayType {
}
class Subfield {
+Subfield(String rootName, List~PathElement~ path)
}
PushdownSubfields ..> Subfield : uses
PushdownSubfields ..> Session : uses
PushdownSubfields ..> SubfieldExtractor : creates
SubfieldExtractor ..> FunctionResolution : uses
SubfieldExtractor ..> ExpressionOptimizer : uses
SubfieldExtractor ..> ConnectorSession : uses
SubfieldExtractor ..> FunctionAndTypeManager : uses
SubfieldExtractor ..> Session : reads properties
SubfieldExtractor ..> SystemSessionProperties : calls isPushSubfieldsForCardinalityEnabled
SubfieldExtractor ..> Subfield : creates
SubfieldExtractor ..> VariableReferenceExpression : casts
SubfieldExtractor ..> MapType : type check
SubfieldExtractor ..> ArrayType : type check
SubfieldExtractor ..> CallExpression : visitCall
SubfieldExtractor ..> Context : adds subfields
FunctionResolution ..> FunctionHandle
SystemSessionProperties ..> FeaturesConfig : constructed from
SystemSessionProperties ..> Session : reads system properties
FeaturesConfig ..> SystemSessionProperties : configures defaults
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new
testPushdownSubfieldsForCardinalitymethod contains several very similar cases (e.g., the*_where_limitand*_with_subscripttables/queries are effectively duplicates); consider consolidating or parameterizing these to keep the test concise and focused on distinct behaviors. - In
Subfield.StructureOnly, overridingisSubscript()to returntrueis a bit non-obvious; adding a brief comment or renaming the class to indicate that this is modeled as a special kind of subscript (e.g., “structure-only subscript”) would make its role clearer to future maintainers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `testPushdownSubfieldsForCardinality` method contains several very similar cases (e.g., the `*_where_limit` and `*_with_subscript` tables/queries are effectively duplicates); consider consolidating or parameterizing these to keep the test concise and focused on distinct behaviors.
- In `Subfield.StructureOnly`, overriding `isSubscript()` to return `true` is a bit non-obvious; adding a brief comment or renaming the class to indicate that this is modeled as a special kind of subscript (e.g., “structure-only subscript”) would make its role clearer to future maintainers.
## Individual Comments
### Comment 1
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveLogicalPlanner.java:1642` </location>
<code_context>
}
+ @Test
+ public void testPushdownSubfieldsForCardinality()
+ {
+ Session cardinalityPushdown = Session.builder(getSession())
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test case with pushdown-subfields-for-cardinality disabled to prove no subfields are pushed in that mode
Since this test only covers the enabled case, please also add coverage for the disabled case by using a session with `PUSHDOWN_SUBFIELDS_FOR_CARDINALITY = false` and asserting that no `[$]` subfields are pushed for the same queries. This will verify that the session property correctly gates the optimization and guard against regressions where cardinality pushdown is applied unconditionally.
Suggested implementation:
```java
import static com.facebook.presto.SystemSessionProperties.PUSHDOWN_SUBFIELDS_FOR_CARDINALITY;
import static com.facebook.presto.SystemSessionProperties.PUSHDOWN_SUBFIELDS_FOR_MAP_FUNCTIONS;
```
To implement the request fully, add a new test method that mirrors `testPushdownSubfieldsForCardinality` but with `PUSHDOWN_SUBFIELDS_FOR_CARDINALITY` set to `"false"` and with assertions that no `[$]` subfields are pushed.
Place this new method next to the existing `testPushdownSubfieldsForCardinality` method in `TestHiveLogicalPlanner`:
```java
@Test
public void testPushdownSubfieldsForCardinalityDisabled()
{
Session noCardinalityPushdown = Session.builder(getSession())
.setSystemProperty(PUSHDOWN_SUBFIELDS_FOR_CARDINALITY, "false")
.build();
// Verify that MAP cardinality does not push down subfields when disabled
assertUpdate("CREATE TABLE test_pushdown_cardinality_map(id integer, x map(integer, double))");
assertPushdownSubfields(
noCardinalityPushdown,
"SELECT t.id, cardinality(x) FROM test_pushdown_cardinality_map t",
"test_pushdown_cardinality_map",
ImmutableMap.of());
assertUpdate("DROP TABLE test_pushdown_cardinality_map");
// Verify that ARRAY cardinality does not push down subfields when disabled
assertUpdate("CREATE TABLE test_pushdown_cardinality_array(id integer, x array(double))");
assertPushdownSubfields(
noCardinalityPushdown,
"SELECT t.id, cardinality(x) FROM test_pushdown_cardinality_array t",
"test_pushdown_cardinality_array",
ImmutableMap.of());
assertUpdate("DROP TABLE test_pushdown_cardinality_array");
}
```
Notes / assumptions you may need to adjust to your codebase:
1. This assumes that `assertPushdownSubfields(Session, String, String, Map<String, Subfield>)` already treats `ImmutableMap.of()` as “no subfields pushed”. If your helper uses a different convention (e.g., `null` or a dedicated `assertNoSubfieldsPushed(...)`), adapt the assertions accordingly.
2. The table names and query shapes are intentionally identical to those in `testPushdownSubfieldsForCardinality`, so the only difference between the tests is the session property value and the expected subfields map.
3. Ensure the new test is inside the `TestHiveLogicalPlanner` class body and near the existing `testPushdownSubfieldsForCardinality` so future maintainers can easily see the enabled/disabled pair.
</issue_to_address>
### Comment 2
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveLogicalPlanner.java:1690-1699` </location>
<code_context>
+ assertUpdate("CREATE TABLE test_pushdown_cardinality_where_limit(id integer, features map(varchar, double))");
</code_context>
<issue_to_address>
**issue (testing):** Drop the last three test tables to avoid state leakage between tests
The earlier cases create a table, assert behavior, and then immediately `DROP` it. These three tests create tables but never drop them, which risks state leakage and order-dependent failures. Please add matching `DROP TABLE` statements for each of these tables for consistency and isolation.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| assertUpdate("CREATE TABLE test_pushdown_cardinality_where_limit(id integer, features map(varchar, double))"); | ||
| assertPushdownSubfields(cardinalityPushdown, "SELECT t.id FROM test_pushdown_cardinality_where_limit t WHERE cardinality(features) > 10 and features['0'] > 0.0", "test_pushdown_cardinality_where_limit", | ||
| ImmutableMap.of("features", toSubfields("features[\"0\"]"))); | ||
|
|
||
| assertUpdate("CREATE TABLE test_pushdown_cardinality_with_subscript(id integer, features map(varchar, double))"); | ||
| assertPushdownSubfields(cardinalityPushdown, "SELECT t.id FROM test_pushdown_cardinality_with_subscript t WHERE cardinality(features) > 10 AND features['0'] > 0.0", "test_pushdown_cardinality_with_subscript", | ||
| ImmutableMap.of("features", toSubfields("features[\"0\"]"))); | ||
|
|
||
| assertUpdate("CREATE TABLE test_pushdown_cardinality_multi_subscript(id integer, features map(varchar, double))"); | ||
| assertPushdownSubfields(cardinalityPushdown, "SELECT t.id FROM test_pushdown_cardinality_multi_subscript t WHERE cardinality(features) > 5 AND features['a'] > 1.0 AND features['b'] < 2.0", "test_pushdown_cardinality_multi_subscript", |
There was a problem hiding this comment.
issue (testing): Drop the last three test tables to avoid state leakage between tests
The earlier cases create a table, assert behavior, and then immediately DROP it. These three tests create tables but never drop them, which risks state leakage and order-dependent failures. Please add matching DROP TABLE statements for each of these tables for consistency and isolation.
|
Please add documentation for the new session property |
389b500 to
fbbc6c2
Compare
Summary: Pull Request resolved: prestodb#26834 Reviewed By: feilong-liu Differential Revision: D88739396
Summary: Pull Request resolved: prestodb#26834 Reviewed By: feilong-liu Differential Revision: D88739396
fbbc6c2 to
890b91a
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull branch, local doc build, looks good. Thanks!
Description
This PR adds subfield pushdown optimization for the
cardinality()function in Presto. When enabled, this optimization allows the query engine to skip reading map keys/values or array elements when only the cardinality (count) of these collections is needed.This PR contains coordinator-side changes only; the corresponding worker-side changes will be added separately to the C++ worker. Since this feature is not yet fully tested end-to-end with the worker, the session property is disabled by default.
Additionally, this implementation takes a conservative approach to subfield pushdown for cardinality: if a column already has other subfields being accessed (e.g.,
features['key']), we skip adding the structure-only subfield for cardinality to avoid potential correctness issues.Key Changes:
Motivation and Context
When queries only need to know the size of a map or array (e.g., `SELECT cardinality(features) FROM table or WHERE cardinality(tags) > 10), there's no need to read all the keys/values or both. This optimization helps reduce shuffles improve the query performance.
Impact
pushdown-subfield-for-cardinalityTest Plan
Added comprehensive unit tests in TestHiveLogicalPlanner.java covering:
Contributor checklist
Release Notes