Skip to content

ESQL: Skip nullifying aliases for Aggregate groups#141340

Open
bpintea wants to merge 16 commits intoelastic:mainfrom
bpintea:fix/unmapped_no_grouping_aliases
Open

ESQL: Skip nullifying aliases for Aggregate groups#141340
bpintea wants to merge 16 commits intoelastic:mainfrom
bpintea:fix/unmapped_no_grouping_aliases

Conversation

@bpintea
Copy link
Contributor

@bpintea bpintea commented Jan 27, 2026

This re-introduces skipping the UnresolvedAttributes that are
collected from the Aggregate#aggregates and are Aliases in the
#groupings. In this case, the aliases, not their child, result in an
UnresolvedAttribute. This must not be nullified, since they'll be
subsequently resolved part of Alias'es child resolution.

The side effect of doing otherwise is that they can shadow attributes
produced by the source or Eval'd.

Related, in Aggregate resoluion skip resolving the aggregates based on
those input attributes that share a name with the not-yet-resolved
UnresolvedAttributes. This was incorrect, but didn't occur before
unmapped-fields fieature since the resolution of the Aggregate all
happened in one Analyzer cycle (unlike with unmapped-fields).

Another fix concerns skipping the right-hand side of Joins when
introducing null-aliases.
Also, null-aliases are no longer introduced behind Aggregates
if these shadow an existing source. (This is a moot change, since in
this case the plan verification would fail anyways. But it is the correct
way.)

bpintea and others added 7 commits January 26, 2026 12:11
This fixes the generation of name IDs for the attributes corresponding
to the unmapped fields and are pushed to different branches in UntionAll.

So far, one set of IDs was generated and reused for all subplans. This
is now updated to own set per subplan.

A minor collateral proposed change: the CSV spec-based tests skipped due
to missing capabilities are now logged.
This re-introduces skipping the `UnresolvedAttribute`s that are
collected from the `Aggregate#aggregates` and are `Alias`es in the
`#groupings`. In this case, the aliases, not their child, result in an
`UnresolvedAttribute`. This must not be nullified, since they'll be
subsequently resolved part of `Alias`'es child resolution.

The side effect of doing otherwise is that they can shadow attributes
produced by the source or `Eval`'d.

Related, in Aggregate resoluion skip resolving the aggregates based on
those input attributes that share a name with the not-yet-resolved
`UnresolvedAttribute`s. This was incorrect, but didn't occur before
unmapped-fields fieature since the resolution of the `Aggregate` all
happened in one Analyzer cycle (unlike with unmapped-fields).
- LookupJoin no longer has Evals inserted on the right-hand side
- nullifying now only iterates the plan once
- make use of the new transformDownSkipBranch
- avoid shadowing source attributes behind Aggregates
- check resuling plan on statement analysis
@bpintea bpintea added auto-backport Automatically create backport pull requests when merged and removed WIP labels Jan 30, 2026
@bpintea bpintea requested review from GalLalouche, alex-spies and astefan and removed request for GalLalouche January 30, 2026 06:09
@elasticsearchmachine
Copy link
Collaborator

Hi @bpintea, I've created a changelog YAML for you.

@bpintea bpintea marked this pull request as ready for review January 30, 2026 06:10
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 30, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few comments. I still need to look over the tests in detail.

* @return the names of the aliases used in the grouping expressions of any Aggregate found in the plan.
*/
private static Set<String> aliasNamesInAggregateGroupings(LogicalPlan plan) {
Set<String> aliasNames = new LinkedHashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a LinkedHashSet here, unless I'm missing something. A simple HashSet should be enough.

*/
private static List<UnresolvedAttribute> collectUnresolved(LogicalPlan plan) {
var aliasedGroupings = aliasNamesInAggregateGroupings(plan);
List<UnresolvedAttribute> unresolved = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could also build a LinkedHashSet directly and not call unresolvedLinkedSet (which is used only once).

}

private static List<Alias> removeShadowing(List<Alias> aliases, List<Attribute> exclude) {
Set<String> excludeNames = new HashSet<>(Expressions.names(exclude));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, a wild thought, if the exclude list should skip synthetic attributes...

var transformed = load ? load(plan, unresolvedLinkedSet) : nullify(plan, unresolvedLinkedSet);

return transformed.equals(plan) ? plan : refreshPlan(transformed, unresolved);
return transformed == plan ? plan : refreshPlan(transformed, unresolved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this change, I found refreshUnresolved(LogicalPlan plan, List<UnresolvedAttribute> unresolved) method to be unnecessary, imho. I guess it boils down to one's style/preference, for me the code is too fragmented in few places with methods that are called only once. refreshPlan is imho better describing the logic if all the code is in that method.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests do also LGTM

* max(@timestamp){r}#90, language_name{f}#57, does_not_exist1{r}#109, $$does_not_exist1$converted_to$long{r$}#149,
* does_not_exist2{r}#154]]
* \_Eval[[TOLONG(does_not_exist1{r}#109) AS $$does_not_exist1$converted_to$long#149]]
* \_Eval[[null[KEYWORD] AS languageName#89, null[DATETIME] AS max(@timestamp)#90]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR.
We create null columns for those fields that are not common to subqueries, but we seem to pick one common data type for those fields that are common, but with different data types.

For example (from our own test data):

FROM (FROM sample_data metadata _index
                                 | STATS cnt = count(*) by _index, client_ip )
                               , (FROM sample_data_str metadata _index
                                 | STATS cnt = count(*) by _index, client_ip )
            metadata _index

has results with

        {
            "name": "client_ip",
            "type": "keyword"
        }

But, if I use FROM sample_data, sample_data_str, I get

            "name": "client_ip",
            "type": "unsupported",
            "original_types": [
                "ip",
                "keyword"
            ],
            "suggested_cast": "keyword"

I don't remember the sub-queries functionality well enough, but it's a bit surprising (to me at least). Given that unmapped fields will treat an unexistent field as either NULL type or KEYWORD type, I am wondering if users wouldn't expect some kind of SET to act on the scenario above similarly to unmapped fields functionality behavior.

&& (nAry instanceof Join == false || child == ((Join) nAry).left())) {
assertSourceType(source);
var nullAliases = removeShadowing(nullAliases(unresolved), source.output());
child = new Eval(source.source(), source, nullAliases);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to break the logic here because there doesn't seem to be a way to protect the Eval constructor in case nullAliases is empty (removeShadowing could change its content) I encountered an issue with the following query:

from employees | eval does_not_exist = does_not_exist2 | mv_expand does_not_exist | keep does_not_exist*

which results in

                "type": "illegal_state_exception",
                "reason": "Found 1 problem\nline 4:85: Plan [ProjectExec[[<no-fields>{r$}#81]]] optimized incorrectly due to missing references [<no-fields>{r$}#81]",

My initial attempt was with

from employees | eval does_not_exist = does_not_exist2 | mv_expand does_not_exist | keep does_not_exist* | stats count(*)

which returned 0 which is not correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.1 v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants