Skip to content

Comments

ES|QL: Add MV_UNION Function#139664

Merged
mridula-s109 merged 22 commits intoelastic:mainfrom
mridula-s109:mridula-s109/add_MV_UNION_function_esql
Dec 22, 2025
Merged

ES|QL: Add MV_UNION Function#139664
mridula-s109 merged 22 commits intoelastic:mainfrom
mridula-s109:mridula-s109/add_MV_UNION_function_esql

Conversation

@mridula-s109
Copy link
Contributor

@mridula-s109 mridula-s109 commented Dec 17, 2025

related: #139298
Description:

  Adds MV_UNION function to ES|QL.
  
  Returns all unique values from both input multi-valued fields (set union).
  
  Example:
  Given set A = [1, 2, 3] and set B = [2, 3, 4]
  MV_UNION(A, B) returns [1, 2, 3, 4]

@mridula-s109 mridula-s109 self-assigned this Dec 17, 2025
@mridula-s109 mridula-s109 added >enhancement :Analytics/ES|QL AKA ESQL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Dec 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @mridula-s109, I've created a changelog YAML for you.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

@github-actions
Copy link
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@mridula-s109 mridula-s109 requested review from a team and ioanatia December 18, 2025 09:41
@mridula-s109 mridula-s109 added Team:SearchOrg Meta label for the Search Org (Enterprise Search) Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:SearchOrg Meta label for the Search Org (Enterprise Search) labels Dec 18, 2025
@mridula-s109 mridula-s109 marked this pull request as ready for review December 18, 2025 09:44
@elasticsearchmachine elasticsearchmachine removed the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Dec 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@ioanatia ioanatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great! I had one question for the behaviour we want when dealing with nulls.
One aspect we can consider as a follow up is to reduce the duplication with mv_intersection.
We could have a base MvSetOperationFunction base class or something similar that both MvUnion and MvIntersect inherit from.
This would be also helpful if we want to implement another set operation like set difference: mv_difference (we would have to think of the right name).
But I think for now what we have is good enough and we don't need to go through this refactor.

@markjhoy might also want to take a look since he recently did the mv_union function

@markjhoy
Copy link
Contributor

markjhoy commented Dec 18, 2025

this looks great! I had one question for the behaviour we want when dealing with nulls. One aspect we can consider as a follow up is to reduce the duplication with mv_intersection. We could have a base MvSetOperationFunction base class or something similar that both MvUnion and MvIntersect inherit from. This would be also helpful if we want to implement another set operation like set difference: mv_difference (we would have to think of the right name). But I think for now what we have is good enough and we don't need to go through this refactor.

@markjhoy might also want to take a look since he recently did the mv_union function

I think it's a good idea to always try and de-duplicate code... with the generators in ESQL, and the process methods being static it gets a bit tricky though... one thing we could do to possibly at least have some commonality, is for the processUnionSet method (in MV_UNION, or the processIntersectionSet in MV_INTERSECTION) -- a lot of that code is fairly common for getting the values and indices for these two, so maybe a helper function something like (this might be messy looking, but just as an idea):

<T> void processFieldSets(
        Block.Builder builder,
        int position,
        Block field1,
        Block field2,
        BiFunction<Integer, Block, T> getValueFunction,
        BiFunction<Set<T>, Set<T>, Set<T>> combinationFunction,
        Consumer<T> addValueFunction
    ) {
        int firstValueCount = field1.getValueCount(position);
        int secondValueCount = field2.getValueCount(position);

        // If either field has no values (is null), return null
        // this behaviour would change from union to intersection, etc. so we could
        // just remove this short-circuit block for the "generic" version
        if (firstValueCount == 0 || secondValueCount == 0) { // <- this would change behaviour between union and intersection
            builder.appendNull();
            return;
        }

        int firstValueIndex = field1.getFirstValueIndex(position);
        int secondValueIndex = field2.getFirstValueIndex(position);

        // Use LinkedHashSet to maintain insertion order
        Set<T> firstSet = new LinkedHashSet<>();

        // Add all values from first field
        for (int i = 0; i < firstValueCount; i++) {
            firstSet.add(getValueFunction.apply(firstValueIndex + i, field1));
        }

        Set<T> secondSet = new LinkedHashSet<>();
        // Add all values from second field (duplicates automatically ignored by Set)
        for (int i = 0; i < secondValueCount; i++) {
            secondSet.add(getValueFunction.apply(secondValueIndex + i, field2));
        }

        Set<T> combinedSet = combinationFunction.apply(firstSet, secondSet);

        if (combinedSet.isEmpty()) {
            builder.appendNull();
            return;
        }

        // Build result
        builder.beginPositionEntry();
        for (T value : values) {
            addValueFunction.accept(value);
        }
        builder.endPositionEntry();
}

Where the combinationFunction here would returned the union of the two sets (and in MV_INTERSECTION returns the intersection)...

This would make it easier to extend to other set operations in the future perhaps (e.g. MV_DIFFERENCE, MV_COMPLIMENT, etc.)

UPDATE:

(edit - just realized... the behaviour for when both fields are null I would think should return null - however, if one of the fields is null but the other has values, the union should return the values... )

@markjhoy markjhoy self-requested a review December 18, 2025 16:27
@markjhoy markjhoy dismissed their stale review December 18, 2025 16:27

removing the blocker based on Craig's input

mridula-s109 and others added 5 commits December 19, 2025 15:42
…expression/function/scalar/multivalue/MvUnion.java

Co-authored-by: Liam Thompson <leemthompo@gmail.com>
…expression/function/scalar/multivalue/MvUnion.java

Co-authored-by: Liam Thompson <leemthompo@gmail.com>
@mridula-s109
Copy link
Contributor Author

this looks great! I had one question for the behaviour we want when dealing with nulls. One aspect we can consider as a follow up is to reduce the duplication with mv_intersection. We could have a base MvSetOperationFunction base class or something similar that both MvUnion and MvIntersect inherit from. This would be also helpful if we want to implement another set operation like set difference: mv_difference (we would have to think of the right name). But I think for now what we have is good enough and we don't need to go through this refactor.
@markjhoy might also want to take a look since he recently did the mv_union function

I think it's a good idea to always try and de-duplicate code... with the generators in ESQL, and the process methods being static it gets a bit tricky though... one thing we could do to possibly at least have some commonality, is for the processUnionSet method (in MV_UNION, or the processIntersectionSet in MV_INTERSECTION) -- a lot of that code is fairly common for getting the values and indices for these two, so maybe a helper function something like (this might be messy looking, but just as an idea):

<T> void processFieldSets(
        Block.Builder builder,
        int position,
        Block field1,
        Block field2,
        BiFunction<Integer, Block, T> getValueFunction,
        BiFunction<Set<T>, Set<T>, Set<T>> combinationFunction,
        Consumer<T> addValueFunction
    ) {
        int firstValueCount = field1.getValueCount(position);
        int secondValueCount = field2.getValueCount(position);

        // If either field has no values (is null), return null
        // this behaviour would change from union to intersection, etc. so we could
        // just remove this short-circuit block for the "generic" version
        if (firstValueCount == 0 || secondValueCount == 0) { // <- this would change behaviour between union and intersection
            builder.appendNull();
            return;
        }

        int firstValueIndex = field1.getFirstValueIndex(position);
        int secondValueIndex = field2.getFirstValueIndex(position);

        // Use LinkedHashSet to maintain insertion order
        Set<T> firstSet = new LinkedHashSet<>();

        // Add all values from first field
        for (int i = 0; i < firstValueCount; i++) {
            firstSet.add(getValueFunction.apply(firstValueIndex + i, field1));
        }

        Set<T> secondSet = new LinkedHashSet<>();
        // Add all values from second field (duplicates automatically ignored by Set)
        for (int i = 0; i < secondValueCount; i++) {
            secondSet.add(getValueFunction.apply(secondValueIndex + i, field2));
        }

        Set<T> combinedSet = combinationFunction.apply(firstSet, secondSet);

        if (combinedSet.isEmpty()) {
            builder.appendNull();
            return;
        }

        // Build result
        builder.beginPositionEntry();
        for (T value : values) {
            addValueFunction.accept(value);
        }
        builder.endPositionEntry();
}

Where the combinationFunction here would returned the union of the two sets (and in MV_INTERSECTION returns the intersection)...

This would make it easier to extend to other set operations in the future perhaps (e.g. MV_DIFFERENCE, MV_COMPLIMENT, etc.)

UPDATE:

(edit - just realized... the behaviour for when both fields are null I would think should return null - however, if one of the fields is null but the other has values, the union should return the values... )

@ioanatia, @markjhoy Great suggestion, i agree the code deduplication would be valuable for maintainability and future set operations like MV_DIFFERENCE. Having said that, for this PR i will limit the scope and create a follow-up issue to refactor both MV_UNION and MV_INTERSECTION to share a common helper, which would also make adding new set operations easier.

@craigtaverner Also i have updated the MV_UNION function to treat null as an empty set if only one of the input value is null.

Copy link
Contributor

@markjhoy markjhoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants