ESQL: Fix incorrectly optimized fork with nullify unmapped_fields#143030
ESQL: Fix incorrectly optimized fork with nullify unmapped_fields#143030elasticsearchmachine merged 17 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
|
buildkite test this |
|
@astefan, I can not reproduce the failure, and it seems not related to this PR. |
|
buildkite test this |
|
Heya, in #141340, I'm adding a test that is failing due to this in GenerativeForkIT (build scan). I'll have to mute it in the other PR and it'd be super nice to unmute as part of this PR. In any case, will put the mute into Update: muting in c85e691 |
…fork-nullify-optimization
…fork-nullify-optimization
|
buildkite test this |
|
@alex-spies I've unmuted that test. From my local tests, it didn't complain. We'll see what CI says |
alex-spies
left a comment
There was a problem hiding this comment.
Thanks a lot @kanoshiou and @astefan !
The fix for Bug 2 was independently discovered by @idegtiarenko while fixing #141870. Apologies that I didn't spot that we're fixing the same issue in both PRs.
The fix for Bug 1 (keeping name ids for existing fork attributes while refreshing) looks super good to me. Thanks a lot!
| Set<String> childOutputNames = new HashSet<>(); | ||
| for (LogicalPlan child : plan.children()) { | ||
| for (Attribute attr : child.output()) { | ||
| childOutputNames.add(attr.name()); | ||
| } | ||
| } | ||
| unresolved.removeIf(ua -> childOutputNames.contains(ua.name())); | ||
|
|
||
| if (unresolved.isEmpty()) { | ||
| return plan; | ||
| } |
There was a problem hiding this comment.
Thanks, I agree that this is the right solution!
And I'm super sorry I didn't notice this earlier @astefan . Bug 2 from the PR description is indeed the same problem that @idegtiarenko fixed in #142300, and you two ended up implementing the very same solution.
The silver lining is that we very much agree on the solution. The added tests in this PR are great, too, as they highlight different queries where ImplicitCasting can introduce an intermediate step between ResolveRefs and ResolveUnmapped.
| required_capability: optional_fields_nullify_tech_preview | ||
| required_capability: fork_v9 | ||
| required_capability: fix_fork_unmapped_nullify |
There was a problem hiding this comment.
nit: I think we need fork_v9 to exclude this from CsvTests, but we should only require 1 of optional_fields_nullify_tech_preview and fix_fork_unmapped_nullify, no?
Also applies below.
| * Bug 2: After {@code ImplicitCasting} runs, the plan may remain unresolved because it requires a | ||
| * subsequent {@code ResolveRefs} pass to fully resolve. However, {@code ResolveUnmapped} runs before | ||
| * that second {@code ResolveRefs} pass and mistakenly treats those still-unresolved attributes as | ||
| * user-introduced unmapped fields, incorrectly nullifying valid references. |
There was a problem hiding this comment.
nit: unfortunately, this got stale. Bug 2 is already fixed.
| FIX_FULL_TEXT_FUNCTIONS_ON_RENAMED_FIELDS, | ||
|
|
||
| /** | ||
| * Fixes two independent analysis bugs in {@code FORK} with {@code unmapped_fields="nullify"}. |
| * matches one in {@code existingOutput}. Genuinely new attributes get fresh NameIds. | ||
| */ | ||
| public static List<Attribute> toReferenceAttributes(List<? extends NamedExpression> named) { | ||
| public static List<Attribute> toReferenceAttributes(List<? extends NamedExpression> named, List<Attribute> existingOutput) { |
There was a problem hiding this comment.
nit: method name doesn't really say what this does, now. Maybe toReferenceAttributesPreservingIds?
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/rules/ResolveUnmapped.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerUnmappedTests.java
|
@astefan conflicts resolved |
|
buildkite test this |
…fork-nullify-optimization
|
buildkite test this |
…fork-nullify-optimization
|
buildkite test this |
|
|
||
| protected List<Attribute> refreshedOutput() { | ||
| return toReferenceAttributes(outputUnion(children())); | ||
| return toReferenceAttributesPreservingIds(outputUnion(children()), this.output()); |
There was a problem hiding this comment.
Hmm, this no longer refreshes the IDs. Meaning that some assumptions are either broken, or they were incorrect to have, or they're no longer valid (in the meantime).
There was a problem hiding this comment.
@bpintea can you expand on this, please? What use cases would this impact/break/change? (are we missing tests?)
| // We don't want to keep the same attributes that are outputted by the FORK branches. | ||
| // Keeping the same attributes can have unintended side effects when applying optimizations like constant folding. |
There was a problem hiding this comment.
This shouldn't be removed. The attribute IDs from within the branches are not kept / reused atop Fork.
|
Thanks for the review, @bpintea! I've updated the PR description to reflect your feedback. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
|
I can help with the backport. |
|
To backport this PR, I think we need to wait for #144097 to be merged first. |
…elocations * upstream/main: (49 commits) CCS logging fixes (elastic#144070) Improve CPS cluster exclusion handling (elastic#143488) Remove snapshot condition now that node_reduce phase is in non-snapshot builds (elastic#144090) Drop deprecation warnings when updating a mapping in the cluster state applier (elastic#143884) (elastic#144040) Add ensureGreenAndNoInitializingShards helper (elastic#144044) Removed unnecessary applies_to blocks from deprecated query (elastic#144096) [CPS] Use single CrossProjectModeDecider instance (elastic#144030) Fix ESQL TS requests with LIMIT 0 (elastic#144031) ESQL: Remove `create` methods in aggs (elastic#144098) ES|QL: Refactor ChangeLimitOperator (elastic#144017) Add Paginated Hit Source Tests (elastic#142592) Fix test failure not preferred (elastic#144019) Remove serialization logic from EIS authorization response (elastic#144021) ESQL: CSV schema inference and parsing enhancements (elastic#144050) ESQL: Fix incorrectly optimized fork with nullify unmapped_fields (elastic#143030) Fix MMR release test using subqueries (elastic#144087) Refactoring `UserAgentPlugin` (elastic#140712) Drop non-finite samples in Prometheus remote write (elastic#144055) [TEST] Wait for internal inference indices to be created in authorization IT (elastic#143885) Disable ndjson datasource QA tests in release-tests (elastic#143992) ...
Thanks @kanoshiou ! It's merged! We may still see conflicts because #143399 was not backported. I don't think it should, though, as this one required a new transport version, increasing the cost of backports a bit. |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…astic#143030) This PR fixes a bug where `Fork.withSubPlans()` incorrectly reassigned new `NameId`s to its output attributes, breaking references in the upper plan. This issue specifically manifests when using `FORK` alongside the `SET unmapped_fields="nullify"` mode. By design, a `FORK` assigns new `NameId`s to its output attributes via `refreshOutput()` to decouple them from the internal branches. This isolation is necessary to prevent unintended side effects during plan optimizations, such as aggressive constant folding leaking across branches. However, the previous implementation unconditionally re-minted these `NameId`s every time `withSubPlans()` was called. Because of this, any node sitting above the `FORK` (like `EVAL` or `STATS`) that already held a reference to the initial `NameId`s would suddenly point to a nonexistent ID. Downstream analysis rules would then fail to resolve these orphaned references, causing the plan execution to fail with an *"optimized incorrectly due to missing references"* error. Fixes elastic#142762 (cherry picked from commit 5fb7136) # Conflicts: # muted-tests.yml # x-pack/plugin/esql/qa/server/src/main/java/org/elasticsearch/xpack/esql/qa/rest/generative/GenerativeRestTest.java # x-pack/plugin/esql/qa/testFixtures/src/main/resources/unmapped-nullify.csv-spec # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/rules/ResolveUnmapped.java # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerUnmappedTests.java
…astic#143030) This PR fixes a bug where `Fork.withSubPlans()` incorrectly reassigned new `NameId`s to its output attributes, breaking references in the upper plan. This issue specifically manifests when using `FORK` alongside the `SET unmapped_fields="nullify"` mode. By design, a `FORK` assigns new `NameId`s to its output attributes via `refreshOutput()` to decouple them from the internal branches. This isolation is necessary to prevent unintended side effects during plan optimizations, such as aggressive constant folding leaking across branches. However, the previous implementation unconditionally re-minted these `NameId`s every time `withSubPlans()` was called. Because of this, any node sitting above the `FORK` (like `EVAL` or `STATS`) that already held a reference to the initial `NameId`s would suddenly point to a nonexistent ID. Downstream analysis rules would then fail to resolve these orphaned references, causing the plan execution to fail with an *"optimized incorrectly due to missing references"* error. Fixes elastic#142762 (cherry picked from commit 5fb7136)
…43030) (#144386) This PR fixes a bug where `Fork.withSubPlans()` incorrectly reassigned new `NameId`s to its output attributes, breaking references in the upper plan. This issue specifically manifests when using `FORK` alongside the `SET unmapped_fields="nullify"` mode. By design, a `FORK` assigns new `NameId`s to its output attributes via `refreshOutput()` to decouple them from the internal branches. This isolation is necessary to prevent unintended side effects during plan optimizations, such as aggressive constant folding leaking across branches. However, the previous implementation unconditionally re-minted these `NameId`s every time `withSubPlans()` was called. Because of this, any node sitting above the `FORK` (like `EVAL` or `STATS`) that already held a reference to the initial `NameId`s would suddenly point to a nonexistent ID. Downstream analysis rules would then fail to resolve these orphaned references, causing the plan execution to fail with an *"optimized incorrectly due to missing references"* error. Fixes #142762 (cherry picked from commit 5fb7136) Co-authored-by: kanoshiou <uiaao@tuta.io>
…astic#143030) This PR fixes a bug where `Fork.withSubPlans()` incorrectly reassigned new `NameId`s to its output attributes, breaking references in the upper plan. This issue specifically manifests when using `FORK` alongside the `SET unmapped_fields="nullify"` mode. By design, a `FORK` assigns new `NameId`s to its output attributes via `refreshOutput()` to decouple them from the internal branches. This isolation is necessary to prevent unintended side effects during plan optimizations, such as aggressive constant folding leaking across branches. However, the previous implementation unconditionally re-minted these `NameId`s every time `withSubPlans()` was called. Because of this, any node sitting above the `FORK` (like `EVAL` or `STATS`) that already held a reference to the initial `NameId`s would suddenly point to a nonexistent ID. Downstream analysis rules would then fail to resolve these orphaned references, causing the plan execution to fail with an *"optimized incorrectly due to missing references"* error. Fixes elastic#142762
This PR fixes a bug where
Fork.withSubPlans()incorrectly reassigned newNameIds to its output attributes, breaking references in the upper plan. This issue specifically manifests when usingFORKalongside theSET unmapped_fields="nullify"mode.By design, a
FORKassigns newNameIds to its output attributes viarefreshOutput()to decouple them from the internal branches. This isolation is necessary to prevent unintended side effects during plan optimizations, such as aggressive constant folding leaking across branches.However, the previous implementation unconditionally re-minted these
NameIds every timewithSubPlans()was called. Because of this, any node sitting above theFORK(likeEVALorSTATS) that already held a reference to the initialNameIds would suddenly point to a nonexistent ID. Downstream analysis rules would then fail to resolve these orphaned references, causing the plan execution to fail with an "optimized incorrectly due to missing references" error.Fixes #142762