ESQL: Prevent pushdown of unmapped fields in filters and sorts#143460
ESQL: Prevent pushdown of unmapped fields in filters and sorts#143460GalLalouche merged 54 commits intoelastic:mainfrom
Conversation
PotentiallyUnmappedKeywordEsField (used for fields loaded from _source when unmapped_fields="load") had isAggregatable=true, which caused isPushableFieldAttribute to short-circuit past the SearchStats check and push filters down to Lucene. On shards where the field is not indexed, the Lucene query returns no results instead of letting the compute engine evaluate the filter on _source-loaded values. Guard isPushableFieldAttribute against PotentiallyUnmappedKeywordEsField so these fields are always filtered in the compute engine. Co-authored-by: Cursor <cursoragent@cursor.com>
The isPushableFieldAttribute fix (rejecting PotentiallyUnmappedKeywordEsField) already covers sort pushdown since PushTopNToSource uses the same method. This test verifies correct sort order when a field is mapped in one index but unmapped in another under unmapped_fields="load". Co-authored-by: Cursor <cursoragent@cursor.com>
Remove _index from the SORT clause — it prevented the entire sort from being pushed to Lucene (canPushDownOrders requires all fields pushable), masking the bug. With only SORT message, the sort is pushed down and produces wrong order on the unmapped shard without the fix. Co-authored-by: Cursor <cursoragent@cursor.com>
Capture single-index unmapped behavior for nullify/load and switch GoldenTestCase to read unmapped settings from parsed statements so SET-based golden tests run like type_conflicts. Co-authored-by: Cursor <cursoragent@cursor.com> Made-with: Cursor
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @GalLalouche, I've created a changelog YAML for you. |
alex-spies
left a comment
There was a problem hiding this comment.
LGTM except some minor test suggestions (and rebasing on the other PR that needs to precede this).
x-pack/plugin/esql/qa/testFixtures/src/main/resources/unmapped-load.csv-spec
Outdated
Show resolved
Hide resolved
| required_capability: unmapped_fields | ||
| required_capability: optional_fields | ||
| required_capability: field_alias_support |
There was a problem hiding this comment.
One capability is probably enough, and it should be a new capability added in this PR to disable bwc tests with earlier versions that can't properly run the added tests. (Serverless + 9.3 bwc tests should fail.)
OPTIONAL_FIELDS_V2, for instance.
| fieldAliasAndNonExistent | ||
| required_capability: unmapped_fields | ||
| required_capability: optional_fields | ||
| required_capability: field_alias_support |
There was a problem hiding this comment.
I think field_alias_support is a hallucination, which means this test doesn't actually run at all.
I thought that @idegtiarenko put in a check that prevents using non-existant caps? Please double check that :)
| FROM sample_data, no_mapping_sample_data METADATA _index | ||
| | KEEP _index, message | ||
| | WHERE message == "Connection error?" | ||
| | SORT _index |
There was a problem hiding this comment.
Slightly irrelevant sorting, but not wrong
| SET unmapped_fields="load"\; | ||
| FROM sample_data, no_mapping_sample_data METADATA _index | ||
| | KEEP _index, message | ||
| | SORT message |
There was a problem hiding this comment.
For good measure, let's maybe add a test that sorts by multiple fields, and expressions based on unmapped fields. Also a case where an unmapped/partially mapped field is cast to a long or so.
There was a problem hiding this comment.
nit: The location of the query.esql is a little inconsistent. Let's make it consistent :) There should be a separate query.esql for nullify with a separate SET as well.
| @@ -0,0 +1,4 @@ | |||
| SET unmapped_fields="load"; FROM sample_data | |||
| | KEEP message, does_not_exist | |||
| | WHERE does_not_exist::KEYWORD == "Connection error?" | |||
There was a problem hiding this comment.
Interesting to see: a case where a pushable and non-pushable (because unmapped) expression are conjuncted.
Similarly for the expressions in the SORT.
| \_TopNExec[[Order[does_not_exist{r}#1,ASC,LAST]],5[INTEGER],70] | ||
| \_EvalExec[[null[NULL] AS does_not_exist#1]] |
There was a problem hiding this comment.
TopN by a constant can be optimized to a limit, but that's a different issue.
@GalLalouche, could you please open an issue and track it internally so we consider this optimization before nullify's GA?
| | KEEP message | ||
| | WHERE message == "Connection error?" |
There was a problem hiding this comment.
Let's have a version of this that just does the filtering, no sort.
There was a problem hiding this comment.
Generally, tests look very nice, but we can try and add a couple more funky cases. E.g. more complex filters (also disjunctions with mixed pushable/non-pushable), more complex sorts; even moreso than in the spec tests.
- Add OPTIONAL_FIELDS_V2 capability for pushdown-elimination BWC. - Remove fieldAliasAndNonExistent test; fix capabilities on pushdown tests. - Remove SORT from filterOnPartiallyUnmappedField; add fully unmapped multi-sort, and cast tests in unmapped-load.csv-spec. - Add comment in LucenePushdownPredicates for PotentiallyUnmappedKeywordEsField. - Add cross-refs between PushdownGoldenTests and LocalPhysicalPlanOptimizerTests. - Fix assumeTrue: use OPTIONAL_FIELDS_V2 for load, remove redundant nullify. - Write query.esql to nested path in GoldenTestCase. This was actually a pretty "noisy" change, since it also affect SubstituteRoundToGoldenTests! - Add filter-only, conjunction pushable/non-pushable golden tests. Made-with: Cursor
…ton-based queries (elastic#142150)
Datasource tests fail on Windows CI and FIPS CI builds due to two independent issues introduced with the external sources feature. **Windows:** `StoragePath.of()` cannot parse `file://` URIs with Windows drive letters. A path like `file://C:\bk\path\file.txt` causes the colon after the drive letter `C` to be misinterpreted as a port separator, resulting in `NumberFormatException`. Both the production code (`LocalStorageProvider.toStoragePath()`) and the tests construct file URIs via manual string concatenation instead of using the existing `StoragePath.fileUri()` helper that normalizes Windows paths correctly. **FIPS:** `ExternalDistributedSpecIT` starts a test cluster with `xpack.security.enabled=false`, but FIPS mode requires security to be enabled. The Elasticsearch process dies during startup before any test method runs. Since the test relies on plain HTTP S3 fixtures that are inherently incompatible with FIPS, the test is now skipped in FIPS mode. Developed using AI-assisted tooling
…l sources (elastic#143420) * ESQL: Add extended distribution tests and fault injection for external sources Extends the test coverage for external source distributed execution with property tests for weighted round-robin and coalescing, and adds fault injection infrastructure for testing resilience under storage failures. - ExtendedDistributionPropertyTests: weighted RR load balancing bounds, coalescing preservation, coalesced+distribution integration - FaultInjectionRetryTests: retry policy behavior under transient and persistent fault patterns (503, connection reset, timeout) - FaultInjectingS3HttpHandler: configurable S3 fault injection with countdown, path filtering, and auto-clearing - FaultInjectingS3HttpHandlerIT: real HTTP server tests for the handler Developed using AI-assisted tooling * Update docs/changelog/143420.yaml
* Fix CSV-escaped quotes rendering in generated docs examples
DocsV3Support.renderTableLine() now unescapes RFC 4180 CSV quoting
(strips outer quote delimiters and replaces "" with ") so that JSON
strings in function example tables render correctly — e.g.
{"key":"value"} instead of {""key"":""value""}.
Affects json_extract and to_tdigest doc examples.
* Refine CSV unescaping to only unescape RFC 4180 doubled quotes; add tests
The previous approach stripped outer quotes from ALL quoted values, breaking
simple quoted values like "POINT(...)" and "foo". Now only cells with actual
RFC 4180 doubled-quote escaping ("") are unescaped, leaving simple quoted
values unchanged.
Added tests: testRenderingExampleResultCsvJsonUnescaping verifies JSON
unescaping works, testRenderingExampleResultSimpleQuotesPreserved verifies
simple quoted values are not modified.
Also adds changelog YAML for the PR.
…stic#143402) * Fix ES|QL function list versioning metadata Audit all _snippets/lists/ files against Java @FunctionAppliesTo annotations. Adds missing applies_to tags, corrects wrong versions, and applies cumulative preview→ga tags where functions graduated. Also adds missing applies_to front matter to time-series-aggregation-functions.md landing page. * update tags on commands lists
…rsionTests testLoadAll elastic#143471
Stopping the transform at the end of test, before the reset, can help other tests running in parallel. Resolve elastic#122980
Update the pattern_text downgrade test so that it includes adding docs and querying for docs.\ Specifically the test now does the following: 1. Create data stream with trial license and pattern text field 2. Index docs, verify they're searchable. 3. Downgrade to basic license. 4. Index more docs in same backing index, verify all docs searchable, verify settings unchanged. 5. Rollover the data stream, verify the new backing index has disable_templating=true. 6. Index more docs into the new backing index, verify all docs searchable across both indices. 7. Search with "fields": ["pattern_field"] to verify the valueFetcher() code path works across both backing indices.
Replace command parsing with regex that checks whole query for METADATA _index. Use static Pattern to avoid recompilation. Made-with: Cursor
30a7824 to
1db923f
Compare
|
Important Review skippedAuto reviews are limited based on label configuration. 🏷️ Required labels (at least one) (2)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ic#143460) This PR fixes a bug in the behavior of SET UNMAPPED_FIELDS=LOAD where a potentially unmapped field would be pushed down to Lucene, and, when it doesn't exist in the mapping, would cause wrong results. Resolves: elastic#141920, elastic#141925.
…locations * upstream/main: (153 commits) ES|QL: Update docs for TOP_SNIPPETS and DECAY (elastic#143739) Correctly include endpoint id in log msg in AuthorizationPoller (elastic#143743) Bar searching or sorting on _seq_no when disabled (elastic#143600) Generalize `testClientCancellation` test (elastic#143586) JSON_EXTRACT: zero-copy byte slicing for object, array, and number extraction (elastic#143702) Track recycler pages in circuit breaker (elastic#143738) [ESQL] Enable distributed pipeline breakers for external sources via FragmentExec (elastic#143696) Adding 'mode' and 'codec' fields to ES monitoring template (elastic#143673) [ESQL] Columnar I/O and vectorized block conversion for external sources (elastic#143703) Fix flaky MMR diversification YAML tests (elastic#143706) ES|QL codegen: check builder arguments for vector support (elastic#143724) Add Views Security Model (elastic#141050) ESQL: Prevent pushdown of unmapped fields in filters and sorts (elastic#143460) Don't run seq_no pruning tests in release CI (elastic#143725) ESQL: Support intra-row field references in ROW command (elastic#140217) ES|QL: Remove implicit limit in FORK branches in CSV tests (elastic#143601) IndexRoutingTests with and without synthetic id (elastic#143566) Synthetic id upgrade test in serverless (elastic#142471) Disable "Review skipped" comments for PRs without specified labels (elastic#143728) Cleanup ES|QL T-Digest code duplication, add memory accounting (elastic#143662) ...
…ic#143460) This PR fixes a bug in the behavior of SET UNMAPPED_FIELDS=LOAD where a potentially unmapped field would be pushed down to Lucene, and, when it doesn't exist in the mapping, would cause wrong results. Resolves: elastic#141920, elastic#141925.
This PR fixes a bug in the behavior of
SET UNMAPPED_FIELDS=LOADwhere a potentially unmapped field would be pushed down to Lucene, and, when it doesn't exist in the mapping, would cause wrong results.Resolves: #141920, #141925.