ESQL: Load many fields column-at-a-time by nik9000 · Pull Request #141926 · elastic/elasticsearch

nik9000 · 2026-02-05T13:40:21Z

ESQL: Load many fields column-at-a-time

Adds support for ColumnAtATimeReader in the case where we're loading
from many segments. This should marginally speed up loading many
documents after a top n. More importantly, it lets #141672 kick in
when loading from many fields. This should save significantly memory
when loading thousands of fields after a | SORT | LIMIT sequence.

Finally, this changes the rules for BlockLoader. Previously you
could return null from columnAtATimeReader but must never return
null from rowStrideReader. Now the rule is that you may return null
from either of the two, but not both. This should let us delete a
bunch of code. While we're at it, we should add a
read(builder, docs, offset, nullsFiltered) override to save a copy.

elasticsearchmachine · 2026-02-05T13:40:47Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2026-02-05T13:45:00Z

I'm adding an integration test now. Will see if there's real performance improvement too. It's likely it'll be in a fun query like

FROM foo
| SORT timestamp DESC
| LIMIT 10000000
| STATS SUM(f)

nik9000 · 2026-02-05T14:43:29Z

So, yes, there is some performance improvement:

while curl -s -HContent-Type:application/json -uelastic:password -XPOST localhost:9200/_query -d'{
    "query": "FROM test-index | SORT timestamp DESC | LIMIT 1000000 | STATS MIN(f)"
}' | jq .took; do echo ok; done

6764 -> 5654. That's not much, but I didn't expect much. The load is 25122339ns-> 79986530ns.

FWIW most of the time for that is spent in the top n operator - something I'll be looking into in a few weeks.

nik9000 · 2026-02-05T16:26:06Z

@martijnvg , a bunch of the time series tests fail in this PR. When I comment out the instanceof OptionalColumnAtATimeReader stuff they pass. It looks kind of a like an off-by-one somewhere deep in the reader.

./gradlew ":x-pack:plugin:esql:qa:server:single-node:javaRestTest" --tests "org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT" -Dtests.method="test {csv-spec:k8s-timeseries-min-over-time.Min_over_time_aggregate_metric_double_implicit_casting_grouping}" -Dtests.seed=8C6B0587BDC10671 -Dtests.locale=hu-Latn-HU -Dtests.timezone=Etc/GMT -Druntime.java=25

reproduces it.

nik9000 · 2026-02-05T17:28:19Z

.../compute/src/main/java/org/elasticsearch/compute/lucene/read/ValuesSourceReaderOperator.java


        BlockLoader loader;
+        // TODO rework this bit of mutable state into something harder to forget
+        // Seriously, I've tripped over this twice.


I'll grab this in a follow-up change.

parkertimmins · 2026-02-06T04:53:29Z

@martijnvg , a bunch of the time series tests fail in this PR. When I comment out the instanceof OptionalColumnAtATimeReader stuff they pass. It looks kind of a like an off-by-one somewhere deep in the reader.
./gradlew ":x-pack:plugin:esql:qa:server:single-node:javaRestTest" --tests "org.elasticsearch.xpack.esql.qa.single_node.EsqlSpecIT" -Dtests.method="test {csv-spec:k8s-timeseries-min-over-time.Min_over_time_aggregate_metric_double_implicit_casting_grouping}" -Dtests.seed=8C6B0587BDC10671 -Dtests.locale=hu-Latn-HU -Dtests.timezone=Etc/GMT -Druntime.java=25
reproduces it.

Looked into this a bit. The cause of the issue is that isDense assumes that the incoming doc ids do not contain duplicates:

elasticsearch/server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java

Line 2133 in d1924a4

if (isDense(index, docs.get(lastIndex), newLength)) {

I'm not yet sure if the fix is to make a more robust/expensive form of isDense, or something else. Anyway, will follow-up on this tomorrow.

nik9000 · 2026-02-06T17:24:10Z

@dnhatn, @martijnvg, @parkertimmins , and I met to talk about this. DocVector usually doesn't contain duplicates. But TimeSeriesAggregatorOperator.selectedForDocIdsAggregator and ENRICH and LOOKUP JOIN can make duplicate doc ids.

If you use Lucene's reader interfaces you can read duplicate doc ids. Most of our BlockLoader implementations do that. Except the failing test. That goes lower, and it has the method @parkertimmins mentioned. It isn't tolerant of duplicates.

We really want the performance we can get by not being tolerant of duplicates. Specifically, we need that performance for the "first load" of fields. And, at least in the case of TimeSeriesAggregatorOperator.selectedForDocIdsAggregator, we're fine to slow down so we can handle duplicates.

I'm going to block this PR on another one I'm starting now. It will add a flag to DocVector saying duplicatesAllowed. Usually it'll be false and we can go fast. Sometime it won't be and we won't take the fancy fast paths.

nik9000 · 2026-02-06T23:03:11Z

Blocked on #142055

…l_from_many_column

nik9000 · 2026-02-09T12:29:41Z

Unblocked! @parkertimmins, could you have a look at this one? And, could you make a follow up with unit tests for y'all's fancy BlockLoader implementations?

parkertimmins · 2026-02-09T13:54:32Z

Unblocked! @parkertimmins, could you have a look at this one? And, could you make a follow up with unit tests for y'all's fancy BlockLoader implementations?

Thanks for adding the mayContainDuplicates logic, that looks good to me. (And wow -974 lines!) Sounds good, yep I'll follow up with some unit tests today for the block loaders.

…l_from_many_column

dnhatn

LGTM. Thanks @nik9000

dnhatn · 2026-02-09T18:36:15Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java

                        boolean toInt,
                        boolean binaryMultiValuedFormat
                    ) throws IOException {
+                        if (docs.mayContainDuplicates()) {


I think we can go further even with duplicates, but Martijn or Parker can follow up on it.

Yes, we can look into removing these if statements in follow ups.

martijnvg

Thanks Nik! One minor comment, LGTM 👍.

martijnvg · 2026-02-09T18:43:39Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesProducer.java

                        boolean toInt,
                        boolean binaryMultiValuedFormat
                    ) throws IOException {
+                        if (docs.mayContainDuplicates()) {


Yes, we can look into removing these if statements in follow ups.

martijnvg · 2026-02-09T18:45:51Z

server/src/main/java/org/elasticsearch/index/mapper/BlockLoader.java

+         *     </li>
+         * </ul>
+         */
+        boolean mayContainDuplicates();


A quick look indicates to most implementations return false here. Maybe have a default implementation that returns false?

I thought about it but figured it was kinder to make the implementer think about the choice when implementing.

…duplicates (#142409) Add a test to es819 codec test to verify changes from #141926 . Just checks that situations which require incoming docs to not contain duplicates, return null on tryRead if passed docs with duplicates. Also, update DenseBinaryDocValues to return null if mayContainDuplicates

…duplicates (elastic#142409) Add a test to es819 codec test to verify changes from elastic#141926 . Just checks that situations which require incoming docs to not contain duplicates, return null on tryRead if passed docs with duplicates. Also, update DenseBinaryDocValues to return null if mayContainDuplicates

In #141926 I deprecated the `AllReader` because we no longer need to make a `BlockLoader` work both row-by-row and column-at-a-time. Now it's fine for a `BlockLoader` to work in either mode. And `AllReader` was the tool that we used to support working both ways. So it can go! This removes it.

In elastic#141926 I deprecated the `AllReader` because we no longer need to make a `BlockLoader` work both row-by-row and column-at-a-time. Now it's fine for a `BlockLoader` to work in either mode. And `AllReader` was the tool that we used to support working both ways. So it can go! This removes it.

nik9000 added 8 commits February 4, 2026 14:21

WIP

1fb44d3

Update

70ef262

Compile:

182f2a1

Builder?

5ffcc09

fix

d5d073e

Format mostly

a428b94

Merge branch 'main' into esql_from_many_column

f56ee5a

TODO that'd save a copy

dce7cd9

nik9000 added >bug :Analytics/ES|QL AKA ESQL v9.4.0 labels Feb 5, 2026

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 5, 2026

Test

8a162b2

nik9000 requested review from craigtaverner, fang-xing-esql and martijnvg February 5, 2026 17:27

test update

6d2e67c

nik9000 commented Feb 5, 2026

View reviewed changes

Test

3a78572

nik9000 and others added 4 commits February 7, 2026 07:35

Merge branch 'main' into esql_from_many_column

1dc64fa

Update

22c7519

[CI] Auto commit changes from spotless

b9b6c14

Remove

d67a83d

nik9000 added 2 commits February 7, 2026 16:34

Merge remote-tracking branch 'nik9000/esql_from_many_column' into esq…

792ada6

…l_from_many_column

Merge branch 'main' into esql_from_many_column

51dc417

nik9000 requested a review from dnhatn February 8, 2026 17:19

nik9000 added 4 commits February 9, 2026 09:09

Merge branch 'main' into esql_from_many_column

738d45b

Merge branch 'main' into esql_from_many_column

1ab41aa

Merge remote-tracking branch 'nik9000/esql_from_many_column' into esq…

0f95674

…l_from_many_column

Merge branch 'main' into esql_from_many_column

4dfbb3c

dnhatn approved these changes Feb 9, 2026

View reviewed changes

martijnvg approved these changes Feb 9, 2026

View reviewed changes

Merge branch 'main' into esql_from_many_column

155441d

nik9000 enabled auto-merge (squash) February 9, 2026 20:54

Merge branch 'main' into esql_from_many_column

b7f814a

nik9000 merged commit 82756e9 into elastic:main Feb 10, 2026
35 checks passed

parkertimmins mentioned this pull request Feb 12, 2026

Add a es819 codec test to verify tryRead returns null if may contain duplicates #142409

Merged

nik9000 mentioned this pull request Feb 24, 2026

ESQL: Remove AllReader #142917

Merged

fang-xing-esql mentioned this pull request Feb 25, 2026

[ES|QL] Read many large keyword or text fields can take a ton of untracked memory #140218

Closed

Conversation

nik9000 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 5, 2026

Uh oh!

nik9000 commented Feb 5, 2026

Uh oh!

nik9000 commented Feb 5, 2026

Uh oh!

nik9000 commented Feb 5, 2026

Uh oh!

nik9000 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

parkertimmins commented Feb 6, 2026

Uh oh!

nik9000 commented Feb 6, 2026

Uh oh!

nik9000 commented Feb 6, 2026

Uh oh!

nik9000 commented Feb 9, 2026

Uh oh!

parkertimmins commented Feb 9, 2026

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

nik9000 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martijnvg Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

martijnvg Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

nik9000 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nik9000 commented Feb 5, 2026 •

edited

Loading