Integrate stored fields format bloom filter with synthetic _id by fcofdez · Pull Request #138515 · elastic/elasticsearch

fcofdez · 2025-11-24T15:25:57Z

This PR integrates ES93BloomFilterStoredFieldsFormat with the synthetic _id lookups. For that, it introduces a new set of Codecs meant to be used only by TIME_SERIES indices. These new codecs are necessary to cover the case when the codec is loaded through SPI (i.e. after a shard relocation or node restarts).

The new codecs just wrap the existing codecs and extend them with the necessary plumbing to populate the bloom filter during indexing.

elasticsearchmachine · 2025-11-24T15:26:45Z

Hi @fcofdez, I've created a changelog YAML for you.

tlrx

I need to take a deeper look but overall approach looks good.

...er/src/main/java/org/elasticsearch/index/codec/ES93TSDBDefaultCompressionLucene103Codec.java

server/src/main/java/org/elasticsearch/index/codec/ES93TSDBLuceneDefaultCodec.java

tlrx · 2025-11-24T15:34:31Z

server/src/main/java/org/elasticsearch/index/IndexSettings.java

        Property.Final
    );

+    public static final boolean USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID_FEATURE_FLAG = new FeatureFlag("stored_field_bloom_filter")


Do you think we need another feature flag, or could it be folded with the existing one for synthetic id?

I'm not sure it makes a lot of sense to test one without the other, but maybe I'm missing a point.

My idea is that the bloom filter is an optimization on top of the synthetic id. But happy to get rid of the feature flag and the index.mapping.use_stored_field_bloom_filter_id index setting if we think that's redundant. It'll simplify the code a bit.

My idea is that the bloom filter is an optimization on top of the synthetic id

I agree but I think we won't use synthetic ids without a bloom filter on top of it, and having two features flags complicate the code. If that's OK, I would prefer use only one feature flag for both.

I won't block the PR for this so if you want to keep it that's OK too.

I got rid of the setting and feature flag in 799fb3a

tlrx · 2025-11-24T15:44:32Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java


    }
+
+    private enum StorageMode {


I find this storage mode a bit confusing. Maybe a useBloomFilterSyntheticId local variable would be simpler?

I got rid of it in a72a66a

…-codec

elasticsearchmachine · 2025-11-25T13:08:10Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

fcofdez · 2025-11-25T14:59:38Z

server/src/main/java/org/elasticsearch/index/codec/TSDBCodecWithSyntheticId.java

+    private final TSDBStoredFieldsFormat storedFieldsFormat;
+
+    TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
+        super(name, new TSDBSyntheticIdCodec(delegate));


I'm planning to incorporate the code from TSDBSyntheticIdCodec into this class in a follow-up PR. But I wanted to keep the change size under control.

tlrx

Great work @fcofdez ! I only left minor comments, the direction makes sense to me and we can improve in follow ups. I'd like to have Martijn or Alan review the codec part before merging though.

tlrx · 2025-11-25T14:12:35Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

+        boolean useSyntheticId = IndexSettings.TSDB_SYNTHETIC_ID_FEATURE_FLAG
+            && mapperService != null
+            && mapperService.getIndexSettings().useTimeSeriesSyntheticId()
+            && mapperService.getIndexSettings().getMode() == IndexMode.TIME_SERIES;


mapperService.getIndexSettings().useTimeSeriesSyntheticId() already ensure that the index is a time-series index and that the feature flag is enabled.

Simplified in d0d94ef

tlrx · 2025-11-25T14:28:24Z

server/src/main/java/org/elasticsearch/index/codec/TSDBCodecWithSyntheticId.java

+    private final TSDBStoredFieldsFormat storedFieldsFormat;
+
+    TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
+        super(name, new TSDBSyntheticIdCodec(delegate));


We can merge TSDBSyntheticIdCodec and TSDBCodecWithSyntheticId together in a follow up.

Ok, I just saw your #138515 (comment) 👍

tlrx · 2025-11-25T14:40:18Z

...main/java/org/elasticsearch/index/codec/bloomfilter/DelegatingBloomFilterFieldsProducer.java

+        return new FilterLeafReader.FilterTerms(delegate.terms(field)) {
+            @Override
+            public TermsEnum iterator() throws IOException {
+                return new LazyFilterTermsEnum() {
+                    private TermsEnum delegate;
+
+                    @Override
+                    protected TermsEnum getDelegate() throws IOException {
+                        if (delegate == null) {
+                            delegate = in.iterator();
+                        }
+                        return delegate;
+                    }


nit: I've been confused by the two delegate (the on in lazy and the one in the bloom filter) and what in was referencing to.

Maybe something like this would help?

final Terms terms = delegate.terms(field); return new FilterLeafReader.FilterTerms(terms) { @Override public TermsEnum iterator() throws IOException { return new LazyFilterTermsEnum() { private TermsEnum termsEnum; @Override protected TermsEnum getDelegate() throws IOException { if (termsEnum == null) { termsEnum = terms.iterator(); } return termsEnum; } @Override public boolean seekExact(BytesRef text) throws IOException { if (bloomFilter.mayContainTerm(field, text) == false) { return false; } return getDelegate().seekExact(text); } }; } };

Good idea, changed in 92a6daa

server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java

tlrx · 2025-11-25T14:53:27Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

+        var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays);
        if (ZSTD_STORED_FIELDS_FEATURE_FLAG) {
-            codecs.put(DEFAULT_CODEC, new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_SPEED, mapperService, bigArrays));
+            PerFieldMapperCodec defaultZstdCodec = new PerFieldMapperCodec(


If we want to reduce the scope of this change, we could create our own default_code_with_synthetic_id and hard-coded this in INDEX_CODEC_SETTING for all time-series with use_synthetic_id enabled.

Here we go for the complete solution immediately, for which I'm ok too.

I don't have a strong opinion about this. I'm ok with both approaches. The downside of an extra codec is that we need to maintain it indefinitely whereas with this change as long as the feature flag is off we keep the current behaviour.

...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java

tlrx · 2025-11-25T15:07:27Z

server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java

+ * @see StoredFieldsFormat
+ */
+public class TSDBStoredFieldsFormat extends StoredFieldsFormat {
+    private final StoredFieldsFormat storedFieldsFormat;


nit: I would call this delegate

Tackled in e1cbce6

tlrx · 2025-11-25T15:10:21Z

server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java

+
+        TSDBStoredFieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
+            boolean success = false;
+            List<Closeable> toClose = new ArrayList<>();


nit:

Suggested change

List<Closeable> toClose = new ArrayList<>();

List<Closeable> toClose = new ArrayList<>(2);

Tackled in 680edf1

tlrx · 2025-11-25T15:10:45Z

server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java

+
+        TSDBStoredFieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
+            boolean success = false;
+            List<Closeable> toClose = new ArrayList<>();


Suggested change

List<Closeable> toClose = new ArrayList<>();

List<Closeable> toClose = new ArrayList<>(2);

Tackled in 680edf1

...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java

tlrx · 2025-11-28T09:51:44Z

@fcofdez I just noticed that org.elasticsearch.index.codec.tsdb.TSDBSyntheticIdPostingsFormat has to be declared in server/src/main/java/module-info.java under provides org.apache.lucene.codecs.PostingsFormat so that Lucene's PerFieldPostingsFormat can correctly load the posting format.

This is something I missed when I introduced TSDBSyntheticIdPostingsFormat.

fcofdez · 2025-11-28T10:19:21Z

@martijnvg It would be great if you could take a look into this PR once you have some time. Thanks!

tlrx · 2025-12-01T12:24:19Z

I just noticed that org.elasticsearch.index.codec.tsdb.TSDBSyntheticIdPostingsFormat has to be declared in server/src/main/java/module-info.java under provides org.apache.lucene.codecs.PostingsFormat so that Lucene's PerFieldPostingsFormat can correctly load the posting format.

I added TSDBSyntheticIdPostingsFormat in module-info.java in #138751 (merged).

martijnvg

Sorry for the delay @fcofdez. I left a number of questions.

martijnvg · 2025-12-01T10:57:36Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

+                .getIndexVersionCreated()
+                .onOrAfter(IndexVersions.TIME_SERIES_USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID);
+
+        var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays);


maybe just having one useSyntheticId if statement with an else clause is clearer then having useSyntheticId checks in several places?

martijnvg · 2025-12-01T12:39:09Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

+            );
+            codecs.put(
+                DEFAULT_CODEC,
+                useSyntheticId ? new ES93TSDBZSTDCompressionLucene103Codec(defaultZstdCodec, bigArrays) : defaultZstdCodec


Is is possible to not every codec combination here? Or at least not do this now. For legacy cases this doesn't seem necessary?

Typically DEFAULT_CODEC is used. We can maybe just enforce this if index.mapping.use_synthetic_id is configured?

That sounds good to me, I wasn't 100% sure if we should support all the codec types or if we could do with just the default one. I'll implement your idea so we reduce the risk surface area, we already discussed that possibility 👍.

@fcofdez what do you think of enforcing ES93TSDBDefaultCompressionLucene103Codec if index.mapping.use_synthetic_id is used? Even if zstd feature flag is enabled? I think that would reduce the size of this change as well.

Seperate from this change, I think we need to remove ZSTD_STORED_FIELDS_FEATURE_FLAG experiment. We seen inclusive results changing to zstd for default codec and so we should for now remove it.

that sounds good to me, I'll do that 👍

@martijnvg I've just pushed a commit leaving only ES93TSDBDefaultCompressionLucene103Codec even when the zstd feature flag is enabled.

martijnvg · 2025-12-01T12:41:09Z

server/src/main/java/org/elasticsearch/index/codec/ES93TSDBZSTDCompressionLucene103Codec.java

+
+import org.elasticsearch.common.util.BigArrays;
+
+public class ES93TSDBZSTDCompressionLucene103Codec extends TSDBCodecWithSyntheticId {


Based on my comment in CodecService, maybe we just need this codec implementation?

So it's my understanding correct that even if zstd compression is behind a feature flag, that's the default compression algorithm for stored field nowadays?

Good point. In released versions we endup using zstd based stored fields otherwise endup using default fast compression for default codec. So I think we need to keep both here.

martijnvg · 2025-12-01T12:43:09Z

...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java

+
+        // TODO: fix IndexDiskUsageStats to take into account synthetic _id terms
+        var checkDiskUsage = false;
+        if (checkDiskUsage) {


Does this fail currently or is in accurate because bloom filter size on disk can't be computed?

In fact this's been already fix by another PR, so I'll remove this condition.

martijnvg · 2025-12-01T12:49:16Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdPostingsFormat.java

            var fieldsProducer = new TSDBSyntheticIdFieldsProducer(state, docValuesProducer);
            success = true;
-            return fieldsProducer;
+            return bloomFilter == null ? fieldsProducer : new DelegatingBloomFilterFieldsProducer(fieldsProducer, bloomFilter);


If field is _id then in what cases would the bloomFilter be null?

martijnvg · 2025-12-01T12:54:15Z

...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java

+    static class Reader extends StoredFieldsReader implements BloomFilter {
+        // The bloom filter can be null in cases where the indexed documents
+        // do not include a field bloomFilterFieldName and thus the bloom filter
+        // is empty. (This mostly apply for tests).


For which tests is this the case? Ideally if this format gets tested it should be in a setup that doesn't produce nulls?

It's mostly about BaseStoredFieldsFormatTestCase which is used by TSDBStoredFieldsFormatTests and we don't populate the bloom filter in that case since the base test expects all the fields to be stored. Maybe we can tighten this up once we synthesise _ids (there's a PR coming for that)

fcofdez · 2025-12-01T14:59:38Z

Thanks a lot for the review @martijnvg. I've pushed a commit addressing your comments. I left the two default codecs since I'm not sure if zstd is used always, but happy to remove the default compression codec if we think that's not needed.

romseygeek · 2025-12-01T11:15:48Z

server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilter.java

+import java.io.Closeable;
+import java.io.IOException;
+
+public interface BloomFilter extends Closeable {


Mainly a style thing, but I think this would be nicer as a function interface with a no-op static implementation. Something like:

static BloomFilter NO_FILTER = (field, term) -> true;

And then rather than checking for isFilterAvailable() you check for == BloomFilter.NO_FILTER.

Also I don't think it needs to be Closeable any more?

Thanks for looking into this @romseygeek, I've removed the isFilterAvailable method as is indeed redundant (see 1ff2ccd). Regarding the Closeable we need it for DelegatingBloomFilterFieldsProducer to close the underlying bloom filter file once it's done with it.

fcofdez · 2025-12-08T18:14:27Z

@martijnvg could you take another look once you have the chance? happy to discuss it sync if that helps. Thanks!

martijnvg

Thanks @fcofdez, LGTM.
Let's also wait for @romseygeek to complete the review.

martijnvg · 2025-12-09T09:39:45Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

+        if (useSyntheticId) {
+            // Use the default Lucene compression when the synthetic id is used even if the ZSTD feature flag is enabled
+            codecs.put(DEFAULT_CODEC, new ES93TSDBDefaultCompressionLucene103Codec(legacyBestSpeedCodec, bigArrays));
+        } else if (ZSTD_STORED_FIELDS_FEATURE_FLAG) {


I think we should remove this feature flag. I don't we will in the near term future use zstd for default codec.
But I don't think this affects this PR.

martijnvg · 2025-12-09T09:43:21Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

        } else {
            codecs.put(DEFAULT_CODEC, legacyBestSpeedCodec);
        }
+


nit: unneeded white space.

martijnvg · 2025-12-09T09:43:44Z

server/src/main/java/org/elasticsearch/index/codec/CodecService.java

            BEST_COMPRESSION_CODEC,
            new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_COMPRESSION, mapperService, bigArrays)
        );
-        Codec legacyBestCompressionCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_COMPRESSION, mapperService, bigArrays);


Only changes formatting? Maybe just keep it the way it is?

romseygeek

LGTM

fcofdez · 2025-12-11T10:17:28Z

Thanks all for the reviews!

* upstream/main: (79 commits) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381 Adjust error bounds for bfloat16 value checks (elastic#139371) Unmute some vector CSS tests (elastic#139370) Do not allow `project_routing` as a query param (elastic#139206) Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341) Remove leniency in LinkedProjectConfig builder methods (elastic#139012) EQL: fix project_routing (elastic#139366) Add patch version for 9.2 index version constant (elastic#139362) Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368 ES|QL: Enable CCS tests for FORK (elastic#139302) Restructuring the semantic_text field type page (elastic#138571) AggregateMetricDouble fields should not build BKD indexes (elastic#138724) Feature/count by trunc with filter (elastic#138765) ESQL: Convert TS 500 error to 400 (elastic#139360) [CI] Rerun failing tests for periodic build pipelines (elastic#139200) revert muting saml test (elastic#139327) Add TDigest histogram as metric (elastic#139247) Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340) Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293) Integrate stored fields format bloom filter with synthetic _id (elastic#138515) ...

Integrate stored fields format bloom filter with synthetic _id

429c14a

fcofdez requested review from burqen, martijnvg and tlrx November 24, 2025 15:25

fcofdez added >enhancement Team:StorageEngine :StorageEngine/Codec labels Nov 24, 2025

elasticsearchmachine added the v9.3.0 label Nov 24, 2025

fcofdez mentioned this pull request Nov 24, 2025

Add PerFieldStoredFieldsFormat to allow multiple stored field formats #138299

Closed

Update docs/changelog/138515.yaml

38984b6

This was referenced Nov 24, 2025

Add new codecs to support PerFieldStoredFieldsFormat and bloom filter stored fields #138301

Closed

Integrate bloom filter checks with TSDBSyntheticIdPostingsFormat #138357

Closed

tlrx reviewed Nov 24, 2025

View reviewed changes

fcofdez added 5 commits November 24, 2025 17:25

Wrong version

24fc7e8

Merge branch 'new-codec' of github.com:fcofdez/elasticsearch into new…

60b1eb7

…-codec

Use same name for codecs

7f6d6f3

Get rid of StorageMode

a72a66a

Some renaming

9d0a1f8

fcofdez marked this pull request as ready for review November 25, 2025 13:07

fcofdez requested a review from a team as a code owner November 25, 2025 13:07

fcofdez requested a review from kkrik-es November 25, 2025 13:23

fcofdez added 2 commits November 25, 2025 15:19

Remove setting

799fb3a

Merge remote-tracking branch 'origin/main' into new-codec

089f6f5

fcofdez commented Nov 25, 2025

View reviewed changes

tlrx reviewed Nov 25, 2025

View reviewed changes

fcofdez added 3 commits November 25, 2025 16:31

Improve use syntheticId check

d0d94ef

Add comment about read order

e01019b

Rename to delegate

e1cbce6

burqen mentioned this pull request Nov 28, 2025

Disk usage don't include synthetic _id postings #138745

Merged

martijnvg reviewed Dec 1, 2025

View reviewed changes

fcofdez added 2 commits December 1, 2025 15:22

Merge remote-tracking branch 'origin/main' into new-codec

8826d6d

Review comments

18e19f5

fcofdez requested a review from martijnvg December 1, 2025 14:59

fcofdez added 3 commits December 2, 2025 10:35

Merge remote-tracking branch 'origin/main' into new-codec

86908c1

Simplify codecs

e7efa2b

Merge remote-tracking branch 'origin/main' into new-codec

c07b92a

romseygeek reviewed Dec 4, 2025

View reviewed changes

fcofdez added 2 commits December 4, 2025 12:48

Remove isFilterAvailable

1ff2ccd

Merge remote-tracking branch 'origin/main' into new-codec

4a92842

fcofdez requested a review from romseygeek December 4, 2025 11:57

martijnvg approved these changes Dec 9, 2025

View reviewed changes

romseygeek approved these changes Dec 9, 2025

View reviewed changes

fcofdez added 6 commits December 10, 2025 10:26

Merge remote-tracking branch 'origin/main' into new-codec

52dd5e0

Fix nits

3afbc55

whitespace

3f89769

Merge remote-tracking branch 'origin/main' into new-codec

810f5f4

Merge remote-tracking branch 'origin/main' into new-codec

83cdf7c

Merge remote-tracking branch 'origin/main' into new-codec

186b4b0

fcofdez merged commit a0ee98c into elastic:main Dec 11, 2025
34 checks passed

burqen mentioned this pull request Dec 16, 2025

Single loop for FieldfInfo processing #137967

Closed

fcofdez mentioned this pull request Dec 18, 2025

[CI] ES93BloomFilterStoredFieldsFormatTests testRandomStoredFieldsWithIndexSort failing #139129

Closed

	List<Closeable> toClose = new ArrayList<>();
	List<Closeable> toClose = new ArrayList<>(2);


		import org.elasticsearch.common.util.BigArrays;

		public class ES93TSDBZSTDCompressionLucene103Codec extends TSDBCodecWithSyntheticId {

Conversation

fcofdez commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 24, 2025

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 25, 2025

Uh oh!

fcofdez Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlrx commented Nov 28, 2025

Uh oh!

fcofdez commented Nov 28, 2025

Uh oh!

tlrx commented Dec 1, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Nov 24, 2025 •

edited

Loading

fcofdez Nov 25, 2025 •

edited

Loading