Skip to content

Integrate stored fields format bloom filter with synthetic _id#138515

Merged
fcofdez merged 32 commits intoelastic:mainfrom
fcofdez:new-codec
Dec 11, 2025
Merged

Integrate stored fields format bloom filter with synthetic _id#138515
fcofdez merged 32 commits intoelastic:mainfrom
fcofdez:new-codec

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Nov 24, 2025

This PR integrates ES93BloomFilterStoredFieldsFormat with the synthetic _id lookups. For that, it introduces a new set of Codecs meant to be used only by TIME_SERIES indices. These new codecs are necessary to cover the case when the codec is loaded through SPI (i.e. after a shard relocation or node restarts).

The new codecs just wrap the existing codecs and extend them with the necessary plumbing to populate the bloom filter during indexing.

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a deeper look but overall approach looks good.

Property.Final
);

public static final boolean USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID_FEATURE_FLAG = new FeatureFlag("stored_field_bloom_filter")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need another feature flag, or could it be folded with the existing one for synthetic id?

I'm not sure it makes a lot of sense to test one without the other, but maybe I'm missing a point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that the bloom filter is an optimization on top of the synthetic id. But happy to get rid of the feature flag and the index.mapping.use_stored_field_bloom_filter_id index setting if we think that's redundant. It'll simplify the code a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that the bloom filter is an optimization on top of the synthetic id

I agree but I think we won't use synthetic ids without a bloom filter on top of it, and having two features flags complicate the code. If that's OK, I would prefer use only one feature flag for both.

I won't block the PR for this so if you want to keep it that's OK too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the setting and feature flag in 799fb3a


}

private enum StorageMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this storage mode a bit confusing. Maybe a useBloomFilterSyntheticId local variable would be simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of it in a72a66a

@fcofdez fcofdez marked this pull request as ready for review November 25, 2025 13:07
@fcofdez fcofdez requested a review from a team as a code owner November 25, 2025 13:07
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@fcofdez fcofdez requested a review from kkrik-es November 25, 2025 13:23
private final TSDBStoredFieldsFormat storedFieldsFormat;

TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
super(name, new TSDBSyntheticIdCodec(delegate));
Copy link
Contributor Author

@fcofdez fcofdez Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to incorporate the code from TSDBSyntheticIdCodec into this class in a follow-up PR. But I wanted to keep the change size under control.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @fcofdez ! I only left minor comments, the direction makes sense to me and we can improve in follow ups. I'd like to have Martijn or Alan review the codec part before merging though.

boolean useSyntheticId = IndexSettings.TSDB_SYNTHETIC_ID_FEATURE_FLAG
&& mapperService != null
&& mapperService.getIndexSettings().useTimeSeriesSyntheticId()
&& mapperService.getIndexSettings().getMode() == IndexMode.TIME_SERIES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapperService.getIndexSettings().useTimeSeriesSyntheticId() already ensure that the index is a time-series index and that the feature flag is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in d0d94ef

private final TSDBStoredFieldsFormat storedFieldsFormat;

TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
super(name, new TSDBSyntheticIdCodec(delegate));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge TSDBSyntheticIdCodec and TSDBCodecWithSyntheticId together in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just saw your #138515 (comment) 👍

Comment on lines 53 to 65
return new FilterLeafReader.FilterTerms(delegate.terms(field)) {
@Override
public TermsEnum iterator() throws IOException {
return new LazyFilterTermsEnum() {
private TermsEnum delegate;

@Override
protected TermsEnum getDelegate() throws IOException {
if (delegate == null) {
delegate = in.iterator();
}
return delegate;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I've been confused by the two delegate (the on in lazy and the one in the bloom filter) and what in was referencing to.

Maybe something like this would help?

        final Terms terms = delegate.terms(field);
        return new FilterLeafReader.FilterTerms(terms) {
            @Override
            public TermsEnum iterator() throws IOException {
                return new LazyFilterTermsEnum() {
                    private TermsEnum termsEnum;

                    @Override
                    protected TermsEnum getDelegate() throws IOException {
                        if (termsEnum == null) {
                            termsEnum = terms.iterator();
                        }
                        return termsEnum;
                    }

                    @Override
                    public boolean seekExact(BytesRef text) throws IOException {
                        if (bloomFilter.mayContainTerm(field, text) == false) {
                            return false;
                        }
                        return getDelegate().seekExact(text);
                    }
                };
            }
        };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, changed in 92a6daa

var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays);
if (ZSTD_STORED_FIELDS_FEATURE_FLAG) {
codecs.put(DEFAULT_CODEC, new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_SPEED, mapperService, bigArrays));
PerFieldMapperCodec defaultZstdCodec = new PerFieldMapperCodec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to reduce the scope of this change, we could create our own default_code_with_synthetic_id and hard-coded this in INDEX_CODEC_SETTING for all time-series with use_synthetic_id enabled.

Here we go for the complete solution immediately, for which I'm ok too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this. I'm ok with both approaches. The downside of an extra codec is that we need to maintain it indefinitely whereas with this change as long as the feature flag is off we keep the current behaviour.

* @see StoredFieldsFormat
*/
public class TSDBStoredFieldsFormat extends StoredFieldsFormat {
private final StoredFieldsFormat storedFieldsFormat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would call this delegate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in e1cbce6


TSDBStoredFieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
boolean success = false;
List<Closeable> toClose = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
List<Closeable> toClose = new ArrayList<>();
List<Closeable> toClose = new ArrayList<>(2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in 680edf1


TSDBStoredFieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
boolean success = false;
List<Closeable> toClose = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<Closeable> toClose = new ArrayList<>();
List<Closeable> toClose = new ArrayList<>(2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in 680edf1

@tlrx
Copy link
Member

tlrx commented Nov 28, 2025

@fcofdez I just noticed that org.elasticsearch.index.codec.tsdb.TSDBSyntheticIdPostingsFormat has to be declared in server/src/main/java/module-info.java under provides org.apache.lucene.codecs.PostingsFormat so that Lucene's PerFieldPostingsFormat can correctly load the posting format.

This is something I missed when I introduced TSDBSyntheticIdPostingsFormat.

@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 28, 2025

@martijnvg It would be great if you could take a look into this PR once you have some time. Thanks!

@tlrx
Copy link
Member

tlrx commented Dec 1, 2025

I just noticed that org.elasticsearch.index.codec.tsdb.TSDBSyntheticIdPostingsFormat has to be declared in server/src/main/java/module-info.java under provides org.apache.lucene.codecs.PostingsFormat so that Lucene's PerFieldPostingsFormat can correctly load the posting format.

I added TSDBSyntheticIdPostingsFormat in module-info.java in #138751 (merged).

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @fcofdez. I left a number of questions.

.getIndexVersionCreated()
.onOrAfter(IndexVersions.TIME_SERIES_USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID);

var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just having one useSyntheticId if statement with an else clause is clearer then having useSyntheticId checks in several places?

);
codecs.put(
DEFAULT_CODEC,
useSyntheticId ? new ES93TSDBZSTDCompressionLucene103Codec(defaultZstdCodec, bigArrays) : defaultZstdCodec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is possible to not every codec combination here? Or at least not do this now. For legacy cases this doesn't seem necessary?

Typically DEFAULT_CODEC is used. We can maybe just enforce this if index.mapping.use_synthetic_id is configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, I wasn't 100% sure if we should support all the codec types or if we could do with just the default one. I'll implement your idea so we reduce the risk surface area, we already discussed that possibility 👍.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fcofdez what do you think of enforcing ES93TSDBDefaultCompressionLucene103Codec if index.mapping.use_synthetic_id is used? Even if zstd feature flag is enabled? I think that would reduce the size of this change as well.

Seperate from this change, I think we need to remove ZSTD_STORED_FIELDS_FEATURE_FLAG experiment. We seen inclusive results changing to zstd for default codec and so we should for now remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds good to me, I'll do that 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg I've just pushed a commit leaving only ES93TSDBDefaultCompressionLucene103Codec even when the zstd feature flag is enabled.


import org.elasticsearch.common.util.BigArrays;

public class ES93TSDBZSTDCompressionLucene103Codec extends TSDBCodecWithSyntheticId {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comment in CodecService, maybe we just need this codec implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's my understanding correct that even if zstd compression is behind a feature flag, that's the default compression algorithm for stored field nowadays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In released versions we endup using zstd based stored fields otherwise endup using default fast compression for default codec. So I think we need to keep both here.


// TODO: fix IndexDiskUsageStats to take into account synthetic _id terms
var checkDiskUsage = false;
if (checkDiskUsage) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fail currently or is in accurate because bloom filter size on disk can't be computed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this's been already fix by another PR, so I'll remove this condition.

var fieldsProducer = new TSDBSyntheticIdFieldsProducer(state, docValuesProducer);
success = true;
return fieldsProducer;
return bloomFilter == null ? fieldsProducer : new DelegatingBloomFilterFieldsProducer(fieldsProducer, bloomFilter);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If field is _id then in what cases would the bloomFilter be null?

static class Reader extends StoredFieldsReader implements BloomFilter {
// The bloom filter can be null in cases where the indexed documents
// do not include a field bloomFilterFieldName and thus the bloom filter
// is empty. (This mostly apply for tests).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which tests is this the case? Ideally if this format gets tested it should be in a setup that doesn't produce nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly about BaseStoredFieldsFormatTestCase which is used by TSDBStoredFieldsFormatTests and we don't populate the bloom filter in that case since the base test expects all the fields to be stored. Maybe we can tighten this up once we synthesise _ids (there's a PR coming for that)

@fcofdez
Copy link
Contributor Author

fcofdez commented Dec 1, 2025

Thanks a lot for the review @martijnvg. I've pushed a commit addressing your comments. I left the two default codecs since I'm not sure if zstd is used always, but happy to remove the default compression codec if we think that's not needed.

@fcofdez fcofdez requested a review from martijnvg December 1, 2025 14:59
import java.io.Closeable;
import java.io.IOException;

public interface BloomFilter extends Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly a style thing, but I think this would be nicer as a function interface with a no-op static implementation. Something like:

static BloomFilter NO_FILTER = (field, term) -> true;

And then rather than checking for isFilterAvailable() you check for == BloomFilter.NO_FILTER.

Also I don't think it needs to be Closeable any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @romseygeek, I've removed the isFilterAvailable method as is indeed redundant (see 1ff2ccd). Regarding the Closeable we need it for DelegatingBloomFilterFieldsProducer to close the underlying bloom filter file once it's done with it.

@fcofdez fcofdez requested a review from romseygeek December 4, 2025 11:57
@fcofdez
Copy link
Contributor Author

fcofdez commented Dec 8, 2025

@martijnvg could you take another look once you have the chance? happy to discuss it sync if that helps. Thanks!

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fcofdez, LGTM.
Let's also wait for @romseygeek to complete the review.

if (useSyntheticId) {
// Use the default Lucene compression when the synthetic id is used even if the ZSTD feature flag is enabled
codecs.put(DEFAULT_CODEC, new ES93TSDBDefaultCompressionLucene103Codec(legacyBestSpeedCodec, bigArrays));
} else if (ZSTD_STORED_FIELDS_FEATURE_FLAG) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this feature flag. I don't we will in the near term future use zstd for default codec.
But I don't think this affects this PR.

} else {
codecs.put(DEFAULT_CODEC, legacyBestSpeedCodec);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unneeded white space.

BEST_COMPRESSION_CODEC,
new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_COMPRESSION, mapperService, bigArrays)
);
Codec legacyBestCompressionCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_COMPRESSION, mapperService, bigArrays);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only changes formatting? Maybe just keep it the way it is?

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez fcofdez merged commit a0ee98c into elastic:main Dec 11, 2025
34 checks passed
@fcofdez
Copy link
Contributor Author

fcofdez commented Dec 11, 2025

Thanks all for the reviews!

szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 11, 2025
* upstream/main: (79 commits)
  Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381
  Adjust error bounds for bfloat16 value checks (elastic#139371)
  Unmute some vector CSS tests (elastic#139370)
  Do not allow `project_routing` as a query param (elastic#139206)
  Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341)
  Remove leniency in LinkedProjectConfig builder methods (elastic#139012)
  EQL: fix project_routing (elastic#139366)
  Add patch version for 9.2 index version constant (elastic#139362)
  Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368
  ES|QL: Enable CCS tests for FORK (elastic#139302)
  Restructuring the semantic_text field type page  (elastic#138571)
  AggregateMetricDouble fields should not build BKD indexes (elastic#138724)
  Feature/count by trunc with filter (elastic#138765)
  ESQL: Convert TS 500 error to 400 (elastic#139360)
  [CI] Rerun failing tests for periodic build pipelines (elastic#139200)
  revert muting saml test (elastic#139327)
  Add TDigest histogram as metric (elastic#139247)
  Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340)
  Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293)
  Integrate stored fields format bloom filter with synthetic _id (elastic#138515)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants