Integrate stored fields format bloom filter with synthetic _id#138515
Integrate stored fields format bloom filter with synthetic _id#138515fcofdez merged 32 commits intoelastic:mainfrom
Conversation
|
Hi @fcofdez, I've created a changelog YAML for you. |
tlrx
left a comment
There was a problem hiding this comment.
I need to take a deeper look but overall approach looks good.
...er/src/main/java/org/elasticsearch/index/codec/ES93TSDBDefaultCompressionLucene103Codec.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/ES93TSDBLuceneDefaultCodec.java
Outdated
Show resolved
Hide resolved
| Property.Final | ||
| ); | ||
|
|
||
| public static final boolean USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID_FEATURE_FLAG = new FeatureFlag("stored_field_bloom_filter") |
There was a problem hiding this comment.
Do you think we need another feature flag, or could it be folded with the existing one for synthetic id?
I'm not sure it makes a lot of sense to test one without the other, but maybe I'm missing a point.
There was a problem hiding this comment.
My idea is that the bloom filter is an optimization on top of the synthetic id. But happy to get rid of the feature flag and the index.mapping.use_stored_field_bloom_filter_id index setting if we think that's redundant. It'll simplify the code a bit.
There was a problem hiding this comment.
My idea is that the bloom filter is an optimization on top of the synthetic id
I agree but I think we won't use synthetic ids without a bloom filter on top of it, and having two features flags complicate the code. If that's OK, I would prefer use only one feature flag for both.
I won't block the PR for this so if you want to keep it that's OK too.
There was a problem hiding this comment.
I got rid of the setting and feature flag in 799fb3a
|
|
||
| } | ||
|
|
||
| private enum StorageMode { |
There was a problem hiding this comment.
I find this storage mode a bit confusing. Maybe a useBloomFilterSyntheticId local variable would be simpler?
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| private final TSDBStoredFieldsFormat storedFieldsFormat; | ||
|
|
||
| TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) { | ||
| super(name, new TSDBSyntheticIdCodec(delegate)); |
There was a problem hiding this comment.
I'm planning to incorporate the code from TSDBSyntheticIdCodec into this class in a follow-up PR. But I wanted to keep the change size under control.
| boolean useSyntheticId = IndexSettings.TSDB_SYNTHETIC_ID_FEATURE_FLAG | ||
| && mapperService != null | ||
| && mapperService.getIndexSettings().useTimeSeriesSyntheticId() | ||
| && mapperService.getIndexSettings().getMode() == IndexMode.TIME_SERIES; |
There was a problem hiding this comment.
mapperService.getIndexSettings().useTimeSeriesSyntheticId() already ensure that the index is a time-series index and that the feature flag is enabled.
| private final TSDBStoredFieldsFormat storedFieldsFormat; | ||
|
|
||
| TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) { | ||
| super(name, new TSDBSyntheticIdCodec(delegate)); |
There was a problem hiding this comment.
We can merge TSDBSyntheticIdCodec and TSDBCodecWithSyntheticId together in a follow up.
| return new FilterLeafReader.FilterTerms(delegate.terms(field)) { | ||
| @Override | ||
| public TermsEnum iterator() throws IOException { | ||
| return new LazyFilterTermsEnum() { | ||
| private TermsEnum delegate; | ||
|
|
||
| @Override | ||
| protected TermsEnum getDelegate() throws IOException { | ||
| if (delegate == null) { | ||
| delegate = in.iterator(); | ||
| } | ||
| return delegate; | ||
| } |
There was a problem hiding this comment.
nit: I've been confused by the two delegate (the on in lazy and the one in the bloom filter) and what in was referencing to.
Maybe something like this would help?
final Terms terms = delegate.terms(field);
return new FilterLeafReader.FilterTerms(terms) {
@Override
public TermsEnum iterator() throws IOException {
return new LazyFilterTermsEnum() {
private TermsEnum termsEnum;
@Override
protected TermsEnum getDelegate() throws IOException {
if (termsEnum == null) {
termsEnum = terms.iterator();
}
return termsEnum;
}
@Override
public boolean seekExact(BytesRef text) throws IOException {
if (bloomFilter.mayContainTerm(field, text) == false) {
return false;
}
return getDelegate().seekExact(text);
}
};
}
};
server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java
Show resolved
Hide resolved
| var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays); | ||
| if (ZSTD_STORED_FIELDS_FEATURE_FLAG) { | ||
| codecs.put(DEFAULT_CODEC, new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_SPEED, mapperService, bigArrays)); | ||
| PerFieldMapperCodec defaultZstdCodec = new PerFieldMapperCodec( |
There was a problem hiding this comment.
If we want to reduce the scope of this change, we could create our own default_code_with_synthetic_id and hard-coded this in INDEX_CODEC_SETTING for all time-series with use_synthetic_id enabled.
Here we go for the complete solution immediately, for which I'm ok too.
There was a problem hiding this comment.
I don't have a strong opinion about this. I'm ok with both approaches. The downside of an extra codec is that we need to maintain it indefinitely whereas with this change as long as the feature flag is off we keep the current behaviour.
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Show resolved
Hide resolved
| * @see StoredFieldsFormat | ||
| */ | ||
| public class TSDBStoredFieldsFormat extends StoredFieldsFormat { | ||
| private final StoredFieldsFormat storedFieldsFormat; |
|
|
||
| TSDBStoredFieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { | ||
| boolean success = false; | ||
| List<Closeable> toClose = new ArrayList<>(); |
There was a problem hiding this comment.
nit:
| List<Closeable> toClose = new ArrayList<>(); | |
| List<Closeable> toClose = new ArrayList<>(2); |
|
|
||
| TSDBStoredFieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException { | ||
| boolean success = false; | ||
| List<Closeable> toClose = new ArrayList<>(); |
There was a problem hiding this comment.
| List<Closeable> toClose = new ArrayList<>(); | |
| List<Closeable> toClose = new ArrayList<>(2); |
...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java
Outdated
Show resolved
Hide resolved
|
@fcofdez I just noticed that This is something I missed when I introduced TSDBSyntheticIdPostingsFormat. |
|
@martijnvg It would be great if you could take a look into this PR once you have some time. Thanks! |
I added |
| .getIndexVersionCreated() | ||
| .onOrAfter(IndexVersions.TIME_SERIES_USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID); | ||
|
|
||
| var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays); |
There was a problem hiding this comment.
maybe just having one useSyntheticId if statement with an else clause is clearer then having useSyntheticId checks in several places?
| ); | ||
| codecs.put( | ||
| DEFAULT_CODEC, | ||
| useSyntheticId ? new ES93TSDBZSTDCompressionLucene103Codec(defaultZstdCodec, bigArrays) : defaultZstdCodec |
There was a problem hiding this comment.
Is is possible to not every codec combination here? Or at least not do this now. For legacy cases this doesn't seem necessary?
Typically DEFAULT_CODEC is used. We can maybe just enforce this if index.mapping.use_synthetic_id is configured?
There was a problem hiding this comment.
That sounds good to me, I wasn't 100% sure if we should support all the codec types or if we could do with just the default one. I'll implement your idea so we reduce the risk surface area, we already discussed that possibility 👍.
There was a problem hiding this comment.
@fcofdez what do you think of enforcing ES93TSDBDefaultCompressionLucene103Codec if index.mapping.use_synthetic_id is used? Even if zstd feature flag is enabled? I think that would reduce the size of this change as well.
Seperate from this change, I think we need to remove ZSTD_STORED_FIELDS_FEATURE_FLAG experiment. We seen inclusive results changing to zstd for default codec and so we should for now remove it.
There was a problem hiding this comment.
that sounds good to me, I'll do that 👍
There was a problem hiding this comment.
@martijnvg I've just pushed a commit leaving only ES93TSDBDefaultCompressionLucene103Codec even when the zstd feature flag is enabled.
|
|
||
| import org.elasticsearch.common.util.BigArrays; | ||
|
|
||
| public class ES93TSDBZSTDCompressionLucene103Codec extends TSDBCodecWithSyntheticId { |
There was a problem hiding this comment.
Based on my comment in CodecService, maybe we just need this codec implementation?
There was a problem hiding this comment.
So it's my understanding correct that even if zstd compression is behind a feature flag, that's the default compression algorithm for stored field nowadays?
There was a problem hiding this comment.
Good point. In released versions we endup using zstd based stored fields otherwise endup using default fast compression for default codec. So I think we need to keep both here.
|
|
||
| // TODO: fix IndexDiskUsageStats to take into account synthetic _id terms | ||
| var checkDiskUsage = false; | ||
| if (checkDiskUsage) { |
There was a problem hiding this comment.
Does this fail currently or is in accurate because bloom filter size on disk can't be computed?
There was a problem hiding this comment.
In fact this's been already fix by another PR, so I'll remove this condition.
| var fieldsProducer = new TSDBSyntheticIdFieldsProducer(state, docValuesProducer); | ||
| success = true; | ||
| return fieldsProducer; | ||
| return bloomFilter == null ? fieldsProducer : new DelegatingBloomFilterFieldsProducer(fieldsProducer, bloomFilter); |
There was a problem hiding this comment.
If field is _id then in what cases would the bloomFilter be null?
| static class Reader extends StoredFieldsReader implements BloomFilter { | ||
| // The bloom filter can be null in cases where the indexed documents | ||
| // do not include a field bloomFilterFieldName and thus the bloom filter | ||
| // is empty. (This mostly apply for tests). |
There was a problem hiding this comment.
For which tests is this the case? Ideally if this format gets tested it should be in a setup that doesn't produce nulls?
There was a problem hiding this comment.
It's mostly about BaseStoredFieldsFormatTestCase which is used by TSDBStoredFieldsFormatTests and we don't populate the bloom filter in that case since the base test expects all the fields to be stored. Maybe we can tighten this up once we synthesise _ids (there's a PR coming for that)
|
Thanks a lot for the review @martijnvg. I've pushed a commit addressing your comments. I left the two default codecs since I'm not sure if zstd is used always, but happy to remove the default compression codec if we think that's not needed. |
| import java.io.Closeable; | ||
| import java.io.IOException; | ||
|
|
||
| public interface BloomFilter extends Closeable { |
There was a problem hiding this comment.
Mainly a style thing, but I think this would be nicer as a function interface with a no-op static implementation. Something like:
static BloomFilter NO_FILTER = (field, term) -> true;
And then rather than checking for isFilterAvailable() you check for == BloomFilter.NO_FILTER.
Also I don't think it needs to be Closeable any more?
There was a problem hiding this comment.
Thanks for looking into this @romseygeek, I've removed the isFilterAvailable method as is indeed redundant (see 1ff2ccd). Regarding the Closeable we need it for DelegatingBloomFilterFieldsProducer to close the underlying bloom filter file once it's done with it.
|
@martijnvg could you take another look once you have the chance? happy to discuss it sync if that helps. Thanks! |
martijnvg
left a comment
There was a problem hiding this comment.
Thanks @fcofdez, LGTM.
Let's also wait for @romseygeek to complete the review.
| if (useSyntheticId) { | ||
| // Use the default Lucene compression when the synthetic id is used even if the ZSTD feature flag is enabled | ||
| codecs.put(DEFAULT_CODEC, new ES93TSDBDefaultCompressionLucene103Codec(legacyBestSpeedCodec, bigArrays)); | ||
| } else if (ZSTD_STORED_FIELDS_FEATURE_FLAG) { |
There was a problem hiding this comment.
I think we should remove this feature flag. I don't we will in the near term future use zstd for default codec.
But I don't think this affects this PR.
| } else { | ||
| codecs.put(DEFAULT_CODEC, legacyBestSpeedCodec); | ||
| } | ||
|
|
| BEST_COMPRESSION_CODEC, | ||
| new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_COMPRESSION, mapperService, bigArrays) | ||
| ); | ||
| Codec legacyBestCompressionCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_COMPRESSION, mapperService, bigArrays); |
There was a problem hiding this comment.
Only changes formatting? Maybe just keep it the way it is?
|
Thanks all for the reviews! |
* upstream/main: (79 commits)
Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/140_pre_filter_search_shards/prefilter on non-indexed date fields} elastic#139381
Adjust error bounds for bfloat16 value checks (elastic#139371)
Unmute some vector CSS tests (elastic#139370)
Do not allow `project_routing` as a query param (elastic#139206)
Unmute HalfFloat...Tests#testSynthesizeArrayRandom (elastic#139341)
Remove leniency in LinkedProjectConfig builder methods (elastic#139012)
EQL: fix project_routing (elastic#139366)
Add patch version for 9.2 index version constant (elastic#139362)
Mute org.elasticsearch.test.rest.yaml.RcsCcsCommonYamlTestSuiteIT test {p0=search.vectors/200_dense_vector_docvalue_fields/dense_vector docvalues with bfloat16} elastic#139368
ES|QL: Enable CCS tests for FORK (elastic#139302)
Restructuring the semantic_text field type page (elastic#138571)
AggregateMetricDouble fields should not build BKD indexes (elastic#138724)
Feature/count by trunc with filter (elastic#138765)
ESQL: Convert TS 500 error to 400 (elastic#139360)
[CI] Rerun failing tests for periodic build pipelines (elastic#139200)
revert muting saml test (elastic#139327)
Add TDigest histogram as metric (elastic#139247)
Links solved bugs to class cast exception changelog and unmutes errors (elastic#139340)
Ensure integer sorts are rewritten to long sorts for BWC indexes (elastic#139293)
Integrate stored fields format bloom filter with synthetic _id (elastic#138515)
...
This PR integrates
ES93BloomFilterStoredFieldsFormatwith the synthetic _id lookups. For that, it introduces a new set ofCodecs meant to be used only byTIME_SERIESindices. These new codecs are necessary to cover the case when the codec is loaded through SPI (i.e. after a shard relocation or node restarts).The new codecs just wrap the existing codecs and extend them with the necessary plumbing to populate the bloom filter during indexing.