Single loop for FieldfInfo processing#137967
Conversation
No codec should extend DeduplicatingFieldInfoCodec, instead always wrap it.
Previously both DeduplicatingFieldInfosFormat and TSDBSyntheticIdCodec.RewriteFieldInfosFormat would iterate over FieldInfos. This is now optimised by letting TSDBSyntheticIdCodec.RewriteFieldInfosFormat extend DeduplicatingFieldInfosFormat in order to let RewriteFieldInfosFormat utilise the iteration done by DeduplicatingFieldInfosFormat. Also let TSDBSyntheticIdCodec extend DeduplicateFieldInfosCodec so that we can simplify the codec wrapping to always use only one of them.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @burqen, I've created a changelog YAML for you. |
| return codec; | ||
| String name = e.getValue().getName(); | ||
| Codec codec = e.getValue(); | ||
| return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(name, codec) : new DeduplicateFieldInfosCodec(name, codec); |
There was a problem hiding this comment.
nit:
| return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(name, codec) : new DeduplicateFieldInfosCodec(name, codec); | |
| return useTsdbSyntheticId ? new TSDBSyntheticIdCodec(codec) : new DeduplicateFieldInfosCodec(codec); |
| protected DeduplicateFieldInfosCodec(String name, Codec delegate) { | ||
| super(name, delegate); | ||
| this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat()); | ||
| } |
There was a problem hiding this comment.
| protected DeduplicateFieldInfosCodec(String name, Codec delegate) { | |
| super(name, delegate); | |
| this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat()); | |
| } | |
| protected DeduplicateFieldInfosCodec(Codec delegate) { | |
| super(delegate.getName(), delegate); | |
| this.fieldInfosFormat = createFieldInfosFormat(delegate.fieldInfosFormat()); | |
| } |
|
|
||
| protected void validateFieldInfos(FieldInfos fieldInfos) {} | ||
|
|
||
| protected FieldInfo processFieldInfo(FieldInfo fi) { |
There was a problem hiding this comment.
nit:
| protected FieldInfo processFieldInfo(FieldInfo fi) { | |
| protected FieldInfo wrapFieldInfo(FieldInfo fi) { |
|
|
||
| private static Codec unwrappedCodec(CodecService codecService, String codecName) { | ||
| Codec codec = codecService.codec(codecName); | ||
| if (codec instanceof DeduplicateFieldInfosCodec deduplicatingCodec) { |
There was a problem hiding this comment.
I think this can be:
| if (codec instanceof DeduplicateFieldInfosCodec deduplicatingCodec) { | |
| if (codec instanceof FilterCoded filtered) { |
There was a problem hiding this comment.
Unfortunately not, the delegate is hidden from us there.
fcofdez
left a comment
There was a problem hiding this comment.
I'm afraid that we should revert the change around not extending DeduplicateFieldInfosCodec in the default codecs. The reason is that Lucene would use SPI to load the Codec and it will just instantiate the codec with the no-args constructor and thus we won't get to deduplicate the fields. This only applies when a node is restarted for example and we need to read the codec from the SegmentInfo and I guess that it applies to search nodes in serverless too.
I can see that Lucene is used to load the codecs here, but they will all be wrapped afterwards. Is there some other place as well where the codecs are service loaded outside of |
Yes, in Lucene, when a commit is read from disk (see https://github.com/apache/lucene/blob/e02bdb4c3c547488342b423e1b9b2b25519bd427/lucene/core/src/java/org/apache/lucene/index/SegmentInfos.java#L409-L412). We kind of rely implicitly that for reading a segment with a particular codec we can use the SPI loaded codec. |
Got it! Thanks for pointing that out. I'll revert and find some other approach. |
|
Good observation @fcofdez, this indeed makes wrapping approach not effective. But it should still be possible to loop once over field info in |
|
@martijnvg The problem is that the I tried a more brute force approach
This makes me think it's a bad idea to make deduplicating codec aware of of synthetic id all together. But I cannot figure out a way to get rid of the extra loop in any other way. All my ideas center around in one way or the other inject functionality into This makes me come back to your comment @fcofdez . How important is it that the service loaded codec benefit from the deduplication? They will still be loaded, only they will not be wrapped with the deduplication. How does that weigh against this optimization that we are trying to do here? Do you have any ideas? @fcofdez , @tlrx , @martijnvg ? About PostingsFormat: |
|
We discussed this offline and agreed on maybe relying on the field attributed defined in |
|
Replaced by #138515 |
Previously both DeduplicatingFieldInfosFormat and TSDBSyntheticIdCodec.RewriteFieldInfosFormat would iterate over FieldInfos.
This is now optimised by letting TSDBSyntheticIdCodec.RewriteFieldInfosFormat extend DeduplicatingFieldInfosFormat in order to let RewriteFieldInfosFormat utilise the iteration done by DeduplicatingFieldInfosFormat.
Also let TSDBSyntheticIdCodec extend DeduplicateFieldInfosCodec so that we can simplify the codec wrapping to always use only one of them.
Additional changes: