Speed up loading keyword fields with index sorts#132950
Speed up loading keyword fields with index sorts#132950martijnvg merged 8 commits intoelastic:mainfrom
Conversation
ce26e27 to
b2ae4bf
Compare
b2ae4bf to
503b524
Compare
3081cfa to
8639922
Compare
165df8b to
e24a8e0
Compare
martijnvg
left a comment
There was a problem hiding this comment.
Thanks this looks great - LGTM.
| try (var reader = DirectoryReader.open(iw)) { | ||
| int gaugeIndex = numDocs; | ||
| for (var leaf : reader.leaves()) { | ||
| var timestampDV = getBulkNumericDocValues(leaf.reader(), timestampField); |
There was a problem hiding this comment.
Like we talked about we need to adjust these test to also use sorted(set) doc values.
| readFields(in, state.fieldInfos); | ||
|
|
||
| final var indexSort = state.segmentInfo.getIndexSort(); | ||
| if (indexSort != null && indexSort.getSort().length > 0) { |
There was a problem hiding this comment.
👍 - good idea to extract this information from segment info, no need to check for index setting.
| if (lookaheadBlockIndex + 1 != blockIndex) { | ||
| lookaheadData.seek(indexReader.get(blockIndex)); | ||
| } | ||
| if (maxOrd >= 0) { |
There was a problem hiding this comment.
Currently this is only invoked from getSorted(...) and in that case maxOrd is always >= 0.
I think we should keep this branch, otherwise we can't reuse lookAheadValueAt(...) for numeric doc value.
There was a problem hiding this comment.
Maybe replace this with a switch statement:
switch (maxOrdAsInt) {
case -1:
decoder.decode(lookaheadData, lookaheadBlock);
break;
default:
decoder.decodeOrdinals(lookaheadData, lookaheadBlock, bitsPerOrd);
}
I think this is a little bit more efficient?
There was a problem hiding this comment.
👍 - I suspect that is also easier to be optimized at runtime by jvm.
| * to load the requested values, for example due to unsupported underlying data. | ||
| * This allows callers to optimistically try optimized loading strategies first, and fall back if necessary. | ||
| */ | ||
| interface OptionalColumnAtATimeReader { |
There was a problem hiding this comment.
👍 This is better than previous abstraction (BulkNumericDocValues).
| } | ||
|
|
||
| @Override | ||
| public BytesRefBlock constantBytes(BytesRef value, int count) { |
There was a problem hiding this comment.
Could this also be used in SingletonOrdinalsBuilder#tryBuildConstantBlock(...)?
There was a problem hiding this comment.
Great point. Unfortunately, we don't have DelegatingBlockLoaderFactory here. I will move this to BlockFactory in a follow-up.
| final class ES819TSDBDocValuesProducer extends DocValuesProducer { | ||
| final IntObjectHashMap<NumericEntry> numerics; | ||
| private int primarySortFieldNumber = -1; | ||
| private boolean primarySortFieldReversed = false; |
There was a problem hiding this comment.
Note that this field can be converted to a variable.
|
|
||
| @Override | ||
| long lookAheadValueAt(int targetDoc) throws IOException { | ||
| return 0L; // Only one ordinal! |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @dnhatn, I've created a changelog YAML for you. |
* upstream/main: (32 commits) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) Add random tests with match_only_text multi-field (elastic#132380) ...
Reading keyword fields that are the primary sort in the index can be sped up by skip reading, as identical values are stored together. In this case, we can use values from the doc_values skipper instead of loading values from doc_values. However, the doc_values skipper is not enabled yet. Here, we use two buffers when reading ordinals: one for the beginning of the block and one for the end. If both return the same value, we can skip the middle. There is a follow-up step where we fill the values in the middle until we reach the last value. This optimization should speed up time-series queries.
…-stats * upstream/main: (36 commits) Fix reproducability of builds against Java EA versions (elastic#132847) Speed up loading keyword fields with index sorts (elastic#132950) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testSyntheticSourceWithTranslogSnapshot elastic#132964 Simplify EsqlSession (elastic#132848) Implement WriteLoadConstraintDecider#canAllocate (elastic#132041) Mute org.elasticsearch.test.rest.yaml.CcsCommonYamlTestSuiteIT test {p0=search/400_synthetic_source/_doc_count} elastic#132965 Switch to PR-based benchmark pipeline defined in ES repo (elastic#132941) Breakdown undesired allocations by shard routing role (elastic#132235) Implement v_magnitude function (elastic#132765) Introduce execution location marker for better handling of remote/local compatibility (elastic#132205) Mute org.elasticsearch.cluster.ClusterInfoServiceIT testMaxQueueLatenciesInClusterInfo elastic#132957 Unmuting simulate index data stream mapping overrides yaml rest test (elastic#132946) Remove CrossClusterCancellationIT.createLocalIndex() (elastic#132952) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetch elastic#132956 Fix failing UT by adding a required capability (elastic#132947) Precompute the BitsetCacheKey hashCode (elastic#132875) Adding simulate ingest effective mapping (elastic#132833) Mute org.elasticsearch.index.mapper.LongFieldMapperTests testFetchMany elastic#132948 Rename skipping logic to remove hard link to skip_unavailable (elastic#132861) Store ignored source in unique stored fields per entry (elastic#132142) ...
Reading keyword fields that are the primary sort in the index can be sped up by skip reading, as identical values are stored together. In this case, we can use values from the doc_values skipper instead of loading values from doc_values. However, the doc_values skipper is not enabled yet. Here, we use two buffers when reading ordinals: one for the beginning of the block and one for the end. If both return the same value, we can skip the middle. There is a follow-up step where we fill the values in the middle until we reach the last value. This optimization should speed up time-series queries.