Sparse doc values index for LogsDB host.name field#120741
Conversation
server/src/test/java/org/elasticsearch/index/mapper/DocumentParserContextTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
Show resolved
Hide resolved
| @Override | ||
| public KeywordFieldMapper build(MapperBuilderContext context) { | ||
| FieldType fieldtype = new FieldType(Defaults.FIELD_TYPE); | ||
| FieldType fieldtype = fieldType(indexSortConfig, indexMode, context.buildFullName(leafName())); |
There was a problem hiding this comment.
Depending on subobjects the value of leafName could include the parent (before the dot) or not. We need the full name to check if we are dealing with host.name no matter the subobjects setting.
| // deduplicate in the common default case to save some memory | ||
| fieldtype = Defaults.FIELD_TYPE; | ||
| } | ||
| if (fieldtype.equals(Defaults.FIELD_TYPE_WITH_SKIP_DOC_VALUES)) { |
There was a problem hiding this comment.
An optimization as above.
| final IndexMode indexMode, | ||
| final String fullFieldName | ||
| ) { | ||
| return (defaultIndexedAndDocValues() || isNotIndexedAndHasDocValues()) |
There was a problem hiding this comment.
Here I also enable the sparse index if the user explicitly defines the field not to be indexed and with doc values. In that case we pay a little price in terms of storage footprint to take advantage of the sparse index.
| this.scriptValues = builder.scriptValues(); | ||
| this.isDimension = builder.dimension.getValue(); | ||
| this.isSyntheticSource = isSyntheticSource; | ||
| this.hasDocValuesSparseIndex = DocValuesSkipIndexType.NONE.equals(fieldType.docValuesSkipIndexType()) == false; |
There was a problem hiding this comment.
Checking this way we make sure the boolean value is correct even if new skip indices other than RANGE are introduced in Lucene in future releases.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| ); | ||
|
|
||
| final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name"); | ||
| assertTrue(mapper.fieldType().hasDocValuesSparseIndex()); |
There was a problem hiding this comment.
In this case we enable the doc values sparse index even if the user explicitly configures index: false and doc_values: true.
martijnvg
left a comment
There was a problem hiding this comment.
A sparse doc values index on the host.name field provides the following benefits:
Improves query performance when the host.name field is used. The sparse index allows skipping over irrelevant documents based on doc values, which reduces query latency and resources required to execute the query.
It reduces compute costs at query time reducing query latency for queries taking advantage of the sparse index.
As it is used when the host.name field is included in the primary sort configuration, the sparse index aligns with sorting requirements, further enhancing efficiency during data retrieval and aggregation.
I don't think querying will be faster than when there is an inverted index. Part of this exercise is to see how much query performance will be effected. However given that host.name is the primary sort field, I think/hope that drop in query performance will be acceptable.
Note that this PR can't be back ported to 8.x, since it relies on doc values skippers, which is a Lucene 10 only feature.
| return isIndexed; | ||
| } | ||
|
|
||
| public boolean hasDocValuesSparseIndex() { |
There was a problem hiding this comment.
I think for now, we don't have to add a new method to this base class? I see this is only used in tests and then we know the concrete class?
| && isPrimarySortField(indexSortConfig); | ||
| } | ||
|
|
||
| private boolean isHostNameField(final String fullFieldName) { |
There was a problem hiding this comment.
I think it is more readable if these sub conditions are used direct in the return statement (^)? The sub conditions aren't long and moving to private methods doesn't buy much?
There was a problem hiding this comment.
I separated them because there are many of those...but if you prefer I can extract them.
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
Show resolved
Hide resolved
My assumption si that for some queries at least, since doc values are sorted, the sparse index can allow skipping entire segments in some cases...so some filtering queries, for instance, might actually be faster? Think for instance having a segment where |
server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
Show resolved
Hide resolved
Right, that is true. However I don't think it can be faster than a term query on an indexed field, since an inverted index also can skip over segments / many docids that don't match. I expect at best similar performance. But I may be wrong here. I think main point is to figure out what effect exactly is. |
martijnvg
left a comment
There was a problem hiding this comment.
Thanks Salvatore, I left a few questions about in what cases sparse index should be enabled.
| boolean isHostNameField = HOST_NAME.equals(fullFieldName); | ||
| boolean isPrimarySortField = indexSortConfig != null && indexSortConfig.hasPrimarySortOnField(HOST_NAME); | ||
|
|
||
| return (isIndexedAndDocValuesDefault || isNotIndexedAndHasDocValues) && isLogsDbMode && isHostNameField && isPrimarySortField; |
There was a problem hiding this comment.
I think we want to enable sparse index only if index has not been configured and doc values isn't disabled.
So I think this is easier:
indexed.isConfigured() == false && hasDocValues.getValue() == false && isLogsDbMode && isHostNameField && isPrimarySortField;
?
There was a problem hiding this comment.
hasDOcValues.getValue() == false? We need doc values to create the sparse index...
| ); | ||
|
|
||
| final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name"); | ||
| assertFalse(mapper.fieldType().hasDocValuesSparseIndex()); |
There was a problem hiding this comment.
I think supporting sparse index in this case is ok?
There was a problem hiding this comment.
I interpreted this and the other test below as "if there are doc values and all other conditions hold true we would like to have the sparse index". Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.
There was a problem hiding this comment.
Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.
Let's do that?
| ); | ||
|
|
||
| final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name"); | ||
| assertTrue(mapper.fieldType().hasDocValuesSparseIndex()); |
There was a problem hiding this comment.
I wonder whether sparse index should be enabled whenwhen index is explicitly set to false?
There was a problem hiding this comment.
See my comment above.
| return isSet && Objects.equals(value, getDefaultValue()) == false; | ||
| } | ||
|
|
||
| public boolean isSet() { |
There was a problem hiding this comment.
Note that I had to add this method to be able to detect the situation when the parameter index is set to true (the default) explicitly. The isConfigured existing method returns true only if the value is set and is different from the default too (which is not our case).
There was a problem hiding this comment.
Does getValue() and isConfigured() indicate the same? Configured specifically to true (even if attribute defaults to true)?
There was a problem hiding this comment.
isConfigured is not true...
public boolean isConfigured() {
return isSet && Objects.equals(value, getDefaultValue()) == false;
}
The second condition is true because value=true and getDefaultValue=true...so the Objects.equals evaluates to true (that is not false as checked by == false. This implementation of isConfigured does not really make sense to me.
So
isSet=true
value=true
getDefaultValue=true
this returns FALSE because of true && true == false.
There was a problem hiding this comment.
IMO the error is in && used instead of ||...but I am really surprised.
The correct implementation should be
public boolean isConfigured() {
return isSet || Objects.equals(value, getDefaultValue()) == false;
}
Even if, when Objects.equals(value, getDefaultValue()) is false probably also isSet is true (isSet must be true if value is not the default, because it means a user explicitly set the non-default value).
Anyway if I try this change I have a lot of test failures...so maybe I am missing something?
There was a problem hiding this comment.
I see, thanks for explaining this.
martijnvg
left a comment
There was a problem hiding this comment.
LGTM 👍
Let's see how nighly benchmark picks this up.
#121751) In this PR, we change how the doc values sparse index is enabled for the host.name keyword field. The initial implementation of the sparse index for host.name was introduced in #120741. Previously, the choice between using an inverted index or a doc values sparse index was determined by the index parameter. With this change, we introduce a new final index-level setting, index.mapping.use_doc_values_sparse_index: - When the setting is true, we enable the sparse index and omit the inverted index for host.name. - When the setting is false (default), we retain the inverted index instead. Additionally, this setting is only exposed if the doc_values_sparse_index feature flag is enabled. This change simplifies enabling the doc values sparse index and makes the selection of indexing strategies explicit at the index level. Moreover, the setting is not dynamic and is exposed only for stateful deployments. The plan is to enable this setting in our nightly benchmarks and evaluate its impact on LogsDB indexing throughput, storage footprint and query latency. Based on benchmarking results, we will decide whether to adopt the sparse index and determine the best way to configure it.
This PR introduces a new field type in
KeywordFieldMapperwith support for a sparse doc values index when specific conditions are met:LOGSDB.host.nameand mapped as a keyword.When all the conditions above hold true we:
FieldTypewithDocValuesSkipIndexType.RANGEas the sparse index type.KeywordFieldMapperto apply the new field type conditionally so to have a sparse doc values index on thehost.namefield.host.namefield dropping the inverted index (in favor of the sparse doc values index).Some queries might be slower as a result of using a doc values sparse index instead of an inverted index.
Disabling the inverted index on the
host.namefield while enabling the doc values sparse index is expected to:Introducing the sparse index is gated by a feature flag which will be used later to do the same for the
@timestampfield too. As a result, we will see the impact on storage, indexing throughput and query latency only is snapshot builds.