Sparse doc values index for LogsDB `host.name` field by salvatore-campagna · Pull Request #120741 · elastic/elasticsearch

salvatore-campagna · 2025-01-23T16:18:43Z

This PR introduces a new field type in KeywordFieldMapper with support for a sparse doc values index when specific conditions are met:

The index mode is set to LOGSDB.
The field name is host.name and mapped as a keyword.
The field is included in the primary sort configuration.
The field has doc values and indexing is not disabled explicitly (not index: false).

When all the conditions above hold true we:

use a new FieldType with DocValuesSkipIndexType.RANGE as the sparse index type.
update the KeywordFieldMapper to apply the new field type conditionally so to have a sparse doc values index on the host.name field.
disable indexing of the host.name field dropping the inverted index (in favor of the sparse doc values index).

Some queries might be slower as a result of using a doc values sparse index instead of an inverted index.

Disabling the inverted index on the host.name field while enabling the doc values sparse index is expected to:

reduce storage footprint depending on the size of the inverted index relative to the sparse index.
improve indexing throughput as a result of reducing the amount of data written during segment flushes.

Introducing the sparse index is gated by a feature flag which will be used later to do the same for the @timestamp field too. As a result, we will see the impact on storage, indexing throughput and query latency only is snapshot builds.

server/src/test/java/org/elasticsearch/index/mapper/DocumentParserContextTests.java

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

salvatore-campagna · 2025-01-24T11:00:45Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

        @Override
        public KeywordFieldMapper build(MapperBuilderContext context) {
-            FieldType fieldtype = new FieldType(Defaults.FIELD_TYPE);
+            FieldType fieldtype = fieldType(indexSortConfig, indexMode, context.buildFullName(leafName()));


Depending on subobjects the value of leafName could include the parent (before the dot) or not. We need the full name to check if we are dealing with host.name no matter the subobjects setting.

salvatore-campagna · 2025-01-24T11:00:59Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

                // deduplicate in the common default case to save some memory
                fieldtype = Defaults.FIELD_TYPE;
            }
+            if (fieldtype.equals(Defaults.FIELD_TYPE_WITH_SKIP_DOC_VALUES)) {


An optimization as above.

salvatore-campagna · 2025-01-24T11:02:07Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            final IndexMode indexMode,
+            final String fullFieldName
+        ) {
+            return (defaultIndexedAndDocValues() || isNotIndexedAndHasDocValues())


Here I also enable the sparse index if the user explicitly defines the field not to be indexed and with doc values. In that case we pay a little price in terms of storage footprint to take advantage of the sparse index.

salvatore-campagna · 2025-01-24T11:03:17Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

            this.scriptValues = builder.scriptValues();
            this.isDimension = builder.dimension.getValue();
            this.isSyntheticSource = isSyntheticSource;
+            this.hasDocValuesSparseIndex = DocValuesSkipIndexType.NONE.equals(fieldType.docValuesSkipIndexType()) == false;


Checking this way we make sure the boolean value is correct even if new skip indices other than RANGE are introduced in Lucene in future releases.

elasticsearchmachine · 2025-01-24T11:13:01Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

salvatore-campagna · 2025-01-24T11:25:57Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

+        );
+
+        final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
+        assertTrue(mapper.fieldType().hasDocValuesSparseIndex());


In this case we enable the doc values sparse index even if the user explicitly configures index: false and doc_values: true.

martijnvg

A sparse doc values index on the host.name field provides the following benefits:

Improves query performance when the host.name field is used. The sparse index allows skipping over irrelevant documents based on doc values, which reduces query latency and resources required to execute the query.
It reduces compute costs at query time reducing query latency for queries taking advantage of the sparse index.
As it is used when the host.name field is included in the primary sort configuration, the sparse index aligns with sorting requirements, further enhancing efficiency during data retrieval and aggregation.

I don't think querying will be faster than when there is an inverted index. Part of this exercise is to see how much query performance will be effected. However given that host.name is the primary sort field, I think/hope that drop in query performance will be acceptable.

Note that this PR can't be back ported to 8.x, since it relies on doc values skippers, which is a Lucene 10 only feature.

martijnvg · 2025-01-24T12:44:33Z

server/src/main/java/org/elasticsearch/index/mapper/MappedFieldType.java

        return isIndexed;
    }

+    public boolean hasDocValuesSparseIndex() {


I think for now, we don't have to add a new method to this base class? I see this is only used in tests and then we know the concrete class?

martijnvg · 2025-01-24T12:46:35Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+                && isPrimarySortField(indexSortConfig);
+        }
+
+        private boolean isHostNameField(final String fullFieldName) {


I think it is more readable if these sub conditions are used direct in the return statement (^)? The sub conditions aren't long and moving to private methods doesn't buy much?

I separated them because there are many of those...but if you prefer I can extract them.

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

salvatore-campagna · 2025-01-24T13:07:46Z

A sparse doc values index on the host.name field provides the following benefits:

Improves query performance when the host.name field is used. The sparse index allows skipping over irrelevant documents based on doc values, which reduces query latency and resources required to execute the query. It reduces compute costs at query time reducing query latency for queries taking advantage of the sparse index. As it is used when the host.name field is included in the primary sort configuration, the sparse index aligns with sorting requirements, further enhancing efficiency during data retrieval and aggregation.

I don't think querying will be faster than when there is an inverted index. Part of this exercise is to see how much query performance will be effected. However given that host.name is the primary sort field, I think/hope that drop in query performance will be acceptable.

Note that this PR can't be back ported to 8.x, since it relies on doc values skippers, which is a Lucene 10 only feature.

My assumption si that for some queries at least, since doc values are sorted, the sparse index can allow skipping entire segments in some cases...so some filtering queries, for instance, might actually be faster? Think for instance having a segment where host.name is foo and one where host.name is bar and the filtering is, for instance on foo. In that case the sparse index would allow skipping the segment where the value is bar. Am I wrong? This is why I thought some queries can actually be faster.

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

martijnvg · 2025-01-24T13:25:09Z

My assumption si that for some queries at least, since doc values are sorted, the sparse index can allow skipping entire segments in some cases...so some filtering queries, for instance, might actually be faster?

Right, that is true. However I don't think it can be faster than a term query on an indexed field, since an inverted index also can skip over segments / many docids that don't match. I expect at best similar performance. But I may be wrong here. I think main point is to figure out what effect exactly is.

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

martijnvg

Thanks Salvatore, I left a few questions about in what cases sparse index should be enabled.

martijnvg · 2025-01-27T14:24:13Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+            boolean isHostNameField = HOST_NAME.equals(fullFieldName);
+            boolean isPrimarySortField = indexSortConfig != null && indexSortConfig.hasPrimarySortOnField(HOST_NAME);
+
+            return (isIndexedAndDocValuesDefault || isNotIndexedAndHasDocValues) && isLogsDbMode && isHostNameField && isPrimarySortField;


I think we want to enable sparse index only if index has not been configured and doc values isn't disabled.

So I think this is easier:

indexed.isConfigured() == false && hasDocValues.getValue() == false && isLogsDbMode && isHostNameField && isPrimarySortField;

?

hasDOcValues.getValue() == false? We need doc values to create the sparse index...

martijnvg · 2025-01-27T14:25:27Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

+        );
+
+        final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
+        assertFalse(mapper.fieldType().hasDocValuesSparseIndex());


I think supporting sparse index in this case is ok?

I interpreted this and the other test below as "if there are doc values and all other conditions hold true we would like to have the sparse index". Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.

Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.

Let's do that?

martijnvg · 2025-01-27T14:27:01Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldMapperTests.java

+        );
+
+        final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
+        assertTrue(mapper.fieldType().hasDocValuesSparseIndex());


I wonder whether sparse index should be enabled whenwhen index is explicitly set to false?

See my comment above.

salvatore-campagna · 2025-01-29T12:43:11Z

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

            return isSet && Objects.equals(value, getDefaultValue()) == false;
        }

+        public boolean isSet() {


Note that I had to add this method to be able to detect the situation when the parameter index is set to true (the default) explicitly. The isConfigured existing method returns true only if the value is set and is different from the default too (which is not our case).

Does getValue() and isConfigured() indicate the same? Configured specifically to true (even if attribute defaults to true)?

isConfigured is not true...

public boolean isConfigured() { return isSet && Objects.equals(value, getDefaultValue()) == false; }

The second condition is true because value=true and getDefaultValue=true...so the Objects.equals evaluates to true (that is not false as checked by == false. This implementation of isConfigured does not really make sense to me.

So

isSet=true value=true getDefaultValue=true

this returns FALSE because of true && true == false.

IMO the error is in && used instead of ||...but I am really surprised.

The correct implementation should be

public boolean isConfigured() { return isSet || Objects.equals(value, getDefaultValue()) == false; }

Even if, when Objects.equals(value, getDefaultValue()) is false probably also isSet is true (isSet must be true if value is not the default, because it means a user explicitly set the non-default value).

Anyway if I try this change I have a lot of test failures...so maybe I am missing something?

I see, thanks for explaining this.

martijnvg

LGTM 👍
Let's see how nighly benchmark picks this up.

#121751) In this PR, we change how the doc values sparse index is enabled for the host.name keyword field. The initial implementation of the sparse index for host.name was introduced in #120741. Previously, the choice between using an inverted index or a doc values sparse index was determined by the index parameter. With this change, we introduce a new final index-level setting, index.mapping.use_doc_values_sparse_index: - When the setting is true, we enable the sparse index and omit the inverted index for host.name. - When the setting is false (default), we retain the inverted index instead. Additionally, this setting is only exposed if the doc_values_sparse_index feature flag is enabled. This change simplifies enabling the doc values sparse index and makes the selection of indexing strategies explicit at the index level. Moreover, the setting is not dynamic and is exposed only for stateful deployments. The plan is to enable this setting in our nightly benchmarks and evaluate its impact on LogsDB indexing throughput, storage footprint and query latency. Based on benchmarking results, we will decide whether to adopt the sparse index and determine the best way to configure it.

…#120741)" This reverts commit 1b6a080.

…" (#124803) This reverts commit 1b6a080.

feature: sparse index for logsdb host.name field

65f02eb

elasticsearchmachine added the v9.0.0 label Jan 23, 2025

salvatore-campagna self-assigned this Jan 23, 2025

salvatore-campagna added >non-issue auto-backport Automatically create backport pull requests when merged v8.18.0 :StorageEngine/Logs You know, for Logs labels Jan 23, 2025

salvatore-campagna and others added 3 commits January 23, 2025 17:20

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

1e14746

fix: enable sparse index if field is indexed nad has doc values

6476c15

fix: rename method and make constant private

47039a9

salvatore-campagna commented Jan 24, 2025

View reviewed changes

server/src/test/java/org/elasticsearch/index/mapper/DocumentParserContextTests.java Outdated Show resolved Hide resolved

salvatore-campagna added 3 commits January 24, 2025 09:52

fix: keep existing Builder

15ecf22

fix: compare to NONE and flip equals

ffe0a2c

fix: refactor Builder constructor and remove unused variables

5cd8126

salvatore-campagna commented Jan 24, 2025

View reviewed changes

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

4b54de6

salvatore-campagna marked this pull request as ready for review January 24, 2025 11:12

salvatore-campagna requested a review from martijnvg January 24, 2025 11:12

elasticsearchmachine added the Team:StorageEngine label Jan 24, 2025

salvatore-campagna commented Jan 24, 2025

View reviewed changes

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

4b6b87d

martijnvg reviewed Jan 24, 2025

View reviewed changes

salvatore-campagna removed v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 24, 2025

martijnvg reviewed Jan 24, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java Show resolved Hide resolved

salvatore-campagna added 2 commits January 24, 2025 15:32

fix: refactor Builder constructore and remove hasDocValuesSparseIndex

db8374e

fix: gate sparse index usage with feature flag

328e42b

salvatore-campagna commented Jan 24, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java Show resolved Hide resolved

salvatore-campagna added 3 commits January 24, 2025 15:41

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

3d27a95

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

4ce3758

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

6a452bf

martijnvg reviewed Jan 27, 2025

View reviewed changes

salvatore-campagna and others added 7 commits January 27, 2025 20:47

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

faea308

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

457d14f

fix: refactor sparse doc values index creation conditions

4909ecb

fix: host.name sparse doc values index tests

2e3a51e

fix: add missing assertions

1868134

fix: use sparse doc values index only for new indices

e4f0184

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

3d7fdc2

salvatore-campagna mentioned this pull request Jan 28, 2025

Sparse doc values index for LogsDB @timestamp field #121018

Closed

salvatore-campagna and others added 2 commits January 29, 2025 12:46

fix: either use the sparse idnex or the inverted index

24aa35d

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

16014a8

salvatore-campagna commented Jan 29, 2025

View reviewed changes

salvatore-campagna added 2 commits January 29, 2025 13:45

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

e0599ac

Merge branch 'main' into feature/timestamp-and-hostname-sparse-index

1254c7a

martijnvg approved these changes Jan 29, 2025

View reviewed changes

salvatore-campagna merged commit 1b6a080 into elastic:main Jan 29, 2025
16 checks passed

This was referenced Jan 29, 2025

[CI] KeywordFieldMapperTests testFieldTypeDefault_ConfiguredDocValues failing #121233

Closed

[TEST] Ensure that feature flag is enabled in new KeywordFieldMapperTests #121248

Merged

salvatore-campagna mentioned this pull request Feb 5, 2025

Refactoring doc values sparse index enabling for the host.name field #121751

Merged

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Mar 13, 2025

Revert "Sparse doc values index for LogsDB host.name field (elastic…

01d89af

…#120741)" This reverts commit 1b6a080.

This was referenced Mar 13, 2025

Remove doc value skipper experimental code from 9.0 branch. #124803

Merged

Don't enable docvalues skippers by default for the time being. #124787

Merged

martijnvg added a commit that referenced this pull request Mar 13, 2025

Revert "Sparse doc values index for LogsDB host.name field (#120741)…

ecc423b

…" (#124803) This reverts commit 1b6a080.

Comments

Conversation

salvatore-campagna commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jan 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

salvatore-campagna commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

martijnvg commented Jan 24, 2025

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

salvatore-campagna commented Jan 23, 2025 •

edited

Loading

salvatore-campagna Jan 24, 2025 •

edited

Loading

salvatore-campagna Jan 24, 2025 •

edited

Loading

salvatore-campagna commented Jan 24, 2025 •

edited

Loading

salvatore-campagna Jan 29, 2025 •

edited

Loading

salvatore-campagna Jan 29, 2025 •

edited

Loading