Skip to content

Comments

Sparse doc values index for LogsDB host.name field#120741

Merged
salvatore-campagna merged 25 commits intoelastic:mainfrom
salvatore-campagna:feature/timestamp-and-hostname-sparse-index
Jan 29, 2025
Merged

Sparse doc values index for LogsDB host.name field#120741
salvatore-campagna merged 25 commits intoelastic:mainfrom
salvatore-campagna:feature/timestamp-and-hostname-sparse-index

Conversation

@salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Jan 23, 2025

This PR introduces a new field type in KeywordFieldMapper with support for a sparse doc values index when specific conditions are met:

  • The index mode is set to LOGSDB.
  • The field name is host.name and mapped as a keyword.
  • The field is included in the primary sort configuration.
  • The field has doc values and indexing is not disabled explicitly (not index: false).

When all the conditions above hold true we:

  • use a new FieldType with DocValuesSkipIndexType.RANGE as the sparse index type.
  • update the KeywordFieldMapper to apply the new field type conditionally so to have a sparse doc values index on the host.name field.
  • disable indexing of the host.name field dropping the inverted index (in favor of the sparse doc values index).

Some queries might be slower as a result of using a doc values sparse index instead of an inverted index.

Disabling the inverted index on the host.name field while enabling the doc values sparse index is expected to:

  • reduce storage footprint depending on the size of the inverted index relative to the sparse index.
  • improve indexing throughput as a result of reducing the amount of data written during segment flushes.

Introducing the sparse index is gated by a feature flag which will be used later to do the same for the @timestamp field too. As a result, we will see the impact on storage, indexing throughput and query latency only is snapshot builds.

@salvatore-campagna salvatore-campagna self-assigned this Jan 23, 2025
@salvatore-campagna salvatore-campagna added >non-issue auto-backport Automatically create backport pull requests when merged v8.18.0 :StorageEngine/Logs You know, for Logs labels Jan 23, 2025
@Override
public KeywordFieldMapper build(MapperBuilderContext context) {
FieldType fieldtype = new FieldType(Defaults.FIELD_TYPE);
FieldType fieldtype = fieldType(indexSortConfig, indexMode, context.buildFullName(leafName()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on subobjects the value of leafName could include the parent (before the dot) or not. We need the full name to check if we are dealing with host.name no matter the subobjects setting.

// deduplicate in the common default case to save some memory
fieldtype = Defaults.FIELD_TYPE;
}
if (fieldtype.equals(Defaults.FIELD_TYPE_WITH_SKIP_DOC_VALUES)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimization as above.

final IndexMode indexMode,
final String fullFieldName
) {
return (defaultIndexedAndDocValues() || isNotIndexedAndHasDocValues())
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I also enable the sparse index if the user explicitly defines the field not to be indexed and with doc values. In that case we pay a little price in terms of storage footprint to take advantage of the sparse index.

this.scriptValues = builder.scriptValues();
this.isDimension = builder.dimension.getValue();
this.isSyntheticSource = isSyntheticSource;
this.hasDocValuesSparseIndex = DocValuesSkipIndexType.NONE.equals(fieldType.docValuesSkipIndexType()) == false;
Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this way we make sure the boolean value is correct even if new skip indices other than RANGE are introduced in Lucene in future releases.

@salvatore-campagna salvatore-campagna marked this pull request as ready for review January 24, 2025 11:12
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

);

final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
assertTrue(mapper.fieldType().hasDocValuesSparseIndex());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we enable the doc values sparse index even if the user explicitly configures index: false and doc_values: true.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sparse doc values index on the host.name field provides the following benefits:

Improves query performance when the host.name field is used. The sparse index allows skipping over irrelevant documents based on doc values, which reduces query latency and resources required to execute the query.
It reduces compute costs at query time reducing query latency for queries taking advantage of the sparse index.
As it is used when the host.name field is included in the primary sort configuration, the sparse index aligns with sorting requirements, further enhancing efficiency during data retrieval and aggregation.

I don't think querying will be faster than when there is an inverted index. Part of this exercise is to see how much query performance will be effected. However given that host.name is the primary sort field, I think/hope that drop in query performance will be acceptable.

Note that this PR can't be back ported to 8.x, since it relies on doc values skippers, which is a Lucene 10 only feature.

return isIndexed;
}

public boolean hasDocValuesSparseIndex() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, we don't have to add a new method to this base class? I see this is only used in tests and then we know the concrete class?

&& isPrimarySortField(indexSortConfig);
}

private boolean isHostNameField(final String fullFieldName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more readable if these sub conditions are used direct in the return statement (^)? The sub conditions aren't long and moving to private methods doesn't buy much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated them because there are many of those...but if you prefer I can extract them.

@salvatore-campagna salvatore-campagna removed v8.18.0 auto-backport Automatically create backport pull requests when merged labels Jan 24, 2025
@salvatore-campagna
Copy link
Contributor Author

salvatore-campagna commented Jan 24, 2025

A sparse doc values index on the host.name field provides the following benefits:

Improves query performance when the host.name field is used. The sparse index allows skipping over irrelevant documents based on doc values, which reduces query latency and resources required to execute the query. It reduces compute costs at query time reducing query latency for queries taking advantage of the sparse index. As it is used when the host.name field is included in the primary sort configuration, the sparse index aligns with sorting requirements, further enhancing efficiency during data retrieval and aggregation.

I don't think querying will be faster than when there is an inverted index. Part of this exercise is to see how much query performance will be effected. However given that host.name is the primary sort field, I think/hope that drop in query performance will be acceptable.

Note that this PR can't be back ported to 8.x, since it relies on doc values skippers, which is a Lucene 10 only feature.

My assumption si that for some queries at least, since doc values are sorted, the sparse index can allow skipping entire segments in some cases...so some filtering queries, for instance, might actually be faster? Think for instance having a segment where host.name is foo and one where host.name is bar and the filtering is, for instance on foo. In that case the sparse index would allow skipping the segment where the value is bar. Am I wrong? This is why I thought some queries can actually be faster.

@martijnvg
Copy link
Member

My assumption si that for some queries at least, since doc values are sorted, the sparse index can allow skipping entire segments in some cases...so some filtering queries, for instance, might actually be faster?

Right, that is true. However I don't think it can be faster than a term query on an indexed field, since an inverted index also can skip over segments / many docids that don't match. I expect at best similar performance. But I may be wrong here. I think main point is to figure out what effect exactly is.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Salvatore, I left a few questions about in what cases sparse index should be enabled.

boolean isHostNameField = HOST_NAME.equals(fullFieldName);
boolean isPrimarySortField = indexSortConfig != null && indexSortConfig.hasPrimarySortOnField(HOST_NAME);

return (isIndexedAndDocValuesDefault || isNotIndexedAndHasDocValues) && isLogsDbMode && isHostNameField && isPrimarySortField;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to enable sparse index only if index has not been configured and doc values isn't disabled.

So I think this is easier:

indexed.isConfigured() == false && hasDocValues.getValue() == false && isLogsDbMode && isHostNameField && isPrimarySortField;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasDOcValues.getValue() == false? We need doc values to create the sparse index...

);

final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
assertFalse(mapper.fieldType().hasDocValuesSparseIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting sparse index in this case is ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I interpreted this and the other test below as "if there are doc values and all other conditions hold true we would like to have the sparse index". Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I think we can also decide that if index: false we do not create neither the inverted index nor the sparse index.

Let's do that?

);

final KeywordFieldMapper mapper = (KeywordFieldMapper) mapperService.documentMapper().mappers().getMapper("host.name");
assertTrue(mapper.fieldType().hasDocValuesSparseIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether sparse index should be enabled whenwhen index is explicitly set to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

return isSet && Objects.equals(value, getDefaultValue()) == false;
}

public boolean isSet() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I had to add this method to be able to detect the situation when the parameter index is set to true (the default) explicitly. The isConfigured existing method returns true only if the value is set and is different from the default too (which is not our case).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does getValue() and isConfigured() indicate the same? Configured specifically to true (even if attribute defaults to true)?

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isConfigured is not true...

public boolean isConfigured() {
            return isSet && Objects.equals(value, getDefaultValue()) == false;
        }

The second condition is true because value=true and getDefaultValue=true...so the Objects.equals evaluates to true (that is not false as checked by == false. This implementation of isConfigured does not really make sense to me.

So

isSet=true
value=true
getDefaultValue=true

this returns FALSE because of true && true == false.

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the error is in && used instead of ||...but I am really surprised.

The correct implementation should be

public boolean isConfigured() {
            return isSet || Objects.equals(value, getDefaultValue()) == false;
        }

Even if, when Objects.equals(value, getDefaultValue()) is false probably also isSet is true (isSet must be true if value is not the default, because it means a user explicitly set the non-default value).

Anyway if I try this change I have a lot of test failures...so maybe I am missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for explaining this.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Let's see how nighly benchmark picks this up.

@salvatore-campagna salvatore-campagna merged commit 1b6a080 into elastic:main Jan 29, 2025
16 checks passed
salvatore-campagna added a commit that referenced this pull request Feb 10, 2025
#121751)

In this PR, we change how the doc values sparse index is enabled for the host.name keyword field.  
The initial implementation of the sparse index for host.name was introduced in #120741.  

Previously, the choice between using an inverted index or a doc values sparse index was determined by the index parameter. With this change, we introduce a new final index-level setting, index.mapping.use_doc_values_sparse_index:

- When the setting is true, we enable the sparse index and omit the inverted index for host.name.  
- When the setting is false (default), we retain the inverted index instead.

Additionally, this setting is only exposed if the doc_values_sparse_index feature flag is enabled.

This change simplifies enabling the doc values sparse index and makes the selection of indexing strategies explicit at the index level. Moreover, the setting is not dynamic and is exposed only for stateful deployments.

The plan is to enable this setting in our nightly benchmarks and evaluate its impact on LogsDB indexing throughput, storage footprint and query latency. Based on benchmarking results, we will decide whether to adopt the sparse index and determine the best way to configure it.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Mar 13, 2025
martijnvg added a commit that referenced this pull request Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants