-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up exists
and missing
filters on high-cardinality fields
#5659
Comments
I really like the |
+1 on the |
+1 on |
So it's like _all but for fields names? Thinking about it loud. May be there is no use case for that... |
I don't want this to be forgotten, so I've added a v1.3.0 label. No pressure ;) |
The `exists` and `missing` filters need to merge postings lists of all existing terms, which can be very costly, especially on high-cardinality fields. This commit indexes the field names of a document under `_field_names` and reuses it to speed up the `exists` and `missing` filters. This is only enabled for indices that are created on or after Elasticsearch 1.3.0. Close elastic#5659
@spinscale asked me about the disk footprint of this feature. In general it is very low: its index options are I did some experiments in 2 extreme cases:
This looks very reasonable to me. Even the 2nd case which has very sparse documents takes less than one byte per field per document. |
The `exists` and `missing` filters need to merge postings lists of all existing terms, which can be very costly, especially on high-cardinality fields. This commit indexes the field names of a document under `_field_names` and reuses it to speed up the `exists` and `missing` filters. This is only enabled for indices that are created on or after Elasticsearch 1.3.0. Close #5659
exists
and missing
filters are slow on high-cardinality fieldsexists
and missing
filters are slow on high-cardinality fields
exists
and missing
filters are slow on high-cardinality fieldsexists
and missing
filters on high-cardinality fields
exists
and missing
filters on high-cardinality fieldsexists
and missing
filters on high-cardinality fields
The way that the
exists
filter works is by merging all postings lists.missing
just wraps anexists
filter into anot
filter.Merging all postings lists can however be very slow on high-cardinality fields. I think there are two ways to fix it:
_field_names
that would index all field names of a document.Working on field data has the drawback of requiring a lot of stuff to be loaded into memory if the field doesn't have doc values, and the returned filter cannot skip.
I tend to like indexing field names because it would not load anything into memory with a default setup, and the returned filter could skip efficiently since it would be based on a postings list. But unfortunately it could not be used on indices that have been created before we introduce this new metadata field.
The text was updated successfully, but these errors were encountered: