Skip to content

Control storing array source with index setting#112397

Merged
elasticsearchmachine merged 25 commits intoelastic:mainfrom
kkrik-es:synthetic-source/store-source-array
Sep 4, 2024
Merged

Control storing array source with index setting#112397
elasticsearchmachine merged 25 commits intoelastic:mainfrom
kkrik-es:synthetic-source/store-source-array

Conversation

@kkrik-es
Copy link
Contributor

Introduce an index setting that forces storing the source of leaf field and object arrays in synthetic source mode. Nested objects are excluded as they already preserve ordering in synthetic source.

Next step is to introduce override params at the mapper level that will allow disabling the source, or storing the source for arrays (if not enabled at index level), or storing the source for both arrays and singletons. This will happen in follow-up changes, so that we can benchmark the impact of this change in parallel.

Related to #112012

@elasticsearchmachine
Copy link
Collaborator

Hi @kkrik-es, I've created a changelog YAML for you.

@kkrik-es
Copy link
Contributor Author

run elasticsearch-ci/bwc-snapshots

@kkrik-es kkrik-es marked this pull request as ready for review September 2, 2024 06:01
@kkrik-es kkrik-es requested a review from martijnvg September 2, 2024 06:02
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

public enum StoreSourceMode {
DISABLED("disabled"), // No source recording
ARRAYS("arrays"), // Store source for arrays of mapped fields
ENABLED("enabled"); // Store source for both singletons and arrays of mapped fields
Copy link
Contributor Author

@kkrik-es kkrik-es Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not excited about the names here.. Maybe DISABLED => OFF or NONE , ENABLED => FULL or ALL?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe naive question, but why do we need to enum here? For now at least it seems that just the index.mapping.store_array_source setting is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we can add the FULL option for some fields, in case we want to track their source even if they're singletons, e.g. for geo or for objects with deep structure like nested. This option will never be set at the index level as it would disable synthetic source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @jimczi also expressed the desire to generalize the option of storing the source of objects or fields verbatim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to none/arrays/all, ptal.

@@ -0,0 +1,732 @@
---
object param - object array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these just moved yaml tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah moved the ones with store_array_source and added some similar with the index setting.

public enum StoreSourceMode {
DISABLED("disabled"), // No source recording
ARRAYS("arrays"), // Store source for arrays of mapped fields
ENABLED("enabled"); // Store source for both singletons and arrays of mapped fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe naive question, but why do we need to enum here? For now at least it seems that just the index.mapping.store_array_source setting is sufficient?

@kkrik-es kkrik-es requested a review from lkts September 3, 2024 14:00

// Setting StoreSourceMode to ENABLED at the index level is equivalent to disabling synthetic source, which is not desired.
// Since the only valid option is to track array source by default, we use a boolean index setting for it.
public static final Setting<Boolean> STORE_ARRAY_SOURCE_SETTING = Setting.boolSetting(
Copy link
Contributor

@lkts lkts Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really happy that the name will be different between index settings and mapper parameters (we have them named the same at least for ignore_malformed) but i see why it is done. Can we maybe make all not valid for index setting but have the same name?

Copy link
Contributor Author

@kkrik-es kkrik-es Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was my initial take here, too. An enum here is more flexible, we may want to have other values in the future..

Changed to match, @martijnvg wdyt?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider offering this option at the mapping level instead of applying it to the global index settings? We don't need this to be the default. It could be enabled at the object level within the mapping. We could also provide the option at the root object level, similar to how we handle subobjects.

Allowing this option for the entire mapping might not be ideal, especially if there are many objects. I believe the need to preserve array order is likely specific to certain parts of the mapping for particular use cases?

Another solution could be to require the field to be nested in order to preserve the ordering. Forcing users to use nested fields when they have multiple objects to differentiate feels more consistent to me. Otherwise, we're introducing a third method for handling arrays of objects, which would only apply to the _source field and be inconsistent with how the data is indexed.

@kkrik-es
Copy link
Contributor Author

kkrik-es commented Sep 3, 2024

Did we consider offering the index-level, catch-all option at the mapping level instead of applying it to the global index settings? We don't need this to be the default. It could be enabled at the object level within the mapping. We could also provide the option at the root object level, similar to how we handle subobjects.

Allowing this option for the entire mapping might not be ideal, especially if there are many objects. I believe the need to preserve array order is likely specific to certain parts of the mapping for particular use cases?

Another solution could be to require the field to be nested in order to preserve the ordering. Forcing users to use nested fields when they have multiple objects to differentiate feels more consistent to me. Otherwise, we're introducing a third method for handling arrays of objects, which would only apply to the _source field and be inconsistent with how the data is indexed.

We'll be adding the per-field option in follow-up PRs, hence the introduced enum.

We want to include the index-level, catch-all option for logsdb mode to avoid surprises when users get switched to use it under the covers (e.g. in serverless). They may have custom mappings that rely on array valyes appearing in the expected order in source (e.g. args), so the switch to synthetic source shouldn't break them.

Iiuc, nested fields (ie type:nested) have bloated overhead, compared to number or keyword fields, for arrays of leaf values - which seems to be the majority of cases in integrations (at least).

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining @kkrik-es!
I left one question regarding the implementation. My main worry is whether we can restrict the storage overhead to actual arrays of objects and not to all objects.

parser = context.parser();
}
context = context.createNestedContext((NestedObjectMapper) context.parent());
} else if (context.storeSourceModeFromIndexSettings() == Mapper.StoreSourceMode.ALL && context.canAddIgnoredField()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is needed for the singleton case. I thought that store array source would only apply to arrays, do we also need to store single objects?

Copy link
Contributor Author

@kkrik-es kkrik-es Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm correct, this is not possible actually.. I'll revert it.

It will be added as a singleton option for per-field configuration - there are cases where we may want to store the source as-is, e.g. for geo or full object hierarchies. We also have the equivalent for nested - highjacking store_array_source:

https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java#L300

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw this is intended to replace store_array_source for objects too.

@kkrik-es
Copy link
Contributor Author

kkrik-es commented Sep 4, 2024

Early results from Rally show no significant increase in storage size (1%). Indexing took 15% more, something to keep in mind..

    |                    Cumulative indexing time of primary shards |    866.076       |    992.146       |    126.07        |    min |  +14.56% |
    |             Min cumulative indexing time across primary shard |      2.98425     |      3.5261      |      0.54185     |    min |  +18.16% |
    |          Median cumulative indexing time across primary shard |     10.8559      |     13.1929      |      2.33698     |    min |  +21.53% |
    |             Max cumulative indexing time across primary shard |    154.211       |    168.458       |     14.2462      |    min |   +9.24% |
    |           Cumulative indexing throttle time of primary shards |      0           |      0           |      0           |    min |    0.00% |
    |    Min cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    | Median cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    |    Max cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    |                       Cumulative merge time of primary shards |    376.848       |    375.484       |     -1.36412     |    min |   -0.36% |
    |                      Cumulative merge count of primary shards |   1212           |   1377           |    165           |        |  +13.61% |
    |                Min cumulative merge time across primary shard |      0.57455     |      0.46835     |     -0.1062      |    min |  -18.48% |
    |             Median cumulative merge time across primary shard |      1.96435     |      2.15628     |      0.19193     |    min |   +9.77% |
    |                Max cumulative merge time across primary shard |    103.914       |     95.2852      |     -8.62887     |    min |   -8.30% |
    |              Cumulative merge throttle time of primary shards |    100.701       |     94.4205      |     -6.28095     |    min |   -6.24% |
    |       Min cumulative merge throttle time across primary shard |      0.00528333  |      0           |     -0.00528     |    min | -100.00% |
    |    Median cumulative merge throttle time across primary shard |      0.0449333   |      0.0538833   |      0.00895     |    min |  +19.92% |
    |       Max cumulative merge throttle time across primary shard |     30.6708      |     26.8331      |     -3.83773     |    min |  -12.51% |
    |                     Cumulative refresh time of primary shards |     32.9666      |     32.2713      |     -0.69533     |    min |   -2.11% |
    |                    Cumulative refresh count of primary shards |   8192           |   8155           |    -37           |        |   -0.45% |
    |              Min cumulative refresh time across primary shard |      0.0171667   |      0.0230167   |      0.00585     |    min |  +34.08% |
    |           Median cumulative refresh time across primary shard |      0.0488667   |      0.0546833   |      0.00582     |    min |  +11.90% |
    |              Max cumulative refresh time across primary shard |     10.6189      |     10.3388      |     -0.28003     |    min |   -2.64% |
    |                       Cumulative flush time of primary shards |     47.6639      |     42.9342      |     -4.7297      |    min |   -9.92% |
    |                      Cumulative flush count of primary shards |   3811           |   3483           |   -328           |        |   -8.61% |
    |                Min cumulative flush time across primary shard |      0.390167    |      0.372167    |     -0.018       |    min |   -4.61% |
    |             Median cumulative flush time across primary shard |      1.2428      |      1.0401      |     -0.2027      |    min |  -16.31% |
    |                Max cumulative flush time across primary shard |      3.05592     |      2.71155     |     -0.34437     |    min |  -11.27% |
    |                                       Total Young Gen GC time |     96.46        |    105.877       |      9.417       |      s |   +9.76% |
    |                                      Total Young Gen GC count |  11580           |  12883           |   1303           |        |  +11.25% |
    |                                         Total Old Gen GC time |      0           |      0           |      0           |      s |    0.00% |
    |                                        Total Old Gen GC count |      0           |      0           |      0           |        |    0.00% |
    |                                                  Dataset size |     77.716       |     78.7394      |      1.02342     |     GB |   +1.32% |
    |                                                    Store size |     77.716       |     78.7394      |      1.02342     |     GB |   +1.32% |
    |                                                 Translog size |      3.99537e-06 |      3.99537e-06 |      0           |     GB |    0.00% |
    |                                        Heap used for segments |      0           |      0           |      0           |     MB |    0.00% |
    |                                      Heap used for doc values |      0           |      0           |      0           |     MB |    0.00% |
    |                                           Heap used for terms |      0           |      0           |      0           |     MB |    0.00% |
    |                                           Heap used for norms |      0           |      0           |      0           |     MB |    0.00% |
    |                                          Heap used for points |      0           |      0           |      0           |     MB |    0.00% |
    |                                   Heap used for stored fields |      0           |      0           |      0           |     MB |    0.00% |
    |                                                 Segment count |    961           |   1050           |     89           |        |   +9.26% |
    |                                   Total Ingest Pipeline count |      4.8861e+08  |      4.8861e+08  |      0           |        |    0.00% |
    |                                    Total Ingest Pipeline time |      3.17516e+07 |      3.2345e+07  | 593340           |     ms |   +1.87% |
    |                                  Total Ingest Pipeline failed |      0           |      0           |      0           |        |    0.00% |

results.txt

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments, otherwise LGTM

public static final Setting<StoreSourceMode> STORE_ARRAY_SOURCE_SETTING = Setting.enumSetting(
StoreSourceMode.class,
"index.mapping.store_source",
StoreSourceMode.NONE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default by arrays if index.mode is logsdb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this in the logs template, so that it can be overriden for logsdb indexes, if not desired.

Copy link
Member

@martijnvg martijnvg Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I didn't know this was the plan to enable. This makes sense to me.

@kkrik-es kkrik-es added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 4, 2024
@elasticsearchmachine elasticsearchmachine merged commit d5bae2c into elastic:main Sep 4, 2024
@kkrik-es kkrik-es deleted the synthetic-source/store-source-array branch September 4, 2024 15:12
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Introduce an index setting that forces storing the source of leaf field
and object arrays in synthetic source mode. Nested objects are excluded
as they already preserve ordering in synthetic source.

Next step is to introduce override params at the mapper level that will
allow disabling the source, or storing the source for arrays (if not
enabled at index level), or storing the source for both arrays and
singletons. This will happen in follow-up changes, so that we can
benchmark the impact of this change in parallel.

Related to elastic#112012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants