Control storing array source with index setting by kkrik-es · Pull Request #112397 · elastic/elasticsearch

kkrik-es · 2024-08-30T14:48:07Z

Introduce an index setting that forces storing the source of leaf field and object arrays in synthetic source mode. Nested objects are excluded as they already preserve ordering in synthetic source.

Next step is to introduce override params at the mapper level that will allow disabling the source, or storing the source for arrays (if not enabled at index level), or storing the source for both arrays and singletons. This will happen in follow-up changes, so that we can benchmark the impact of this change in parallel.

Related to #112012

elasticsearchmachine · 2024-08-30T14:49:12Z

Hi @kkrik-es, I've created a changelog YAML for you.

…ray' into synthetic-source/store-source-array

# Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java

kkrik-es · 2024-08-31T09:57:32Z

run elasticsearch-ci/bwc-snapshots

elasticsearchmachine · 2024-09-02T06:02:16Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

kkrik-es · 2024-09-02T06:04:54Z

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

+    public enum StoreSourceMode {
+        DISABLED("disabled"),  // No source recording
+        ARRAYS("arrays"),      // Store source for arrays of mapped fields
+        ENABLED("enabled");    // Store source for both singletons and arrays of mapped fields


Not excited about the names here.. Maybe DISABLED => OFF or NONE , ENABLED => FULL or ALL?

Maybe naive question, but why do we need to enum here? For now at least it seems that just the index.mapping.store_array_source setting is sufficient?

The idea is that we can add the FULL option for some fields, in case we want to track their source even if they're singletons, e.g. for geo or for objects with deep structure like nested. This option will never be set at the index level as it would disable synthetic source.

I think that @jimczi also expressed the desire to generalize the option of storing the source of objects or fields verbatim.

Updated to none/arrays/all, ptal.

martijnvg · 2024-09-02T10:40:32Z

.../src/yamlRestTest/resources/rest-api-spec/test/indices.create/21_synthetic_source_stored.yml

@@ -0,0 +1,732 @@
+---
+object param - object array:


Are these just moved yaml tests?

Yeah moved the ones with store_array_source and added some similar with the index setting.

martijnvg · 2024-09-02T10:42:59Z

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

+    public enum StoreSourceMode {
+        DISABLED("disabled"),  // No source recording
+        ARRAYS("arrays"),      // Store source for arrays of mapped fields
+        ENABLED("enabled");    // Store source for both singletons and arrays of mapped fields


Maybe naive question, but why do we need to enum here? For now at least it seems that just the index.mapping.store_array_source setting is sufficient?

…ray' into synthetic-source/store-source-array

...lasticsearch/datastreams/logsdb/qa/StandardVersusLogsIndexModeRandomDataChallengeRestIT.java

lkts · 2024-09-03T16:39:50Z

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

+
+    // Setting StoreSourceMode to ENABLED at the index level is equivalent to disabling synthetic source, which is not desired.
+    // Since the only valid option is to track array source by default, we use a boolean index setting for it.
+    public static final Setting<Boolean> STORE_ARRAY_SOURCE_SETTING = Setting.boolSetting(


I am not really happy that the name will be different between index settings and mapper parameters (we have them named the same at least for ignore_malformed) but i see why it is done. Can we maybe make all not valid for index setting but have the same name?

Yeah that was my initial take here, too. An enum here is more flexible, we may want to have other values in the future..

Changed to match, @martijnvg wdyt?

jimczi

Did we consider offering this option at the mapping level instead of applying it to the global index settings? We don't need this to be the default. It could be enabled at the object level within the mapping. We could also provide the option at the root object level, similar to how we handle subobjects.

Allowing this option for the entire mapping might not be ideal, especially if there are many objects. I believe the need to preserve array order is likely specific to certain parts of the mapping for particular use cases?

Another solution could be to require the field to be nested in order to preserve the ordering. Forcing users to use nested fields when they have multiple objects to differentiate feels more consistent to me. Otherwise, we're introducing a third method for handling arrays of objects, which would only apply to the _source field and be inconsistent with how the data is indexed.

kkrik-es · 2024-09-03T17:40:25Z

Did we consider offering the index-level, catch-all option at the mapping level instead of applying it to the global index settings? We don't need this to be the default. It could be enabled at the object level within the mapping. We could also provide the option at the root object level, similar to how we handle subobjects.

Allowing this option for the entire mapping might not be ideal, especially if there are many objects. I believe the need to preserve array order is likely specific to certain parts of the mapping for particular use cases?

Another solution could be to require the field to be nested in order to preserve the ordering. Forcing users to use nested fields when they have multiple objects to differentiate feels more consistent to me. Otherwise, we're introducing a third method for handling arrays of objects, which would only apply to the _source field and be inconsistent with how the data is indexed.

We'll be adding the per-field option in follow-up PRs, hence the introduced enum.

We want to include the index-level, catch-all option for logsdb mode to avoid surprises when users get switched to use it under the covers (e.g. in serverless). They may have custom mappings that rely on array valyes appearing in the expected order in source (e.g. args), so the switch to synthetic source shouldn't break them.

Iiuc, nested fields (ie type:nested) have bloated overhead, compared to number or keyword fields, for arrays of leaf values - which seems to be the majority of cases in integrations (at least).

jimczi

Thanks for explaining @kkrik-es!
I left one question regarding the implementation. My main worry is whether we can restrict the storage overhead to actual arrays of objects and not to all objects.

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

jimczi · 2024-09-03T18:43:47Z

server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java

                parser = context.parser();
            }
            context = context.createNestedContext((NestedObjectMapper) context.parent());
+        } else if (context.storeSourceModeFromIndexSettings() == Mapper.StoreSourceMode.ALL && context.canAddIgnoredField()) {


I don't understand why this is needed for the singleton case. I thought that store array source would only apply to arrays, do we also need to store single objects?

Hm correct, this is not possible actually.. I'll revert it.

It will be added as a singleton option for per-field configuration - there are cases where we may want to store the source as-is, e.g. for geo or full object hierarchies. We also have the equivalent for nested - highjacking store_array_source:

https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java#L300

Btw this is intended to replace store_array_source for objects too.

kkrik-es · 2024-09-04T08:44:51Z

Early results from Rally show no significant increase in storage size (1%). Indexing took 15% more, something to keep in mind..

    |                    Cumulative indexing time of primary shards |    866.076       |    992.146       |    126.07        |    min |  +14.56% |
    |             Min cumulative indexing time across primary shard |      2.98425     |      3.5261      |      0.54185     |    min |  +18.16% |
    |          Median cumulative indexing time across primary shard |     10.8559      |     13.1929      |      2.33698     |    min |  +21.53% |
    |             Max cumulative indexing time across primary shard |    154.211       |    168.458       |     14.2462      |    min |   +9.24% |
    |           Cumulative indexing throttle time of primary shards |      0           |      0           |      0           |    min |    0.00% |
    |    Min cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    | Median cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    |    Max cumulative indexing throttle time across primary shard |      0           |      0           |      0           |    min |    0.00% |
    |                       Cumulative merge time of primary shards |    376.848       |    375.484       |     -1.36412     |    min |   -0.36% |
    |                      Cumulative merge count of primary shards |   1212           |   1377           |    165           |        |  +13.61% |
    |                Min cumulative merge time across primary shard |      0.57455     |      0.46835     |     -0.1062      |    min |  -18.48% |
    |             Median cumulative merge time across primary shard |      1.96435     |      2.15628     |      0.19193     |    min |   +9.77% |
    |                Max cumulative merge time across primary shard |    103.914       |     95.2852      |     -8.62887     |    min |   -8.30% |
    |              Cumulative merge throttle time of primary shards |    100.701       |     94.4205      |     -6.28095     |    min |   -6.24% |
    |       Min cumulative merge throttle time across primary shard |      0.00528333  |      0           |     -0.00528     |    min | -100.00% |
    |    Median cumulative merge throttle time across primary shard |      0.0449333   |      0.0538833   |      0.00895     |    min |  +19.92% |
    |       Max cumulative merge throttle time across primary shard |     30.6708      |     26.8331      |     -3.83773     |    min |  -12.51% |
    |                     Cumulative refresh time of primary shards |     32.9666      |     32.2713      |     -0.69533     |    min |   -2.11% |
    |                    Cumulative refresh count of primary shards |   8192           |   8155           |    -37           |        |   -0.45% |
    |              Min cumulative refresh time across primary shard |      0.0171667   |      0.0230167   |      0.00585     |    min |  +34.08% |
    |           Median cumulative refresh time across primary shard |      0.0488667   |      0.0546833   |      0.00582     |    min |  +11.90% |
    |              Max cumulative refresh time across primary shard |     10.6189      |     10.3388      |     -0.28003     |    min |   -2.64% |
    |                       Cumulative flush time of primary shards |     47.6639      |     42.9342      |     -4.7297      |    min |   -9.92% |
    |                      Cumulative flush count of primary shards |   3811           |   3483           |   -328           |        |   -8.61% |
    |                Min cumulative flush time across primary shard |      0.390167    |      0.372167    |     -0.018       |    min |   -4.61% |
    |             Median cumulative flush time across primary shard |      1.2428      |      1.0401      |     -0.2027      |    min |  -16.31% |
    |                Max cumulative flush time across primary shard |      3.05592     |      2.71155     |     -0.34437     |    min |  -11.27% |
    |                                       Total Young Gen GC time |     96.46        |    105.877       |      9.417       |      s |   +9.76% |
    |                                      Total Young Gen GC count |  11580           |  12883           |   1303           |        |  +11.25% |
    |                                         Total Old Gen GC time |      0           |      0           |      0           |      s |    0.00% |
    |                                        Total Old Gen GC count |      0           |      0           |      0           |        |    0.00% |
    |                                                  Dataset size |     77.716       |     78.7394      |      1.02342     |     GB |   +1.32% |
    |                                                    Store size |     77.716       |     78.7394      |      1.02342     |     GB |   +1.32% |
    |                                                 Translog size |      3.99537e-06 |      3.99537e-06 |      0           |     GB |    0.00% |
    |                                        Heap used for segments |      0           |      0           |      0           |     MB |    0.00% |
    |                                      Heap used for doc values |      0           |      0           |      0           |     MB |    0.00% |
    |                                           Heap used for terms |      0           |      0           |      0           |     MB |    0.00% |
    |                                           Heap used for norms |      0           |      0           |      0           |     MB |    0.00% |
    |                                          Heap used for points |      0           |      0           |      0           |     MB |    0.00% |
    |                                   Heap used for stored fields |      0           |      0           |      0           |     MB |    0.00% |
    |                                                 Segment count |    961           |   1050           |     89           |        |   +9.26% |
    |                                   Total Ingest Pipeline count |      4.8861e+08  |      4.8861e+08  |      0           |        |    0.00% |
    |                                    Total Ingest Pipeline time |      3.17516e+07 |      3.2345e+07  | 593340           |     ms |   +1.87% |
    |                                  Total Ingest Pipeline failed |      0           |      0           |      0           |        |    0.00% |

results.txt

martijnvg

Left two comments, otherwise LGTM

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

martijnvg · 2024-09-04T11:56:06Z

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

+    public static final Setting<StoreSourceMode> STORE_ARRAY_SOURCE_SETTING = Setting.enumSetting(
+        StoreSourceMode.class,
+        "index.mapping.store_source",
+        StoreSourceMode.NONE,


Should the default by arrays if index.mode is logsdb?

I think we should change this in the logs template, so that it can be overriden for logsdb indexes, if not desired.

I see, I didn't know this was the plan to enable. This makes sense to me.

Introduce an index setting that forces storing the source of leaf field and object arrays in synthetic source mode. Nested objects are excluded as they already preserve ordering in synthetic source. Next step is to introduce override params at the mapper level that will allow disabling the source, or storing the source for arrays (if not enabled at index level), or storing the source for both arrays and singletons. This will happen in follow-up changes, so that we can benchmark the impact of this change in parallel. Related to elastic#112012

kkrik-es added 2 commits August 30, 2024 17:34

Control storing array source with index setting

b21c6dd

bypass for nested

717aacb

kkrik-es self-assigned this Aug 30, 2024

kkrik-es added >enhancement Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings labels Aug 30, 2024

elasticsearchmachine added the v8.16.0 label Aug 30, 2024

Update docs/changelog/112397.yaml

4d27b0c

kkrik-es added 4 commits August 30, 2024 21:57

add node feature

0d1ec40

Merge remote-tracking branch 'origin/synthetic-source/store-source-ar…

2c3c5ef

…ray' into synthetic-source/store-source-array

Merge branch 'main' into synthetic-source/store-source-array

99bf5a9

# Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/MapperFeatures.java

Merge branch 'main' into synthetic-source/store-source-array

cd14476

Merge branch 'elastic:main' into synthetic-source/store-source-array

0bac4fd

kkrik-es marked this pull request as ready for review September 2, 2024 06:01

kkrik-es requested a review from martijnvg September 2, 2024 06:02

kkrik-es commented Sep 2, 2024

View reviewed changes

martijnvg reviewed Sep 2, 2024

View reviewed changes

kkrik-es added 5 commits September 3, 2024 15:01

Merge branch 'main' into synthetic-source/store-source-array

cf74008

update names

d87257a

Merge remote-tracking branch 'origin/synthetic-source/store-source-ar…

8055705

…ray' into synthetic-source/store-source-array

add comment

524e82a

add randomized test coverage

40e7828

kkrik-es requested a review from lkts September 3, 2024 14:00

lkts approved these changes Sep 3, 2024

View reviewed changes

jimczi reviewed Sep 3, 2024

View reviewed changes

kkrik-es added 4 commits September 4, 2024 10:49

Merge branch 'main' into synthetic-source/store-source-array

2de740f

use enum in index setting

3bb31eb

revert change for using index.ALL

50d5dc5

update comment

2c4d97b

kkrik-es added 3 commits September 4, 2024 12:08

update randomized test

e65959d

skip nested objects

2d4ea8e

remove empty line

6ce3d8b

martijnvg approved these changes Sep 4, 2024

View reviewed changes

kkrik-es added 5 commits September 4, 2024 16:58

rename index setting

4ce6096

rename enum

1ef6219

rename enum

950fcfb

rename enum

4348e3f

rename enum

7d0e84a

kkrik-es added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 4, 2024

elasticsearchmachine merged commit d5bae2c into elastic:main Sep 4, 2024

kkrik-es deleted the synthetic-source/store-source-array branch September 4, 2024 15:12

kkrik-es mentioned this pull request Sep 13, 2024

Configure keeping source in FieldMapper #112706

Merged

Conversation

kkrik-es commented Aug 30, 2024

Uh oh!

elasticsearchmachine commented Aug 30, 2024

Uh oh!

kkrik-es commented Aug 31, 2024

Uh oh!

elasticsearchmachine commented Sep 2, 2024

Uh oh!

kkrik-es Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lkts Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

kkrik-es commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kkrik-es Sep 2, 2024 •

edited

Loading

lkts Sep 3, 2024 •

edited

Loading

kkrik-es Sep 4, 2024 •

edited

Loading

kkrik-es commented Sep 3, 2024 •

edited

Loading

kkrik-es Sep 3, 2024 •

edited

Loading

kkrik-es commented Sep 4, 2024 •

edited

Loading

martijnvg Sep 4, 2024 •

edited

Loading