Skip to content

Configure keeping source in FieldMapper#112706

Merged
elasticsearchmachine merged 22 commits intoelastic:mainfrom
kkrik-es:synthetic-source/keep-field-param
Sep 19, 2024
Merged

Configure keeping source in FieldMapper#112706
elasticsearchmachine merged 22 commits intoelastic:mainfrom
kkrik-es:synthetic-source/keep-field-param

Conversation

@kkrik-es
Copy link
Contributor

@kkrik-es kkrik-es commented Sep 10, 2024

Introduces per-field param synthetic_source_keep that overrides the behavior for keeping the field source in synthetic source mode:

  • none : no source is stored
  • arrays: the incoming source is recorded as-is for arrays of a given field
  • all: the incoming source is recorded as is for both singleton and array values of a given field

Related to #112012

@elasticsearchmachine
Copy link
Collaborator

Hi @kkrik-es, I've created a changelog YAML for you.

private boolean sourceMatchesExactly(MappingTransforms.FieldMapping mapping, List<Object> expectedValues) {
return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false"));
return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false"))
|| mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("all");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lkts any hints on how to cover arrays too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the reason i passed expectedValues in here but it doesn't really work.
In theory this should work:

expectedValues.size() > 1 && mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("arrays");

but it's also true when a field has a single value inside higher-level object array which won't trigger storing it in ignored source (at least in current impl).

Maybe we should embrace the pattern of "always try to exact match first and then do the custom match if that fails". I initially thought about it as temporary measure but it does make some things easier.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, there should be no complaints if we accidentally produce the same source..

But isn't this the case today, based on the code you call out above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can just remove sourceMatchesExactly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I misunderstood your suggestion. I think the current form is correct, the test should report an error if the source is expected to be an exact copy but it's not - even if relaxed matching succeeds.

@kkrik-es kkrik-es requested review from lkts and martijnvg September 10, 2024 17:59
@kkrik-es kkrik-es marked this pull request as ready for review September 10, 2024 17:59
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@lkts lkts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have questions about changes in StringStoredFieldFieldLoader and CompositeSyntheticFieldLoader.

private boolean sourceMatchesExactly(MappingTransforms.FieldMapping mapping, List<Object> expectedValues) {
return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false"));
return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false"))
|| mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("all");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was the reason i passed expectedValues in here but it doesn't really work.
In theory this should work:

expectedValues.size() > 1 && mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("arrays");

but it's also true when a field has a single value inside higher-level object array which won't trigger storing it in ignored source (at least in current impl).

Maybe we should embrace the pattern of "always try to exact match first and then do the custom match if that fails". I initially thought about it as temporary measure but it does make some things easier.

What do you think?

this.parts = parts;
this.storedFieldLoadersHaveValues = false;
this.docValuesLoadersHaveValues = false;
// In text mappers, the leaf name is a prefix of the full name and we want to use the leaf name, e.g. `myfield.sdfge` => 'myfield'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix this specifically in text/keyword mapper which is the only place where this pattern is used. No need to complicate this generic code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to keyword mapper, along with tests..

@Override
public String fieldName() {
return name;
return simpleName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct when a field is inside an object.

Copy link
Contributor Author

@kkrik-es kkrik-es Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either use a composite or store "actual field path" in this loader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a similar approach to keyword mapper, ptal.

}

@Override
protected SyntheticSourceSupport syntheticSourceSupportForKeepTests(boolean ignoreMalformed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up recently and it sounds like we don't intend to support BigDecimal in xcontent #111937. I wonder if this is a valid test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding casting for BigDecimal, still an issue.. It's freakin json.

# Conflicts:
#	test/framework/src/main/java/org/elasticsearch/logsdb/datageneration/datasource/DefaultMappingParametersHandler.java
@kkrik-es kkrik-es requested a review from lkts September 11, 2024 12:00
return fullPath.substring(0, fullPath.lastIndexOf(simpleName) + simpleName.length());
}

public SourceLoader.SyntheticFieldLoader syntheticFieldLoader(String simpleName, boolean trimLeafNameSuffix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we take String fullPath as a parameter here? Then keyword and text mapper just pass fullPath() in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry, simpler.

type: keyword
kw_arrays:
type: keyword
synthetic_source_keep: arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if _source.mode is stored and we use the synthetic_source_keep parameter? I expect that setting to be ignored completely. Mappings and settings are composed in templates so it is not unlikely that we will have mixed situations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the setting set to stored? I think I didn't get that part..

Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing is called synthetic_source_keep but you can have an index that is not using synthetic source, right? In that case you would have a mapping parameter that is about synthetic source....but with no synthetic source (stored source). This might happen because of template and component template composition and to me looks a bit odd. Also nothing prevents a user from setting this parameter explicitly without using synthetic source.

Moreover, some users might adopt an index mode (like logsdb or tsdb) which uses syntehtic source without actually knowing that synthetic source is used under the hood. Then you have to explain them that they need a parameter called synthetic_source_keep to preserve arrays...which requires telling them that synthetic source is used under the hood.

Some index modes might also have both source modes...

I don't know TBH what is the right thing here...it just looks a bit odd to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is the distinction between arrays and singletons...users might not necessarily know if a field is a single-value field or a multi-value field because they not necessarily control the mapping. Think about fields coming from OTel or some integration...we are asking them to make a choice...and most of the time, just to be on the safe side, they will need to chose all because they don't know if the filed is single-value or multi-value.

Copy link
Contributor

@salvatore-campagna salvatore-campagna Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should have just two choices, preserve_array, true or false. I understand that not making a distinction between singleton and arrays might result in performance penalty...under stored source preserve_array is a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a long thread on this topic in #112397.. Names are hard, we can argue for and against each option.

all should be used very deliberately, and probably rarely. The options for arrays is more widely applicable and easier to grasp, since this is a known limitation of synthetic source. It's thus better to have separate options for arrays and [arrays + singletons].

@salvatore-campagna
Copy link
Contributor

I think this needs to be documented with some examples...there might be corner cases where using this is not trivial. Is the documentation updated in another PR?

@kkrik-es
Copy link
Contributor Author

I think this needs to be documented with some examples...there might be corner cases where using this is not trivial. Is the documentation updated in another PR?

Yeah there will be another PR updating objects, I'll probably add documentation then. We also need to document the index-level setting..

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - left one question for my understanding.

&& fieldMapper.syntheticSourceMode() == FieldMapper.SyntheticSourceMode.FALLBACK;
boolean fieldWithStoredArraySource = mapper instanceof FieldMapper fieldMapper
&& context.sourceKeepModeFromIndexSettings() == Mapper.SourceKeepMode.ARRAYS;
&& getSourceKeepMode(context, fieldMapper.sourceKeepMode()) != Mapper.SourceKeepMode.NONE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, should this just be: getSourceKeepMode(context, fieldMapper.sourceKeepMode()) = ARRAYS;, since this method is for dealing with arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep array source both for ARRAYS and ALL values, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe my assumption (which may be wrong) is that in this method we only deal with arrays, so I thought just checking for ARRAYS should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the definition of ALL, the source should be recorded when parsing an array. Maybe I'm missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand that arrays are included with ALL. I think this code will work. I just wondered whether getSourceKeepMode(context, fieldMapper.sourceKeepMode()) = ARRAYS; would work here too, since only arrays should be handled by the parseNonDynamicArray(...) method?

Copy link
Contributor Author

@kkrik-es kkrik-es Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parseNonDynamicArray method is called when parsing an array in a doc. In both ARRAYS and ALL cases, we want to record the source . If we just check for ARRAYS here, the array source won't get recorded in the ALL case - this is similar to the check above for objectMapper.storeArraySource() that applies to objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see now. This makes sense now.

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java
#	server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java
#	test/framework/src/main/java/org/elasticsearch/logsdb/datageneration/datasource/DefaultMappingParametersHandler.java
# Conflicts:
#	modules/data-streams/src/javaRestTest/java/org/elasticsearch/datastreams/logsdb/qa/StandardVersusLogsIndexModeRandomDataChallengeRestIT.java
@kkrik-es kkrik-es added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Sep 19, 2024
for (var field : ignoredFieldsMissingValues) {
fields.put(field.name(), field);
}
context.deduplicateIgnoredFieldValues(fields.keySet());
Copy link
Contributor Author

@kkrik-es kkrik-es Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, I added a yaml test and it caught a bug where there may be duplicate values, as a leaf array may be recorded separately and before identifying that all leaf elements need to be recorded in the second pass. This fixes it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example? I don't get it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkrik-es kkrik-es added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 19, 2024
@elasticsearchmachine elasticsearchmachine merged commit e244216 into elastic:main Sep 19, 2024
@kkrik-es kkrik-es deleted the synthetic-source/keep-field-param branch September 19, 2024 13:29
kkrik-es added a commit to kkrik-es/elasticsearch that referenced this pull request Sep 19, 2024
Introduces per-field param `synthetic_source_keep` that overrides the
behavior for keeping the field source in synthetic source mode:  -
`none` : no source is stored  - `arrays`: the incoming source is
recorded as-is for arrays of a given field  - `all`: the incoming source
is recorded as is for both singleton and array values of a given field

Related to elastic#112012
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Sep 19, 2024
Introduces per-field param `synthetic_source_keep` that overrides the
behavior for keeping the field source in synthetic source mode:  -
`none` : no source is stored  - `arrays`: the incoming source is
recorded as-is for arrays of a given field  - `all`: the incoming source
is recorded as is for both singleton and array values of a given field

Related to #112012
salvatore-campagna added a commit to elastic/rally-tracks that referenced this pull request Sep 30, 2024
This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the
behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all`
not usable when set at index level).
See elastic/elasticsearch#112706 to understand the effect of each value.

Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs`
and `elastic/security` using value `arrays`.
salvatore-campagna added a commit to salvatore-campagna/rally-tracks that referenced this pull request Sep 30, 2024
…tic#682)

This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the
behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all`
not usable when set at index level).
See elastic/elasticsearch#112706 to understand the effect of each value.

Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs`
and `elastic/security` using value `arrays`.
salvatore-campagna added a commit to elastic/rally-tracks that referenced this pull request Oct 1, 2024
#684)

This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the
behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all`
not usable when set at index level).
See elastic/elasticsearch#112706 to understand the effect of each value.

Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs`
and `elastic/security` using value `arrays`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants