Configure keeping source in FieldMapper#112706
Configure keeping source in FieldMapper#112706elasticsearchmachine merged 22 commits intoelastic:mainfrom
Conversation
|
Hi @kkrik-es, I've created a changelog YAML for you. |
…m' into synthetic-source/keep-field-param
| private boolean sourceMatchesExactly(MappingTransforms.FieldMapping mapping, List<Object> expectedValues) { | ||
| return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false")); | ||
| return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false")) | ||
| || mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("all"); |
There was a problem hiding this comment.
@lkts any hints on how to cover arrays too?
There was a problem hiding this comment.
I think that was the reason i passed expectedValues in here but it doesn't really work.
In theory this should work:
expectedValues.size() > 1 && mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("arrays");
but it's also true when a field has a single value inside higher-level object array which won't trigger storing it in ignored source (at least in current impl).
Maybe we should embrace the pattern of "always try to exact match first and then do the custom match if that fails". I initially thought about it as temporary measure but it does make some things easier.
What do you think?
There was a problem hiding this comment.
Sounds good, there should be no complaints if we accidentally produce the same source..
But isn't this the case today, based on the code you call out above?
There was a problem hiding this comment.
Yes, you can just remove sourceMatchesExactly.
There was a problem hiding this comment.
Sorry I misunderstood your suggestion. I think the current form is correct, the test should report an error if the source is expected to be an exact copy but it's not - even if relaxed matching succeeds.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
lkts
left a comment
There was a problem hiding this comment.
I have questions about changes in StringStoredFieldFieldLoader and CompositeSyntheticFieldLoader.
| private boolean sourceMatchesExactly(MappingTransforms.FieldMapping mapping, List<Object> expectedValues) { | ||
| return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false")); | ||
| return mapping.parentMappingParameters().stream().anyMatch(m -> m.getOrDefault("enabled", "true").equals("false")) | ||
| || mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("all"); |
There was a problem hiding this comment.
I think that was the reason i passed expectedValues in here but it doesn't really work.
In theory this should work:
expectedValues.size() > 1 && mapping.mappingParameters().getOrDefault("synthetic_source_keep", "none").equals("arrays");
but it's also true when a field has a single value inside higher-level object array which won't trigger storing it in ignored source (at least in current impl).
Maybe we should embrace the pattern of "always try to exact match first and then do the custom match if that fails". I initially thought about it as temporary measure but it does make some things easier.
What do you think?
| this.parts = parts; | ||
| this.storedFieldLoadersHaveValues = false; | ||
| this.docValuesLoadersHaveValues = false; | ||
| // In text mappers, the leaf name is a prefix of the full name and we want to use the leaf name, e.g. `myfield.sdfge` => 'myfield' |
There was a problem hiding this comment.
I think we should fix this specifically in text/keyword mapper which is the only place where this pattern is used. No need to complicate this generic code.
There was a problem hiding this comment.
Moved the logic to keyword mapper, along with tests..
server/src/main/java/org/elasticsearch/index/mapper/CompositeSyntheticFieldLoader.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public String fieldName() { | ||
| return name; | ||
| return simpleName; |
There was a problem hiding this comment.
I don't think this is correct when a field is inside an object.
There was a problem hiding this comment.
Good call, we need something better for this:
There was a problem hiding this comment.
Yes, either use a composite or store "actual field path" in this loader.
There was a problem hiding this comment.
Used a similar approach to keyword mapper, ptal.
| } | ||
|
|
||
| @Override | ||
| protected SyntheticSourceSupport syntheticSourceSupportForKeepTests(boolean ignoreMalformed) { |
There was a problem hiding this comment.
This came up recently and it sounds like we don't intend to support BigDecimal in xcontent #111937. I wonder if this is a valid test.
There was a problem hiding this comment.
I tried adding casting for BigDecimal, still an issue.. It's freakin json.
# Conflicts: # test/framework/src/main/java/org/elasticsearch/logsdb/datageneration/datasource/DefaultMappingParametersHandler.java
| return fullPath.substring(0, fullPath.lastIndexOf(simpleName) + simpleName.length()); | ||
| } | ||
|
|
||
| public SourceLoader.SyntheticFieldLoader syntheticFieldLoader(String simpleName, boolean trimLeafNameSuffix) { |
There was a problem hiding this comment.
Why don't we take String fullPath as a parameter here? Then keyword and text mapper just pass fullPath() in here.
There was a problem hiding this comment.
Yes sorry, simpler.
server/src/main/java/org/elasticsearch/index/mapper/StringStoredFieldFieldLoader.java
Show resolved
Hide resolved
| type: keyword | ||
| kw_arrays: | ||
| type: keyword | ||
| synthetic_source_keep: arrays |
There was a problem hiding this comment.
What happens if _source.mode is stored and we use the synthetic_source_keep parameter? I expect that setting to be ignored completely. Mappings and settings are composed in templates so it is not unlikely that we will have mixed situations.
There was a problem hiding this comment.
What's the setting set to stored? I think I didn't get that part..
There was a problem hiding this comment.
This thing is called synthetic_source_keep but you can have an index that is not using synthetic source, right? In that case you would have a mapping parameter that is about synthetic source....but with no synthetic source (stored source). This might happen because of template and component template composition and to me looks a bit odd. Also nothing prevents a user from setting this parameter explicitly without using synthetic source.
Moreover, some users might adopt an index mode (like logsdb or tsdb) which uses syntehtic source without actually knowing that synthetic source is used under the hood. Then you have to explain them that they need a parameter called synthetic_source_keep to preserve arrays...which requires telling them that synthetic source is used under the hood.
Some index modes might also have both source modes...
I don't know TBH what is the right thing here...it just looks a bit odd to me.
There was a problem hiding this comment.
Another thing is the distinction between arrays and singletons...users might not necessarily know if a field is a single-value field or a multi-value field because they not necessarily control the mapping. Think about fields coming from OTel or some integration...we are asking them to make a choice...and most of the time, just to be on the safe side, they will need to chose all because they don't know if the filed is single-value or multi-value.
There was a problem hiding this comment.
IMO we should have just two choices, preserve_array, true or false. I understand that not making a distinction between singleton and arrays might result in performance penalty...under stored source preserve_array is a no-op.
There was a problem hiding this comment.
We had a long thread on this topic in #112397.. Names are hard, we can argue for and against each option.
all should be used very deliberately, and probably rarely. The options for arrays is more widely applicable and easier to grasp, since this is a known limitation of synthetic source. It's thus better to have separate options for arrays and [arrays + singletons].
|
I think this needs to be documented with some examples...there might be corner cases where using this is not trivial. Is the documentation updated in another PR? |
Yeah there will be another PR updating objects, I'll probably add documentation then. We also need to document the index-level setting.. |
martijnvg
left a comment
There was a problem hiding this comment.
LGTM - left one question for my understanding.
| && fieldMapper.syntheticSourceMode() == FieldMapper.SyntheticSourceMode.FALLBACK; | ||
| boolean fieldWithStoredArraySource = mapper instanceof FieldMapper fieldMapper | ||
| && context.sourceKeepModeFromIndexSettings() == Mapper.SourceKeepMode.ARRAYS; | ||
| && getSourceKeepMode(context, fieldMapper.sourceKeepMode()) != Mapper.SourceKeepMode.NONE; |
There was a problem hiding this comment.
question, should this just be: getSourceKeepMode(context, fieldMapper.sourceKeepMode()) = ARRAYS;, since this method is for dealing with arrays?
There was a problem hiding this comment.
We want to keep array source both for ARRAYS and ALL values, I think.
There was a problem hiding this comment.
Right, maybe my assumption (which may be wrong) is that in this method we only deal with arrays, so I thought just checking for ARRAYS should be sufficient.
There was a problem hiding this comment.
Based on the definition of ALL, the source should be recorded when parsing an array. Maybe I'm missing something?
There was a problem hiding this comment.
Right, I understand that arrays are included with ALL. I think this code will work. I just wondered whether getSourceKeepMode(context, fieldMapper.sourceKeepMode()) = ARRAYS; would work here too, since only arrays should be handled by the parseNonDynamicArray(...) method?
There was a problem hiding this comment.
The parseNonDynamicArray method is called when parsing an array in a doc. In both ARRAYS and ALL cases, we want to record the source . If we just check for ARRAYS here, the array source won't get recorded in the ALL case - this is similar to the check above for objectMapper.storeArraySource() that applies to objects.
There was a problem hiding this comment.
Ok, I see now. This makes sense now.
# Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/DocumentParser.java # server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java # test/framework/src/main/java/org/elasticsearch/logsdb/datageneration/datasource/DefaultMappingParametersHandler.java
# Conflicts: # modules/data-streams/src/javaRestTest/java/org/elasticsearch/datastreams/logsdb/qa/StandardVersusLogsIndexModeRandomDataChallengeRestIT.java
| for (var field : ignoredFieldsMissingValues) { | ||
| fields.put(field.name(), field); | ||
| } | ||
| context.deduplicateIgnoredFieldValues(fields.keySet()); |
There was a problem hiding this comment.
Fyi, I added a yaml test and it caught a bug where there may be duplicate values, as a leaf array may be recorded separately and before identifying that all leaf elements need to be recorded in the second pass. This fixes it.
There was a problem hiding this comment.
Do you have an example? I don't get it.
There was a problem hiding this comment.
Values 1000, 2000 were showing up in the beginning of the list too.
Introduces per-field param `synthetic_source_keep` that overrides the behavior for keeping the field source in synthetic source mode: - `none` : no source is stored - `arrays`: the incoming source is recorded as-is for arrays of a given field - `all`: the incoming source is recorded as is for both singleton and array values of a given field Related to elastic#112012
💚 Backport successful
|
Introduces per-field param `synthetic_source_keep` that overrides the behavior for keeping the field source in synthetic source mode: - `none` : no source is stored - `arrays`: the incoming source is recorded as-is for arrays of a given field - `all`: the incoming source is recorded as is for both singleton and array values of a given field Related to #112012
This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all` not usable when set at index level). See elastic/elasticsearch#112706 to understand the effect of each value. Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs` and `elastic/security` using value `arrays`.
…tic#682) This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all` not usable when set at index level). See elastic/elasticsearch#112706 to understand the effect of each value. Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs` and `elastic/security` using value `arrays`.
#684) This PR introduces a new track parameter, `synthetic_source_keep` which is used to control the behaviour of synthetic source for all field types. It can have values `none`, `arrays` or `all` (`all` not usable when set at index level). See elastic/elasticsearch#112706 to understand the effect of each value. Later on we will use this to change the behaviour in our nightlies and run benchmarks on both `elastic/logs` and `elastic/security` using value `arrays`.
Introduces per-field param
synthetic_source_keepthat overrides the behavior for keeping the field source in synthetic source mode:none: no source is storedarrays: the incoming source is recorded as-is for arrays of a given fieldall: the incoming source is recorded as is for both singleton and array values of a given fieldRelated to #112012