-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Potential ECS mappings in Elasticsearch #89743
Conversation
# Matching all long fields | ||
# This matches the following fields: bytes, packets, port, duration, severity, pid | ||
# No path matching is needed as it is assumed these fields are already in the json doc a long | ||
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference but I think it's easier to reason about when we only add things that differ from the default. It makes the mapping more concise. I do see that it's less explicit and requires readers to know what dynamic: runtime
does. Overall, IMHO, it's a better trade-off to not include redundant mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part that is important for me here is line 70 which documents which ECS fields it should match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW maybe a middle ground that could be found is leave the comments in but comment out the full block for documentation purpose. My assumption is that as soon as this would get loaded into ES / cluster state, comments would be kicked out anyways.
# TODO: Does this also match the top level labels? | ||
match: labels | ||
runtime: | ||
type: flattened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime fields don't support flattened.
See https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime-mapping-fields.html for a list of supported field types.
Just relying on dynamic: runtime seems sufficient for labels. But if we want to index them, either flattened or subobjects: false may be a good choice.
Creating too many fields doesn't seem unique to labels, there may also be many top-level fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels by nature have a lot of fields as these are for example k8s labels and not fields that are directly user created.
@jsoriano I remember you were involved by the time flattened was introduced but it seems we don't use it for labels today but some other places. Did we just never switch over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had this [private] https://github.com/elastic/obs-infraobs-team/issues/461 conversation. The general output was to wait till there is better general support, and evaluate case per case. Now using subobjects
may be also a good choice, except on places where we are also worried about fields explosion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do want to use flattened, these would be indexed though, not runtime. How often are these labels queried / aggregated on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect in a k8s use case, filter queries on these could happen frequently, not sure about aggregations. But I would prefer to not have these indexed by default but also allow it in runtime.
Is the following correct: If the field is runtime, keyword vs flattened does not really make a difference. Only when indexed there is a difference between flattened and keyword. So basically as long as it is a runtime, we keep it keyword, only for index we recommend flattened. If yes, this leaves us still with the topic that the number of fields could explode because of lots of labels but this is something that could be overcome by the user based on the setting or with the idea from @felixbarny to have a soft limit for # of fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the field is runtime, keyword vs flattened does not really make a difference.
Runtime fields don't support flattened. Flattened implies a specific way of indexing the fields. If we want to use runtime fields for labels, the best option is to just not map them explicitly and rely on dynamic: runtime.
runtime: | ||
# TODO: This is a field type that does not exist today. What ECS does is that for all name fields, it | ||
# has a subfield .text to also do text indexing | ||
type: text_keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if text_keyword will be created, it won't be supported for runtime fields. As runtime fields are not analyzed, there's only one type that's valid for string runtime fields, which is keyword
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do about these text fields? Just not have them by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, our only option is to map them as keyword
. Therefore, we could remove this block and rely on dynamic: runtime
.
# Matching all long fields | ||
# This matches the following fields: bytes, packets, port, duration, severity, pid | ||
# No path matching is needed as it is assumed these fields are already in the json doc a long | ||
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part that is important for me here is line 70 which documents which ECS fields it should match.
# TODO: Does this also match the top level labels? | ||
match: labels | ||
runtime: | ||
type: flattened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labels by nature have a lot of fields as these are for example k8s labels and not fields that are directly user created.
@jsoriano I remember you were involved by the time flattened was introduced but it seems we don't use it for labels today but some other places. Did we just never switch over?
runtime: | ||
# TODO: This is a field type that does not exist today. What ECS does is that for all name fields, it | ||
# has a subfield .text to also do text indexing | ||
type: text_keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do about these text fields? Just not have them by default?
match_mapping_type: string | ||
match: message | ||
runtime: | ||
type: match_only_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixbarny Based on https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime-mapping-fields.html it seems this will not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, only keyword
is supported. I'd just leave out this section and rely on dynamic: runtime
. It is somewhat unfortunate that runtime fields don't support full-text searches, only exact matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can prioritize #76261 if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if we could prioritise it as it is kind of add that one of the most common types in ES is not supported by runtime fields. Even if there would be huge limitations compared to index, I would argue this is still good enough.
# This matches the following fields: bytes, packets, port, duration, severity, pid | ||
# No path matching is needed as it is assumed these fields are already in the json doc a long | ||
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose | ||
# DISCUSS: If a field "7" as string comes in, will it still be mapped correctly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it would be mapped as a runtime field of type keyword
. However, if users override the mapping so that the runtime field has a type of long
, I think ES will coerce the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes me think we should have the path match to have a good out of box experience. I assume the following would do the trick for bytes?
- bytes_match:
path_match: bytes
runtime:
type: long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many bytes fields are there? One? Is it always there or often or rarely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not just about bytes but all these: bytes, packets, port, duration, severity, pid
. There are quite a few, I would say 20-30 (can do an exact count if it helps). If these are there or not heavily depends on the data. If you have anything related to processes, you will have always a pid, if you have access logs, like you will always have a port etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the following would do the trick for bytes?
dynamic: runtime
should be good enough for all these cases. I don't think we need to (and should) map any of these explicitly.
|
||
# Keyword fields to discuss if should be indexed or not. Proposed rule for discussion. Quorum needs to be reached | ||
# for a field to be indexed. Reason is going from runtime to index is easy, otherway around could be a breaking | ||
# It should also be discussed if logs-*, metrics-* and traces-* have exact same default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose each type will come with their own defaults. For example, log.level
doesn't make much sense in the context of metrics and traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll start some grouping of the fields below to make the triaging easier.
# Match location fields to geopoint | ||
- match_location: | ||
# TODO: What is the correct matching type here? | ||
match_mapping_type: object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when you have name conventions, I think you can safely drop the match_mapping_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I had it more as a safeguard. What will happen, if a field is location
and it is keyword instead of an object? As it is a runtime field, ES will just ignore it as conversion fails or is this the script error case?
match_mapping_type: string | ||
match: strings | ||
runtime: | ||
type: wildcard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wildcard is also not supported by runtime fields, this would be keyword already with dynamic:runtime, unless you want it indexed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any plans to support wildcard?
A more general thought: There are cases like wildcard or text where runtime field will be less powerful then the indexed variant which I would argue is kind of expected. Would it be possible that instead of the person implementing the template having to think this through, ES would internally do a fallback? Lets say wildcard is internally just treated like a keyword field if used as runtime and if the user sends a query with *
inside it might return a warning. The reason for the above is that with this we keep the nice property of getting from runtime to index is just moving around the definition instead of having to also select a different type.
- match_ecs.version: | ||
path_match: ecs.version | ||
# DISCUSS: In ECS this is still a keyword field | ||
type: version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mapped as an indexed field but is under the ### Runtime mappings
section. The version type is currently not supported on runtime fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the ECS version should be an indexed field as I have not seen lots of query on it. I'll add it as runtime field.
Even in ECS today it is a keyword as by the time it was introduced, version did not exist in Elasticsearch. The simplest is to just map it to keyword as this is being supported by Elasticsearch and is what is in ECS. But is it future proof?
# still be adjusted but not something we should have the user do by default. | ||
# TODO: if path_match would be an array, this could be defined much nicer and more compact | ||
- match_bytes: | ||
path_match: bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these use match
instead of path_match
so that not only the top-level bytes
field but also foo.bytes
matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I keep mixing up match
and patch_match
. The idea is that it should match all *.bytes
. Same is true for all the other match_*
runtime: | ||
type: ip | ||
- match_forwarded_ip: | ||
match: forwarded_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we anticipate more ip fields that end with _ip
match: forwarded_ip | |
match: *_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea. Can't see a reason at the moment why something would be called _ip but not be an ip ...
As a first step, this is now implemented in elastic-package with dynamic templates: elastic/integrations#5055 This allows each integration to adopt it when ready. Eventually, it should become the default in Elasticsearch and be available directly from Elasticsearch. I'm temporarily closing this PR but I expect it to be reopened / fresh started again as soon as elastic/integrations#5055 is completed. |
This PR is intended to have a conversation about potential mappings for ECS in Elasticsearch. It might not ever get merged into the ES code based but allows to keep a history of the conversation and changes. The core idea of the PR are as following:
@timestamp
are runtime fields by default and not indexed"7"
in JSON needs to be mapped to long.subobject: false
is set to flatten all the documents that are ingested. This ensures no conflicts can happen on fields likehost
andhost.name
Implementation
Even though this PR contains the potential template that could be used for ECS it does not contain any code around the implementation. Implementation could happen in multipel phases:
Discussions