Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Potential ECS mappings in Elasticsearch #89743

Closed
wants to merge 9 commits into from

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented Aug 31, 2022

This PR is intended to have a conversation about potential mappings for ECS in Elasticsearch. It might not ever get merged into the ES code based but allows to keep a history of the conversation and changes. The core idea of the PR are as following:

  • Map all core ECS fields through dynamic mappings on the paths. This will make sure only the fields that are contained in each index will be part of the mapping itself
  • All fields except @timestamp are runtime fields by default and not indexed
  • long and double values are matched with a path to ensure also string values are mapped correct. For example "7" in JSON needs to be mapped to long.
  • subobject: false is set to flatten all the documents that are ingested. This ensures no conflicts can happen on fields like host and host.name

Implementation

Even though this PR contains the potential template that could be used for ECS it does not contain any code around the implementation. Implementation could happen in multipel phases:

  1. ECS component template for each ECS version exists in Elasticsearch that can easily be used
  2. ECS config flag in data stream / mapping to use ECS template. This heavily simplifies the usage

Discussions

  • ECS versions: Should this template be versioned? The current suggestion is yes to prevent any unexpected changes to users. At the same time not versioning it would have the benefits of users get the benefits automatically when upgrading Elasticsearch.
  • Fields indexed by default: Which fields should be indexed by default?
  • How do we handle keyword_text multifields?
  • What about extended ECS fields?

@elasticsearchmachine elasticsearchmachine added v8.5.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 31, 2022
# Matching all long fields
# This matches the following fields: bytes, packets, port, duration, severity, pid
# No path matching is needed as it is assumed these fields are already in the json doc a long
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference but I think it's easier to reason about when we only add things that differ from the default. It makes the mapping more concise. I do see that it's less explicit and requires readers to know what dynamic: runtime does. Overall, IMHO, it's a better trade-off to not include redundant mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that is important for me here is line 70 which documents which ECS fields it should match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW maybe a middle ground that could be found is leave the comments in but comment out the full block for documentation purpose. My assumption is that as soon as this would get loaded into ES / cluster state, comments would be kicked out anyways.

x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
# TODO: Does this also match the top level labels?
match: labels
runtime:
type: flattened
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime fields don't support flattened.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime-mapping-fields.html for a list of supported field types.

Just relying on dynamic: runtime seems sufficient for labels. But if we want to index them, either flattened or subobjects: false may be a good choice.

Creating too many fields doesn't seem unique to labels, there may also be many top-level fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels by nature have a lot of fields as these are for example k8s labels and not fields that are directly user created.

@jsoriano I remember you were involved by the time flattened was introduced but it seems we don't use it for labels today but some other places. Did we just never switch over?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this [private] https://github.com/elastic/obs-infraobs-team/issues/461 conversation. The general output was to wait till there is better general support, and evaluate case per case. Now using subobjects may be also a good choice, except on places where we are also worried about fields explosion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do want to use flattened, these would be indexed though, not runtime. How often are these labels queried / aggregated on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect in a k8s use case, filter queries on these could happen frequently, not sure about aggregations. But I would prefer to not have these indexed by default but also allow it in runtime.

Is the following correct: If the field is runtime, keyword vs flattened does not really make a difference. Only when indexed there is a difference between flattened and keyword. So basically as long as it is a runtime, we keep it keyword, only for index we recommend flattened. If yes, this leaves us still with the topic that the number of fields could explode because of lots of labels but this is something that could be overcome by the user based on the setting or with the idea from @felixbarny to have a soft limit for # of fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is runtime, keyword vs flattened does not really make a difference.

Runtime fields don't support flattened. Flattened implies a specific way of indexing the fields. If we want to use runtime fields for labels, the best option is to just not map them explicitly and rely on dynamic: runtime.

runtime:
# TODO: This is a field type that does not exist today. What ECS does is that for all name fields, it
# has a subfield .text to also do text indexing
type: text_keyword
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if text_keyword will be created, it won't be supported for runtime fields. As runtime fields are not analyzed, there's only one type that's valid for string runtime fields, which is keyword.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do about these text fields? Just not have them by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, our only option is to map them as keyword. Therefore, we could remove this block and rely on dynamic: runtime.

x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
# Matching all long fields
# This matches the following fields: bytes, packets, port, duration, severity, pid
# No path matching is needed as it is assumed these fields are already in the json doc a long
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that is important for me here is line 70 which documents which ECS fields it should match.

x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
x-pack/plugin/core/src/main/resources/ecs-template.yml Outdated Show resolved Hide resolved
# TODO: Does this also match the top level labels?
match: labels
runtime:
type: flattened
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels by nature have a lot of fields as these are for example k8s labels and not fields that are directly user created.

@jsoriano I remember you were involved by the time flattened was introduced but it seems we don't use it for labels today but some other places. Did we just never switch over?

runtime:
# TODO: This is a field type that does not exist today. What ECS does is that for all name fields, it
# has a subfield .text to also do text indexing
type: text_keyword
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do about these text fields? Just not have them by default?

match_mapping_type: string
match: message
runtime:
type: match_only_text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, only keyword is supported. I'd just leave out this section and rely on dynamic: runtime. It is somewhat unfortunate that runtime fields don't support full-text searches, only exact matches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can prioritize #76261 if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if we could prioritise it as it is kind of add that one of the most common types in ES is not supported by runtime fields. Even if there would be huge limitations compared to index, I would argue this is still good enough.

# This matches the following fields: bytes, packets, port, duration, severity, pid
# No path matching is needed as it is assumed these fields are already in the json doc a long
# Elasticsearch would do this mapping automatically but it is specified here for documentation purpose
# DISCUSS: If a field "7" as string comes in, will it still be mapped correctly?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it would be mapped as a runtime field of type keyword. However, if users override the mapping so that the runtime field has a type of long, I think ES will coerce the type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes me think we should have the path match to have a good out of box experience. I assume the following would do the trick for bytes?

      - bytes_match:
          path_match: bytes
          runtime:
            type: long

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many bytes fields are there? One? Is it always there or often or rarely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not just about bytes but all these: bytes, packets, port, duration, severity, pid. There are quite a few, I would say 20-30 (can do an exact count if it helps). If these are there or not heavily depends on the data. If you have anything related to processes, you will have always a pid, if you have access logs, like you will always have a port etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the following would do the trick for bytes?

dynamic: runtime should be good enough for all these cases. I don't think we need to (and should) map any of these explicitly.


# Keyword fields to discuss if should be indexed or not. Proposed rule for discussion. Quorum needs to be reached
# for a field to be indexed. Reason is going from runtime to index is easy, otherway around could be a breaking
# It should also be discussed if logs-*, metrics-* and traces-* have exact same default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose each type will come with their own defaults. For example, log.level doesn't make much sense in the context of metrics and traces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start some grouping of the fields below to make the triaging easier.

@javanna javanna added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 1, 2022
@javanna javanna self-assigned this Sep 1, 2022
# Match location fields to geopoint
- match_location:
# TODO: What is the correct matching type here?
match_mapping_type: object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when you have name conventions, I think you can safely drop the match_mapping_type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I had it more as a safeguard. What will happen, if a field is location and it is keyword instead of an object? As it is a runtime field, ES will just ignore it as conversion fails or is this the script error case?

match_mapping_type: string
match: strings
runtime:
type: wildcard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wildcard is also not supported by runtime fields, this would be keyword already with dynamic:runtime, unless you want it indexed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any plans to support wildcard?

A more general thought: There are cases like wildcard or text where runtime field will be less powerful then the indexed variant which I would argue is kind of expected. Would it be possible that instead of the person implementing the template having to think this through, ES would internally do a fallback? Lets say wildcard is internally just treated like a keyword field if used as runtime and if the user sends a query with * inside it might return a warning. The reason for the above is that with this we keep the nice property of getting from runtime to index is just moving around the definition instead of having to also select a different type.

- match_ecs.version:
path_match: ecs.version
# DISCUSS: In ECS this is still a keyword field
type: version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mapped as an indexed field but is under the ### Runtime mappings section. The version type is currently not supported on runtime fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the ECS version should be an indexed field as I have not seen lots of query on it. I'll add it as runtime field.

Even in ECS today it is a keyword as by the time it was introduced, version did not exist in Elasticsearch. The simplest is to just map it to keyword as this is being supported by Elasticsearch and is what is in ECS. But is it future proof?

# still be adjusted but not something we should have the user do by default.
# TODO: if path_match would be an array, this could be defined much nicer and more compact
- match_bytes:
path_match: bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these use match instead of path_match so that not only the top-level bytes field but also foo.bytes matches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I keep mixing up match and patch_match. The idea is that it should match all *.bytes. Same is true for all the other match_*

runtime:
type: ip
- match_forwarded_ip:
match: forwarded_ip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we anticipate more ip fields that end with _ip

Suggested change
match: forwarded_ip
match: *_ip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. Can't see a reason at the moment why something would be called _ip but not be an ip ...

@ruflin
Copy link
Contributor Author

ruflin commented Feb 20, 2023

As a first step, this is now implemented in elastic-package with dynamic templates: elastic/integrations#5055 This allows each integration to adopt it when ready. Eventually, it should become the default in Elasticsearch and be available directly from Elasticsearch.

I'm temporarily closing this PR but I expect it to be reopened / fresh started again as soon as elastic/integrations#5055 is completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants