Skip to content

Allow dots in field names.#5983

Open
iinuwa wants to merge 2 commits intoGraylog2:masterfrom
iinuwa:allow-nested-objects
Open

Allow dots in field names.#5983
iinuwa wants to merge 2 commits intoGraylog2:masterfrom
iinuwa:allow-nested-objects

Conversation

@iinuwa
Copy link

@iinuwa iinuwa commented May 31, 2019

Description

This removes the features that filter and rename field names on ingestion.

Motivation and Context

Now that Elasticsearch supports dots in field names to delimit nested objects and Graylog 3.0 has dropped support for Elasticsearch versions without this feature, this PR re-enables dots in field names.

Fixes. #4583

How Has This Been Tested?

The tests should all pass. I have also done an informal test in my environment, and the nested object mappings are created in Elasticsearch as expected.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly. (Cf. Graylog2/documentation@e60ad962)
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

I am not sure how the documentation should be modified, but I created a proposed file change in the documentation repo to be published if this is accepted in Graylog 3.1.0.

@CLAassistant
Copy link

CLAassistant commented May 31, 2019

CLA assistant check
All committers have signed the CLA.

@kroepke
Copy link
Member

kroepke commented Jun 4, 2019

We do plan on allowing nested objects in the future, however we cannot simply revert the existing behavior and change extractors because that will break existing users' setups. Pipeline functions might need changes as well.

In addition displaying nested objects opens up all kinds of issues, too, which aren't addressed here.

Data semantics are also an issue, for example to make sense of arrays in nested objects, we will need to support nested aggregations, which is not straightforward to implement because the query engine needs the exact mapping type or it will generated invalid queries.

I'm reluctant to close this PR because I support the goal, however I believe that nested object support is much more involved than just allowing indexing them.

I'll leave it open for now for the sake of discussion, but please don't be disappointed if this does not get merged any time soon.

@iinuwa
Copy link
Author

iinuwa commented Jun 4, 2019

No, I completely understand; this PR was mostly to get the ball rolling on a feature that we need in my organization. Glad there's a plan for it!

Some thoughts I had:

We haven't implemented Graylog yet, so I'm happy to help with greenfield testing, but I imagine that the harder part will be supporting existing setups as you say. I wonder if it may be useful to publish a new spec for GELF with support for arrays and objects (cf #668). That may provide a migration path by allowing users to configure GELF inputs to use a specific GELF spec. That way, the inputs defined with the 1.1 spec would still process as expected, and 1.2 inputs could use the new functionality.

IMO, displaying nested objects with dot-separated field names in the UI is pretty natural. I believe that is how Elasticsearch queries reference the fields and how Kibana displays them as well. That may cause issues, though, because there wouldn't be a visible difference between field names with dots in them and fields of nested objects.

I ran into the problem with defining the data structure up front pretty quickly while testing this. It would most likely have to be defined up front for aggregation. I saw a proposal on IRC (I believe from @gimmic) to have a way to define the Elasticsearch index templates from the web app. That may help to make it easier to define the data structure for the indices up front so that the log messages are ingested correctly.

I guess we'd also have to come up with a syntax for doing advanced queries of nested objects.(Elasticsearch/Lucene doesn't even have that (cf. elastic/elasticsearch#11322)

Thanks for opening it up for discussion!

@gimmic
Copy link

gimmic commented Jun 4, 2019

I do think it is strange having graylog rename fields for backwards compatibility which is no longer required. The field rename has confused me with parsing rules now more than once until I look at the raw input/field value(because of the way the graylog web interface shows the renamed value).

As backwards compatibility is dropped, it would make sense to sunset this transparent-renaming feature as well, maybe invert it with a flag, or provide a checkbox in the UI to enable/disable this feature.

I also obviously support editing of templates via the web UI but that's an entire new feature request.

This would also allow for use of the actual Elastic Common Schema (ECS) which uses nested field names as well via periods.

@kroepke
Copy link
Member

kroepke commented Jun 6, 2019

Sorry, this turned into an essay :/

Indeed, ECS support (or something similar to it) is one of the motivations for supporting them.
Let me explain the rationale (this will get philosophical quickly, please bear with me):
Initially Graylog didn't care what you sent, as long as it could be mapped to what GELF said, this predates our use of elasticsearch even (so pre 2012 IIRC).
Now, Graylog never supported you in querying or displaying those fields well.
When in the Elasticsearch 2.x time frame dots where disallowed and allowed again, we had to make an impossible choice: either mangle the field names or drop the data entirely. Neither is a good option, so we chose the former.

This was bad then and is bad now, IMHO, because it needs to be done just prior to sending it Elasticsearch due to backwards compatibility, however no one ever came up with a better way of doing it, so we never changed it again.

Now the philosophy part:
What do people even want to accomplish with (nested) objects in the context of log management?
AFAICS there are two things:

  1. Several fields are so closely related that they are (almost) always used or manipulated together, such as geo information, ip/port, results of certain lookups like whois results
  2. There's an unknown or repeating part in the event/message that one does not want to see as a top level field (or even worse a series of fields suffixed with an index), because since it's unknown it really is only useful for displaying it.

Nested objects as in Elasticsearch's mapping type is rarely what people really want, at least not in logging, as it has to create one document per array index and comes with a few restrictions.

That being said, scenarios where I've heard requests for nested field support are ECS support, generic JSON documents and messages that have fields containing a list of things, such as a stack trace.

ECS support falls into scenario 1 from above, it provides a logical grouping that enables treating certain attributes as one, which totally makes sense from a processing and display point of view.

Generic JSON support is something that sounds great on paper, but then it's rare that people can actually articulate how they intend to make sense of it. Querying is difficult, aggregations are hopelessly complex to express (do you need JSONPath to extract something, how about arrays, objects, what about data types like IP addresses that are represented as simple strings etc). The same goes for XML messages.

Things like stack traces are interesting, because they are a combination of "contains interesting information" and "I just need to display it properly for analysis".
It's probably the closest in spirit to log management use cases.

Now, where does this leave us?
We know that something like ECS support is a sensible request and could benefit us a lot.
Also various beats send some kind of nested data, that follow the same rationale (rtt fields in heartbeat for example).
Generic JSON/XML document support sounds nice, but in reality it's sort of an edge case, because not all searches that seem natural actually make sense (and don't really do what people think they do).
Lastly stacktraces that are delivered as a list of lines really are a string and should probably be treated as such. In general anything that relies on a specific order in an array is pretty much an anti-pattern.
Essentially in the last two cases what you really want is to store verbatim what you received and do some kind of preprocessing to extract the valuable data that you might be searching on. For stacktraces that's likely the exception that was the root cause and maybe its line information. Certainly not the entire stack.

Hope that sheds some light on what we are thinking about this.

@kroepke
Copy link
Member

kroepke commented Jun 6, 2019

I do think it is strange having graylog rename fields for backwards compatibility which is no longer required. The field rename has confused me with parsing rules now more than once until I look at the raw input/field value(because of the way the graylog web interface shows the renamed value).

The backwards compatibility concern really is that we cannot silently change behavior in running setups.
I fully understand that for new users this seems silly, but that's the nature of supporting decisions from the past.
Please note that I'm not trying to argue that we don't want to allow fields like these, but we need to make sure that we don't screw up existing users, which is very important to us.

@gimmic
Copy link

gimmic commented Jun 6, 2019

Yeah, I get it. At some point you will simply have to introduce a breaking change. Thankfully, this is a fairly minor breaking change compared to some of the stunts Elastic have pulled over the years. I think given sufficient notification this should be something the community could generally fix or otherwise have an option to toggle. The old functionality could be emulated on upgraded instances.

@iinuwa
Copy link
Author

iinuwa commented Jun 7, 2019

@kroepke, thanks for taking the time to explain your thoughts.

My eventual goal is to be able to send messages using ECS as well. I would be fine with just having dotted field names for now until we can get actual nested objects (fake it 'til you make it.)

For this PR, would it be enough to add an option to rename . to _ in field names and set it to be on by default? That would preserve the current functionality while allowing new users to take the new functionality. Maybe later on, it could be marked as deprecated, and then removed if/when we have a clear way to migrate to use the new functionality. (I can also remove the commit for parsing array values as keywords—didn't mean to include that here.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants