Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Log input]Forever growing registry file with kubernetes autodiscovery #13140

Closed
marqc opened this issue Aug 1, 2019 · 30 comments · May be fixed by #41747
Closed

[Log input]Forever growing registry file with kubernetes autodiscovery #13140

marqc opened this issue Aug 1, 2019 · 30 comments · May be fixed by #41747
Labels
bug containers Related to containers use case Filebeat Filebeat Stalled Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@marqc
Copy link
Contributor

marqc commented Aug 1, 2019

When using kubernetes autodiscover provider registry file tends to grow in time leaving a lot of entries with TTL=-2. This entries are never removed from registry. eg.

sample config:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 5m
      hints.enabled: true
      templates.config:
        - type: container
          paths:
            - "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
          scan_frequency: 3s
          max_bytes: 1000000
          clean_removed: true
cat data.json | jq -r .[].ttl | sort | uniq -c
    660 -1
   2957 -2

When pods are stopped inputs are stopped/disabled and marked with TTL=-2, log files are often getting removed from disk after that (for example for jobs from cronjob it can keep stopped docker containers for long time), so with no active Input it won't be traced and won't be removed from registry.

For state to be removed from registry "states.Update" method must be called on it, but with autodiscovery pattern containing containerId no input will ever keep track of them and make them get removed from registry.

I think, that kubernetes autodiscovery should always remove state from registry when final cleanup_timeout "stop" event is send, because kubernetes will never re-run the same already stopped container (it always creates new one).

@jsoriano jsoriano added Team:Integrations Label for the Integrations team bug containers Related to containers use case libbeat labels Aug 8, 2019
@jsoriano
Copy link
Member

Hi @marqc, thanks for the report, we are investigating the issue.

On the meantime, I think that your configuration is not doing what you expect. There are two ways of configuring filebeat autodiscover, one with hints, and another one with templates, mixing them is possible, but can lead to some unexpected behaviours. In your case you are enabling hints-based configuration with hints.enabled: true, and you are also trying to define a template.
In this case I think that the template is ignored, it should be defined as:

      templates:
        - config:
            type: container
            paths:
              - "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
            scan_frequency: 3s
            max_bytes: 1000000
            clean_removed: true

BUT, this configuration will apply to all containers, as well as hints-based autodiscover, so you would have the default configuration of hints-based autodiscover, and this template working at the same time for any container.

If you want to override some options (like these scan_frequency, max_bytes...) while using hints-based autodiscover, you can do it overriding the default settings used with hints.default_config. Something like this:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 5m
      hints.enabled: true
      hints.default_config:
        type: container
        paths:
          - "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
        scan_frequency: 3s
        max_bytes: 1000000
        clean_removed: true

@marqc
Copy link
Contributor Author

marqc commented Aug 13, 2019

@jsoriano thanks, I have alreadydone that and overriding attributes works as expected. The original issue is not affected by this change. It still leaves entries in registry if log file is not deleted from disk in 5 minutes after container is stopped (crashed, pod evicted, job finished).

@jsoriano jsoriano removed Team:Integrations Label for the Integrations team [zube]: Investigate labels Aug 13, 2019
@jsoriano jsoriano assigned urso and unassigned jsoriano Aug 13, 2019
@jsoriano
Copy link
Member

@marqc we can confirm that there is some issue cleaning the state of files that are not owned by any input. There is an ongoing effort to refactor filebeat registry that will probably help here.

In the meantime the only solution would be to stop filebeat and cleanup the registry file with some script.

@jsoriano jsoriano added Filebeat Filebeat and removed libbeat labels Aug 13, 2019
@silenceper
Copy link

Is there the latest solution, about cleaning up the status in the registry?

@boernd
Copy link
Contributor

boernd commented Nov 5, 2020

FYI, leaking registry entries (in our case with > 15k registry entries) also caused filebeat to stall and rarely send any events. After cleanup the performance was ok again (version 7.9.3).

@trnl
Copy link

trnl commented Nov 6, 2020

We suffer from the same issue as well.

Log delivery based on filebeat is not really stable.

@jsoriano
Copy link
Member

jsoriano commented Nov 6, 2020

@trnl is this also happening to you with 7.9?

@trnl @boernd approximately, how many files are you collecting at a given moment?

@trnl
Copy link

trnl commented Nov 9, 2020

@jsoriano 7.9.2

We have entries in registry form May 2020, however containers and related folders gone from the system quite long time ago.

@boernd
Copy link
Contributor

boernd commented Nov 9, 2020

@trnl is this also happening to you with 7.9?

@trnl @boernd approximately, how many files are you collecting at a given moment?

@jsoriano Hard to tell, Kibana tells me ~2k unique log.file.path for the last couple of minutes. We have 212 pods runnng atm, so roughly 10 * 5 (docker logs including the rotated ones) logs per node.

The following screen shows the registry growing averaged per filebeat:
grafik

The drop in the graph is where I did a manual cleanup of some pods.

@hukaixuan
Copy link

Any updates of this issue? We meet the same issue here:

we have about 200 containers of a k8s node and use filebeat to collect their log, but the registry file is really big(up to 20M, ~50k lines),cause the performance of filebeat is not stable
image

filebeat performance:
image

filebeat version: 7.11.2
configuration of input part:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      host: ${NODE_NAME}
      hints.enabled: true
      hints.default_config:
        type: container
        paths:
          - /var/log/containers/*${data.kubernetes.container.id}.log

filebeat.registry.flush: 10s

By the way, could anyone explain why the registry file size affect the performance of filebeat so much?

@jsoriano jsoriano added the Team:Elastic-Agent Label for the Agent team label Apr 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@hukaixuan
Copy link

hukaixuan commented Apr 27, 2021

After reading the code about registry, found the reason of "why the registry file size affect the performance of filebeat so much?"
I found that the update of memory states and write to registry file are in the same select block,so they cannot execute in parallel.
But looks like the write registry file method commitStateUpdates is safe to parallel processing with r.onEvents(states) (since gcStates with locking states and the following operation is doing with a copy of states).
So I move commitStateUpdates to an independent goroutine.
image
image
And looks the performance of filebeat is better and stable:
截屏2021-04-27 下午7 58 09

I want to known if it is all right to doing this change or there will be some problems?

@alexandervasylev
Copy link

Did someone find a solution without changing source code? We've faced with the same problem within a Filebeat in a Kubernetes cluster.

@exekias
Copy link
Contributor

exekias commented Aug 13, 2021

We have been working on a new input that may help solving this issue, as it is able to cleanup registry entries that are no longer used, I've created an issue to test and validate the approach: elastic/integrations#1526

@srhb
Copy link

srhb commented Jan 19, 2022

Any news on mitigations here?

@ruflin ruflin added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 19, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@stephan-erb-by
Copy link

stephan-erb-by commented Apr 1, 2022

I think the question is if we can the hints-based autodiscover working together with the new filestream input. Has anyone attempted this, yet?

@faec
Copy link
Contributor

faec commented Apr 4, 2022

It looks to me like @MichaelKatsoulis checked in the switch to the filestream input in elastic/integrations#2139, so this might already be done?

@stephan-erb-by
Copy link

I think the integrations are not using or supporting hints.enabled, but I might be mistaken. So the new integrations would fix it for the autodiscovert, but not all usecases supported by the old mechanism.

@fdartayre
Copy link
Contributor

As the issue comes from the log input, suggested workaround is to use a filestream input instead (GA since 7.14):

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 5m
      hints.enabled: true
      hints.default_config:
        type: filestream
        id: "my-id-${data.kubernetes.container.id}"
        paths:
          - "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
        scan_frequency: 3s
        message_max_bytes: 1000000
        clean_removed: true
        parsers:
        - container: ~

Note: without a dynamic id ( id: "my-id-${data.kubernetes.container.id}"), the provider would auto generate an id, which could lead to duplicated data (#31239).

@stephan-erb-by
Copy link

stephan-erb-by commented Jul 1, 2022

thanks @fdartayre!

We are a heavy user of Kubernetes POD annotations to configure the logs input, such as

  co.elastic.logs.mycontainer/json.add_error_key": "true"
  co.elastic.logs.mycontainer/json.keys_under_root": "true"
  co.elastic.logs.mycontainer/json.message_key": "message"
  co.elastic.logs.mycontainer/json.ignore_decoding_error": "true"
  co.elastic.logs.mycontainer/json.expand_keys": "true"

or

  co.elastic.logs.myothercontainer/multiline.type": "pattern"
  co.elastic.logs.myothercontainer/multiline.pattern": "^[[:space:]]"
  co.elastic.logs.myothercontainer/multiline.negate": "false"
  co.elastic.logs.myothercontainer/multiline.match": "after"

To my knowledge this will not work correctly with the new filestream input. Or should this still work?

@jlind23 jlind23 changed the title Forever growing registry file with kubernetes autodiscovery [Log input]Forever growing registry file with kubernetes autodiscovery Jul 8, 2022
@Iatbzh
Copy link

Iatbzh commented Jul 26, 2022

Has this problem been solved? I still have this problem in filebeat7.9.2

@jsoriano
Copy link
Member

Has this problem been solved? I still have this problem in filebeat7.9.2

Have you tried to upgrade to a more recent version? As mentioned in #13140 (comment) you may try to use the filestream input to mitigate this issue.

@Iatbzh
Copy link

Iatbzh commented Jul 26, 2022

这个问题解决了吗?我在filebeat7.9.2中仍然有这个问题

您是否尝试过升级到更新的版本?如#13140(评论)中所述,您可以尝试使用filestream输入来缓解此问题。

Thank you for resolving

@asazallesmilner
Copy link

Is this a validated and functional code to run filestream with autodiscover and hints? #13140 (comment)

filebeat.autodiscover:
providers:
- type: kubernetes
cleanup_timeout: 5m
hints.enabled: true
hints.default_config:
type: filestream
id: "my-id-${data.kubernetes.container.id}"
paths:
- "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
scan_frequency: 3s
message_max_bytes: 1000000
clean_removed: true
parsers:
- container: ~

@eedugon
Copy link
Contributor

eedugon commented Jan 2, 2023

@jsoriano , @fdartayre : why do we use and suggest scan_frequency input option that is from log legacy input instead of prospector.scanner.check_interval option that is the one documented in filestream input? Are both valid?

@asazallesmilner
Copy link

Want to put a note here for what we found.
Filestream is currently INCOMPATIBLE with hints based annotations. This means all of the hints based annotations our users were using broke when we went to Filestream and we are having to roll back.

@bigpigeon
Copy link

bigpigeon commented Mar 24, 2023

I fix this issue with pr #34904
just setting filebeat.yaml similar below

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 5m
      templates.config:
        - type: container
          paths:
            - "/var/lib/docker/containers/${data.kubernetes.container.id}/*-json.log"
          close_removed: true
          clean_removed: true

@botelastic
Copy link

botelastic bot commented Mar 27, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Mar 27, 2024
@botelastic botelastic bot closed this as completed Sep 23, 2024
@rsafonseca
Copy link

Can this be re-opened? It is still an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug containers Related to containers use case Filebeat Filebeat Stalled Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.