Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drops event after spooler is blocked for a certain time #3091

Closed
wants to merge 1 commit into from

Conversation

ruflin
Copy link
Member

@ruflin ruflin commented Dec 1, 2016

In case Logstash or Elasticsearch is not available, the filebeat output gets stuck. This has the side effect that also the close_* options do not apply anymore. Because of this files are kept open. Under Linux file rotation continues but the disk space is still used. Under Windows this can mean that file rotation works and a single file gets very large.

The goal of this change is to keep the client node sane in case the output is blocked. Filebeat follows the at least once principle and tries to make sure all log lines arrive by the send whenever possible. Using drop_after breaks this principle and will lead to data loss!

When setting drop_after to 5 minutes for example, events start to be dropped after the output is not available for 5 minutes. It will drop all events until the output is available again. As soon as the output becomes available again, the last batch in the publisher will be sent and it continues sending all new events which arrive from now in the log files. All events between the batch which was still in the publisher and the first event which is sent again are lost.

The registry file will not be updated when the events are dropped. So in case filebeat is stopped and started again, it will continue at the position where it was before the output was not available. As soon as the output becomes available again, it will update the registry file. But for files where no new events appear, the old position stays in the registry itself.

Dropping all events has the advantage, that when the publisher starts sending events again, it will not overwhelm the received with all the queue events.

This implementation is different form turning guaranteed of or using udp in the way that events only start to be dropped after a certain time. If the output is only blocked for a time < drop_after all events will still be sent by filebeat and not events are dropped. drop_after is only an "emergency switch" in case the output is not available for a longer timer.

Alternative implementation

  1. Drop only event older then

An alternative implementation could be do only drop events where the timestamp is older then the predefined period. The advantage of this would be that not necessarly all events are dropped until the output becomes available again, but only the oldest one. This implementation is a little bit more complex and I don't it is needed.

  1. Drop events in publisher

Instead of dropping events in the spooler, events could be dropped in the publisher. Advantage of having it in the publisher is that the registry file would also be updated, so no resending of events would happen on restart.

Questions

  • Should the registry also be updated so no resending happens? -> publisher implementation needed

TOOD:

  • Add tests
  • Add stats for events which were dropped
  • Add docs

In case Logstash or Elasticsearch is not available, the filebeat output gets stuck. This has the side effect that also the close_* options do not apply anymore. Because of this files are kept open. Under Linux file rotation continues but the disk space is still used. Under Windows this can mean that file rotation works and a single file gets very large.

The goal of this change is to keep the client node sane in case the output is blocked. Filebeat follows the at least once principle and tries to make sure all log lines arrive by the send whenever possible. Using `drop_after` breaks this principle and will lead to data loss!

When setting `drop_after` to 5 minutes for example, events start to be dropped after the output is not available for 5 minutes. It will drop all events until the output is available again. As soon as the output becomes available again, the last batch in the publisher will be sent and it continues sending all new events which arrive from now in the log files. All events between the batch which was still in the publisher and the first event which is sent again are lost.

The registry file will not be updated when the events are dropped. So in case filebeat is stopped and started again, it will continue at the position where it was before the output was not available. As soon as the output becomes available again, it will update the registry file. But for files where no new events appear, the old position stays in the registry itself.

Dropping all events has the advantage, that when the publisher starts sending events again, it will not overwhelm the received with all the queue events.

This implementation is different form turning guaranteed of or using udp in the way that events only start to be dropped after a certain time. If the output is only blocked for a time < drop_after all events will still be sent by filebeat and not events are dropped. `drop_after` is only an "emergency switch" in case the output is not available for a longer timer.

Alternative implementation

1. Drop only event older then

An alternative implementation could be do only drop events where the timestamp is older then the predefined period. The advantage of this would be that not necessarly all events are dropped until the output becomes available again, but only the oldest one. This implementation is a little bit more complex and I don't it is needed.

2. Drop events in publisher

Instead of dropping events in the spooler, events could be dropped in the publisher. Advantage of having it in the publisher is that the registry file would also be updated, so no resending of events would happen on restart.

Questions
* Should the registry also be updated so no resending happens? -> publisher implementation needed

TOOD:
* Add tests
* Add stats for events which were dropped
* Add docs
@ruflin ruflin added discuss Issue needs further discussion. Filebeat Filebeat labels Dec 1, 2016
@ruflin
Copy link
Member Author

ruflin commented Dec 7, 2016

Closing as @urso is working on a potentially better / cleaner solution.

@ruflin ruflin closed this Dec 7, 2016
@justinyeh3112
Copy link

Hi. Any update on solutions for this? Is there another ticket open tracking this?

@ruflin
Copy link
Member Author

ruflin commented Feb 20, 2017

@justinyeh3112 This problem was solved here: #3511 It's already available in the nightly builds. In case you test it, it would be great to get some feedback on it.

@justinyeh3112
Copy link

great!

@justinyeh3112
Copy link

Hey @ruflin, so I gave this a shot and it doesn't seem to be working how I expect. Here's what a I did:

  • Configure filebeat to start tracking a file with close_timeout: 1m
  • Start writing to that file
  • Verified I saw the contents make it to logstash
  • Take down logstash
  • Stop writing to the file
  • Delete the file
    RESULT: A lsof -ap <pid> | grep testfile.log shows filebeat still hanging on to the file after the close timeout elapses.

@ruflin ruflin deleted the spooler-drop-events branch February 23, 2017 07:08
@ruflin
Copy link
Member Author

ruflin commented Feb 23, 2017

@justinyeh3112 Thanks for testing.

  • Which build did you test?
  • What is your OS?
  • Can you share your full config?
  • Could you run filebeat with -e -d "*" and share the output?

I tried to reproduce this but couldn't so far. The last step "delete the file" should not be needed.

@justinyeh3112
Copy link

got this working ,thanks!

@ruflin
Copy link
Member Author

ruflin commented Feb 23, 2017

@justinyeh3112 What cause the problem initially?

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

Hey, should probably loop back here. So i guess i'm seeing it sometimes work and sometimes not work. Here are some answers to your questions:

  • Commit: 820310b (I grabbed the head of 5.3.0)
  • Centos 6.8
  config.prospectors:
    path: <path-to-live-reload>/filebeat-config/*.yml
    reload:
      enabled: true
      period: 10s
  config_dir: <dir>
  registry_file: <registry>
logging:
  level: error
  to_files: true
  to_stderr: false
output:
  logstash:
    hosts:
    - <host>
    ssl:
      certificate: <cert>
      certificate_authorities:
      - <ca>
      key: <key>
    timeout: 15
path:
  home: <path>

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

heres the prospector config:

- paths:
  - <pathtolog>
  ignore_older: 24h
  close_timeout: 10s  //for testing
  fields:
    <some fields>
  fields_under_root: true
  multiline:
    pattern: <multiline>

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

one slight amendment to the repro steps. i actually did not do the "stop writing to the file", since it seems there would be a race condition with beats reaching EOF and closing the file before i could delete it. Here are the full steps:

  • Write continuously to test.log file
  • Configure filebeat to start tracking said file with close_timeout: 10s
  • Verified I saw the contents make it to logstash
  • Take down logstash
  • Delete the file
    RESULT: A lsof -ap <pid> | grep test.log shows filebeat still hanging on to the file after the close timeout elapses.

<process> 59466 <user> 16r REG 253,0 27301 218552883 /<path>/test.log (deleted)

@ruflin
Copy link
Member Author

ruflin commented Feb 28, 2017

@justinyeh3112 So you don't see it happening every time, but only in some cases? Interesting is that you are using it in combination with prospector reloading. Is the config file you changing during the time you see the error (thinking if it could be related to prospector shutdown)?

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

hmm, no the configuration isn't changing. of note, we are running filebeat in process via vendoring the code. i want to try now with just the filebeat binary to make sure its not something we're doing.

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

Interestingly, it seems when it does work the file is actually being released right when i delete it, as opposed to later on when i would expect the close_timeout to kick in.

setting close_timeout to 30s
running "lsof -ap 129193 | grep test.log"
RESULT: beatwrapper 129193 <user>   51r   REG   253,0     11248 213561540 /<path>/test.log
Stopping service 'logstash'
<sleep>
removing test.log
running "lsof -ap 129193 | grep test.log"
RESULT:
sleeping 30s
running "lsof -ap 129193 | grep test.log"
RESULT:

@justinyeh3112
Copy link

here is the same test output where it fails:

setting close_timeout to 30s
running "lsof -ap 84624 | grep test.log"
RESULT: beatwrapper 84624 <user>   20r   REG   253,0     3520 214894923 /<path>/test.log
Stopping service 'logstash'
<sleep>
removing test.log
running "lsof -ap 84624 | grep test.log"
RESULT: beatwrapper 84624 <user>   20r   REG   253,0     3650 214894923 /<path>/test.log (deleted)
sleeping 30s
running "lsof -ap 84624 | grep test.log"
RESULT: beatwrapper 84624 <user>   20r   REG   253,0     4040 214894923 /<path>/test.log (deleted)

@justinyeh3112
Copy link

justinyeh3112 commented Feb 28, 2017

fyi, i was able to repro this using the the latest nightly build of filebeat.

waiting for filebeat to be tracking test.log
filebeat pid: 66544
-- RUNNING TEST --
running "lsof -ap 66544 | grep test.log"
RESULT: filebeat 66544 <user>   10r   REG  253,0     8851 214647331 /<path>/test.log
Stopping service 'logstash'
waiting 10s
running "lsof -ap 66544 | grep test.log"
RESULT: filebeat 66544 <user>   10r   REG  253,0     9112 214647331 /<path>/test.log
removing test.log
running "lsof -ap 66544 | grep test.log"
RESULT: filebeat 66544 <user>   10r   REG  253,0     9112 214647331 /<path>/test.log (deleted)
sleeping 30s
running "lsof -ap 66544 | grep test.log"
RESULT: filebeat 66544 <user>   10r   REG  253,0    14413 214647331 /<path>/test.log (deleted)

@justinyeh3112
Copy link

hi, we think this line is the problem:

https://github.com/elastic/beats/blob/master/filebeat/harvester/log.go#L148

This is a synchronous call and has no way of receiving signals to give up. If a harvester hits the line where it tries to send an event onto the publisher channel while the publisher is blocked, this function call will never return and will block the harvester forever. The file being harvested will never be properly closed even if close_timeout and close_inactive durations have already be exceeded.

On the publisher, https://github.com/elastic/beats/blob/master/libbeat/outputs/mode/single/single.go#L166 seems to indicate that for filebeat it will never give up.

Question, what is the reasoning behind not allowing the guaranteed flag to be set for filebeat?

@ruflin
Copy link
Member Author

ruflin commented Mar 1, 2017

In sendEvent the OnEventSignal should return if close_timeout is reached: https://github.com/elastic/beats/blob/master/filebeat/harvester/log.go#L172. Be aware that this only works with close_timeout and not close_inactive. An other important note is that the file handler should be close but the harvester will not finish. It will be blocked here (on purpose): https://github.com/elastic/beats/blob/master/filebeat/harvester/log.go#L183

Can you try to reproduce the above without using the reloading feature and just the normal prospector definitions? It should not make a difference but to make sure.

For the guaranteed flag: We always try to keep the "at-least-once" principle for logs and so far there was no request for it to be changed.

I will also do some additional testing on my side to try to reproduce the above.

@grantwu
Copy link

grantwu commented Mar 2, 2017

@ruflin Hey, I'm working with @justinyeh3112 jyeh and after digging a bit more I found that you're right. The timeout definitely works there. I narrowed it down a little further. This line in particular gets hairy when the publisher channel gets blocked:

https://github.com/elastic/beats/blob/master/filebeat/prospector/prospector.go#L318

If a new file is opened by the harvester in h.Setup() and the publisher is spinning trying to fulfill the its guaranteed delivery with another event already sitting in the channel, then this call to send the event through the channel blocks indefinitely. At least that's what I've found. Let me know if you need more info.

@ruflin
Copy link
Member Author

ruflin commented Mar 3, 2017

@grantwu Good find. I think you are right and this can definitively lead to the problem that the file was opened but as the harvester was never started, it is not closed anymore. The state update here was introduced to have the first state update of a harvester in a synchronous way to prevent that two harvesters for the same file could be started. It's a little bit a chicken / egg problem here.

I need to dig deeper into this to think through how to solve this issue without introducing other issue.

Thanks a lot for digging deeper here.

@ruflin
Copy link
Member Author

ruflin commented Mar 3, 2017

@justinyeh3112 @grantwu A first attempt to fix it: #3715

ruflin added a commit to ruflin/beats that referenced this pull request Mar 3, 2017
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again.

See elastic#3091 (comment)
urso pushed a commit that referenced this pull request Mar 3, 2017
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again.

See #3091 (comment)
ruflin added a commit to ruflin/beats that referenced this pull request Mar 3, 2017
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again.

See elastic#3091 (comment)
(cherry picked from commit e13a7f2)
@ruflin
Copy link
Member Author

ruflin commented Mar 3, 2017

@justinyeh3112 @grantwu #3715 Was just merged into master and will be backported to the 5.3 branch. Any chance you could test master to see if you still hit the issue?

urso pushed a commit that referenced this pull request Mar 3, 2017
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again.

See #3091 (comment)
(cherry picked from commit e13a7f2)
@justinyeh3112
Copy link

@ruflin this did not seem to address the problem for me. here is roughly the test script i'm using with filebeat configured to 30s close_timeout:

echo "running \"lsof -ap $pid | grep test.log\""
output=$(lsof -ap $pid  | grep test.log)
echo "RESULT: $output"
// <command to stop your logstash server>
echo "waiting 10s"
sleep 10s
echo "running \"lsof -ap $pid | grep test.log\""
output=$(lsof -ap $pid  | grep test.log)
echo "RESULT: $output"
echo "removing test.log"
rm test.log
sleep 5s
echo "running \"lsof -ap $pid | grep test.log\""
output=$(lsof -ap $pid  | grep test.log)
echo "RESULT: $output"
echo "sleeping 30s"
sleep 30s
echo "running \"lsof -ap $pid | grep test.log\""
output=$(lsof -ap $pid  | grep test.log)
echo "RESULT: $output"

and the output

-- RUNNING TEST --
running "lsof -ap 72888 | grep test.log"
RESULT: filebeat 72888 jyeh   42r   REG   253,0     11500 214831017 /opt/testing/test.log
Stopping service 'logstash'
waiting 10s
running "lsof -ap 72888 | grep test.log"
RESULT: filebeat 72888 jyeh   42r   REG   253,0     16397 214831017 /opt/testing/test.log
removing test.log
running "lsof -ap 72888 | grep test.log"
RESULT: filebeat 72888 jyeh   42r   REG   253,0     16527 214831017 /opt/testing/test.log (deleted)
sleeping 30s
running "lsof -ap 72888 | grep test.log"
RESULT: filebeat 72888 jyeh   42r   REG   253,0     21108 214831017 /opt/testing/test.log (deleted)
-- DONE --

@ruflin
Copy link
Member Author

ruflin commented Mar 6, 2017

@justinyeh3112 :-( I will be busy the next two weeks with Elastic{ON} so will not have too much time to look into it. But will definitively keep it on my list. Can you open a new issue with it so this becomes more visible to others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Filebeat Filebeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants