-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drops event after spooler is blocked for a certain time #3091
Conversation
In case Logstash or Elasticsearch is not available, the filebeat output gets stuck. This has the side effect that also the close_* options do not apply anymore. Because of this files are kept open. Under Linux file rotation continues but the disk space is still used. Under Windows this can mean that file rotation works and a single file gets very large. The goal of this change is to keep the client node sane in case the output is blocked. Filebeat follows the at least once principle and tries to make sure all log lines arrive by the send whenever possible. Using `drop_after` breaks this principle and will lead to data loss! When setting `drop_after` to 5 minutes for example, events start to be dropped after the output is not available for 5 minutes. It will drop all events until the output is available again. As soon as the output becomes available again, the last batch in the publisher will be sent and it continues sending all new events which arrive from now in the log files. All events between the batch which was still in the publisher and the first event which is sent again are lost. The registry file will not be updated when the events are dropped. So in case filebeat is stopped and started again, it will continue at the position where it was before the output was not available. As soon as the output becomes available again, it will update the registry file. But for files where no new events appear, the old position stays in the registry itself. Dropping all events has the advantage, that when the publisher starts sending events again, it will not overwhelm the received with all the queue events. This implementation is different form turning guaranteed of or using udp in the way that events only start to be dropped after a certain time. If the output is only blocked for a time < drop_after all events will still be sent by filebeat and not events are dropped. `drop_after` is only an "emergency switch" in case the output is not available for a longer timer. Alternative implementation 1. Drop only event older then An alternative implementation could be do only drop events where the timestamp is older then the predefined period. The advantage of this would be that not necessarly all events are dropped until the output becomes available again, but only the oldest one. This implementation is a little bit more complex and I don't it is needed. 2. Drop events in publisher Instead of dropping events in the spooler, events could be dropped in the publisher. Advantage of having it in the publisher is that the registry file would also be updated, so no resending of events would happen on restart. Questions * Should the registry also be updated so no resending happens? -> publisher implementation needed TOOD: * Add tests * Add stats for events which were dropped * Add docs
Closing as @urso is working on a potentially better / cleaner solution. |
Hi. Any update on solutions for this? Is there another ticket open tracking this? |
@justinyeh3112 This problem was solved here: #3511 It's already available in the nightly builds. In case you test it, it would be great to get some feedback on it. |
great! |
Hey @ruflin, so I gave this a shot and it doesn't seem to be working how I expect. Here's what a I did:
|
@justinyeh3112 Thanks for testing.
I tried to reproduce this but couldn't so far. The last step "delete the file" should not be needed. |
got this working ,thanks! |
@justinyeh3112 What cause the problem initially? |
Hey, should probably loop back here. So i guess i'm seeing it sometimes work and sometimes not work. Here are some answers to your questions:
|
heres the prospector config:
|
one slight amendment to the repro steps. i actually did not do the "stop writing to the file", since it seems there would be a race condition with beats reaching EOF and closing the file before i could delete it. Here are the full steps:
|
@justinyeh3112 So you don't see it happening every time, but only in some cases? Interesting is that you are using it in combination with prospector reloading. Is the config file you changing during the time you see the error (thinking if it could be related to prospector shutdown)? |
hmm, no the configuration isn't changing. of note, we are running filebeat in process via vendoring the code. i want to try now with just the filebeat binary to make sure its not something we're doing. |
Interestingly, it seems when it does work the file is actually being released right when i delete it, as opposed to later on when i would expect the close_timeout to kick in.
|
here is the same test output where it fails:
|
fyi, i was able to repro this using the the latest nightly build of filebeat.
|
hi, we think this line is the problem: https://github.com/elastic/beats/blob/master/filebeat/harvester/log.go#L148 This is a synchronous call and has no way of receiving signals to give up. If a harvester hits the line where it tries to send an event onto the publisher channel while the publisher is blocked, this function call will never return and will block the harvester forever. The file being harvested will never be properly closed even if close_timeout and close_inactive durations have already be exceeded. On the publisher, https://github.com/elastic/beats/blob/master/libbeat/outputs/mode/single/single.go#L166 seems to indicate that for filebeat it will never give up. Question, what is the reasoning behind not allowing the guaranteed flag to be set for filebeat? |
In Can you try to reproduce the above without using the reloading feature and just the normal prospector definitions? It should not make a difference but to make sure. For the guaranteed flag: We always try to keep the "at-least-once" principle for logs and so far there was no request for it to be changed. I will also do some additional testing on my side to try to reproduce the above. |
@ruflin Hey, I'm working with @justinyeh3112 jyeh and after digging a bit more I found that you're right. The timeout definitely works there. I narrowed it down a little further. This line in particular gets hairy when the publisher channel gets blocked: If a new file is opened by the harvester in h.Setup() and the publisher is spinning trying to fulfill the its guaranteed delivery with another event already sitting in the channel, then this call to send the event through the channel blocks indefinitely. At least that's what I've found. Let me know if you need more info. |
@grantwu Good find. I think you are right and this can definitively lead to the problem that the file was opened but as the harvester was never started, it is not closed anymore. The state update here was introduced to have the first state update of a harvester in a synchronous way to prevent that two harvesters for the same file could be started. It's a little bit a chicken / egg problem here. I need to dig deeper into this to think through how to solve this issue without introducing other issue. Thanks a lot for digging deeper here. |
@justinyeh3112 @grantwu A first attempt to fix it: #3715 |
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again. See elastic#3091 (comment)
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again. See #3091 (comment)
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again. See elastic#3091 (comment) (cherry picked from commit e13a7f2)
@justinyeh3112 @grantwu #3715 Was just merged into master and will be backported to the 5.3 branch. Any chance you could test master to see if you still hit the issue? |
If close_timeout is set, a file handler inside the harvester is closed after `close_timeout` is reached even if the output is blocked. For new which were found for harvesting during the output was blocked, it was possible that the Setup of the file happened but then the initial state could not be sent. So the harvester was never started, file handler was open but `close_timeout` this not apply. Setup and state update are now turned around. This has the affect, that in case of an error during the setup phase, the state must be revert again. See #3091 (comment) (cherry picked from commit e13a7f2)
@ruflin this did not seem to address the problem for me. here is roughly the test script i'm using with filebeat configured to 30s close_timeout:
and the output
|
@justinyeh3112 :-( I will be busy the next two weeks with Elastic{ON} so will not have too much time to look into it. But will definitively keep it on my list. Can you open a new issue with it so this becomes more visible to others? |
In case Logstash or Elasticsearch is not available, the filebeat output gets stuck. This has the side effect that also the close_* options do not apply anymore. Because of this files are kept open. Under Linux file rotation continues but the disk space is still used. Under Windows this can mean that file rotation works and a single file gets very large.
The goal of this change is to keep the client node sane in case the output is blocked. Filebeat follows the at least once principle and tries to make sure all log lines arrive by the send whenever possible. Using
drop_after
breaks this principle and will lead to data loss!When setting
drop_after
to 5 minutes for example, events start to be dropped after the output is not available for 5 minutes. It will drop all events until the output is available again. As soon as the output becomes available again, the last batch in the publisher will be sent and it continues sending all new events which arrive from now in the log files. All events between the batch which was still in the publisher and the first event which is sent again are lost.The registry file will not be updated when the events are dropped. So in case filebeat is stopped and started again, it will continue at the position where it was before the output was not available. As soon as the output becomes available again, it will update the registry file. But for files where no new events appear, the old position stays in the registry itself.
Dropping all events has the advantage, that when the publisher starts sending events again, it will not overwhelm the received with all the queue events.
This implementation is different form turning guaranteed of or using udp in the way that events only start to be dropped after a certain time. If the output is only blocked for a time < drop_after all events will still be sent by filebeat and not events are dropped.
drop_after
is only an "emergency switch" in case the output is not available for a longer timer.Alternative implementation
An alternative implementation could be do only drop events where the timestamp is older then the predefined period. The advantage of this would be that not necessarly all events are dropped until the output becomes available again, but only the oldest one. This implementation is a little bit more complex and I don't it is needed.
Instead of dropping events in the spooler, events could be dropped in the publisher. Advantage of having it in the publisher is that the registry file would also be updated, so no resending of events would happen on restart.
Questions
TOOD: