-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat not resuming after EOF #878
Comments
which filebeat version are you using? Can you share your filebeat configuration? Can you run filebeat in debug mode by running filebeat with |
Yea, sorry for lack of information. Filebeat version: 1.0.1 Configuration: ################### Filebeat Configuration #########################
filebeat:
prospectors:
-
#
# HTTP apache log files
#
paths:
- "/logs/httpd/201601/*httpd_access.log"
input_type: log
fields:
syslog_format: true
http_proto: "http"
fields_under_root: true
document_type: httpd
-
#
# HTTPS apache log files
#
paths:
- "/logs/httpd/201601/*httpd_ssl_access.log"
input_type: log
fields:
syslog_format: true
http_proto: "https"
fields_under_root: true
document_type: httpd
-
#
# Dovecot log files
#
paths:
- "/logs/mail/201601/*mail.log"
input_type: log
fields:
syslog_format: true
fields_under_root: true
document_type: dovecot
-
#
# Postfix log files
#
paths:
- "/logs/mail/201601/*mail.log"
input_type: log
fields:
syslog_format: true
fields_under_root: true
document_type: postfix
-
#
# Radius log files
#
paths:
- "/logs/radius/201601/*radius.log"
input_type: log
fields:
syslog_format: true
fields_under_root: true
document_type: radiusd
-
#
# dhcpd log files
#
paths:
- "/logs/dhcpd/201601/*dhcpd.log"
input_type: log
fields:
syslog_format: true
fields_under_root: true
document_type: dhcpd
registry_file: /var/lib/filebeat/registry
spool_size: 4096
#config_dir:
output:
logstash:
hosts: [ "logstash2:5055", "logstash1:5055" ]
worker: 2
loadbalance: true
tls:
certificate_authorities: ["/etc/pki/tls/certs/filebeat.crt"]
certificate: "/etc/pki/tls/certs/filebeat.crt"
certificate_key: "/etc/pki/tls/private/filebeat.key"
# Controls whether the client verifies server certificates and host name.
# If insecure is set to true, all server host names and certificates will be
# accepted. In this mode TLS based connections are susceptible to
# man-in-the-middle attacks. Use only for testing.
#insecure: true
# Debug output
file:
path: "/tmp/filebeat"
filename: filebeat
rotate_every_kb: 10000
number_of_files: 2
shipper:
name: filebeat
logging:
to_syslog: false
to_files: true
files:
path: /var/log/beats
name: filebeat.log
rotateeverybytes: 10485760 # = 10MB
keepfiles: 5
# Other available selectors are beat, publish, service
#selectors: [ ]
# Available log levels are: critical, error, warning, info, debug
level: info I did run it in debug mode (level: debug in config file) and what I see in logs is a bunch of:
I can also run Matej |
window size becomes 1, if it filebeat failed too send over 'long' period of time. We probe logstash being operative by sending only 1 event in this case and waiting for an ACK. Which obviously failed (0 events out of 200 sent). Does this repeat or will filebeat stop at some point logging these messages? There should be some info messages regarding EOF in between, right? |
@matejzero there's an issue in 1.0.1 the window size not growing anymore once reached a size of 1 potentially affecting throughput. Maybe you want to test 1.1 build.
This config generates 4 output workers (2 workers per logstash). The certificate and certificate_key options are not required, due to logstash not yet support client authentication. |
This logging continues until I'm forced to restart filebeat (sometimes 1h, sometimes 20h). It never stops. As far as debug goes and infos about EOF, I usually noticed the problem too late and log files were already rotated. Will try to catch it next time. I tried restarting logstash on both nodes, but filebeat did not resume sending (in case beats input plugin would crash). But it was receiving events from redis in the mean time. I will change the number of workers just in case, logstash really can't ingest all the data (although it should, I'm mostly sending somewhere between 200-400 events/s to a double 8 core nodes). I will also test out the 1.1 build. |
It EOF again today. It did not resume sending, but both logstash nodes were working (receiving events via redis and I was able to telnet to beats port on logstash nodes). I will upgrade to 1.2 nightly build today and see how it goes. |
yeah, testing 1.2 might be helpfull, as we've added some more debug output. The fact that debug output never stops excludes the chance of output being deadlocked. still no idea why reconnects do fail. Thanks for testing. When testing 1.2 you can start filebeat with |
I've updated to 1.2 and so far it is working for 3 days without a problem. I will close the issue for now and reopen if the problem reoccur. Thanks for the help! |
@matejzero Thanks. Let us know if you have any issues with 1.2. |
Will do. |
Hello, problem occurred again today and yesterday. $ filebeat -version In log files, all I see is:
I did run it with httprof and here is the output of debug:
I can telnet to logstash port without problem:
In logstash, I see the following errors around the time filebeat stopped sending logs:
According to grafana metrics, logs stopped flowing to logstash at around 13:16:00. The last error message in logstash logs is at 13:50:50:
Let me know if you need more information. |
Thanks for the details. Very helpful, as I haven't been able to reproduce this problem myself yet. Skimming the trace, it looks like a deadlock in filebeat publisher, caused by a number of (potentially) temporary send failures by logstash queues being clogged. I assume filebeat stops writing this message: Would be interesting to take a second trace like 10 seconds after and do a diff to confirm we're really dealing with a deadlock. A similar error happened like 3 hours before (11:15;45), did filebeat recover in this case? Do we have some logs? |
Hey there, filebeat never stops writing Next time it hangs, I will do 2 traces, so you will be able to do a diff. According to logstash logs, filebeat recovered at the first event. So yesterdays scenario was the following:
|
Hm... This is a sign filebeat actually not being deadlocked, but failing to reconnect. |
As you are running in debug mode (given you have the '*' or 'logstash' selector set), You still see 'connect' and 'close connection' messages in your logs? |
As far as selectors go, this is what I have in my config file:
I forgot to make a copy of log files when I restarted filebeat, so I can't check if I see I also tried restarting logstash before restarting filebeat just in case, there was a problem with logstash's beat plugin, but even after that, filebeat did not reconnect back. On the other hand, it always reconnects and starts sending data if I restart filebeat. |
OK, Thanks. |
@matejzero Can you tell me the version of logstash you are using and the version of the beats input? I have seen concurrency error in pre 2.1.2 version of the plugins, but not causing the kind of deadlock you are currently seeing? The currency was revisited in 2.1.2 and it might solve your issue. |
Also what is your logstash configuration in that case? |
Logstash version 2.1.1 Beats plugin version:
I need some time to get logstash configuration together since it's a rather big and need to remove some entries out... Will post later. |
@matejzero One more thing can you do thread dump when this situation happen? |
Thread dump of logstash if filebeat? |
The thread dump of logstash. |
@urso ^ this is weird though no? |
It happened again, and oddly, at the same time?! According to logstash logs, this are the times where I see slowdown logs in logstash:
I can't send you the output of httpprof, because I upgraded filebeat to 1.1.0 today and forgot to add -httpprof parameter. Will update init script. As far as jstack goes, I tried getting thread dumb, but I see the following error:
I tried with -F and got the following:
|
At the same time? This is odd? Nothing scheduled on the box? |
Your config would still be useful here to narrow down the code in the problem. Is that the complete output from jstack? |
Did some more tests with filebeat->LS->ES setup. I make the pipeline in LS clog by killing ES for short period of time every so often. After a while I was able to reproduce this:
No more INFO messages. with window size shrinking. Checking with netstat filebeat->LS connections are still established, but no more data are transmitted. Interestingly (in my tests), new publish requests seem to be send to LS (at least no error is returned when writing to the socket), but no ACK is ever received until timeout. Based on many restarts + bad timing in some cases Iogstash output module can get into this weird state which is supposed to be broken by:
Interestingly restarting logstash did work for me (connection was closed in filebeat due to error code and new connections have being created), but then I did run my tests all on localhost. I think I found a bug in case 2 but need to do some more testing. This one might explain why filebeat used to recover once, but not the second time. Follow progress in PR #927. I started improving unit tests + added a more aggressive connection closing policy. Fix for possible bug not yet included as I first want to try to reproduce it in unit test. |
@ph: JVM:
If I restart filebeat, it reconnects and start sending data, but if I restart logstash, then nothing happens. I HAVE to restart filebeat. I will put my config together when I get to work. @urso |
Here is my config file:
|
@urso in your scenario to make it crash you used the following configuration? You said filebeat is in a weird state, by reading your comment you said that restarting LS unblock filebeat. But if you restart filebeat in this scenario is logstash still in weird state? We could have a bug on both end of the pipe. |
@ph no, I used filebeat, as it will block the spooler if output blocks, whereas generatorbeat is not using the guaranteed mode. I really have Restarting any of filebeat or LS did work for me. Sometimes it took a while for LS to recover though, like 2+ minutes until new connections have been accepted. |
Next time system hangs, I will restart logstash and wait for a longer period. I must admit that when I restarted logstash, I only waited half a minute and then restarted filebeat. If I would wait longer, I might get the same results as @urso. |
I restarted logstash on filebeat lock and checked netstat. There was no connection to the logstash nodes from filebeat, even after 15 minutes. After restarting filebeat, connection was back up right away. |
I installed the snapshot and we will see. Will report what the results are. |
I will make the back off time configurable. |
Hello, I am experiencing the same error. It seems the logstash filebeat plugin ( at the receiver end ) stops working when a node form ES gets disconnect and the flow doesn't resume, it stay stuck . It is really critical .. |
It looks like you have a different problem than me. If you restart logstash, does the receiving resumes or do you have to restart filebeat? |
Actually the bug found can be triggered by logstash plugin it's internal timeout, given socket close on TCP layer not being communicated correctly to filebeat (e.g. some firewall or NAT inbetween). If elasticsearch indexing takes too long (e.g. one 'long' pause due to garbage collecting) , the pipe in logstash will experience the timeout error disconnecting filebeat. Despite breaking filebeat reconnect (hopefully fixed) this will have a negative effect on throughput too. Increasing congestion_threshold in logstash from default of 5s to e.g. 30s might relieve situation a little. In general I'd set Logstash works by generating back-pressure internally within logstash. That is, if output is down or becomes unresponsive pipe will block and logstash will disconnect filebeat. There's no option in filebeat stopping logstash from doing so. |
Is this fix merged to 1.1.1 release? |
Yes |
Hello, I was still experiencing the problem even with Filebeat 1.1.1 on my last test . With the new update to Filebeat after 10-20 seconds the error is not thrown anymore but the logs still don't start to come in . I will do a more detailed test and give more information . |
I am using 1.2.3 version of filebeat. We see frequent message in log. Jul 9 04:08:36 c4t08458 /usr/bin/filebeat[24986]: balance.go:248: Error publishing events (retrying): EOF |
@sumanthakannantha For questions please open a topic here: https://discuss.elastic.co/c/beats/filebeat |
Hello,
I'm having troubles with filebeat not resuming after EOF. My filebeat is sending logs to 2 logstash nodes in loadbalance mode. Every day at random time, it stops sending with EOF error in log file:
If I look at logstash logs, I see this
Pipeline blocked logs continue for another 20s and then they stop. Logstash is still reading logs from redis, but filebeat never resumes. The only way to get logs flowing again is to restart filebeat.
Shouldn't filebeat resume sending of logs? At least that is the idea a got on filebeat irc channel.
The text was updated successfully, but these errors were encountered: