Filebeat on Windows never recovers from a timeout or EOF #905

Cylindric · 2016-02-02T11:48:02Z

I'm using Filebeat on a bunch of Windows web servers to ship IIS log files to logstash.

It looks like when my logstash/elastic servers get a bit backed up and start stalling the pipeline for a while, Filebeat on the webservers correctly fail to send some log entries, but then never try again.

Normally, I see this in the Filebeat logs:

2016-01-28T15:57:55Z INFO Events sent: 74
2016-01-28T15:57:55Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:01Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:11Z INFO Run prospector
2016-01-28T15:58:11Z INFO Run prospector
2016-01-28T15:58:20Z INFO Run prospector
2016-01-28T15:58:20Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:25Z INFO Events sent: 125
2016-01-28T15:58:25Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:30Z INFO Run prospector
2016-01-28T15:58:30Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:40Z INFO Run prospector
2016-01-28T15:58:40Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:50Z INFO Events sent: 127
2016-01-28T15:58:50Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:50Z INFO Run prospector

Eventually, the backend blocks for whatever reason, and I'll get this, followed by infinite "Run prospector" messages until I restart filebeat.

2016-02-02T10:04:45Z INFO Run prospector
2016-02-02T10:04:46Z INFO Error publishing events (retrying): read tcp 10.117.1.11:55417->10.10.0.19:5044: wsarecv: An existing connection was forcibly closed by the remote host.
2016-02-02T10:04:47Z INFO Run prospector
2016-02-02T10:04:47Z INFO Run prospector
2016-02-02T10:04:49Z INFO Run prospector
2016-02-02T10:04:52Z INFO Run prospector
2016-02-02T10:04:55Z INFO Run prospector
2016-02-02T10:04:57Z INFO Run prospector
2016-02-02T10:04:57Z INFO Run prospector
2016-02-02T10:04:59Z INFO Run prospector
2016-02-02T10:05:02Z INFO Run prospector
2016-02-02T10:05:02Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:05Z INFO Run prospector
2016-02-02T10:05:07Z INFO Run prospector
2016-02-02T10:05:07Z INFO Run prospector
2016-02-02T10:05:09Z INFO Run prospector
2016-02-02T10:05:10Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:12Z INFO Run prospector
2016-02-02T10:05:15Z INFO Run prospector
2016-02-02T10:05:17Z INFO Run prospector
2016-02-02T10:05:17Z INFO Run prospector
2016-02-02T10:05:19Z INFO Run prospector
2016-02-02T10:05:22Z INFO Run prospector
2016-02-02T10:05:24Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:24Z INFO Events sent: 10000
2016-02-02T10:05:24Z INFO Registry file updated. 359 states written.
2016-02-02T10:05:25Z INFO Run prospector

That's a particularly odd example, because there's even one last send of 10000 before it goes silent. Usually a bunch of EOF is the last attempt I see FB making.

My config is as follows. I added the max_retries in an attempt to make FB retry for ever, but that made no difference.

filebeat:
  spool_size: 10000
  idle_timeout: 10s

  prospectors:
    -
      type: log
      document_type: iis
      paths:
        - "C:/inetpub/logs/LogFiles/*/*.log"


  registry_file: "C:/ProgramData/filebeat/registry"

output:
  logstash:
    max_retries: -1
    hosts: ["logstash:5044", "elastic1:5044", "elastic2:5044", "elastic3:5044"]
    loadbalance: true

shipper:

logging:
  level: info
  to_files: true
  to_syslog: false

  files:
    path: C:/Filebeat
    name: filebeat.log
    keepfiles: 5

I'm currently trying the nightly build filebeat-1.2.0-nightly160126172227-windows.

The text was updated successfully, but these errors were encountered:

ruflin · 2016-02-03T07:13:11Z

Would it be possible for you to enable the log level debug to get some more information on what goes wrong during the publishing?

urso · 2016-02-03T09:56:08Z

please update your filebeat. Might be related to #872 which was fixed after your build. Why not use filebeat 1.1 ?

Cylindric · 2016-02-03T11:02:58Z

Because I've had this problem since pre-1.0, so I've been trying nightlies ever since, on the assumption that 1.1 is newer than 1.0 so would have more fixes in it. I'll try the latest nightly.
@ruflin I've updated it to the latest nightly now, and set logging to debug. Will report back.

ruflin · 2016-02-03T12:22:01Z

@Cylindric Thanks

sebbarg · 2016-02-04T12:54:57Z

I'm seeing the same here on Windows 2012R2 / filebeat 1.1.0. Here's a log snippet (INFO level):

2016-02-04T12:21:37Z INFO Events sent: 333
2016-02-04T12:21:37Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:42Z INFO Events sent: 267
2016-02-04T12:21:42Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:49Z INFO Events sent: 466
2016-02-04T12:21:49Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:57Z INFO Events sent: 391
2016-02-04T12:21:57Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:09Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:09Z INFO send fail
2016-02-04T12:22:09Z INFO backoff retry: 1s
2016-02-04T12:22:15Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:15Z INFO send fail
2016-02-04T12:22:15Z INFO backoff retry: 2s
2016-02-04T12:22:22Z INFO Events sent: 490
2016-02-04T12:22:22Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:27Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:27Z INFO send fail
2016-02-04T12:22:27Z INFO backoff retry: 1s
2016-02-04T12:22:33Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:33Z INFO send fail
2016-02-04T12:22:33Z INFO backoff retry: 2s
2016-02-04T12:22:37Z INFO Events sent: 300
2016-02-04T12:22:37Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:42Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:42Z INFO send fail
2016-02-04T12:22:42Z INFO backoff retry: 1s
2016-02-04T12:24:13Z INFO Error publishing events (retrying): read tcp 10.0.1.101:61582->10.0.1.10:5044: i/o timeout
2016-02-04T12:24:13Z INFO send fail
2016-02-04T12:24:13Z INFO backoff retry: 2s
2016-02-04T12:24:16Z INFO Events sent: 410
2016-02-04T12:24:16Z INFO Registry file updated. 3 states written.
2016-02-04T12:24:21Z INFO Error publishing events (retrying): EOF
2016-02-04T12:24:21Z INFO send fail
2016-02-04T12:24:21Z INFO backoff retry: 1s
2016-02-04T12:38:51Z INFO Stopping Registrar

The back-off is working... until suddenly it isn't. I'm restarting the service at 12:38 and all is good again.

urso · 2016-02-09T00:14:48Z

We found a bug in error recovery making the logstash output hang in an infinite loop if too many errors occured + some bad timing.

We're preparing a 1.1.1 release containing a fix for this particular error. Maybe you want to give it a try:

The PR for master branch contains quite some more refactoring/changes and will be merged into master hopefully soonish.

Cylindric · 2016-02-10T10:32:12Z

Thanks, I'm giving that a go both on my Windows IIS servers and my Debian Varnish servers, both of which seem to have similar issues.

Then I'll just need to work out what my bottleneck is on the server-side of things, so my logstash doesn't keep stalling :(

thanks,

Cylindric · 2016-02-10T23:25:17Z

13 hours in, and so-far, so-good. It usually would have died by now. Will try to update tomorrow after IIS cycles the log files.

urso · 2016-02-11T00:24:23Z

Seems to be duplicate of #878 . Also see my comment about congestion_threshold in beats plugin. If indexing occasionally requires >5s, set congestion_threshold up. Also consider increasing timeout in filebeat.yml logstash output section. Not erroring all to often may have positive effects on throughput.

Cylindric · 2016-02-11T00:50:50Z

Hmm, does look very similar, although I don't remember the windowsize, and my logs have now rotated out. I'll keep an eye open. The congestion_threshold tip looks interesting though - I'll check that out for sure.

urso self-assigned this Feb 4, 2016

andrewkroh added Filebeat Filebeat bug labels Mar 9, 2016

andrewkroh closed this as completed Mar 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat on Windows never recovers from a timeout or EOF #905

Filebeat on Windows never recovers from a timeout or EOF #905

Cylindric commented Feb 2, 2016

ruflin commented Feb 3, 2016

urso commented Feb 3, 2016

Cylindric commented Feb 3, 2016

ruflin commented Feb 3, 2016

sebbarg commented Feb 4, 2016

urso commented Feb 9, 2016

Cylindric commented Feb 10, 2016

Cylindric commented Feb 10, 2016

urso commented Feb 11, 2016

Cylindric commented Feb 11, 2016

Filebeat on Windows never recovers from a timeout or EOF #905

Filebeat on Windows never recovers from a timeout or EOF #905

Comments

Cylindric commented Feb 2, 2016

ruflin commented Feb 3, 2016

urso commented Feb 3, 2016

Cylindric commented Feb 3, 2016

ruflin commented Feb 3, 2016

sebbarg commented Feb 4, 2016

urso commented Feb 9, 2016

Cylindric commented Feb 10, 2016

Cylindric commented Feb 10, 2016

urso commented Feb 11, 2016

Cylindric commented Feb 11, 2016