Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat on Windows never recovers from a timeout or EOF #905

Closed
Cylindric opened this issue Feb 2, 2016 · 10 comments
Closed

Filebeat on Windows never recovers from a timeout or EOF #905

Cylindric opened this issue Feb 2, 2016 · 10 comments
Assignees
Labels

Comments

@Cylindric
Copy link

I'm using Filebeat on a bunch of Windows web servers to ship IIS log files to logstash.

It looks like when my logstash/elastic servers get a bit backed up and start stalling the pipeline for a while, Filebeat on the webservers correctly fail to send some log entries, but then never try again.

Normally, I see this in the Filebeat logs:

2016-01-28T15:57:55Z INFO Events sent: 74
2016-01-28T15:57:55Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:00Z INFO Run prospector
2016-01-28T15:58:01Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:10Z INFO Run prospector
2016-01-28T15:58:11Z INFO Run prospector
2016-01-28T15:58:11Z INFO Run prospector
2016-01-28T15:58:20Z INFO Run prospector
2016-01-28T15:58:20Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:21Z INFO Run prospector
2016-01-28T15:58:25Z INFO Events sent: 125
2016-01-28T15:58:25Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:30Z INFO Run prospector
2016-01-28T15:58:30Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:31Z INFO Run prospector
2016-01-28T15:58:40Z INFO Run prospector
2016-01-28T15:58:40Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:41Z INFO Run prospector
2016-01-28T15:58:50Z INFO Events sent: 127
2016-01-28T15:58:50Z INFO Registry file updated. 327 states written.
2016-01-28T15:58:50Z INFO Run prospector

Eventually, the backend blocks for whatever reason, and I'll get this, followed by infinite "Run prospector" messages until I restart filebeat.

2016-02-02T10:04:45Z INFO Run prospector
2016-02-02T10:04:46Z INFO Error publishing events (retrying): read tcp 10.117.1.11:55417->10.10.0.19:5044: wsarecv: An existing connection was forcibly closed by the remote host.
2016-02-02T10:04:47Z INFO Run prospector
2016-02-02T10:04:47Z INFO Run prospector
2016-02-02T10:04:49Z INFO Run prospector
2016-02-02T10:04:52Z INFO Run prospector
2016-02-02T10:04:55Z INFO Run prospector
2016-02-02T10:04:57Z INFO Run prospector
2016-02-02T10:04:57Z INFO Run prospector
2016-02-02T10:04:59Z INFO Run prospector
2016-02-02T10:05:02Z INFO Run prospector
2016-02-02T10:05:02Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:05Z INFO Run prospector
2016-02-02T10:05:07Z INFO Run prospector
2016-02-02T10:05:07Z INFO Run prospector
2016-02-02T10:05:09Z INFO Run prospector
2016-02-02T10:05:10Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:12Z INFO Run prospector
2016-02-02T10:05:15Z INFO Run prospector
2016-02-02T10:05:17Z INFO Run prospector
2016-02-02T10:05:17Z INFO Run prospector
2016-02-02T10:05:19Z INFO Run prospector
2016-02-02T10:05:22Z INFO Run prospector
2016-02-02T10:05:24Z INFO Error publishing events (retrying): EOF
2016-02-02T10:05:24Z INFO Events sent: 10000
2016-02-02T10:05:24Z INFO Registry file updated. 359 states written.
2016-02-02T10:05:25Z INFO Run prospector

That's a particularly odd example, because there's even one last send of 10000 before it goes silent. Usually a bunch of EOF is the last attempt I see FB making.

My config is as follows. I added the max_retries in an attempt to make FB retry for ever, but that made no difference.

filebeat:
  spool_size: 10000
  idle_timeout: 10s

  prospectors:
    -
      type: log
      document_type: iis
      paths:
        - "C:/inetpub/logs/LogFiles/*/*.log"


  registry_file: "C:/ProgramData/filebeat/registry"

output:
  logstash:
    max_retries: -1
    hosts: ["logstash:5044", "elastic1:5044", "elastic2:5044", "elastic3:5044"]
    loadbalance: true

shipper:

logging:
  level: info
  to_files: true
  to_syslog: false

  files:
    path: C:/Filebeat
    name: filebeat.log
    keepfiles: 5

I'm currently trying the nightly build filebeat-1.2.0-nightly160126172227-windows.

@ruflin
Copy link
Member

ruflin commented Feb 3, 2016

Would it be possible for you to enable the log level debug to get some more information on what goes wrong during the publishing?

@urso
Copy link

urso commented Feb 3, 2016

please update your filebeat. Might be related to #872 which was fixed after your build. Why not use filebeat 1.1 ?

@Cylindric
Copy link
Author

Because I've had this problem since pre-1.0, so I've been trying nightlies ever since, on the assumption that 1.1 is newer than 1.0 so would have more fixes in it. I'll try the latest nightly.
@ruflin I've updated it to the latest nightly now, and set logging to debug. Will report back.

@ruflin
Copy link
Member

ruflin commented Feb 3, 2016

@Cylindric Thanks

@sebbarg
Copy link

sebbarg commented Feb 4, 2016

I'm seeing the same here on Windows 2012R2 / filebeat 1.1.0. Here's a log snippet (INFO level):

2016-02-04T12:21:37Z INFO Events sent: 333
2016-02-04T12:21:37Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:42Z INFO Events sent: 267
2016-02-04T12:21:42Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:49Z INFO Events sent: 466
2016-02-04T12:21:49Z INFO Registry file updated. 3 states written.
2016-02-04T12:21:57Z INFO Events sent: 391
2016-02-04T12:21:57Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:09Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:09Z INFO send fail
2016-02-04T12:22:09Z INFO backoff retry: 1s
2016-02-04T12:22:15Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:15Z INFO send fail
2016-02-04T12:22:15Z INFO backoff retry: 2s
2016-02-04T12:22:22Z INFO Events sent: 490
2016-02-04T12:22:22Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:27Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:27Z INFO send fail
2016-02-04T12:22:27Z INFO backoff retry: 1s
2016-02-04T12:22:33Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:33Z INFO send fail
2016-02-04T12:22:33Z INFO backoff retry: 2s
2016-02-04T12:22:37Z INFO Events sent: 300
2016-02-04T12:22:37Z INFO Registry file updated. 3 states written.
2016-02-04T12:22:42Z INFO Error publishing events (retrying): EOF
2016-02-04T12:22:42Z INFO send fail
2016-02-04T12:22:42Z INFO backoff retry: 1s
2016-02-04T12:24:13Z INFO Error publishing events (retrying): read tcp 10.0.1.101:61582->10.0.1.10:5044: i/o timeout
2016-02-04T12:24:13Z INFO send fail
2016-02-04T12:24:13Z INFO backoff retry: 2s
2016-02-04T12:24:16Z INFO Events sent: 410
2016-02-04T12:24:16Z INFO Registry file updated. 3 states written.
2016-02-04T12:24:21Z INFO Error publishing events (retrying): EOF
2016-02-04T12:24:21Z INFO send fail
2016-02-04T12:24:21Z INFO backoff retry: 1s
2016-02-04T12:38:51Z INFO Stopping Registrar

The back-off is working... until suddenly it isn't. I'm restarting the service at 12:38 and all is good again.

@urso urso self-assigned this Feb 4, 2016
@urso
Copy link

urso commented Feb 9, 2016

@Cylindric
Copy link
Author

Thanks, I'm giving that a go both on my Windows IIS servers and my Debian Varnish servers, both of which seem to have similar issues.

Then I'll just need to work out what my bottleneck is on the server-side of things, so my logstash doesn't keep stalling :(

thanks,

@Cylindric
Copy link
Author

13 hours in, and so-far, so-good. It usually would have died by now. Will try to update tomorrow after IIS cycles the log files.

image

@urso
Copy link

urso commented Feb 11, 2016

Seems to be duplicate of #878 . Also see my comment about congestion_threshold in beats plugin. If indexing occasionally requires >5s, set congestion_threshold up. Also consider increasing timeout in filebeat.yml logstash output section. Not erroring all to often may have positive effects on throughput.

@Cylindric
Copy link
Author

Hmm, does look very similar, although I don't remember the windowsize, and my logs have now rotated out. I'll keep an eye open. The congestion_threshold tip looks interesting though - I'll check that out for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants