-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat Fails After Power Failure #3537
Comments
Is there another log file, like Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk. |
Sure is. Added the .1 —> .6 log files for completeness.
Reviewed the other server that experienced the same blank registry file and it is pretty much the same as the logs provided below. Nothing additional. I could possibly try and reproduce the issue by enabling higher level of verbose logging within filebeat (if that’s possible) and cause a power failure? Would that yield better information possibly?
[—REDACT--@—REDACT-- filebeat]# cat filebeat.1
2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s
[—REDACT--@—REDACT-- filebeat]# cat filebeat.2
2017-02-05T21:39:16+13:00 INFO Metrics logging every 30s
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 15.5132ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[—REDACT--@—REDACT-- filebeat]# cat filebeat.3
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
[—REDACT--@—REDACT-- filebeat]# cat filebeat.4
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 11.1636ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[—REDACT--@—REDACT-- filebeat]# cat filebeat.5
2017-02-05T21:39:16+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:16+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:16+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:16+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:16+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:16+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:16+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:16+13:00 INFO filebeat start running.
2017-02-05T21:39:16+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:16+13:00 INFO Total non-zero values:
2017-02-05T21:39:16+13:00 INFO Uptime: 14.0138ms
2017-02-05T21:39:16+13:00 INFO filebeat stopped.
2017-02-05T21:39:16+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
[—REDACT--@—REDACT-- filebeat]# cat filebeat.6
2017-02-05T21:39:15+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T21:39:15+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T21:39:15+13:00 INFO Max Retries set to: 3
2017-02-05T21:39:15+13:00 INFO Activated logstash as output plugin.
2017-02-05T21:39:15+13:00 INFO Publisher name: [email protected]
2017-02-05T21:39:15+13:00 INFO Flush Interval set to: 1s
2017-02-05T21:39:15+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T21:39:15+13:00 INFO filebeat start running.
2017-02-05T21:39:15+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T21:39:15+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T21:39:15+13:00 INFO Total non-zero values:
2017-02-05T21:39:15+13:00 INFO Uptime: 39.7352ms
2017-02-05T21:39:15+13:00 INFO filebeat stopped.
2017-02-05T21:39:15+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
From: Andrew Kroh <[email protected]>
Reply-To: elastic/beats <[email protected]>
Date: Tuesday, 7 February 2017 at 06:39
To: elastic/beats <[email protected]>
Cc: Andrew Hardy <andrewhardy@**************>, Author <[email protected]>
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
Is there another log file, like /var/log/filebeat/filebeat.1, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.
Filebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@andrewkroh in #3537: Is there another log file, like `/var/log/filebeat/filebeat.1`, containing data from before the ungraceful shutdown? It may not contain anything useful since the logs may not have been synced to disk.\r\n\r\nFilebeat writes registry data to temporary file, calls fsync, then uses a move to achieve something that's close to an atomic update. So we are a bit surprised to see a zero byte file since fsync is supposed to block until the bytes are written to disk."}],"action":{"name":"View Issue","url":"#3537 (comment)"}}}
|
I'm experiencing (or rather experienced) similar issue, after updating filebeat to latest (5.0.2) version, filebeat service wouldn't start anymore, however after removing /var/lib/filebeat/registry filebeat service starts without any issues) |
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs. |
Thanks for the response, removal of the registry file has resolved the issue for as well.
Is a possible fix to perform an error check to determine whether the EOF is present in the log file and if true, perform a check against the integrity of the registry file and if blank, perform the remove and reattempt starting of the service else continue?
Identifying the reason for the agent failing would be an optimal solution in my mind but maybe error checking is an appropriate way to handle this specific issue if root cause cannot be determined? I’m not a developer by day so just random thinking on possible solutions.
The provided work around fix was successful, thanks for that guys. I haven’t seen the issue occur with Ubuntu VMs which also suffered the same critical failure (also running filebeat) but had no issues starting back up. Is there differences in how Centos and Ubuntu handle ungraceful failure that may be contributing to the issue?
From: Ben Hoffman <[email protected]>
Reply-To: elastic/beats <[email protected]>
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats <[email protected]>
Cc: Andrew Hardy <[email protected]>, Author <[email protected]>
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a [forum post](https://discuss.elastic.co/t/filebeat-will-not-start-centos-7/74460) about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"#3537 (comment)"}}}
|
Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events. @andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore. So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this? |
Nicholas - That would make a lot of sense, must have lost the key log files you guys were looking for. Damn.
File system on both systems that experienced the failure was XFS for OS and ext3 for the drive that ES indexes are stored. OS - sda1, ES storage - sdb1. (Elk server). The other server was a centos 7 server running Open monitoring distribution, OS drive was also using XFS as the file system type.
OS partition on both servers using LVM2 format as per vgdisplay output.
I haven't been able to reliably reproduce the issue to date but as this is in a lab environment, I can try and do this to help us gain some traction over the next couple of days.
I have a couple of ideas on where to start with the testing but hoping for some input on the order... whether the suspicion is with file system type, partition format, the way the OS handles ungraceful failure or somewhere else. Thinking it maybe worth while coming up with a plan of attack to save time.
Andy
… On 11/02/2017, at 4:44 AM, Nicolas Ruflin ***@***.***> wrote:
Since version 5.0 filebeat does some checks of the registry file if it is in valid state and with 5.1 additional checks were introduced. The problem before was, that above issues were not even detected. Now if there is an invalid registry file, filebeat will not start. If we just reset the registry file this would lead to duplicated events.
@andynz2017 About the log files you posted. It seems you tried to start filebeat several times which lead to the issue that each time a new log file was written, and the one we are interested in does not exist anymore.
So far it seems, all people reporting an issue on restart and the registry file are using VM Virtual Machines. Could you guys share what file systems you are using? Did anyone find a "reliable" way to reproduce this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Just an update here.
Host failed last night The following information was being written over a 13 minute period prior to the power failure on one of the servers. Comparing the output to another log such as /var/log/messages, it looks like nothing at all is written at the time that the server dies.
It appears to indicate that when ever there is a major catastrophic power event, the VM(s) on the host does not have time to write anything to disk. This only affects Centos Servers. Ubuntu servers running filebeat do not experience the same issues and filebeat gracefully starts fine when those VMs boot up. Ubuntu 16.04 VM running default file system, ext4 – Didn’t have my 2FA token handy to elevate to root to verify this but will do that tomorrow.
Looking through the actual Host logs, nothing was written to disk also – no kernel power alerts, no dump file, no logs in application of system prior to the event. I cant explain this one. Any ideas on what other logs/information may be useful in helping track the root of this issue down? May need to use matter of elimination and build some test servers with different file systems etc to see if I see similar behaviour?
//Last extract from /var/log/messages
Feb 13 00:32:01 —Redact-- systemd[1]: Removed slice user-997.slice.
Feb 13 00:32:01 —Redact-- systemd[1]: Stopping user-997.slice.
Feb 13 00:32:01 —Redact-- systemd[1]: Removed slice user-996.slice.
Feb 13 00:32:01 —Redact-- systemd[1]: Stopping user-996.slice.
Feb 13 06:44:56 —Redact-- rsyslogd: [origin software="rsyslogd" swVersion="7.4.7" x-pid="930" x-info="http://www.rsyslog.com"] start
//Last 13 or so minutes from /var/log/filebeat/filebeat
2017-02-13T00:19:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 registrar.writes=1 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1244 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:19:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 registrar.states.update=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1320 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1
2017-02-13T00:20:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.publish.write_bytes=1312 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14
2017-02-13T00:20:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=8 libbeat.logstash.publish.write_bytes=938 libbeat.publisher.published_events=8 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8
2017-02-13T00:21:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 registrar.writes=1 libbeat.logstash.publish.write_bytes=1231 publish.events=14 libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1
2017-02-13T00:21:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.publish.write_bytes=1253 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 libbeat.publisher.published_events=12 registrar.states.update=12
2017-02-13T00:22:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1297 libbeat.publisher.published_events=14 registrar.writes=1 publish.events=14 registrar.states.update=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14
2017-02-13T00:22:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1103 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.publisher.published_events=8 publish.events=8
2017-02-13T00:23:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14 libbeat.logstash.publish.write_bytes=1312 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14
2017-02-13T00:23:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1221 publish.events=12 libbeat.logstash.published_and_acked_events=12
2017-02-13T00:24:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 registrar.states.update=14 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1244 registrar.writes=1
2017-02-13T00:24:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=8 libbeat.publisher.published_events=8 libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=927 publish.events=8 registrar.writes=1
2017-02-13T00:25:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1253
2017-02-13T00:25:37+13:00 INFO Non-zero metrics in the last 30s: publish.events=12 libbeat.logstash.published_and_acked_events=12 registrar.states.update=12 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 registrar.writes=1 libbeat.logstash.publish.write_bytes=1264 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:26:07+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=14 registrar.writes=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14 libbeat.logstash.publish.write_bytes=1301 libbeat.logstash.published_and_acked_events=14 libbeat.logstash.call_count.PublishEvents=1
2017-02-13T00:26:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 registrar.states.update=8 registrar.writes=1 libbeat.logstash.published_and_acked_events=8 publish.events=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.write_bytes=948 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:27:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1228 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 publish.events=14 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14
2017-02-13T00:27:37+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 publish.events=12 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.publish.write_bytes=1229 registrar.states.update=12
2017-02-13T00:28:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 publish.events=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1263 registrar.states.update=14
2017-02-13T00:28:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=955 registrar.states.update=8 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.writes=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 publish.events=8
2017-02-13T00:29:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1304 publish.events=14 libbeat.logstash.publish.read_bytes=35 libbeat.logstash.call_count.PublishEvents=1 libbeat.publisher.published_events=14 registrar.writes=1 libbeat.logstash.published_and_acked_events=14 registrar.states.update=14
2017-02-13T00:29:37+13:00 INFO Non-zero metrics in the last 30s: registrar.states.update=12 libbeat.logstash.publish.write_bytes=1225 libbeat.publisher.published_events=12 libbeat.logstash.publish.read_bytes=35 publish.events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=12 registrar.writes=1
2017-02-13T00:30:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1322 publish.events=14 registrar.writes=1 libbeat.logstash.publish.read_bytes=35 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14
2017-02-13T00:30:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=949 libbeat.publisher.published_events=8 registrar.writes=1 publish.events=8 libbeat.logstash.published_and_acked_events=8 registrar.states.update=8 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:31:07+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.write_bytes=1249 libbeat.logstash.published_and_acked_events=14 publish.events=14 registrar.writes=1 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.read_bytes=35 libbeat.publisher.published_events=14
2017-02-13T00:31:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=12 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1345 libbeat.logstash.published_and_acked_events=12 libbeat.logstash.publish.read_bytes=35 registrar.states.update=12 registrar.writes=1 publish.events=12
2017-02-13T00:32:07+13:00 INFO Non-zero metrics in the last 30s: registrar.writes=1 libbeat.logstash.published_and_acked_events=14 libbeat.publisher.published_events=14 publish.events=14 registrar.states.update=14 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=1266 libbeat.logstash.publish.read_bytes=35
2017-02-13T00:32:37+13:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.published_and_acked_events=8 registrar.writes=1 libbeat.logstash.publish.write_bytes=960 libbeat.publisher.published_events=8 libbeat.logstash.publish.read_bytes=35 registrar.states.update=8 publish.events=8
——————————————————————————————————————————————————————————————————————————————————————————————————————————————————
Andrew
From: Ben Hoffman <[email protected]>
Reply-To: elastic/beats <[email protected]>
Date: Thursday, 9 February 2017 at 18:20
To: elastic/beats <[email protected]>
Cc: Andrew Hardy <[email protected]>, Author <[email protected]>
Subject: Re: [elastic/beats] Filebeat Fails After Power Failure (#3537)
I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a forum post about it with more information, such as the log files/syslogs.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/elastic/beats","title":"elastic/beats","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/elastic/beats"}},"updates":{"snippets":[{"icon":"PERSON","message":"@bah8892 in #3537: I experienced something like this in a VMWare Virtual Machine of CentOS 7 64-Bit. When I power off the VM or revert to a snapshot, filebeat will not start again. The fix for me was to delete the registry file. I made a [forum post](https://discuss.elastic.co/t/filebeat-will-not-start-centos-7/74460) about it with more information, such as the log files/syslogs. "}],"action":{"name":"View Issue","url":"#3537 (comment)"}}}
|
@andynz2017 Thanks a lot for digging deeper into this one. I went through all the logs above, unfortunately there isn't something inside that would even indicate that filebeat got stopped. But this is kind of not expected, as the machine is not shut down properly. Here my notes I made when going through the issue:
Could it be that something "strange" happens in case a process has an open file handler of a file when the OS crashes? |
Hi folks, On a hunch, I'm going to suggest the problem is a lack of beats/filebeat/registrar/registrar.go Lines 296 to 328 in 0af759a
Event timeline:
The fix is probably to open the file with |
Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue. `2017-02-24T10:10:27Z INFO filebeat start running. EDIT : Sorry was in a bit of rush earlier - i'm running windows 10, build 14393.726 - JDK v8 u121. |
@leinad13 That seems to be a different issue. Can you open a topic on discuss? https://discuss.elastic.co/c/beats/filebeat |
Hey,
Out of interest, was the registry file blank in your case? I noticed you're using filebeat running on Windows whereas the initial thread was based on Centos 64... this issue is possibly OS agnostic and the common information that I have derived to date seems to indicate that the issues triggered during shutdown/power off. I'm not sure if it's just a coincidence or if there is actually something going on with how filebeat writes to registry during the shutdown/power off scenario.
I'm guessing removal of the registry file in your case and starting filebeat service resolves the issue?
… On 24/02/2017, at 11:17 PM, leinad13 ***@***.***> wrote:
Apologies I havn't had time to read the whole thread, but I think I have experienced this issue as well. I got an ELK dev stack up and running on my laptop, installed filebeat and had it working yesterday. Then when i booted this morning, checked Kibana - no logs since last night. My laptop didn't have a un-graceful shutdown so not sure what caused the issue.
2017-02-24T10:10:27Z INFO filebeat start running. 2017-02-24T10:10:27Z INFO Registry file set to: C:\ProgramData\filebeat\registry 2017-02-24T10:10:27Z INFO Loading registrar data from C:\ProgramData\filebeat\registry 2017-02-24T10:10:27Z ERR Error decoding old state: invalid character '\x00' looking for beginning of value 2017-02-24T10:10:27Z INFO Total non-zero values: 2017-02-24T10:10:27Z INFO Uptime: 35.9984ms 2017-02-24T10:10:27Z INFO filebeat stopped. 2017-02-24T10:10:27Z CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@avleen PR just go merged. I really appreciate the work you put into this one to find the issue and point it out. |
Thanks @ruflin ! |
Closes elastic#3537 (cherry picked from commit e2e58c0)
I was having the same error as @leinad13 , after a BSOD, namely "2017-09-07T10:51:51+03:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: invalid character '\x00' looking for beginning of value" I am using filebeat 5.5.2. The fix was to delete the files in "C:\ProgramData\filebeat" if it helps someone reading this thread. |
We had the same issue, This was on 5.2.2 |
I just had the same issue after a host failure, filebeat v 5.6.3 |
@cosminpopescu @JoshSchreuder @trondhindenes There must be something more there as the above fix is in all 5.x branches. Any chance you guys could provide some more informations like
|
I am still able to reproduce it with filebeat 6.1.1 and 6.2.1 on Linux and FreeBSD.
Start fails with empty registry
|
I suggest not catting |
@ruflin I've also had the same issue with filebeat not starting after a hard shutdown on multiple systems. Here are some details:
Let me know if there's anything else I can get from these systems to get to the bottom of this. |
Here is the Filebeat issue for it: #6792 |
Issue: Filebeat running on CentOS Linux release 7.3.1611 (Core), filebeat 5.1. Server suffers power failure due to host failing. Ungraceful shutdown, VM running off remote QNAP via ISCSI. VM recovers fine but filebeat doesn't start.
Host OS: Windows Server 2016 Datacenter Evaluation.
HyperVisor: Hyper-V 10.0.14393.0.
VM OS Affected: CentOS Linux release 7.3.1611 (Core), x64 Server Edition.
Filebeat: Version 5.1.1
Filebeat Log: Default logging verbosity.
File before failure:
This is the only file that was in use. /var/log/filebeat/* yields just the standard EOF error, no details around time stamp or what action the filebeat agent was taking at the time of power failure. Is there an alternate location that might tell me this information?
2017-02-05T19:21:25+13:00 INFO Home path: [/usr/share/filebeat] Config path: [/etc/filebeat] Data path: [/var/lib/filebeat] Logs path: [/var/log/filebeat]
2017-02-05T19:21:25+13:00 INFO Setup Beat: filebeat; Version: 5.1.1
2017-02-05T19:21:25+13:00 INFO Max Retries set to: 3
2017-02-05T19:21:25+13:00 INFO Activated logstash as output plugin.
2017-02-05T19:21:25+13:00 INFO Publisher name: mon.testdomain.com
2017-02-05T19:21:25+13:00 INFO Flush Interval set to: 1s
2017-02-05T19:21:25+13:00 INFO Max Bulk Size set to: 2048
2017-02-05T19:21:25+13:00 INFO filebeat start running.
2017-02-05T19:21:25+13:00 INFO Registry file set to: /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Loading registrar data from /var/lib/filebeat/registry
2017-02-05T19:21:25+13:00 INFO Total non-zero values:
2017-02-05T19:21:25+13:00 INFO Uptime: 9.9703ms
2017-02-05T19:21:25+13:00 INFO filebeat stopped.
2017-02-05T19:21:25+13:00 CRIT Exiting: Could not start registrar: Error loading state: Error decoding states: EOF
*** This issue has been moved to Github from https://discuss.elastic.co/t/filebeat-fails-after-power-failure/73967 ***
Happy to provide any additional detail, please just provide steps to acquire it if its not in obvious log files.
The text was updated successfully, but these errors were encountered: