Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

Plugin randomly dies with a connection shutdown error message #28

Closed
gregleroux opened this issue Dec 27, 2016 · 9 comments
Closed

Plugin randomly dies with a connection shutdown error message #28

gregleroux opened this issue Dec 27, 2016 · 9 comments
Labels

Comments

@gregleroux
Copy link

Snap daemon version (use snapteld -v):
./bin/snapteld -v
snapteld version master-91ea244

Environment:

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release):
    /etc/debian_version
    8.6

  • Kernel (e.g. uname -a):
    4.4.14-3

  • Relevant tools (e.g. plugins used with Snap):
    NAME VERSION TYPE SIGNED STATUS LOADED TIME
    df 5 collector false loaded Wed, 21 Dec 2016 16:10:27 CET
    meminfo 3 collector false loaded Wed, 21 Dec 2016 16:10:28 CET
    processes 7 collector false loaded Wed, 21 Dec 2016 16:10:28 CET
    psutil 10 collector false loaded Wed, 21 Dec 2016 16:10:29 CET
    psutil 9 collector false loaded Wed, 21 Dec 2016 16:10:29 CET
    file 2 publisher false loaded Wed, 21 Dec 2016 16:10:29 CET
    opentsdb 9 publisher false loaded Wed, 21 Dec 2016 16:10:30 CET

  • Others (e.g. deploying with Ansible):

What happened:
work fine for hours or days and plugin stop with this error messages. Other plugin still working
task configuration:

{
    "version": 1,
    "schedule": {
        "type": "simple",
        "interval": "45s"
    },
    "deadline":"45s",
    "max-failures":-1,
    "workflow": {
        "collect": {
            "metrics": {
                "/intel/psutil/load/load1": {},
                "/intel/psutil/load/load5": {},
                "/intel/psutil/load/load15": {},
                "/intel/psutil/cpu/cpu-total/user": {},
                "/intel/psutil/cpu/cpu-total/iowait": {},
                "/intel/psutil/cpu/cpu-total/system": {},
                "/intel/psutil/net/*/bytes_sent": {},
                "/intel/psutil/net/*/packets_sent": {},
                "/intel/psutil/net/*/bytes_recv": {},
                "/intel/psutil/net/*/packets_recv": {},
                "/intel/psutil/net/*/errin": {},
                "/intel/psutil/net/*/errout": {},
                "/intel/psutil/vm/available": {},
                "/intel/psutil/vm/free": {},
                "/intel/psutil/vm/used": {},
                "/intel/psutil/vm/used_percent": {},
                "/intel/psutil/vm/total": {},
                "/intel/procfs/meminfo/mem_used_perc": {},
                "/intel/procfs/meminfo/swap_cached_perc": {},
                "/intel/procfs/meminfo/mem_used": {}
            },
            "config": {
                "/intel/procfs": {
                    "proc_path": "/proc"
                }
          },
		  "publish": [
              {
              "plugin_name": "opentsdb",
                  "config": {
                      "host": "https://xxx:[email protected]"
                      }
              }
          ]

        }
    }
}

WARN[2016-12-11T13:54:50+01:00] Task failed _block=spin _module=scheduler-task consecutive failure limit=-1 consecutive failures=8 error=Publish call error: Post https://xxx:[email protected]/api/put: net/http: request canceled (Client.Timeout exceeded while awaiting headers) task-id=405b8374-fae6-456c-bdbb-a57a65c01cf2 task-name=Task-405b8374-fae6-456c-bdbb-a57a65c01cf2 WARN[2016-12-11T13:54:52+01:00] heartbeat missed _module=control-aplugin block=check-health plugin_name=publisher:opentsdb:v9:id3 WARN[2016-12-11T13:54:57+01:00] heartbeat missed _module=control-aplugin block=check-health plugin_name=publisher:opentsdb:v9:id3 ERRO[2016-12-11T13:54:58+01:00] error with publisher job _module=scheduler-job block=run error=connection is shut down job-type=publisher plugin-config=map[host:{Value:https://xxx:[email protected]}] plugin-name=opentsdb plugin-version=-1 WARN[2016-12-11T13:54:58+01:00] Publish job failed _block=submit-publish-job _module=scheduler-workflow parent-node-type=collector publish-name=opentsdb publish-version=-1 task-id=ec78a8f7-701b-44d2 -83a9-4436dec1390d task-name=Task-ec78a8f7-701b-44d2-83a9-4436dec1390d WARN[2016-12-11T13:54:58+01:00] Task failed _block=spin _module=scheduler-task consecutive failure limit=-1 consecutive failures=54 error=connection is shut down task-id=ec78a8f7-701b-44d2-83 a9-4436dec1390d task-name=Task-ec78a8f7-701b-44d2-83a9-4436dec1390d WARN[2016-12-11T13:55:02+01:00] heartbeat missed _module=control-aplugin block=check-health plugin_name=publisher:opentsdb:v9:id3 WARN[2016-12-11T13:55:02+01:00] heartbeat failed _module=control-aplugin block=check-health plugin_name=publisher:opentsdb:v9:id3 WARN[2016-12-11T13:55:02+01:00] handling dead available plugin event _block=handle-events _module=control-runner aplugin=publisher:opentsdb:v9:id3 event=Control.AvailablePluginDead ERRO[2016-12-11T13:55:08+01:00] error selecting _module=control-routing block=select error=could not select a plugin strategy=least-recently-used ERRO[2016-12-11T13:55:08+01:00] error with publisher job _module=scheduler-job block=run error=could not select a plugin job-type=publisher plugin-config=map[host:{Value:https://xxx:[email protected]}] plugin-name=opentsdb plugin-version=-1 WARN[2016-12-11T13:55:08+01:00] Publish job failed _block=submit-publish-job _module=scheduler-workflow parent-node-type=collector publish-name=opentsdb publish-version=-1 task-id=ec78a8f7-701b-44d2
What you expected to happen:

Steps to reproduce it (as minimally and precisely as possible):

Anything else do we need to know (e.g. issue happens only occasionally):
I've tried numerous different timeouts and retry configurations and nothing seems to work. Once the error message is shown the plugin does not recover regardless of the max-failures.

@bjray
Copy link

bjray commented Jan 3, 2017

Hey @gregleroux, sorry for the delay. Someone will take a look at it and get back to you.

@nanliu
Copy link
Collaborator

nanliu commented Jan 4, 2017

Hey @gregleroux, I didn't go through the whole opentsdb setup, but I simulated a server endpoint using netcat. I loaded the opentsdb plugin and ran the example task with max-failure: -1:

{
  "version": 1,
  "schedule": {
    "type": "simple",
    "interval": "10s"
  },
  "max-failures": -1,
  "workflow": {
    "collect": {
      "metrics": {
        "/intel/psutil/load/load1": {},
        "/intel/psutil/load/load15": {},
        "/intel/psutil/load/load5": {},
        "/intel/psutil/vm/available": {},
        "/intel/psutil/vm/free": {},
        "/intel/psutil/vm/used": {}
      },
      "publish": [
        {
          "plugin_name": "opentsdb",
          "config": {
            "host": "127.0.0.1",
            "port": 4242
          }
        }
      ]
    }
  }
}

So when I startup this task, it's failing because there is no opentsdb service running (and I see similar net/http client.timeout failure error in the logs):

$ snaptel task list
ID 					 NAME 						 STATE 		 HIT 	 MISS 	 FAIL 	 CREATED 		 LAST FAILURE
95601b52-a99e-4563-b935-c13370630646 	 Task-95601b52-a99e-4563-b935-c13370630646 	 Running 	 2 	 0 	 2 	 3:52PM 1-03-2017 	 Publish call error: Post http://127.0.0.1:4242/api/put: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

However, as soon as I enable the fake opentsdb service (using netcat), it starts to publish:

DEBUG[2017-01-03T16:25:37-08:00] Publish job completed                         _block=submit-publish-job _module=scheduler-workflow parent-node-type=collector publish-name=opentsdb publish-version=-1 task-id=95601b52-a99e-4563-b935-c13370630646 task-name=Task-95601b52-a99e-4563-b935-c13370630646

And using snaptel task list, I can see there are metric hits, and failure does not increment:

$ snaptel task list
ID 					 NAME 						 STATE 		 HIT 	 MISS 	 FAIL 	 CREATED 		 LAST FAILURE
95601b52-a99e-4563-b935-c13370630646 	 Task-95601b52-a99e-4563-b935-c13370630646 	 Running 	 175 	 0 	 137 	 3:52PM 1-03-2017

If I terminate the fake opentsdb service, the failure shows up again, but it will recover and start to publish as soon as the service comes back online.

Here's the command to simulate the fake opentsdb service (gnu netcat, flags might differ depending on os):

while true; do echo -e "HTTP/1.0 200 OK\r\n\r\nWelcome." | /usr/local/bin/nc -l -p 4242 ; done

The netcat fake http service shows the data posted by the opentsdb publisher:

Host: 127.0.0.1:4242
User-Agent: Go-http-client/1.1
Content-Length: 1006
Content-Type: application/json
Accept-Encoding: gzip

[{"metric":"intel.psutil.load.load15","timestamp":1483490037,"value":2.81},{"metric":"intel.psutil.load.load5","timestamp":1483490037,"value":2.19}},{"metric":"intel.psutil.load.load1","timestamp":1483490037,"value":1.91}},{"metric":"intel.psutil.vm.available","timestamp":1483490037,"value":4746481664}},{"metric":"intel.psutil.vm.free","timestamp":1483490037,"value":237142016}},{"metric":"intel.psutil.vm.used","timestamp":1483490037,"value":12433387520}}]

Based on this test, the behavior should be:

  1. when opentsdb service is not available, the snap task will keep running, but the hit/fail counter will increment simultaneously.
  2. when the opentsdb service becomes available, only hit counter will increment for the snap task, and you should see debug log messages indicating successful publishing.

Let me know if this doesn't match what you see, and how it behaves on your system.

@nanliu nanliu added the question label Jan 4, 2017
@gregleroux
Copy link
Author

Hey @nanliu, I think it's related to intelsdi-x/snap#1448.
Since i open this issue, i have not encountered the problem :(, snap and openTSDB plugin work fine.

I'm aggree with you comment, it's what i see when plugin work.
But, when i have this message in log:
error=connection is shut down task-id=ec78a8f7-701b-44d2-83 a9-4436dec1390d
openTSDB plugin does not recover, and snap continue to work (without publish to opentsdb).

Tell me things that i can do for debug when its starts again if its can help.

@nanliu
Copy link
Collaborator

nanliu commented Jan 4, 2017

So after a bunch of testing, I was able to reproduce a similar error message. If I kill the publisher plugin process several times, it starts generating the error message: error="could not select a plugin". For normal Snap operation, it's likely got to this point because:

  1. plugin heartbeat missed
  2. plugin termination due to repeat heartbeat missed (3 consecutive times)
  3. plugin no longer restarting because MaxPluginRestartCount (3 max)

Right now it appears we have a hard limit of 3 plugin restarts. So in this case the task max-failure=-1 doesn't stop the task, but the snap framework no longer attempts to start the publisher plugin.

To confirm this is the same issue you are seeing:

  1. The publisher plugin process is no longer running on the system when this occurs: pgrep opentsdb does not return an pid.
  2. The snap log indicates the plugin has already restarted three times: level=warning msg="plugin restarted" _block=handle-events _module=control-runner aplugin=9 event=opentsdb restart_count=3, and the failure occurs several consecutive heartbeat miss after restart_count=3.

I'll write an update in the Snap repo regarding how we plan to fix this. For now, you can recompile and change the MaxPluginRestartCount to a large number or Int max https://github.com/intelsdi-x/snap/blame/master/control/runner.go#L59.

@nanliu
Copy link
Collaborator

nanliu commented Jan 6, 2017

@gregleroux, please review intelsdi-x/snap#1456, and see if the daemon control setting max_plugin_restarts to a high value resolve this. If you are able to try the latest build, you can set max_plugin_restarts: -1 to disable this behavior. We also provided additional logging message to indicate the plugin has been disabled due to multiple failures:

level=warning msg="plugin disabled due to exceeding restart limit: 10"

A linux test build of the latest version of snap is available at:
https://s3-us-west-2.amazonaws.com/snap.ci.snap-telemetry.io/snap/latest/linux/x86_64/snapteld

@gregleroux
Copy link
Author

@nanliu, thanks, i test this right now and give you a feed back asap.

@nanliu
Copy link
Collaborator

nanliu commented Jan 12, 2017

@gregleroux let us know when you have an update from your test results.

@gregleroux
Copy link
Author

@nanliu, it's good for me, plugin work fine with no error and never stop sending metrics :)

@nanliu
Copy link
Collaborator

nanliu commented Jan 13, 2017

Perfect, thanks for the update!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants