You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 23, 2023. It is now read-only.
sometimes metrictank pods that use the kafka-mdm plugin remain not ready because they return 503 on /, despite having all of their backlog replayed
2018/06/20 19:53:15 [DEBUG] memberlist: Initiating push/pull sync with: 172.22.128.7:7946
[Macaron] 2018-06-20 19:53:18: Started GET / for 172.19.64.0
[Macaron] 2018-06-20 19:53:18: Completed / 503 Service Unavailable in 867.431µs
[Macaron] 2018-06-20 19:53:18: Started GET /node for 172.19.64.0
[Macaron] 2018-06-20 19:53:18: Completed /node 200 OK in 906.048µs
[Macaron] 2018-06-20 19:53:28: Started GET / for 172.19.64.0
[Macaron] 2018-06-20 19:53:28: Started GET /node for 172.19.64.0
[Macaron] 2018-06-20 19:53:28: Completed / 503 Service Unavailable in 809.084µs
[Macaron] 2018-06-20 19:53:28: Completed /node 200 OK in 854.825µs
however their priority is good. metrictank.stats.$environment.$instance.cluster.self.priority.gauge32 is at 0
and :
as we can see above in the above curl http://localhost:6060/node output, the problem is.State
however, metrictank.stats.$environment.$instance.cluster.self.state.ready.gauge1 becomes 1 shortly after startup.
and from looking at the code, that metric (nodeReady) should always be in sync with the .State property on the node.
the one exception i can see is in MemberlistManager.NotifyUpdate(), an update may come in for a node and we'll update the state of that node, but if that node has the same name our nodeName, then that way we could set state to something else without updating the nodeReady metric.
The text was updated successfully, but these errors were encountered:
@woodsaj since you're much more familiar with that code, can you think of scenarios where MemberlistManager.NotifyUpdate() may receive an update about itself (the local node) being non ready ? seems like a race condition where the node marks itself as ready but shortly afterwards gets a stale update about when it was not-ready
a race condition seems like the most likely issue. We can solve this easily enough by using the "updated" timestamp to drop updates that are older than the state we already have.
sometimes metrictank pods that use the kafka-mdm plugin remain not ready because they return 503 on /, despite having all of their backlog replayed
however their priority is good.
metrictank.stats.$environment.$instance.cluster.self.priority.gauge32
is at 0and :
curl doesn't show the body, but:
so that confirms we trigger the 503 in appStatus:
which means this returns false:
as we can see above in the above
curl http://localhost:6060/node
output, the problem is.Statehowever,
metrictank.stats.$environment.$instance.cluster.self.state.ready.gauge1
becomes 1 shortly after startup.and from looking at the code, that metric (
nodeReady
) should always be in sync with the .State property on the node.the one exception i can see is in MemberlistManager.NotifyUpdate(), an update may come in for a node and we'll update the state of that node, but if that node has the same name our nodeName, then that way we could set state to something else without updating the
nodeReady
metric.The text was updated successfully, but these errors were encountered: