Detect node is behind #3373

yacovm · 2024-09-09T13:17:36Z

Why this should be merged

Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point.

This commit adds a mechanism that detects that our node is behind the majority of the stake.
The intent is to later have this mechanism be the trigger for the bootstrapping mechanism.

How this works

The mechanism works in the following manner:

It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message,
the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked,
if it wasn't invoked too recently.
The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator.
The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted).
The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block.
The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake.
If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted,
then the mechanism announces the node is behind, and returns the time period between the two invocations.
The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then the total time is nullified.

How this was tested

Apart from unit tests, I conducted an experiment by running a node on Fuji.
I blocked egress communication with the other nodes in the network, and let the node sit idle and useless for a while as it cannot communicate with any other node.

Then I let the node resume its communication to the other nodes in the network, and observed that it reported that it is behind the rest of the network, before it syncs up with the rest of the network.

[09-13|23:50:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}
[09-13|23:50:40.199] INFO health/worker.go:261 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-14|00:00:12.154] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.004133461}
[09-14|00:00:13.156] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.004133461}
[09-14|00:00:14.167] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.014581489}
[09-14|00:00:15.171] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.014581489}
[09-14|00:00:16.172] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 3.016257856}
[09-14|00:00:17.174] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 3.016257856}
[09-14|00:00:18.176] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 4.019044379}
[09-14|00:00:22.244] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.005194513}
[09-14|00:00:23.250] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.005194513}
[09-14|00:00:24.257] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.012517576}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}

snow/engine/snowman/engine_decorator.go

snow/engine/snowman/straggler_detect.go

snow/engine/snowman/straggler_detect_test.go

snow/networking/handler/health.go

snow/engine/snowman/engine_decorator.go

This commit adds a mechanism that detects that our node is behind the majority of the stake. The intent is to later have this mechanism be the trigger for the bootstrapping mechanism. Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point. The mechanism works in the following manner: - It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message, the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked, if it wasn't invoked too recently. - The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator. - The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted). - The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block. - The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake. - If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted, then the mechanism announces the node is behind, and returns the time period between the two invocations. - The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then the total time is nullified. Signed-off-by: Yacov Manevich <[email protected]>

Signed-off-by: Yacov Manevich <[email protected]>

snow/engine/snowman/straggler_detect_test.go

darioush · 2024-09-26T14:21:11Z

snow/engine/snowman/straggler_detect.go

+
+	validators := nodeWeights(sd.connectedValidators().List())
+
+	nodeWeight2lastAccepted := make(nodeWeights2Blocks, len(validators))


could we change the name of this variable as well eg nodeWeightToLastAccepted?

darioush · 2024-09-26T14:21:56Z

snow/engine/snowman/straggler_detect.go

+
+	ratio := float64(totalWeightWeKnowItsLastAcceptedBlock) / float64(totalValidatorWeight)
+
+	if ratio < 0.8 {


should we use the constant here as well?

darioush · 2024-09-26T14:25:21Z

snow/engine/snowman/straggler_detect.go

+			zap.Float64("ratio", ratio))
+		return nil, 0, 0
+	}
+	return nodeWeight2lastAccepted, totalValidatorWeight, totalWeightWeKnowItsLastAcceptedBlock


It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot and getNetworkInfo, could this computation be simplified somehow?

darioush · 2024-09-26T14:26:50Z

snow/engine/snowman/engine_decorator.go

+	f  func(time.Duration)
+}
+
+func NewDecoratedEngine(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine {


should we call this something else like WithStragglerDetection or NewStragglerDetector, sine Decorated seems more generic.
Also seems we could unexport the decoratedEngine struct, since we're returning common.Engine

It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot and getNetworkInfo, could this computation be simplified somehow?

I can do it if you feel strongly about it, it's quite easy to do that.

However after thinking about it, I'm not sure that's ideal. I know the code can be made simpler by not repeating the computation, but, I structured it because I wanted the logs to distinguish between the following two cases:

We do not know yet the last accepted block for enough nodes (stake)

We know the last accepted block for enough nodes (stake) but most of them is exactly like our own last accepted block.

I can easily remove the code handling the first case, but then if we don't proceed checking if we're behind the rest, the logs may falsely state it is because our last accepted block is as the rest, but it might be that we simply haven't gathered enough info.

Signed-off-by: Yacov Manevich <[email protected]>

ids/node_weight.go

ceyonur · 2024-09-27T11:24:43Z

snow/engine/snowman/engine_decorator.go

+	f  func(time.Duration)
+}
+
+func NewDecoratedEngineWithStragglerDetector(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine {


can we use mockable.Clock instead of time func() time.Time if we require it for testing? We might not even need it to be here and mockable.Clock can be just a field in stragglerDetectorConfig and test can set it directly.

Sure, I changed the tests to use mockable.Clock, good idea.

However I kept the function pointer func() time.Time() as I don't want a mock to be part of the production code.

ceyonur · 2024-09-27T11:30:01Z

snow/engine/snowman/engine_decorator.go

+		de.Engine.Config.Ctx.Log.Info("We are behind the rest of the network", zap.Float64("seconds", behindDuration.Seconds()))
+	}
+	de.Engine.metrics.stragglingDuration.Set(float64(behindDuration))
+	de.f(behindDuration)


is this for testing only? If so, I feel like this hook is equivalent of directly testing CheckIfWeAreStragglingBehind(). So not sure if we really need it here. If we really want to test this maybe we should test the set metrics?

This is not for testing. This is a decorator, so it invokes the method it decorates but also invokes a function, which is pluggable. The pluggable function it invokes, if this f.

I still slightly think this is a bit extra (since we don't actually use it in production). but if you think we should keep it, can we rename it and add comment so that it's visible for consumers?

ceyonur · 2024-09-27T11:46:32Z

snow/engine/snowman/straggler_detect.go

+	getTime func() time.Time,
+	log logging.Logger,
+	minConfirmationThreshold float64,
+	lastAccepted func() (ids.ID, uint64),


can we rather take tracker.Accepted interface here? (or any sub-interface that provides this function)

Why is an interface superior to a function pointer in this case? If I put an interface I would need to make a mock structure for it in the tests. Can you please explain?

Generally interfaces provides more "defined" and flexible structures rather than pointer functions. I don't particularly like function pointers because they lock the function signature and all functions expecting the function pointer needs to be changed if it is modified. Also Mocking complex functions as a single interface can be much more easily done through mockgens. So in the end it's all readability, code standards and flexibility. IMHO having one or very specific pointer function is fine, but repeating it for the sake of easy testing is generally an issue.

ceyonur · 2024-09-27T11:47:15Z

snow/engine/snowman/straggler_detect.go

+	log logging.Logger,
+	minConfirmationThreshold float64,
+	lastAccepted func() (ids.ID, uint64),
+	connectedValidators func() set.Set[ids.NodeWeight],


can we take a single tracker.Peers interface for this and below (connectedPercent) funcs?

I made some changes to the struct, let me know if it looks better now.

snow/engine/snowman/straggler_detect.go

ceyonur · 2024-09-27T13:24:33Z

snow/engine/snowman/straggler_detect.go

+			newProcessingWeight, err := safemath.Add(processingWeight, nw.Weight)
+			if err != nil {
+				sd.log.Error("Cumulative weight overflow", zap.Uint64("cumulative", processingWeight), zap.Uint64("added", nw.Weight))
+				return false


should we instead just return err and handle it in either CheckIfWeAreStragglingBehind or again return err from there and handle wherever CheckIfWeAreStragglingBehind is called.

This error means the total weight of the subnet has overflowed uint64. I don't think it's worth to change the API to propagate this further up, but if you feel strongly about it, I'll change it accordingly.

I'd say we can return err from failedCatchingUp and handle in CheckIfWeAreStragglingBehind, but really does not super matter if we dont want to return err from CheckIfWeAreStragglingBehind

the failedCatchingUp method (now called areWeBehindTheRest ) returns true or false whether it managed to detect the caller node is behind. Handling an error returned by it by the calling method is simply logging it. Therefore I don't see why we need to propagate the error further upwards, if we can handle it at a lower level and simplify the method signature of failedCatchingUp.

ceyonur · 2024-09-27T13:35:05Z

snow/engine/snowman/straggler_detect.go

+
+	res.connStakePercent = sd.connectedPercent()
+
+	validators := nodeWeights(sd.connectedValidators().List())


This seems racy with connectedPercent.

it does not matter if the validators change after the connectedPercent was read.

We only use the connected stake to know if to abort early or continue the assessment.

Let's look at two possibilities:

The validators have more stake than connected percent, and we abort early in validateNetInfo because connected percent was below the threshold: Then, we will simply not detect whether we are behind in this probe. This scheduling is equivalent to a scheduling where we read the connected percent and the validators in closer temporal proximity.

We read a connected percent above the threshold, but afterwards, we read validators such that the connected percent is below the threshold: This is equivalent to the logic ignoring the connected percent altogether, and just relying on the data from connectedValidators().

Initially, in the first version of this PR, the connected percent was the first thing we compute, and we wouldn't continue computing anything if it was too low.

However, after the first code review, I changed the computation to first collect data, and then validate it.

I can remove the connected percent, as it might make the implementation simpler.

I'd not oppose to have a simpler implementation if you say this is not ultra necessary.

I removed it in the follow-up commit

snow/engine/snowman/straggler_detect.go

ceyonur · 2024-09-27T13:50:14Z

snow/engine/snowman/straggler_detect.go

+	return nodeWeights(maps.Keys(nwb)).totalWeight()
+}
+
+func dropHeight(f func() (ids.ID, uint64)) func() ids.ID {


this function is a bit difficult to follow, and not sure if we %100 need it.

can we add comment or prefably drop the function? IMHO we should be using tracker.Accepted interface anyway

I added a comment.

Could we remove this function? Seems like overkill to use in a single place

Signed-off-by: Yacov Manevich <[email protected]>

aaronbuchwald

Left a few comments

I think the case we'd care about most is when the chain has fallen sufficiently far behind that the last accepted block reported by most peers is a block that we have not verified or marked as processing. This seems to only catch the case that we have a long processing chain that is up to date with the latest accepted block of the network.

aaronbuchwald · 2024-10-01T11:58:26Z

snow/engine/snowman/metrics.go

+		stragglingDuration: prometheus.NewGauge(prometheus.GaugeOpts{
+			Name: "straggling_duration",
+			Help: "For how long we have been straggling behind the rest, in nano-seconds.",
+		}),


nit: could we put this at the end ?

aaronbuchwald · 2024-10-01T12:04:12Z

snow/engine/common/tracker/peers.go

@@ -37,10 +36,10 @@ type Peers interface {
 	SampleValidator() (ids.NodeID, bool)
 	// GetValidators returns the set of all validators
 	// known to this peer manager
-	GetValidators() set.Set[ids.NodeID]
+	GetValidators() set.Set[ids.NodeWeight]


Should we return a map instead of set if we need both? The nodeIDs should be unique within the set, but this change makes it possible to add the nodeID twice with two different weights. This shouldn't happen because we're creating it on demand, but it seems like an odd use of set to me.

Would it make sense to use the validator set type on the snowman config type rather than making this change to Peers ? https://github.com/ava-labs/avalanchego/blob/master/snow/engine/snowman/config.go#L23

aaronbuchwald · 2024-10-01T13:18:39Z

chains/manager.go

+	engine := smeng.NewDecoratedEngineWithStragglerDetector(sme, time.Now, func(_ time.Duration) {})
+


Could we include a comment on why the anonymous function here is empty?

Why make this a decorator instead of part of the engine?

aaronbuchwald · 2024-10-01T15:03:41Z

snow/engine/snowman/straggler_detect.go

+	return nodeWeights(maps.Keys(nwb)).totalWeight()
+}
+
+func dropHeight(f func() (ids.ID, uint64)) func() ids.ID {


Could we remove this function? Seems like overkill to use in a single place

aaronbuchwald · 2024-10-01T15:08:30Z

snow/engine/snowman/straggler_detect.go

+
+	// Ensure we have collected last accepted blocks that are not our own last accepted block.
+	nodeWeightToLastAccepted = nodeWeightToLastAccepted.filter(func(id ids.ID) bool {
+		return ourLastAcceptedBlock.Compare(id) != 0


Suggested change

return ourLastAcceptedBlock.Compare(id) != 0

return ourLastAcceptedBlock != id

aaronbuchwald · 2024-10-01T15:09:40Z

snow/engine/snowman/straggler_detect.go

+
+	totalValidatorWeight, err := validators.totalWeight()
+	if err != nil {
+		s.log.Error("Failed computing total weight", zap.Error(err))


Rather than emitting error logs in each place where we return an error, would it make sense to leave logging the error to the caller? Seems this is only called from one place.

aaronbuchwald · 2024-10-01T15:24:10Z

snow/engine/snowman/straggler_detect.go

+	return res, nil
+}
+
+type netInfo struct {


Could we include some comments explaining the phases of filtering?

The process seems to be:

Gather the last accepted block from each validator

Filter out any blocks that match our own last accepted block in getNetworkInfo

Filter out any blocks that we have marked as processing in areWeBehindTheRest

but this requires reading from the bottom to top of the file to understand the flow.

How do we handle blocks that are accepted by the network, but we have not added to processing yet because we have not fetched the full ancestry?

Since we are using consensus.Processing(...), we'll only count blocks that we have already verified, but I thought the intent behind this PR was to detect scenarios where we are sufficiently far behind that the blocks reported as the last accepted block by our peers are far enough ahead that we have not verified them.

yacovm self-assigned this Sep 9, 2024

yacovm requested a review from StephenButtolph as a code owner September 9, 2024 13:17

yacovm marked this pull request as draft September 9, 2024 13:17

yacovm force-pushed the detectWeAreBehind branch 17 times, most recently from 73769f0 to 4c8303c Compare September 14, 2024 00:04

yacovm added the DO NOT MERGE This PR must not be merged in its current state label Sep 14, 2024

yacovm force-pushed the detectWeAreBehind branch from 4c8303c to 9ef0c38 Compare September 14, 2024 00:05

yacovm marked this pull request as ready for review September 14, 2024 00:35

yacovm changed the title ~~[WIP] detect node is behind~~ Detect node is behind Sep 14, 2024

yacovm force-pushed the detectWeAreBehind branch 2 times, most recently from b2c9a19 to 91cc26e Compare September 16, 2024 18:36

michaelkaplan13 requested review from marun, ceyonur and tsachiherman September 24, 2024 16:14

darioush reviewed Sep 24, 2024

View reviewed changes

yacovm force-pushed the detectWeAreBehind branch from abda6fd to f732a03 Compare September 24, 2024 19:02

Address code review comments

670aa53

Signed-off-by: Yacov Manevich <[email protected]>

yacovm force-pushed the detectWeAreBehind branch from f732a03 to 670aa53 Compare September 24, 2024 19:06

darioush reviewed Sep 26, 2024

View reviewed changes

yacovm added 2 commits September 26, 2024 17:35

Address code review comments II

3c21d9e

Signed-off-by: Yacov Manevich <[email protected]>

Address code review comments III

f355866

Signed-off-by: Yacov Manevich <[email protected]>

marun mentioned this pull request Sep 27, 2024

Implement traversal based early termination #3337

Open

ceyonur reviewed Sep 27, 2024

View reviewed changes

yacovm force-pushed the detectWeAreBehind branch 2 times, most recently from f04499f to 78901b7 Compare September 28, 2024 00:41

Address code review comments IV

ce5cb7c

Signed-off-by: Yacov Manevich <[email protected]>

yacovm force-pushed the detectWeAreBehind branch from 78901b7 to ce5cb7c Compare September 28, 2024 00:45

aaronbuchwald requested changes Oct 1, 2024

View reviewed changes


		validators := nodeWeights(sd.connectedValidators().List())

		nodeWeight2lastAccepted := make(nodeWeights2Blocks, len(validators))


		ratio := float64(totalWeightWeKnowItsLastAcceptedBlock) / float64(totalValidatorWeight)

		if ratio < 0.8 {


		res.connStakePercent = sd.connectedPercent()

		validators := nodeWeights(sd.connectedValidators().List())

		engine := smeng.NewDecoratedEngineWithStragglerDetector(sme, time.Now, func(_ time.Duration) {})

	return ourLastAcceptedBlock.Compare(id) != 0
	return ourLastAcceptedBlock != id

Detect node is behind #3373

Are you sure you want to change the base?

Detect node is behind #3373

Conversation

yacovm commented Sep 9, 2024 • edited Loading

Why this should be merged

How this works

How this was tested

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronbuchwald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yacovm commented Sep 9, 2024 •

edited

Loading