Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect node is behind #3373

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

yacovm
Copy link
Contributor

@yacovm yacovm commented Sep 9, 2024

Why this should be merged

Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point.

This commit adds a mechanism that detects that our node is behind the majority of the stake.
The intent is to later have this mechanism be the trigger for the bootstrapping mechanism.

How this works

The mechanism works in the following manner:

  • It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message,
    the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked,
    if it wasn't invoked too recently.
  • The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator.
  • The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted).
  • The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block.
  • The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake.
  • If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted,
    then the mechanism announces the node is behind, and returns the time period between the two invocations.
  • The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then the total time is nullified.

How this was tested

Apart from unit tests, I conducted an experiment by running a node on Fuji.
I blocked egress communication with the other nodes in the network, and let the node sit idle and useless for a while as it cannot communicate with any other node.

Then I let the node resume its communication to the other nodes in the network, and observed that it reported that it is behind the rest of the network, before it syncs up with the rest of the network.

[09-13|23:50:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "bootstrapped", "tags": ["application"]}
[09-13|23:50:40.199] INFO health/worker.go:261 check started passing {"name": "readiness", "name": "bootstrapped", "tags": ["application"]}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-13|23:55:10.199] WARN health/worker.go:252 check started failing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"], "error": "not connected to enough stake: connected to 38.094644%; required at least 80.000000%"}
[09-14|00:00:12.154] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.004133461}
[09-14|00:00:13.156] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.004133461}
[09-14|00:00:14.167] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.014581489}
[09-14|00:00:15.171] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.014581489}
[09-14|00:00:16.172] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 3.016257856}
[09-14|00:00:17.174] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 3.016257856}
[09-14|00:00:18.176] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 4.019044379}
[09-14|00:00:22.244] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.005194513}
[09-14|00:00:23.250] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 1.005194513}
[09-14|00:00:24.257] INFO <C Chain> snowman/engine_decorator.go:29 We are behind the rest of the network {"seconds": 2.012517576}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "C", "tags": ["11111111111111111111111111111111LpoYY"]}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "X", "tags": ["11111111111111111111111111111111LpoYY"]}
[09-14|00:00:40.199] INFO health/worker.go:261 check started passing {"name": "health", "name": "P", "tags": ["11111111111111111111111111111111LpoYY"]}

@yacovm yacovm self-assigned this Sep 9, 2024
@yacovm yacovm marked this pull request as draft September 9, 2024 13:17
@yacovm yacovm force-pushed the detectWeAreBehind branch 17 times, most recently from 73769f0 to 4c8303c Compare September 14, 2024 00:04
@yacovm yacovm added the DO NOT MERGE This PR must not be merged in its current state label Sep 14, 2024
@yacovm yacovm marked this pull request as ready for review September 14, 2024 00:35
@yacovm yacovm changed the title [WIP] detect node is behind Detect node is behind Sep 14, 2024
@yacovm yacovm force-pushed the detectWeAreBehind branch 2 times, most recently from b2c9a19 to 91cc26e Compare September 16, 2024 18:36
snow/engine/snowman/engine_decorator.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect_test.go Outdated Show resolved Hide resolved
snow/engine/snowman/straggler_detect_test.go Show resolved Hide resolved
snow/engine/snowman/straggler_detect_test.go Show resolved Hide resolved
snow/networking/handler/health.go Outdated Show resolved Hide resolved
snow/engine/snowman/engine_decorator.go Outdated Show resolved Hide resolved
This commit adds a mechanism that detects that our node is behind the majority of the stake.
The intent is to later have this mechanism be the trigger for the bootstrapping mechanism.
Currently, the bootstrapping mechanism is only active upon node boot, but not at a later point.

The mechanism works in the following manner:

- It intercepts the snowman engine's Chits message handling, and upon every reception of the Chits message,
  the mechanism that detects if the node is a straggler (a node with a ledger height behind the rest) may be invoked,
  if it wasn't invoked too recently.
- The mechanism draws statistics from the validators known to it, and computes the latest accepted block for each validator.
- The mechanism then proceeds to determine which blocks are pending to be processed (a block pending to be processed was not accepted).
- The mechanism then collects a snapshot of all blocks it hasn't accepted yet, and the amount of stake that has accepted this block.
- The mechanism then waits for its next invocation, in order to see if it has accepted blocks correlated with enough stake.
- If there is too much stake that has accepted blocks by other nodes correlated to it that the node hasn't accepted,
  then the mechanism announces the node is behind, and returns the time period between the two invocations.
- The mechanism sums the total time it has detected the node is behind, until a sampling concludes it is not behind, and then
  the total time is nullified.

Signed-off-by: Yacov Manevich <[email protected]>
Signed-off-by: Yacov Manevich <[email protected]>
snow/engine/snowman/straggler_detect_test.go Show resolved Hide resolved

validators := nodeWeights(sd.connectedValidators().List())

nodeWeight2lastAccepted := make(nodeWeights2Blocks, len(validators))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we change the name of this variable as well eg nodeWeightToLastAccepted?


ratio := float64(totalWeightWeKnowItsLastAcceptedBlock) / float64(totalValidatorWeight)

if ratio < 0.8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use the constant here as well?

zap.Float64("ratio", ratio))
return nil, 0, 0
}
return nodeWeight2lastAccepted, totalValidatorWeight, totalWeightWeKnowItsLastAcceptedBlock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot and getNetworkInfo, could this computation be simplified somehow?

f func(time.Duration)
}

func NewDecoratedEngine(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this something else like WithStragglerDetection or NewStragglerDetector, sine Decorated seems more generic.
Also seems we could unexport the decoratedEngine struct, since we're returning common.Engine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do the calculation of requiring 0.8 of the stake both in getNetworkSnapshot and getNetworkInfo, could this computation be simplified somehow?

I can do it if you feel strongly about it, it's quite easy to do that.

However after thinking about it, I'm not sure that's ideal. I know the code can be made simpler by not repeating the computation, but, I structured it because I wanted the logs to distinguish between the following two cases:

  • We do not know yet the last accepted block for enough nodes (stake)
  • We know the last accepted block for enough nodes (stake) but most of them is exactly like our own last accepted block.

I can easily remove the code handling the first case, but then if we don't proceed checking if we're behind the rest, the logs may falsely state it is because our last accepted block is as the rest, but it might be that we simply haven't gathered enough info.

ids/node_weight.go Outdated Show resolved Hide resolved
f func(time.Duration)
}

func NewDecoratedEngineWithStragglerDetector(e *Engine, time func() time.Time, f func(time.Duration)) common.Engine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use mockable.Clock instead of time func() time.Time if we require it for testing? We might not even need it to be here and mockable.Clock can be just a field in stragglerDetectorConfig and test can set it directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed the tests to use mockable.Clock, good idea.

However I kept the function pointer func() time.Time() as I don't want a mock to be part of the production code.

de.Engine.Config.Ctx.Log.Info("We are behind the rest of the network", zap.Float64("seconds", behindDuration.Seconds()))
}
de.Engine.metrics.stragglingDuration.Set(float64(behindDuration))
de.f(behindDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for testing only? If so, I feel like this hook is equivalent of directly testing CheckIfWeAreStragglingBehind(). So not sure if we really need it here. If we really want to test this maybe we should test the set metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for testing. This is a decorator, so it invokes the method it decorates but also invokes a function, which is pluggable. The pluggable function it invokes, if this f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still slightly think this is a bit extra (since we don't actually use it in production). but if you think we should keep it, can we rename it and add comment so that it's visible for consumers?

getTime func() time.Time,
log logging.Logger,
minConfirmationThreshold float64,
lastAccepted func() (ids.ID, uint64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rather take tracker.Accepted interface here? (or any sub-interface that provides this function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is an interface superior to a function pointer in this case? If I put an interface I would need to make a mock structure for it in the tests. Can you please explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally interfaces provides more "defined" and flexible structures rather than pointer functions. I don't particularly like function pointers because they lock the function signature and all functions expecting the function pointer needs to be changed if it is modified. Also Mocking complex functions as a single interface can be much more easily done through mockgens. So in the end it's all readability, code standards and flexibility. IMHO having one or very specific pointer function is fine, but repeating it for the sake of easy testing is generally an issue.

log logging.Logger,
minConfirmationThreshold float64,
lastAccepted func() (ids.ID, uint64),
connectedValidators func() set.Set[ids.NodeWeight],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we take a single tracker.Peers interface for this and below (connectedPercent) funcs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes to the struct, let me know if it looks better now.

snow/engine/snowman/straggler_detect.go Outdated Show resolved Hide resolved
newProcessingWeight, err := safemath.Add(processingWeight, nw.Weight)
if err != nil {
sd.log.Error("Cumulative weight overflow", zap.Uint64("cumulative", processingWeight), zap.Uint64("added", nw.Weight))
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we instead just return err and handle it in either CheckIfWeAreStragglingBehind or again return err from there and handle wherever CheckIfWeAreStragglingBehind is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error means the total weight of the subnet has overflowed uint64. I don't think it's worth to change the API to propagate this further up, but if you feel strongly about it, I'll change it accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we can return err from failedCatchingUp and handle in CheckIfWeAreStragglingBehind, but really does not super matter if we dont want to return err from CheckIfWeAreStragglingBehind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failedCatchingUp method (now called areWeBehindTheRest ) returns true or false whether it managed to detect the caller node is behind. Handling an error returned by it by the calling method is simply logging it. Therefore I don't see why we need to propagate the error further upwards, if we can handle it at a lower level and simplify the method signature of failedCatchingUp.


res.connStakePercent = sd.connectedPercent()

validators := nodeWeights(sd.connectedValidators().List())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems racy with connectedPercent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not matter if the validators change after the connectedPercent was read.

We only use the connected stake to know if to abort early or continue the assessment.

Let's look at two possibilities:

  1. The validators have more stake than connected percent, and we abort early in validateNetInfo because connected percent was below the threshold: Then, we will simply not detect whether we are behind in this probe. This scheduling is equivalent to a scheduling where we read the connected percent and the validators in closer temporal proximity.
  2. We read a connected percent above the threshold, but afterwards, we read validators such that the connected percent is below the threshold: This is equivalent to the logic ignoring the connected percent altogether, and just relying on the data from connectedValidators().

Initially, in the first version of this PR, the connected percent was the first thing we compute, and we wouldn't continue computing anything if it was too low.

However, after the first code review, I changed the computation to first collect data, and then validate it.

I can remove the connected percent, as it might make the implementation simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd not oppose to have a simpler implementation if you say this is not ultra necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it in the follow-up commit

snow/engine/snowman/straggler_detect.go Show resolved Hide resolved
return nodeWeights(maps.Keys(nwb)).totalWeight()
}

func dropHeight(f func() (ids.ID, uint64)) func() ids.ID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is a bit difficult to follow, and not sure if we %100 need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add comment or prefably drop the function? IMHO we should be using tracker.Accepted interface anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this function? Seems like overkill to use in a single place

@yacovm yacovm force-pushed the detectWeAreBehind branch 2 times, most recently from f04499f to 78901b7 Compare September 28, 2024 00:41
Signed-off-by: Yacov Manevich <[email protected]>
Copy link
Collaborator

@aaronbuchwald aaronbuchwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments

I think the case we'd care about most is when the chain has fallen sufficiently far behind that the last accepted block reported by most peers is a block that we have not verified or marked as processing. This seems to only catch the case that we have a long processing chain that is up to date with the latest accepted block of the network.

Comment on lines +45 to +48
stragglingDuration: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "straggling_duration",
Help: "For how long we have been straggling behind the rest, in nano-seconds.",
}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we put this at the end ?

@@ -37,10 +36,10 @@ type Peers interface {
SampleValidator() (ids.NodeID, bool)
// GetValidators returns the set of all validators
// known to this peer manager
GetValidators() set.Set[ids.NodeID]
GetValidators() set.Set[ids.NodeWeight]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a map instead of set if we need both? The nodeIDs should be unique within the set, but this change makes it possible to add the nodeID twice with two different weights. This shouldn't happen because we're creating it on demand, but it seems like an odd use of set to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use the validator set type on the snowman config type rather than making this change to Peers ? https://github.com/ava-labs/avalanchego/blob/master/snow/engine/snowman/config.go#L23

Comment on lines +1348 to +1349
engine := smeng.NewDecoratedEngineWithStragglerDetector(sme, time.Now, func(_ time.Duration) {})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include a comment on why the anonymous function here is empty?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this a decorator instead of part of the engine?

return nodeWeights(maps.Keys(nwb)).totalWeight()
}

func dropHeight(f func() (ids.ID, uint64)) func() ids.ID {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this function? Seems like overkill to use in a single place


// Ensure we have collected last accepted blocks that are not our own last accepted block.
nodeWeightToLastAccepted = nodeWeightToLastAccepted.filter(func(id ids.ID) bool {
return ourLastAcceptedBlock.Compare(id) != 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ourLastAcceptedBlock.Compare(id) != 0
return ourLastAcceptedBlock != id


totalValidatorWeight, err := validators.totalWeight()
if err != nil {
s.log.Error("Failed computing total weight", zap.Error(err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than emitting error logs in each place where we return an error, would it make sense to leave logging the error to the caller? Seems this is only called from one place.

return res, nil
}

type netInfo struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include some comments explaining the phases of filtering?

The process seems to be:

  1. Gather the last accepted block from each validator
  2. Filter out any blocks that match our own last accepted block in getNetworkInfo
  3. Filter out any blocks that we have marked as processing in areWeBehindTheRest

but this requires reading from the bottom to top of the file to understand the flow.

How do we handle blocks that are accepted by the network, but we have not added to processing yet because we have not fetched the full ancestry?

Since we are using consensus.Processing(...), we'll only count blocks that we have already verified, but I thought the intent behind this PR was to detect scenarios where we are sufficiently far behind that the blocks reported as the last accepted block by our peers are far enough ahead that we have not verified them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE This PR must not be merged in its current state
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants