Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve P2P event handling #1388

Merged
merged 2 commits into from
Apr 24, 2023
Merged

Conversation

fredcarle
Copy link
Collaborator

@fredcarle fredcarle commented Apr 21, 2023

Relevant issue(s)

Resolves #1334
Resolves #1335
Resolves #1336
Resolves #1337
Resolves #1338
Resolves #1375

Description

This PR seems to have fixed all known flakyness from the P2P integration tests. It does so at different levels:

  1. We prevent pre peer connection and pre replicator configuration actions from having their events received by the wait function by adding a 100ms sleep call pre setup.
  2. We ensure that the wait go routines are ready before continuing with the actions.
  3. We add a docQueue to the PushLog function so that we don't handle multiple updates to the same document at once. This helps avoid unnecessary transaction conflicts. This would also sometimes cause the PushLog function to hang and the RPC call hitting a timeout.
  4. We handle pubsub topic subscriptions a little better by subscribing only to the collection OR the documents. Subscribing to documents and the collection would cause double the event handling.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Integration test (a 6 hour run without errors)

Specify the platform(s) on which this was tested:

  • Github CI
  • MacOS

@fredcarle fredcarle added area/testing Related to any test or testing suite area/p2p Related to the p2p networking system action/no-benchmark Skips the action that runs the benchmark. labels Apr 21, 2023
@fredcarle fredcarle added this to the DefraDB v0.5.1 milestone Apr 21, 2023
@fredcarle fredcarle requested a review from a team April 21, 2023 18:00
@fredcarle fredcarle self-assigned this Apr 21, 2023
@@ -140,6 +140,9 @@ func connectPeers(
nodes []*node.Node,
addresses []string,
) chan struct{} {
// If we have some database actions prior to connecting the peers, we want to ensure that they had time to
// complete before we connect. Otherwise we might wrongly catch them in our wait function.
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice spot - I didn't notice this looking through your wip branch earlier :)

@codecov
Copy link

codecov bot commented Apr 21, 2023

Codecov Report

Merging #1388 (93b74ce) into develop (8f853d2) will decrease coverage by 0.11%.
The diff coverage is 51.32%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1388      +/-   ##
===========================================
- Coverage    70.76%   70.65%   -0.11%     
===========================================
  Files          185      185              
  Lines        17884    17934      +50     
===========================================
+ Hits         12655    12671      +16     
- Misses        4279     4305      +26     
- Partials       950      958       +8     
Impacted Files Coverage Δ
net/peer.go 49.01% <26.00%> (-1.87%) ⬇️
net/server.go 64.06% <71.42%> (+0.51%) ⬆️

... and 2 files with indirect coverage changes

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work Fred - thanks a whole bunch for sorting these out :)

Left a couple of small comments which would be good to resolve before merge.

@@ -629,10 +627,16 @@ func (p *Peer) handleDocCreateLog(evt events.Update) error {
return errors.Wrap("failed to get DocKey from broadcast message", err)
}

// We need to register the document before pushing to the replicators if we want to
// ensure that we have subscribed to the topic.
err = p.RegisterNewDocument(p.ctx, dockey, evt.Cid, evt.Block, evt.SchemaID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: This change is not noted in the description, is it still needed, or is this left here accidentally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still needed for the situation where the target would receive the push to replicator and then the target getting an update before the source registers to the topic. This probably would never happen in real life but it did happen in the tests.

@@ -804,6 +808,30 @@ func (p *Peer) AddP2PCollections(collections []string) error {
addedTopics = append(addedTopics, col)
}

// If adding the collection topics succeeds, we remove the collections' documents
// from the pubsub topics to avoid receiving duplicate events.
for _, col := range collections {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: This is potentially very expensive - how long do you expect this code to live? It might be worth noting this in the public documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I can't think of a better way to handle this. It would only be expensive if a dev decides to subscribe to a given collection after it has millions of documents. It's also a one-off event if someone was to do it so I don't think we need to worry about speed too much. Not at this time anyways.

I can add a warning to the doc though.

return err
}
for key := range keyChan {
err := p.server.removePubSubTopic(key.Key.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This will not be rolled back if the txn commit fails, but it should be. I do not see this as a new issue however (just expanded), as the addedTopics block also suffers from this. Strongly suggest creating a ticket to roll this back properly on commit error (might already be one, I'm not sure though).

Same issue exists in RemoveP2PCollections

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket opened 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers :)

if err != nil {
log.Info(ctx, "could not decode the peer id of the log creator", logging.NewKV("Error", err.Error()))
}
err = s.pushLogEmitter.Emit(EvtReceivedPushLog{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: I like that this is in a defer now

@@ -307,7 +317,7 @@ func configureReplicator(
docIDsSyncedToSource[currentdocID] = struct{}{}
}

if action.NodeID.HasValue() && action.NodeID.Value() == cfg.SourceNodeID {
if !action.NodeID.HasValue() || action.NodeID.Value() == cfg.SourceNodeID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Are you sure this is correct?

If !action.NodeID.HasValue() then the document will be created synchronously and locally by the test framework in all nodes. Does the P2P system still transmit sync events if they already exist on the node?

todo: If this is correct, can you please add a comment documenting this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the P2P system still transmit sync events if they already exist on the node?

Yes. The Replicator doesn't know that the target has the document already so it still sends the created document which will cause the target to send a received push log event.

@fredcarle fredcarle force-pushed the fredcarle/fix/p2p-create-order branch from 953bd35 to 93b74ce Compare April 24, 2023 22:57
@fredcarle fredcarle merged commit 322807f into develop Apr 24, 2023
@fredcarle fredcarle deleted the fredcarle/fix/p2p-create-order branch April 24, 2023 23:08
shahzadlone pushed a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
Description
This PR seems to have fixed all known flakyness from the P2P integration tests. It does so at different levels:

- We prevent pre peer connection and pre replicator configuration actions from having their events received by the wait function by adding a 100ms sleep call pre setup.
- We ensure that the wait go routines are ready before continuing with the actions.
- We add a docQueue to the PushLog function so that we don't handle multiple updates to the same document at once. This helps avoid unnecessary transaction conflicts. This would also sometimes cause the PushLog function to hang and the RPC call hitting a timeout.
- We handle pubsub topic subscriptions a little better by subscribing only to the collection OR the documents. Subscribing to documents and the collection would cause double the event handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment