Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests for receiving events while resyncing for a partial join #419

Merged

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Jul 21, 2022

Add tests for three cases where a homeserver receives a latest event with missing prev events while it is still resyncing state after a partial join.

Namely:

... <-- M <-- A <-- B
        +---------+
        v          \
... <-- M <-- A <-- B

and

        +---------+
        v          \
... <-- M <-- A <-- B <-- C

where event M is the last event the homeserver knows about and the rightmost event is the one it receives.

The last case tests the code path mentioned in matrix-org/synapse#13002

Best reviewed commit by commit.

Sean Quah added 3 commits July 21, 2022 16:26
`testReceiveEventDuringPartialStateJoin()` sends an event over
federation, checks that a client can see it and checks the state at the
event.
@squahtx squahtx requested a review from a team July 21, 2022 19:06
Comment on lines +791 to +798
var prevEvents interface{}
if prevEventIDs == nil {
prevEvents = nil
} else {
prevEvents = prevEventIDs
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this because interface{}(nil) is not equal to []string(nil)

Comment on lines 635 to 641
// since we have not yet given the homeserver the full state at the join event and allowed
// the partial join to complete, it can't possibly know the full state at the last event.
// While it may be possible for the response to be correct by some accident of state res,
// the homeserver is still wrong in spirit.
t.Fatalf("/state_ids request for event %s did not block when it should have", event.EventID())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part tests matrix-org/synapse#13002 . If Synapse does not correctly flag that it does not have partial state in the CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin test case, it will erroneously return state (which was derived via state res and only happens to be correct) instead of blocking.

})

// we should be able to receive events with missing prevs over federation during the resync
t.Run("CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin", func(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for better names are welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "can receive events with some missing grandparents" or something?

@squahtx squahtx force-pushed the squah/faster_room_joins_calculate_partial_state_flag_for_backfill branch from 6b43730 to 5e38528 Compare July 21, 2022 20:43
@richvdh richvdh self-assigned this Jul 22, 2022
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally good, a few suggestions.

@@ -143,66 +143,8 @@ func TestPartialStateJoin(t *testing.T) {

// derek sends an event in the room
event := psjResult.CreateMessageEvent(t, "derek", nil)
psjResult.Server.MustSendTransaction(t, deployment, "hs1", []json.RawMessage{event.JSON()}, nil)
t.Logf("Derek sent event event ID %s", event.EventID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sent/created/ here

Comment on lines +493 to +497
// fire off a /state_ids request for the last event.
// it must either:
// * block because the homeserver does not have full state at the last event
// * or 403 because the homeserver does not have full state yet and does not consider the
// Complement homeserver to be in the room
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we know which, and test for that specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now only the last test case blocks instead of 403s. I'm somewhat reluctant to be more specific in the tests, because it feels like an implementation detail of synapse (which will change when we get around to matrix-org/synapse#13288).

If you're really keen on being more specific, I'm happy to add a boolean parameter or suchlike. (It's not easy to move this part of the test out, since it needs to happen after the /event request above has succeeded)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right. tbh I don't think either blocking or returning a 403 is correct behaviour, though I'm not quite sure what is.

It would be very useful to add a link to #13288 and say that this is somewhat temporary until we solve it properly.

Copy link
Contributor Author

@squahtx squahtx Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, is there a better way to test whether synapse thinks an event has partial state?
this part of the test was intended to check that _resolve_state_at_missing_prevs calculates the partial state flag correctly and it sounds like it's not a good idea to rely on /state_ids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none that I can immediately think of. The ideal thing would be to check that after the resync, /state_ids returns some result that would have been impossible without the information conveyed during the resync - but that sounds rather fiddly for a test (and, probably, a hard-to-understand test failure if it later turns out not to be true).

select {
case <-time.After(1 * time.Second):
t.Logf("/state_ids request for event %s blocked as expected", event.EventID())
defer func() { <-stateIdsResponseChan }()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this defer thing do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the /state_ids request blocks, the send of the response happens some time later, perhaps even after the test ends.
If we don't read from the channel, the goroutine above blocks on sending into the channel and complement gets very unhappy 10 minutes later because it's still not finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so we're making sure we read from the channel so that the goroutine doesn't block. Can you add a comment to that effect?

Comment on lines 508 to 518
if err := psjResult.Server.SendFederationRequest(deployment, stateReq, &respStateIDs); err != nil {
httpErr, ok := err.(gomatrix.HTTPError)
t.Logf("%v", httpErr)
if ok && httpErr.Code == 403 {
stateIdsResponseChan <- nil
return
}
t.Errorf("/state_ids request returned non-200: %s", err)
return
}
stateIdsResponseChan <- &respStateIDs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than matching for the 403 here and sending a nil magic value down the channel, I suggest you define

type stateIDsResult struct {
    RespStateIDs gomatrixserverlib.RespStateIDs
    Error error
}

... and then send that back down the channel, and do the matching in the main thread.

result.ServerRoom,
result.fedStateIdsRequestReceivedWaiter,
result.fedStateIdsSendResponseWaiter,
result.fedStateIdsAllowedEvents,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be easier to flip this around, and have handleStateIdsRequests only match /state_ids requests for the last event before the join event? That's what I did in #418, in 687ad36.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks a lot cleaner. I've borrowed that commit and it works great. Do we want to split that change into a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'll do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

})

// we should be able to receive events with missing prevs over federation during the resync
t.Run("CanReceiveEventsWithMissingPrevWithHalfMissingPrevsDuringPartialStateJoin", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "can receive events with some missing grandparents" or something?

Comment on lines 248 to 250
t.Logf("Derek sent event A with ID %s", eventA.EventID())
t.Logf("Derek sent event B with ID %s", eventB.EventID())
t.Logf("Derek sent event C with ID %s", eventC.EventID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/sent/created/. He hasn't sent them to hs1 yet.

@richvdh richvdh removed their assignment Jul 22, 2022
@squahtx squahtx requested a review from richvdh July 22, 2022 14:06
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines 635 to 636
// read from the channel and discard the result, so that the goroutine above doesn't block
// indefinitely on the send
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm a bit surprised that closing the channel isn't sufficient for this, but fair enough)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get a panic: send on closed channel without this. To clarify:

no close and no read -> complement times out after 10 minutes
defer close + no read -> panic
defer close + defer read -> complement is happy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see why this is done in a goroutine at all - is this because SendFederationRequest can block indefinitely? There is a default 30s timeout - if that is too long then you need to send a custom context with a timeout here -

return httpClient.DoRequestAndParseResponse(context.Background(), httpReq, resBody)
- currently this is context.Background() so it won't ever expire, but you can change that easily by doing:

ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
httpClient.DoRequestAndParseResponse(ctx, httpReq, resBody)

This will require a change to Complement's API to:

func (s *Server) SendFederationRequest(ctx context.Context, deployment *docker.Deployment, req gomatrixserverlib.FederationRequest, resBody interface{}) error {

but that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, that's a lot cleaner!

@squahtx
Copy link
Contributor Author

squahtx commented Jul 22, 2022

// fire off a /state_ids request for the last event.
// it must either:
//   * block because the homeserver does not have full state at the last event
//   * or 403 because the homeserver does not have full state yet and does not consider the
//     Complement homeserver to be in the room

hm, is there a better way to test whether synapse thinks an event has partial state? this part of the test was intended to check that _resolve_state_at_missing_prevs calculates the partial state flag correctly and it sounds like it's not a good idea to rely on /state_ids

Before we merge, is there a better way to check that Synapse has calculated the partial state flag for an event correctly?

if time.Since(start) > time.Second {
t.Fatalf("timeout waiting for received event to be visible")
}
res := alice.DoFunc(t, "GET", []string{"_matrix", "client", "r0", "rooms", psjResult.ServerRoom.RoomID, "event", event.EventID()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can now use WithRetryUntil in your DoFunc so you can remove the for loop etc. It would look a bit like:

res := alice.DoFunc(t, "GET", []string{"_matrix", "client", "r0", "rooms", psjResult.ServerRoom.RoomID, "event", event.EventID()},
client.WithRetryUntil(time.Second, func(res *http.Response) bool {
    if res.StatusCode == 200 {
        return true
    }
    eventResBody := client.ParseJSON(t, res)
    if res.StatusCode == 404 && gjson.GetBytes(eventResBody, "errcode").String() == "M_NOT_FOUND" {
        return false
    }
    t.Fatalf("GET /event failed with %d: %s", res.StatusCode, string(eventResBody))
    return false
})

Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use client.WithRetryUntil and add a context to SendFederationRequest and you should be able to remove a fair chunk of boilerplate here, and help other test authors.

@squahtx squahtx requested a review from a team as a code owner July 25, 2022 21:44
@squahtx squahtx requested a review from kegsay July 25, 2022 23:37
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm otherwise.

Comment on lines 1 to 5
{
"go.testEnvVars": {
"COMPLEMENT_BASE_IMAGE": "complement-synapse"
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we should be commiting this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally not, as it will make Dendrite people sad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot! I must have added it by accident while merging in main.

@squahtx
Copy link
Contributor Author

squahtx commented Jul 26, 2022

Thank you both for the reviews!

@squahtx squahtx merged commit 3d09f94 into main Jul 26, 2022
@squahtx squahtx deleted the squah/faster_room_joins_calculate_partial_state_flag_for_backfill branch July 26, 2022 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants