Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism for recovering from disconnections in conferences #1069

Merged
merged 1 commit into from
Sep 5, 2018

Conversation

zugz
Copy link

@zugz zugz commented Aug 12, 2018

This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented Aug 12, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@iphydf iphydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review the logic more in-depth later.

toxcore/group.c Outdated
return -1;
}

memcpy(&temp[g->numpeers], &g->frozen[frozen_index], sizeof(Group_Peer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a regular assignment.

toxcore/group.c Outdated
g->frozen = nullptr;
} else {
if (g->numfrozen != (uint32_t)frozen_index) {
memcpy(&g->frozen[frozen_index], &g->frozen[g->numfrozen], sizeof(Group_Peer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also regular assignment.

toxcore/group.c Outdated
{
Group_c *g = get_group_c(g_c, groupnumber);

if (!g) {
return -1;
}

// TODO(irungentoo):
int peer_index = peer_in_chat(g, real_pk);
int peer_index = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be "const int" if you use a ternary expression.

toxcore/group.c Outdated
}

memcpy(&temp[g->numfrozen], &g->group[peer_index], sizeof(Group_Peer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assignment.

toxcore/group.c Outdated
return -1;
}

int friendcon_id = getfriend_conn_id_pk(g_c->fr_c, real_pk);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

toxcore/group.c Outdated
return -1;
}

int close_index = add_conn_to_groupchat(g_c, friendcon_id, groupnumber, 0, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@iphydf iphydf added this to the v0.2.x milestone Aug 12, 2018
@zugz
Copy link
Author

zugz commented Aug 12, 2018 via email

docs/minpgc.md Outdated

# Discussion
## Overview
The intention is to recover seemlessly from splits in the group, the most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seamlessly

@zugz
Copy link
Author

zugz commented Aug 13, 2018 via email

Copy link
Member

@iphydf iphydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)


auto_tests/conference_test.c, line 107 at r4 (raw file):

    uint16_t disconnect_toxes[NUM_DISCONNECT];

    for (uint16_t i = 0; i < NUM_DISCONNECT; ++i) {

NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?


auto_tests/conference_test.c, line 112 at r4 (raw file):

    }

    for (uint16_t j = 0; j < 70 * 5; ++j) {

What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.


auto_tests/conference_test.c, line 128 at r4 (raw file):

        }

        c_sleep(200);

I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).


auto_tests/conference_test.c, line 141 at r4 (raw file):

    printf("reconnecting toxes\n");

    for (uint16_t j = 0; j < 120 * 5; ++j) {

Should we also wait for an event here?

@zugz zugz force-pushed the minpgc branch 5 times, most recently from 1fc620c to 93b6d54 Compare August 19, 2018 10:51
bool disconnected[NUM_GROUP_TOX];

for (uint16_t i = 0; i < NUM_GROUP_TOX; ++i) {
disconnected[i] = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know.. bool disconnected[NUM_GROUP_TOX] = {0}; does the same. You may have a reason to have this explicit loop. What's the reason?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason. Fixed.

uint16_t disconnect;

do {
disconnect = random_u16() % NUM_GROUP_TOX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me about 30 seconds to understand this loop. I might be incompetent, but more incompetent people are going to read this and not understand it within 10 seconds. Perhaps you can factor this out into a function with a name (and maybe a comment) to clarify what this does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

toxcore/group.c Outdated
g->peer_on_join(g->object, groupnumber, g->numpeers - 1);
}

--g->numfrozen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth asserting that this isn't 0 before decrementing. I'm conflicted, because I soon want assert to mean "provably true" and this assertion won't be easy to prove. Just something to consider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I rearranged things to call the callbacks after we deal with g->frozen. So now we know g->numfrozen > 0 because we just checked that get_frozen_index didn't return -1. It makes more sense this way anyway.

Copy link
Member

@iphydf iphydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf)


toxcore/group.c, line 1269 at r1 (raw file):

Previously, iphydf wrote…

const

No const?

@zugz
Copy link
Author

zugz commented Aug 19, 2018 via email

@anthonybilinski
Copy link

I tried running it now and ran into a few problems:

  1. On start of client, I see these warning logs:
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1374 is too large"
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1378 is too large"
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1378 is too large"
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1378 is too large"
  1. I see this warning log very frequently on my client running non-pgc toxcore which makes sense, but I see it occasionally on clients running pgc toxcore as well, looks like at times when I would expect Peer Query.
core/toxlogger.cpp:57 : Warning: "WARN :network.c:networking_poll:654: [33] -- Packet has no handler"

I also see the ID warning during auto_conference_test after "letting random toxes timeout".

  1. I see a lot of this during the auto_conference_test, but seems less serious:
WARNING network.c:550	sendpacket:	unknown address type: 0
  1. qTox segfaults on close (minpgc toxcore), only if there are still others in the group, see:
double free or corruption (out)
Aborted (core dumped)
  1. Probably a qTox issue, qTox segfaults on leaving group (old toxcore, pgc group members)

@zugz
Copy link
Author

zugz commented Aug 20, 2018 via email

@anthonybilinski
Copy link

anthonybilinski commented Aug 22, 2018

Is it really introduced by minpgc?

hard to say, but going back and forth I do not see it with master but do see it with this branch. I will test a bit more to make sure.

Do you get either with master?

no, didn't see either with master. I think the "Packet has no handler" is actually blocking functionality, if I close and re-open my client using this branch, I don't rejoin the group, but I see those warnings instead (I think). I have some non-default network settings in client, i.e. ipv4 only, I will test full default and see if I can still repro.

whatever peers do. I can't imagine what could be causing it.

Agreed, just got a repro on stock toxcore, guaranteed qTox issue.

@zugz
Copy link
Author

zugz commented Aug 22, 2018 via email

@anthonybilinski
Copy link

Ok more testing and some root causing:

  1. I do seem this with stock toxcore master tip. I didn't see it before I guess because my toxcore was older, created Message length is too large log spam #1116
  2. Same, created Add empty handler for LAN discovery packets when LAN discovery is disabled #1115
  3. May have been the same as 5? Saw this on stock toxcore.
  4. Looks like it actually is toxcore, not qTox. See it on stock toxcore too, issue opened: Segfault on group quit, free of invalid audio_decoder #1114

I tested again stopping one of the clients in the group with gdb until they were viewed as offline instead of restarting the client, and you're right that that works. I was under the impression that the other minpgc peer would re-include the re-connecting minpgc peer, but it makes sense that the peer who left needs to maintain the state as well.

In that case testing looks good from my end. I'll continue reading through the changes tomorrow.

@zugz
Copy link
Author

zugz commented Aug 23, 2018 via email

@zugz zugz force-pushed the minpgc branch 3 times, most recently from cb731c6 to e3441ea Compare August 31, 2018 16:44
@zugz zugz force-pushed the minpgc branch 2 times, most recently from c718ae0 to 14a58e4 Compare September 1, 2018 16:51
@codecov
Copy link

codecov bot commented Sep 1, 2018

Codecov Report

Merging #1069 into master will increase coverage by 0.5%.
The diff coverage is 94.1%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1069     +/-   ##
========================================
+ Coverage    82.5%   83.1%   +0.5%     
========================================
  Files          82      82             
  Lines       14476   14593    +117     
========================================
+ Hits        11948   12131    +183     
+ Misses       2528    2462     -66
Impacted Files Coverage Δ
auto_tests/overflow_sendq_test.c 100% <100%> (ø) ⬆️
auto_tests/run_auto_test.h 100% <100%> (ø) ⬆️
auto_tests/reconnect_test.c 100% <100%> (ø) ⬆️
auto_tests/conference_test.c 99.1% <100%> (+1.4%) ⬆️
auto_tests/friend_connection_test.c 100% <100%> (ø) ⬆️
auto_tests/overflow_recvq_test.c 100% <100%> (ø) ⬆️
auto_tests/send_message_test.c 94.1% <100%> (ø) ⬆️
auto_tests/conference_peer_nick_test.c 100% <100%> (ø) ⬆️
auto_tests/conference_double_invite_test.c 100% <100%> (ø) ⬆️
auto_tests/lossy_packet_test.c 100% <100%> (ø) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6872c14...1b23222. Read the comment docs.

@zugz
Copy link
Author

zugz commented Sep 1, 2018

This is now ready for final review.

Copy link

@sudden6 sudden6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 15 files at r6.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)


toxcore/group.c, line 1320 at r6 (raw file):

 * return -1 otherwise.
 */
static int try_send_rejoin(Group_Chats *g_c, uint32_t groupnumber, const uint8_t *real_pk)

use true/false for success/failure?


toxcore/group.c, line 1898 at r6 (raw file):

}

static int handle_packet_rejoin(Group_Chats *g_c, int friendcon_id, const uint8_t *data, uint16_t length,

use true/false for success/failure?

Copy link
Author

@zugz zugz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf, @zugz, and @sudden6)


auto_tests/conference_test.c, line 107 at r4 (raw file):

Previously, iphydf wrote…

NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?

Not any more.


auto_tests/conference_test.c, line 112 at r4 (raw file):

Previously, iphydf wrote…

What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.

Done.


auto_tests/conference_test.c, line 128 at r4 (raw file):

Previously, iphydf wrote…

I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).

Done.


auto_tests/conference_test.c, line 141 at r4 (raw file):

Previously, iphydf wrote…

Should we also wait for an event here?

Done.


toxcore/group.c, line 1320 at r6 (raw file):

Previously, sudden6 wrote…

use true/false for success/failure?

Done.


toxcore/group.c, line 1898 at r6 (raw file):

Previously, sudden6 wrote…

use true/false for success/failure?

Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.

Copy link

@sudden6 sudden6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)


toxcore/group.c, line 1898 at r6 (raw file):

Previously, zugz (zugz) wrote…

Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.

ok

Copy link
Member

@robinlinden robinlinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r1, 1 of 4 files at r3, 2 of 3 files at r5, 14 of 15 files at r6, 1 of 1 files at r7, 1 of 1 files at r8.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)

Copy link

@sudden6 sudden6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewable status: :shipit: complete! 2 change requests, 1 of 1 approvals obtained (waiting on @iphydf)

toxcore/group.c Outdated Show resolved Hide resolved
@zugz zugz changed the title minimal conference persistence Add mechanism for recovering from disconnections in conferences Sep 5, 2018
  * add freezing and unfreezing of peers
  * add rejoin packet
  * revise handling of temporary invited connections
  * rename "peer kill" packet to "peer leave" packet
  * test rejoining in conference test
  * use custom clock in conference test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants