-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mechanism for recovering from disconnections in conferences #1069
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review the logic more in-depth later.
toxcore/group.c
Outdated
return -1; | ||
} | ||
|
||
memcpy(&temp[g->numpeers], &g->frozen[frozen_index], sizeof(Group_Peer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a regular assignment.
toxcore/group.c
Outdated
g->frozen = nullptr; | ||
} else { | ||
if (g->numfrozen != (uint32_t)frozen_index) { | ||
memcpy(&g->frozen[frozen_index], &g->frozen[g->numfrozen], sizeof(Group_Peer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also regular assignment.
toxcore/group.c
Outdated
{ | ||
Group_c *g = get_group_c(g_c, groupnumber); | ||
|
||
if (!g) { | ||
return -1; | ||
} | ||
|
||
// TODO(irungentoo): | ||
int peer_index = peer_in_chat(g, real_pk); | ||
int peer_index = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be "const int" if you use a ternary expression.
toxcore/group.c
Outdated
} | ||
|
||
memcpy(&temp[g->numfrozen], &g->group[peer_index], sizeof(Group_Peer)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assignment.
toxcore/group.c
Outdated
return -1; | ||
} | ||
|
||
int friendcon_id = getfriend_conn_id_pk(g_c->fr_c, real_pk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
toxcore/group.c
Outdated
return -1; | ||
} | ||
|
||
int close_index = add_conn_to_groupchat(g_c, friendcon_id, groupnumber, 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
Done all. Thanks. I also made some more things const.
|
docs/minpgc.md
Outdated
|
||
# Discussion | ||
## Overview | ||
The intention is to recover seemlessly from splits in the group, the most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seamlessly
a14947c
to
1c1ebd5
Compare
+The intention is to recover seemlessly from splits in the group, the
most
seamlessly
Oops! Done.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)
auto_tests/conference_test.c, line 107 at r4 (raw file):
uint16_t disconnect_toxes[NUM_DISCONNECT]; for (uint16_t i = 0; i < NUM_DISCONNECT; ++i) {
NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?
auto_tests/conference_test.c, line 112 at r4 (raw file):
} for (uint16_t j = 0; j < 70 * 5; ++j) {
What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.
auto_tests/conference_test.c, line 128 at r4 (raw file):
} c_sleep(200);
I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).
auto_tests/conference_test.c, line 141 at r4 (raw file):
printf("reconnecting toxes\n"); for (uint16_t j = 0; j < 120 * 5; ++j) {
Should we also wait for an event here?
1fc620c
to
93b6d54
Compare
auto_tests/conference_test.c
Outdated
bool disconnected[NUM_GROUP_TOX]; | ||
|
||
for (uint16_t i = 0; i < NUM_GROUP_TOX; ++i) { | ||
disconnected[i] = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know.. bool disconnected[NUM_GROUP_TOX] = {0};
does the same. You may have a reason to have this explicit loop. What's the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason. Fixed.
auto_tests/conference_test.c
Outdated
uint16_t disconnect; | ||
|
||
do { | ||
disconnect = random_u16() % NUM_GROUP_TOX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me about 30 seconds to understand this loop. I might be incompetent, but more incompetent people are going to read this and not understand it within 10 seconds. Perhaps you can factor this out into a function with a name (and maybe a comment) to clarify what this does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
toxcore/group.c
Outdated
g->peer_on_join(g->object, groupnumber, g->numpeers - 1); | ||
} | ||
|
||
--g->numfrozen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth asserting that this isn't 0 before decrementing. I'm conflicted, because I soon want assert
to mean "provably true" and this assertion won't be easy to prove. Just something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, I rearranged things to call the callbacks after we deal with g->frozen. So now we know g->numfrozen > 0 because we just checked that get_frozen_index didn't return -1. It makes more sense this way anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
toxcore/group.c, line 1269 at r1 (raw file):
Previously, iphydf wrote…
const
No const?
No const?
Done.
|
I tried running it now and ran into a few problems:
I also see the ID warning during auto_conference_test after "letting random toxes timeout".
|
Thanks for testing.
Warning: "ERROR:Messenger.c:m_send_message_generic:508: Message length 1378 is too large"
I don't see what would cause this. Is it really introduced by minpgc?
core/toxlogger.cpp:57 : Warning: "WARN :network.c:networking_poll:654: [33] -- Packet has no handler"
WARNING network.c:550 sendpacket: unknown address type: 0
Interesting. I don't see either of these during the
auto_conference_test, but travis gets the latter at least. Do you get
either with master?
4. qTox segfaults on close (minpgc toxcore)
double free or corruption (out)
Aargh. That one was my fault. Thanks for finding it! Should be fixed
now.
5. Probably a qTox issue, qTox segfaults on leaving group (old toxcore, pgc group members)
Well that certainly isn't minpgc's fault - the client shouldn't segfault
whatever peers do. I can't imagine what could be causing it.
|
hard to say, but going back and forth I do not see it with master but do see it with this branch. I will test a bit more to make sure.
no, didn't see either with master. I think the "Packet has no handler" is actually blocking functionality, if I close and re-open my client using this branch, I don't rejoin the group, but I see those warnings instead (I think). I have some non-default network settings in client, i.e. ipv4 only, I will test full default and see if I can still repro.
Agreed, just got a repro on stock toxcore, guaranteed qTox issue. |
no, didn't see either with master. [...] "Packet has no handler"
That one really makes no sense to me. It should be triggered only if you
receive a packet with an unknown packet kind. minpgc doesn't add any new
packet kinds (only a new crypto packet id), and nor can I imagine why
anything it does would solicit any non-standard packets from other
peers.
if I close and re-open my client using this branch, I don't rejoin the
group
That's expected! You need to have a friend in the group who's also
running minpgc for the persistence to work, and even then it doesn't
work if you restart the client rather than just lose your network
connection, because saving and loading isn't implemented (yet).
|
Ok more testing and some root causing:
I tested again stopping one of the clients in the group with gdb until they were viewed as offline instead of restarting the client, and you're right that that works. I was under the impression that the other minpgc peer would re-include the re-connecting minpgc peer, but it makes sense that the peer who left needs to maintain the state as well. In that case testing looks good from my end. I'll continue reading through the changes tomorrow. |
* Wednesday, 2018-08-22 at 23:44 -0700 - Anthony Bilinski <[email protected]>:
4. May have been the same as 5? Saw this on stock toxcore.
Ah yes, that would make sense. Well, it made me find and fix a bug in
minpgc anyway!
5. Looks like it actually is toxcore
Yes. #1117 should fix it.
In that case testing looks good from my end. I'll continue reading
through the changes tomorrow.
Great. Thanks for testing and reading! I've just pushed a couple more
little changes, and there's one more thing I know needs doing (there's
a FIXME in the code for it), but otherwise to my knowledge it's bugfree
and working.
|
cb731c6
to
e3441ea
Compare
c718ae0
to
14a58e4
Compare
Codecov Report
@@ Coverage Diff @@
## master #1069 +/- ##
========================================
+ Coverage 82.5% 83.1% +0.5%
========================================
Files 82 82
Lines 14476 14593 +117
========================================
+ Hits 11948 12131 +183
+ Misses 2528 2462 -66
Continue to review full report at Codecov.
|
This is now ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 15 files at r6.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @iphydf and @zugz)
toxcore/group.c, line 1320 at r6 (raw file):
* return -1 otherwise. */ static int try_send_rejoin(Group_Chats *g_c, uint32_t groupnumber, const uint8_t *real_pk)
use true/false
for success/failure
?
toxcore/group.c, line 1898 at r6 (raw file):
} static int handle_packet_rejoin(Group_Chats *g_c, int friendcon_id, const uint8_t *data, uint16_t length,
use true/false
for success/failure
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf, @zugz, and @sudden6)
auto_tests/conference_test.c, line 107 at r4 (raw file):
Previously, iphydf wrote…
NUM_DISCONNECT is actually MAX_DISCONNECT? Since numbers may repeat?
Not any more.
auto_tests/conference_test.c, line 112 at r4 (raw file):
Previously, iphydf wrote…
What's 70 * 5? It looks like we wait for a fixed amount of time. Perhaps we should wait for events to happen? I.e. wait until all the toxes we expect to be disconnected are in fact disconnected.
Done.
auto_tests/conference_test.c, line 128 at r4 (raw file):
Previously, iphydf wrote…
I think the test can be a little faster by using 100 here. I think we have an ITERATION_INTERVAL constant for that, which was chosen to iterate as few times as possible while giving about the same test timing. 200, I think (I might be wrong), was increasing test times significantly. The reason I kept 200 here is because in some loops we print something on every loop iteration. That can be fixed, and then we can iterate more frequently, speeding up this test (which currently times out).
Done.
auto_tests/conference_test.c, line 141 at r4 (raw file):
Previously, iphydf wrote…
Should we also wait for an event here?
Done.
toxcore/group.c, line 1320 at r6 (raw file):
Previously, sudden6 wrote…
use
true/false
forsuccess/failure
?
Done.
toxcore/group.c, line 1898 at r6 (raw file):
Previously, sudden6 wrote…
use
true/false
forsuccess/failure
?
Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
toxcore/group.c, line 1898 at r6 (raw file):
Previously, zugz (zugz) wrote…
Making this bool would introduce an ugly mismatch with g_handle_packet. The whole file should be converted to use boolean returns where possible, but this is outside the scope of this PR.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 5 files at r1, 1 of 4 files at r3, 2 of 3 files at r5, 14 of 15 files at r6, 1 of 1 files at r7, 1 of 1 files at r8.
Reviewable status: 2 change requests, 0 of 1 approvals obtained (waiting on @iphydf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 change requests, 1 of 1 approvals obtained (waiting on @iphydf)
* add freezing and unfreezing of peers * add rejoin packet * revise handling of temporary invited connections * rename "peer kill" packet to "peer leave" packet * test rejoining in conference test * use custom clock in conference test
This change is