Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try ipv6 connections even after udp timeout #1086

Merged
merged 1 commit into from
Aug 27, 2018
Merged

Conversation

zugz
Copy link

@zugz zugz commented Aug 13, 2018

Currently this just adds a test demonstrating the problem: if a friend
connection is established for at least 2 seconds, then one friend stops
iterating for at least 31 seconds, then the friend connection goes down and
requires around two minutes to come back up. This is clearly excessive, and
poses a problem for tests which want to use this method for testing network
disconnections, as well as for actual clients which are ^Z suspended.

Experience with actual clients suggests suspension of toxcore is necessary to
trigger this bug - disconnections due to actual network failure are typically
re-established much faster. I have no real idea currently what the problem
could be.

See also zugz/c-toxcore:reconnect_slow:timeshifting for a version of the test
which executes without delays, by simulating the sleeps.


This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented Aug 13, 2018

.

@iphydf iphydf added this to the v0.2.x milestone Aug 13, 2018
@zugz zugz changed the title WIP: make friend connections reconnect faster after suspend try ipv6 connections even after udp timeout Aug 14, 2018
@zugz zugz changed the title try ipv6 connections even after udp timeout Try ipv6 connections even after udp timeout Aug 14, 2018
@zugz
Copy link
Author

zugz commented Aug 14, 2018

It looks like I found the problem, but please review carefully to make sure this is the correct solution.

@zugz
Copy link
Author

zugz commented Aug 14, 2018

@GrayHatter could you take a look at this? Could there have been a reason for the previous behaviour?

@GrayHatter
Copy link

What part are you asking me to take a look at?

Verify the correctness of the tests? Or explain the time out behavior?

@GrayHatter
Copy link

What are you asking here? What's the point of timing out connections? Or verify the correctness of the tests?

Re: The change to netcrypto. Generally, I don't approve. If you want to disable the timeout checks, that's something I'm happy to discuss the merits of, but this still does the checks, then ignores them after the fact.

I'd have to look into exactly what calls return_ipp_conn.... But why would you want to keep trying for a connection that is already timed out? I myself have always thought the timeouts were way too lazy. IMO toxcore should be culling connetions after something closer to 10s. (But that would be a change for after some network optimizations, toxcore is too network heavy for that still)

Before I can't answer intelligently, about why toxcore does what it does... Can you describe the problem your coming up against a bit better?

Additional thought; I don't know the TCP/IP stack that well, does the system hold packets for suspended apps? If the kernel is kind enough to queue UDP packets for suspended apps with an open socket. When you resume the app, the paused client will think it's peer is still online, (because the kernel will then send it all the packets sent by it's peer, thus the suspend app will update the timeout to when it first got the packets, not when they were actually sent). Where as with a network outage both clients will rightfully see the other as disconnected, because the timeouts on both ends will be accurate.

@zugz
Copy link
Author

zugz commented Aug 15, 2018 via email

#include "../toxcore/util.h"
#include "check_compat.h"

#define NUM_TOXES 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a different name, or needs #undef at the end: https://travis-ci.org/TokTok/c-toxcore/jobs/416043872#L1251


printf("letting connections settle\n");

while ((int)(time(nullptr) - test_start_time) < 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the int-cast here?

uint32_t connected_count = 0;

for (size_t j = 0; j < friend_count; j++) {
if (tox_friend_get_connection_status(toxes[index], j, nullptr) != TOX_CONNECTION_NONE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd use the state and friend connection callbacks for this, but maybe we need a bit more auto_test framework to make that less repetitive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm leaving it as is for now - quite a few of the tests use this deprecated function, best to fix them all at once.

return conn->ip_portv4;
}

if (net_family_is_ipv6(conn->ip_portv6.ip.family)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about this a bit more tomorrow. Before I think about it and read more code, maybe you can tell me what the purpose of line 638-640 is after this change. I.e. why we have ip_is_lan between the two ipv6 returns.

@zugz
Copy link
Author

zugz commented Aug 16, 2018 via email

Copy link
Member

@iphydf iphydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)


auto_tests/Makefile.inc, line 3 at r3 (raw file):

if BUILD_TESTS

TESTS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test conference_test \

reconnect_test (and keep this sorted)


auto_tests/Makefile.inc, line 10 at r3 (raw file):

        tox_many_tcp_test tox_many_test tox_one_test tox_strncasecmp_test typing_test version_test

check_PROGRAMS = disc_test bootstrap_test conference_double_invite_test conference_peer_nick_test conference_simple_test \

reconnect_test


auto_tests/Makefile.inc, line 60 at r3 (raw file):

conference_test_LDADD = $(AUTOTEST_LDADD)

disc_test_SOURCES = ../auto_tests/bootstrap_test.c

Not bootstrap_test, reconnect_test.

Copy link
Member

@iphydf iphydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, 2 of 3 files at r4.
Reviewable status: 1 change requests, 0 of 1 approvals obtained (waiting on @zugz)


auto_tests/reconnect_test.c, line 1 at r4 (raw file):

/* Auto Tests: Conferences.

Something else here. Perhaps copy/paste some of the PR description here for posterity.


auto_tests/reconnect_test.c, line 71 at r4 (raw file):

    printf("disconnecting #%u\n", state[disconnect].index);

    while (!all_disconnected_from(TOX_COUNT, toxes, state, disconnect)) {

Optional: perhaps use do-while like in all other tests as of recently.

@codecov
Copy link

codecov bot commented Aug 26, 2018

Codecov Report

Merging #1086 into master will decrease coverage by <.1%.
The diff coverage is 95%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1086     +/-   ##
========================================
- Coverage    82.7%   82.6%   -0.1%     
========================================
  Files          81      82      +1     
  Lines       14438   14478     +40     
========================================
+ Hits        11945   11965     +20     
- Misses       2493    2513     +20
Impacted Files Coverage Δ
auto_tests/reconnect_test.c 100% <100%> (ø)
toxcore/net_crypto.c 93.5% <66.6%> (-0.1%) ⬇️
toxcore/crypto_core_test.cc 90.3% <0%> (-9.7%) ⬇️
toxav/audio.c 65.7% <0%> (-7.8%) ⬇️
toxav/toxav.c 68.3% <0%> (-0.4%) ⬇️
toxcore/Messenger.c 85.7% <0%> (-0.1%) ⬇️
toxcore/DHT.c 76.7% <0%> (ø) ⬆️
toxav/msi.c 65% <0%> (+0.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0075374...66ab386. Read the comment docs.

@zugz
Copy link
Author

zugz commented Aug 26, 2018 via email

@zugz zugz force-pushed the reconnect_slow branch 5 times, most recently from 0453144 to 27a22a0 Compare August 27, 2018 16:08
@iphydf
Copy link
Member

iphydf commented Aug 27, 2018

<@irungentoo> it looks ok
<@irungentoo> though for consistency it might be better to put the  if (net_family_is_ipv4(conn->ip_portv4.ip.family)) {...} before: if (net_family_is_ipv6(conn->ip_portv6.ip.family)) {...}

@zugz zugz force-pushed the reconnect_slow branch 2 times, most recently from c1624c6 to c6b0073 Compare August 27, 2018 20:05
Also adds a test (auto_reconnect_test) which fails without this change.
@zugz zugz merged commit 66ab386 into TokTok:master Aug 27, 2018
@zugz zugz deleted the reconnect_slow branch August 27, 2018 21:18
@robinlinden robinlinden modified the milestones: v0.2.x, v0.2.7 Aug 30, 2018
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants