-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempts to fix flakiness in session_ticket_test #3913
Conversation
tests/unit/s2n_session_ticket_test.c
Outdated
* We expect to choose the wrong key 0.02% of the time. This value is drawn from the weight of the expected key, | ||
* which does not change per test run. Therefore, the probability that the test chooses the wrong key | ||
* allowed_failures times is 0.0002 ^ 10, which is extremely unlikely to occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.02% of the time is 1/5000. My understanding is that we're seeing this test fail more often than that.
Letting it fail up to 10 times doesn't seem like the right solution. The logic would definitely be wrong if the test failed more often than it succeeded, but that would pass this test. Like, if we're choosing the wrong key 9/10 times, that is very clearly an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should mention that we're not seeing this test fail more often than expected. There was a flaw in my testing methodology when I got those results. I can change this to be 0.0002^2, that is still a really small probability, I just wasn't sure what number to choose.
Technically it is possible to choose the wrong key 9/10 times, its just very improbable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to identify what is being tested here. If I'm following the assertions correctly, it's wanting to show that when the expected key is selected, it behaves in a certain way. The original test even went as far to assume that it would never pick another one.
Now if we're wanting to test the distribution of it picking another one, we should have a more specialized test for that, as highlighted in #3922; ideally one that doesn't use self-talk but calls the API directly and has a large sample size to eliminate flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit of 10
is mostly just an arbitrary number to be able to say that it's very unlikely this will fail in this way again but not have it spin indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked online and lowered allowed failures to 1. Also updated the comment.
Resolved issues:
N/A
Description of changes:
There are two types of flakiness seen with this test. One is that we expect the server to issue a new session ticket during the handshake and occasionally it does not. The other is that sometimes the ticket key used for encrypting the session ticket is not the expected ticket key.
We suspect that the cause of the first issue is that in order issue a session ticket the server needs to check if the ticket key is available. A key is valid if its intro time is earlier than the current time. Wall clocks don't guarantee montonicity, so it is possible that this check might not pass if the test executes quickly and the key was added close to when it is actually used. To fix this issue I changed the key intro time so that it is introduced in the past, so that by the time the key is used key_intro_time < now.
The second issue is caused by the fact that selecting a key is a random weighted selection, so we expect it to actually select the wrong key sometimes. To fix this I put the test in a loop and accepted some rate of failure.
It's very hard to verify that these fixes work since flakiness by nature is hard to replicate. I suspect we may need to iterate on this solution a bit if it gets merged and failures are still observed. But it's a start towards fixing the issue.
Also added an issue to beef up testing around the key selection algorithm: #3922
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.