Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempts to fix flakiness in session_ticket_test #3913

Merged
merged 11 commits into from
Apr 11, 2023

Conversation

maddeleine
Copy link
Contributor

@maddeleine maddeleine commented Mar 30, 2023

Resolved issues:

N/A

Description of changes:

There are two types of flakiness seen with this test. One is that we expect the server to issue a new session ticket during the handshake and occasionally it does not. The other is that sometimes the ticket key used for encrypting the session ticket is not the expected ticket key.
We suspect that the cause of the first issue is that in order issue a session ticket the server needs to check if the ticket key is available. A key is valid if its intro time is earlier than the current time. Wall clocks don't guarantee montonicity, so it is possible that this check might not pass if the test executes quickly and the key was added close to when it is actually used. To fix this issue I changed the key intro time so that it is introduced in the past, so that by the time the key is used key_intro_time < now.
The second issue is caused by the fact that selecting a key is a random weighted selection, so we expect it to actually select the wrong key sometimes. To fix this I put the test in a loop and accepted some rate of failure.

It's very hard to verify that these fixes work since flakiness by nature is hard to replicate. I suspect we may need to iterate on this solution a bit if it gets merged and failures are still observed. But it's a start towards fixing the issue.

Also added an issue to beef up testing around the key selection algorithm: #3922

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@github-actions github-actions bot added the s2n-core team label Mar 30, 2023
@maddeleine maddeleine requested a review from camshaft April 5, 2023 22:16
Comment on lines 826 to 828
* We expect to choose the wrong key 0.02% of the time. This value is drawn from the weight of the expected key,
* which does not change per test run. Therefore, the probability that the test chooses the wrong key
* allowed_failures times is 0.0002 ^ 10, which is extremely unlikely to occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.02% of the time is 1/5000. My understanding is that we're seeing this test fail more often than that.

Letting it fail up to 10 times doesn't seem like the right solution. The logic would definitely be wrong if the test failed more often than it succeeded, but that would pass this test. Like, if we're choosing the wrong key 9/10 times, that is very clearly an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should mention that we're not seeing this test fail more often than expected. There was a flaw in my testing methodology when I got those results. I can change this to be 0.0002^2, that is still a really small probability, I just wasn't sure what number to choose.
Technically it is possible to choose the wrong key 9/10 times, its just very improbable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to identify what is being tested here. If I'm following the assertions correctly, it's wanting to show that when the expected key is selected, it behaves in a certain way. The original test even went as far to assume that it would never pick another one.

Now if we're wanting to test the distribution of it picking another one, we should have a more specialized test for that, as highlighted in #3922; ideally one that doesn't use self-talk but calls the API directly and has a large sample size to eliminate flakiness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit of 10 is mostly just an arbitrary number to be able to say that it's very unlikely this will fail in this way again but not have it spin indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked online and lowered allowed failures to 1. Also updated the comment.

@maddeleine maddeleine requested a review from lrstewart April 6, 2023 21:39
@maddeleine maddeleine enabled auto-merge (squash) April 11, 2023 18:25
@maddeleine maddeleine merged commit 49097f4 into aws:main Apr 11, 2023
@maddeleine maddeleine deleted the flaky_test branch April 11, 2023 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants