-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report coverage stability in the dashboard #5
Comments
A silly question: what is coverage stability? I'd guess some jaccard similarity-derived metric of coverage between all runs of the same input, such that "100%" is deterministic and anything else is not? |
Fraction of (repeated) inputs for which we observe identical coverage each time; it's pretty easy to measure. Concretely I'd replay from our seed pool at a diminishing rate (say 1% of executions, halve the rate each time through the pool, over-schedule new pool entries to catch up, replay entries twice occasionally as a cache-hit heuristic), so we get some decent confidence fairly promptly but with low overhead over time. We're really just reporting an observation here rather than estimating some latent property, partly because it's unclear what that would be. e.g. conceptually we might have different flake-rates per input, may or may not care about that, etc. A particularly common case is that we observe different coverage only the first time we play some input; it's ambiguous whether we actually want such a seed in our pool since mutants are less likely to help. |
We should also report coverage stability fraction, including a rating stable (100%), unstable (85%--100%), or serious problem (<85%"); and explain the difference between stability (=coverage) and flakiness (=outcome). Stability is mostly an efficiency thing; flakiness means your test is broken.
This mostly requires measuring both of these on the backend and then plumbing the data around, it's not hugely involved.
The text was updated successfully, but these errors were encountered: