Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report coverage stability in the dashboard #5

Open
Zac-HD opened this issue Oct 26, 2022 · 2 comments
Open

Report coverage stability in the dashboard #5

Zac-HD opened this issue Oct 26, 2022 · 2 comments

Comments

@Zac-HD
Copy link
Owner

Zac-HD commented Oct 26, 2022

We should also report coverage stability fraction, including a rating stable (100%), unstable (85%--100%), or serious problem (<85%"); and explain the difference between stability (=coverage) and flakiness (=outcome). Stability is mostly an efficiency thing; flakiness means your test is broken.

This mostly requires measuring both of these on the backend and then plumbing the data around, it's not hugely involved.

@tybug
Copy link
Collaborator

tybug commented Jan 7, 2025

A silly question: what is coverage stability? I'd guess some jaccard similarity-derived metric of coverage between all runs of the same input, such that "100%" is deterministic and anything else is not?

@Zac-HD
Copy link
Owner Author

Zac-HD commented Jan 8, 2025

Fraction of (repeated) inputs for which we observe identical coverage each time; it's pretty easy to measure. Concretely I'd replay from our seed pool at a diminishing rate (say 1% of executions, halve the rate each time through the pool, over-schedule new pool entries to catch up, replay entries twice occasionally as a cache-hit heuristic), so we get some decent confidence fairly promptly but with low overhead over time.

We're really just reporting an observation here rather than estimating some latent property, partly because it's unclear what that would be. e.g. conceptually we might have different flake-rates per input, may or may not care about that, etc. A particularly common case is that we observe different coverage only the first time we play some input; it's ambiguous whether we actually want such a seed in our pool since mutants are less likely to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants