Update flaky test script to print more details and detect flaky exceptions and timeouts#14731
Conversation
…detect flaky, unexpected test errors like exceptions and timeouts. Signed-off-by: Randy Miller <rmiller14@gmail.com>
sunjayBhatia
left a comment
There was a problem hiding this comment.
Big fan of this change, looks like it gives much more relevant content in the flake report 👍🏽
Might drive by and do some more comments on the implementation details later but as an overview looks good
lizan
left a comment
There was a problem hiding this comment.
This is really awesome. Thanks!
A bonus point would be log those with AZP commands: https://github.com/microsoft/azure-pipelines-tasks/blob/master/docs/authoring/commands.md
and set CI status to SucceededWithIssues. That can be in a follow up PR.
…dback. Signed-off-by: Randy Miller <rmiller14@gmail.com>
That sounds cool. I'll do this in a follow-up PR. |
|
There seems to be a bug in shell_utils.sh, so I'll fix in the next commit. |
Signed-off-by: Randy Miller <rmiller14@gmail.com>
Signed-off-by: Randy Miller <rmiller14@gmail.com>
Signed-off-by: Randy Miller <rmiller14@gmail.com>
|
The script isn't compatible with python 3.6. Updating. |
… available (even for Windows) Signed-off-by: Randy Miller <rmiller14@gmail.com>
Fixed and tested. |
Signed-off-by: Randy Miller <rmiller14@gmail.com>
a7888dd to
5a05a40
Compare
|
@lizan CI is passing now. There aren't any major changes to the script since you last reviewed. |
|
Ping. |
Commit Message:
Update the flaky test script to print more details and detect flaky, unexpected test errors like exceptions and timeouts, with the goal of making the notifications more actionable.
Signed-off-by: Randy Miller rmiller14@gmail.com
Additional Description:
The flaky test script, ci/flaky/process_xml.py, is executed on every CI run, delivering a notification to the Slack channel "test-flaky" if there are any flaky test failures. Those notifications aren't as useful as they could be though, for a number of reasons:
The goal of this PR is to make these flaky test notifications more actionable by addressing the 4 bullets above. Below is what a notification would look like should this PR get merged. The last 2 flakes are not captured at all today by the current state of the script, as those flakes are unexpected test "errors" (eg, exceptions or timeouts) rather than test "failures" (eg, test assert failed).
Risk Level: N/A for code/test, low for the flaky test script due to the amount of churn.
Testing: Ran locally many, many times. For a portion of those runs, I treated normal failures as flakes to get better coverage on the parsing helpers. Not sure how to test the changes to bazel.yml though.