-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
tools/flakiness_checker.py
Outdated
"MXNET_TEST_COUNT, defaults to 500") | ||
|
||
parser.add_argument("-s", "--seed", | ||
help="random seed, passed as MXNET_TEST_SEED") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also elaborate that if no seed is provided then using a random seed
tools/flakiness_checker.py
Outdated
test_file += ".py" | ||
test_path = test_file | ||
top = str(subprocess.check_output(["git", "rev-parse", "--show-toplevel"]), | ||
errors= "strict").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align the arguments
tools/flakiness_checker.py
Outdated
new_env["MXNET_TEST_SEED"] = seed | ||
|
||
code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env) | ||
print("nosetests completed with return code " + str(code)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need to unset the environment variables after the run? Zero effect on the user's environment variables is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, the user environment variables are local to the spawned process. The actual environment is not modified.
@marcoabreu Please take a look and share your comments on this, this could be a useful tool for contributors to check the flakiness with an out-of-box experience. |
tools/flakiness_checker.py
Outdated
|
||
parser.add_argument("test", action=NameAction, | ||
help="file name and and function name of test, " | ||
"provided in the format: file_name.test_name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there're two formats? One is the <path-to-file>:<test-name> while the other is the <name-of-test-file>.<test_name>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this in my most recent commit, and updated the help message accordingly
I'm currently on vacation and will be back next Friday. I'll do a review then. |
@marcoabreu We are not deploying this yet but using it as a developer tool, as this is the work of our intern @cetsai, who only stays with us for a short period of time, is it okay if we do a review within our team and merge this? |
tools/flakiness_checker.py
Outdated
import argparse | ||
|
||
|
||
DEFAULT_NUM_TRIALS = 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flakyness Test should be at least 10000 runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.99^500 < 0.01, which means that a flaky test that succeeds with 99% of probability has less than 1% chance of succeeding 500 runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this is just the default value, for this tool you can always provide an argument to specify how many trials you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our biggest issues are the random seeds as well as race conditions - both cases which are not directly related to the number of runs but just are really just pure random. My experience has shown that sometimes we only reveal a problem with the 8000th iteration
My concern is that people trust the default values (and they should). If everyone has to change the value themselves, we should just increase it directly. I don't know of anybody using 500 as iteration count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually some unit tests may take up to more than 30 seconds each, if you're running it for more than 10000 times by default then that's 300000 seconds which is ~3.5days. Say if some test is related to random seeds, it's due to input values generated randomly by the seed which is also randomly drawn, that's like a fixed probability. BTW 0.999^7000 < 0.001 so we don't even need 10000 times I guess...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our software runs in services that that serve millions of requests - even a 0.1% chance can be devastating. We should err on the side of caution as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I think the main purpose of running tests is to check your own implementation rather than making tests pass all the time. For discovering problems in your own code, less number of trials is sufficient. In addition, when we deploy this on CI we also don't want it to introduce much overhead in the CI. As it is stated in the design doc of Carl's project, the golden standard of checking flakiness is running it infinite number of times or at least with all possible seeds (0-2147483647), but it's also something impossible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there're a lot of unit tests that would last for more than 30 seconds...even it only lasts for 3.5 seconds each, 10000 runs would cost ~10hrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the purpose of this script is specifically to detect flaky tests, right? I'd expect that the expectation would then be that it passes all the time.
Sure, we won't achieve perfect coverage and at some points you get diminishing returns, but I have seen that even you usually use 10000 runs as the standard value for your tests. Time should also not be that much of a factor (even a few hours are fine) because this is intended to he run in the background. Robustness tests are a time intensive task and they should be designed to leave as few room for errors as possible (with costs in mind).
For CI you don't have to worry, this will run asynchronously and we'll make a proper assessment what are be best values for which case - it will probably be a Time-Boxed execution. But we are not at that stage yet.
Fyi, we got 455 tests < 0.1s, 509 < 1s, 565 < 3s and 63 > 3s. 628 tests in total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is this supposed to be a required check for every PR? If it's taking too long then is it becoming an obstacle for PRs.
tools/flakiness_checker.py
Outdated
new_env["MXNET_TEST_SEED"] = seed | ||
|
||
code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env) | ||
print("nosetests completed with return code " + str(code)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, the user environment variables are local to the spawned process. The actual environment is not modified.
tools/flakiness_checker.py
Outdated
code = subprocess.call(["nosetests", "-s","--verbose",test_path], env = new_env) | ||
print("nosetests completed with return code " + str(code)) | ||
|
||
def find_test_path(test_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some documentation? From the name alone it was not clear to me what this exactly does. Also, how is it going to handle duplicates and the fact that we import tests into other testfiles, e.g. test_operator_gpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a closer look at the code, you'll see that the file name and test name are both provided by the user. So if you would like to run the gpu version of test_sigmoid then you'll do <path-to-test_operator_gpu.py>:test_sigmoid or test_operator_gpu.test_sigmoid. The problem you're asking is probably "if we modified some tests in test_operator.py, how do we detect that we also need to run the gpu version of it?", which is designed to be handled by other components in the project (this is stated in the design doc too).
tools/flakiness_checker.py
Outdated
|
||
for (path, names, files) in os.walk(top): | ||
if test_file in files: | ||
test_path = path + "/" + test_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.join
Sure, my review is non blocking. please just address my comments. |
tools/flakiness_checker.py
Outdated
else: | ||
new_env["MXNET_TEST_SEED"] = seed | ||
|
||
code = subprocess.call(["nosetests", "-s","--verbose",test_path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider passing logging-level as an argument and set it here or use a default logging-level argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, point, I'll add that in my next commit
tools/flakiness_checker.py
Outdated
|
||
def run_test_trials(args): | ||
test_path = args.test_path + ":" + args.test_name | ||
print("testing: " + test_path) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use logging instead of print. This allows you to make log messages without having to concatenate strings
If a directory was provided as part of the argument, the directory will be | ||
joined with cwd unless it was an absolute path, in which case, the | ||
absolute path will be used instead. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the behaviour described in this doc is actually resembled by the code. I'm a bit concerned about the absolute path because you always join with cwd. Also you should only append the .py if it isn't present yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path.join will only use the last absolute path provided, so if the user provides an absolute path it will override cwd, I've tested this and it does work this way. Currently, I'm using re.split to process the command line argument with test file and test name, and I'm removing .py from the case when it's present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Very nice catch and approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! This is a great step towards more stable tests!
@marcoabreu Thanks for the quick reviews! |
…to flakiness_checker
Could you do something more sophisticated that can be automated? With nose internals you could grab all the tests, maybe blacklist some, run them and time how long they take, then set a budget of time and run each test a number of iterations with different seeds, to detect flakyness. Example, of using nose internals: In [4]: from nose import loader |
@larroy I would to clarify the scope and use case of this script here, this is only part of the final automated checker bot for flaky tests, it's being submitted now as a standalone helper tool for developers to have an easier way to check their tests. The reason we're submitting this is that we are seeing the need of such a tool during the flaky tests fixes. What you suggest is nice to have, but it probably does not fit in the scope of this tool. I can share the link to the doc of this project to you if you would like more details on the whole project and where this small tool stands within the project. |
5e08b67
to
41f5d33
Compare
@larroy are we good to move ahead? |
@larroy @marcoabreu Is this good for merge? The second batch of flakiness fixes is about to be on and we'd like to make this available for all developers involved. |
* add flakiness checker * fixed style and argument parsing * added verbosity option, further documentation, etc. * added logging * Added check for invalid argument * updated error message for specificity * fixed help message
Description
Added a new script under tools called flakiness checker, which runs tests a large number of times to check for flakiness. @haojin2 @eric-haibin-lin @azai91 @anirudh2290
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments