-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multiple problems with views-of-views #10505
Conversation
Due to what was probably a copy/paste error, the tests were grabbing a global variable A (rather than an argument) that happened to be a plain array. So we weren't really testing views-of-views. Unsurprisingly, fixing this problem uncovered bugs. Many "simple" things worked, but sufficiently complicated constructs like A = rand(20, 20, 20) B = slice(A, 1:5, 6, 6:9) C = sub(B, :) (specifically, those that use linear indexing in creation of a view-of-view) certainly had problems. Drat. Well, hopefully this has been rare up until now; it seems rather likely that folks would have gotten BoundsErrors if anyone had actually been using this. But we start returning views from getindex it will become common, so better that it was caught now. In debugging and fixing the problems, I took the opportunity to do some cleanup to make this more maintainable, specifically: adopting a consistent naming convention, improving many variable names, and adding comments.
The appveyor failure looks like a timeout (unsurprisingly). |
If we really need to run a substantially slower set of tests on every commit and PR, an easy option would be to split up the existing tests into more entries in the build matrix. This will impact our queue time (esp. on appveyor), but not much else. While our tests have been gradually growing in coverage lately, it also seems there have been some regressions in codegen speed. Profiling and taking a few performance passes on codegen might be called for soon, particularly if we're aiming to upgrade LLVM before 0.4, which AIUI will only make this problem worse. An outstanding todo has been to move more testing onto the buildbots, the perf suite and otherwise. Other than Elliot and myself I don't think many people are paying very close attention to those at the moment though. |
Where does one go to look at buildbot results? Other than the coveralls page (which I do check now and then). Indeed, a puzzling gap in coverage helped me discover the problem here. Interesting failure in bitarray tests here, I haven't seen that one before. |
http://buildbot.e.ip.saba.us:8010/builders - I have a hard time remembering that myself, would be great if build.julialang.org could be made to redirect there |
The redirect can be set up by @StefanKarpinski It is probably time to have a comprehensive test suite and a quick test suite. |
I saw a very similar failure on Win32 during the parallel tests on #10512. It was new to me, too. Glad that it isn't an issue with either of these PRs, but sad that it looks like we've got another non-deterministic type error on our hands. |
OK, I think I've come up with a good solution: run a random small fraction of the subarray tests. There's an environmental variable you can set to specify the fraction (including all of them). Strategy or name bikeshedding welcome. |
I'm not a fan of the idea of running a random subset of tests. Nondeterministic test failures is probably the single most frustrating thing about working on/with Julia. Unless you're confident that the percentage you've chosen would be enough to always immediately catch any problem that might get introduced that would cause any of these tests to fail... If we're going to choose based on an environment variable, wouldn't it be more customary to give it an all-caps name? |
I appreciate the feedback as always, @tkelman. It's great to have such a conscientious steward of our CI infrastructure. I'm not confident that 2.5% of the tests would be enough to catch any problem, so the issue you are worried about is real. But I think this is in a different category from the nondeterministic failures that plague other julia development (BTW, on Linux The alternative is to hand-pick a subset of tests to run consistently. That's not bad either, but I'm not confident of my ability to pick a subset that would catch any and all possible errors. Therefore, a potential outcome is that a bug might be introduced that affects a subset that isn't tested. Indeed, developers will change their code until the test suite passes, so a consistent, biased subset of the full set of tests might encourage this outcome. In that scenario, if no one is watching the buildbots, then it could go a long time without being caught. Anyway, willing to go either way here, but to me the random subset seems the safer solution. Indeed I can change to all-caps. I'll do that if this strategy ends up being approved, or delete it if not. |
The reason I do not like picking a random subset of tests is because you can easily have a failure on one subset, and not have it occur again. You never know then if it is some CI environment issue, or a real failure. I would much rather pick a small set (even imperfect) for regular testing where failures are deterministic, and put in efforts towards building a comprehensive test suite that runs nightly. |
One option would be to pick a fixed random subset of these tests. That is deterministic but arbitrary. |
yep. Random tests are dangerous. Better pick a fixed random set. May it be possible to achieve this with a fixed seed or is the random sequence generated by a certain seed hardware specific? |
I dislike the "Write fewer tests" option, or never run all of them. If we make the choice of test permutations to run random, we should at least make it simple to rerun the same set of tests (in the same order) trivial. The test failure message must be very clear that a randomly chosen test failed, and detailed steps to rerun the exact same test run. (eg print the seed used for the RNG). Another way to pick tests would be to run every A convenient interface might be: @test_subset 5 for I_type in subtypes(Integer), F_type in subtypes(FloatingPoint)
testfun(zero(I_type), one(F_type))
end Where the PS: What is the recursive option for |
What if I were to run all the tests inside a @ivarne, your interface is a good idea. One challenge, though: currently to run 2.5% of the total test suite, I construct a view from array using 15.8% of the possible choices, and then a view from these views with another 15.8% of the choices. ( |
This allows the tests to complete in a reasonable time. One can force all tests to run by setting the environment variable testfraction=1
It feels simpler and more robust to me to have an extensive test suite that would be run nightly and on demand, and keep an (hopefully complete enough) subset to run on every PR. With a random subset, one would never be sure anyway that the next build wouldn't trigger a failure. At last, with a deterministic testsuite, we can decide to have most typical use cases in the default subset, so that 99% of mistakes are caught before merging. And we'll still be able to run the full testsuite when a PR looks like it could break something subtle. Finally, if some subset was known to be run only nightly, we could make it run very complete tests, like going over all possible combination of types from Base. If tests were chosen at random, we would still be inclined to avoid too time-consuming checks like that, by fear of diluting essential ones in an ocean of "just in case" tests. |
The consensus seems clearly against random subsets, so I will change this as soon as I can. |
There's another option, which is to craft a subset of tests that is hard to "game" and designed so that it covers most of the corner cases. Of course, that's easier said than done, but it would be ideal. |
We don't anticipate people writing intentionally broken code that passes the first level of tests. The goal is rather that we should have a high probability to catch inadvertently introduced problems. |
109991d
to
9a86698
Compare
Fix multiple problems with views-of-views
OK, I went with |
Thought I'll mention here that parallel tests also have a conditional execution of tests dependent on env variable PTEST_FULL. https://github.com/JuliaLang/julia/blob/master/test/parallel.jl#L185) |
@amitmurthy I think most of us had missed that comment and were unaware of that, otherwise we would probably have more people running with (Also, when linking to lines of code, please hit |
Its been introduced quite a ways back - probably the reason it has not been noticed. We can standardize on JULIA_TESTFULL. As written, the additional parallel tests launch and tear down upto 90 workers via ssh - which takes around 7-8 GB of additional RAM. |
I think that number should be reduced a bit if we want more people running the full parallel test. |
Thanks for bringing this to our attention, @amitmurthy. I agree with all of your proposals. |
This is a public service announcement from your local "oh, crap" department.
Due to what was probably a copy/paste error, the SubArray tests were grabbing a global variable
A
rather than an argumentSB
.A
happened to be a plain array, with the consequence that the SubArray tests were not thoroughly testing views-of-views:I note that, had I passed that file through Lint, it would have caught the problem.
Unsurprisingly, fixing this test problem uncovered bugs in my implementation of SubArrays. Many "simple" (hah!) things worked because I had tested quite a lot via the command line, and also copied in some old hand-written tests. But constructs like the one above---specifically, those that use linear indexing in creation of a view-of-view, when one of the uber-parent dimensions had been sliced out---certainly had problems. I think people would have quickly gotten
BoundsError
s if this were being commonly used, so the fact that this hasn't been noticed until now is a good sign. Once we start returning views from indexing operations, views-of-views are going to be really common, so it surely would have turned up then.One major downside to this fix is that, now that the tests are running properly, the
subarray
test takes (on my machine) 14 minutes to complete. I think that's probably too long. However, I also don't think the test is excessive in its thoroughness:Colon
, oneInt
, oneUnitRange{Int}
, oneStepRange{Int}
, and oneVector{Int}
. It uses all possible combinations of these, leading to 125 views. It then creates views-of-views using indexes of the same types, for 1, 2, and 3-dimensional indexing. That's a lot of combinations (and a lot of code to compile), but the different choices have really important consequences for both cartesian indexing and inference ofLD
(the maximum dimension with a uniform stride), which is key to our current SubArrays' ability to perform fast linear indexing in many situations. The logic behind this inference is quite complicated, and seems to demand thorough testing.I suspect there's no alternative but to hand-pick a subset of these combinations for routine testing, and perhaps save the exhaustive tests for a rarely-run test suite. But I didn't want to start decreasing the stringency of our tests without discussing it first, especially when an accidental lack of stringency led to bugs going unnoticed.
[Note that the large size of this commit partly reflects cleanup as well as bugfixes.]
Sorry, all!