-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to LLVM 3.7.1 and switch over CI #14623
Conversation
Both win32 and win64 are failing on appveyor with |
Now merged into |
Wow, huge! How carefully have the perf implications been looked into for this? I recall a lot of work on bringing codegen times down to a relatively small regression, but how about the generated code on nontrivial code? |
The plan is to get this in for now, and then tackle the rest going forward on master as undoubtedly we will find more issues. |
Perhaps this is also where the benchmarking infrastructure will come in handy. |
No time like the present?
The CI tracker currently only has a subset of the array benchmarks in Base, so some manual perf testing would also be helpful. |
FWIW I see the same performance in my packages and some of them (NearestNeighbors.jl) have previously been quite good at detecting performance regressions, at least in the array code. Just one sample, but hey. |
@staticfloat we're going to need a newer gtar on the linux buildbots, apparently they don't understand |
Looks like some of the raw time regressions could be attributable to GC, but some might go away if johnmyleswhite/Benchmarks.jl#40 was in play. |
Ok, fine, I'll write us a better memory manager. |
Also, do people know what other JITs do to avoid RWX pages? I looked at v8 and openjdk, but both seems to have RWX pages. |
LuaJIT does RW first and then switches to RX. (related: http://lists.llvm.org/pipermail/llvm-dev/2012-July/051841.html) |
Right, that's what LLVM 3.7 does too, but we run into some fragmentation problems which artificially bloats our memory usage. |
What we could do is go RW->RX and put it pack into RW when we want to append. There is a problem with multi-threaded code, but I'm sure we could solve that just by holding the thread until we're done writing to the page. I'm not sure that's significantly better than just having a RWX page though. An attacker just have to time the writing to the page correctly. Also, I'm not necessarily arguing that it should be our job to protect against this, just exploring if there's a reasonable default. |
We already relies on segfault for multithreading GC, we could easily add the logic for catching not executable fault as long as LLVM (or codegen) is able to tell whether an address is being written to by LLVM.
I always thought the whole point is to make the attack window smaller? The attacker can always write to the same page when we are doing codegen (edit: for multi-threading), unless there's some verification pass after we set the page to not-writable, in which case we can also do that when we set the page back (edit: to RX). |
dunno. I should remember #14430 (comment) and add in |
@yuyichao That was my thought as well (i.e. holding the thread on an NX fault). The fact that the attack window is already there is a fair point. The only thing that is worse here is that a potential attacker might more easily be able to determine the allocation address. In any case, I think I actually found a bug in my patch which caused us to waste some memory. Let's see if just fixing that is sufficient. We should still think about a better memory allocator in the future though. |
We could checksum the portion of the page that already exists and crash if it was modified. |
If detecting the address is an issue, we can hide that by making two maps of the physical pages that we might still emit code into and never set the RX one back. Not sure if this is possible to do on windows or if it is supported by the llvm api though. |
Also, my patch does indeed seem to fix the windows issue. Nice! Will commit upstream and add it to the patch set. |
Yeah, Windows allows it. LLVM's memory APIs may not currently support it, but that's what a custom memory manager is for. |
A very minor issue is what if the attacher can modify the checksum on the stack.... probably not important compare to the one it solve..... |
Is the bug Windows only or will it also be worth rebuilding a new revision of the homebrew bottle with the modified patch? |
This bug is windows-only, but there was also a MachO patch that was forgotten (but that one is only a problem in LLVM_ASSERTIONS mode which the bottles may not be in?) |
not sure whether bottles have assertions on, but I think that would be good to do on CI? |
Yeah, probably. |
If the attacker can write the stack you have generally lost because they can overwrite the return address. |
Patch committed upstream as llvm-mirror/llvm@1f644cd. |
Right now can't anybody just (also, can we please, please add the llvm-shlib patch to our patchset so we can support |
Set CXXFLAGS=-DUSE_ORCJIT
(possibly temporary if we need these for something?)
Fewer failures this time around, but still more than you might hope: https://gist.github.com/cf1da5230c275eedaa67 MbedTLS and dependents look like they were broken by #14667, so these aren't all regressions relative to latest master. The nightly binary is old because the centos and osx buildbots have been constantly failing for 2 weeks. There are some strange-looking bounds errors and segfaults in here that may be legit regressions though. (or more likely due to #14474) Enough other things have changed on master in the last few weeks that it's probably best to just merge this and work on fixing things, bringing back osx testing, etc as we go. |
+1 Let's get this merged and address failures as they come in |
Upgrade to LLVM 3.7.1 and switch over CI
🎆 Congrats, big milestone here! |
Am I reading correctly that this makes total travis CI time 2 hours longer? And to think I was holding off on merging jb/functions since it makes CI time 20 minutes longer... |
No, something is going wrong with Travis. |
Ok, I suspected it must be something like that. A single AV build is ~20 minutes longer; is that expected? |
Looks like it's recompiling OpenBLAS for some reason. @tkelman take a look? 20 minutes is possible though moderately more than expected. |
It's how the docker caching works on travis now. We aren't using the ppa any more, we do a source build from scratch but cache the built deps. It only takes so long after clearing out the cache. When the cache is populated it's about AppVeyor is slower as a real perf regression. OSX was so much slower it timed out constantly and we had to turn it off for the time being. |
@JeffBezanson yes, i noted that prior to merging. total CI build time regression appeared to be about 2x |
@Keno can you look into the |
Will take a look. |
🎉 |
If you could take a look at Brim, Gadfly, Mamba, OpenStreetMap, and/or Winston that'd be awesome. They're all hitting BoundsErrors. JuMP, NamedTuples, and Persist are failing for reasons that don't make sense to me yet, but they don't look BoundsError related. |
So, trying a, b, c, d = (1,2,3) and caused by 53ecbaa changing the number of arguments returned by |
Ah good call, sorry about the false alarm. Does anyone else have commit access to Showoff.jl? If not we may need to redirect metadata to use a mirror/fork for a while if it becomes urgent. |
Ah, I guess I am on 0201437, which doesn't include this PR. But, at that point |
There is definitely something pretty weird going on with Brim related to bounds checks. The test which fails is: using Brim
A = [1 0; 0 1]
M = partition_lp(A) When launched from a julia session with
Without that command line option it works with a deprecation warning. |
FWIW, I've just switched the Copr RPM nightlies to LLVM 3.7.1, and all the tests pass. The build is only slightly slower than before (~26 minutes vs. 18 minutes in the best cases). This should increase the testing of this code since there are about 500 downloads each week. |
Thanks to @Keno for doing most of the preparation work, and @staticfloat for handling mac packaging.
There's one ugly git line in thegone.travis.yml
that can go away once staticfloat/homebrew-julia@9642485 and staticfloat/homebrew-julia@e6b8d2d are merged into master of homebrew-julia.This probably calls for a PkgEval run before merging. Closes #9336.
Fixes: (not yet)
#10595, #12671, #10444, #9222, #9085, #4905, #4418, #3596, #10301, #11037, #11083^^ these should all be checked manually, and add tests before closing wherever possible
todo: