-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Incremental GC #5227
WIP: Incremental GC #5227
Conversation
Wow, this is really impressive. It would be great to have the option to build with a low-pause GC for users who might want that. Having multiple GCs available is really cool. I'm planning to work on a new default GC as well. It also requires a write barrier, so that work will not be wasted. What I want to try is taking advantage of the fact that garbage is very likely to have been referenced only from the stack (i.e. temporary values). So I'd set a bit on heap-referenced values, then during a "minor" GC I'd only mark objects currently on the stack, then do a full sweep. That skips tracing all heap objects, which takes most of the time in my experiments. It seems like you'd be an ideal person to evaluate this idea. |
Very cool work. I worry about having multiple GCs because each one might not get enough testing to be really reliable. It would be ideal, imo, to have a single standard GC that can meet everyone's needs, perhaps with a little assist from the high-level coding style (e.g. if you have real-time requirements, always pre-allocate temporary arrays). On the other hand, GC might be an area where it's not actually possible to satisfy everyone with one system. |
Whether you care about pause times is a pretty hard dividing line in the GC design space. You can't address this by changing how user code is written, because you also call libraries, and our compiler also allocates objects if it wants to. Even if our GC is so good it takes 1ms 99.9% of the time, that still isn't low-latency. The 0.1% ruins it. Limited pause time is a harsh requirement. @carnaval Is there an argument that this eventually collects all garbage? I believe it might but it's not obvious. For example, say you constantly add new nodes to a large tree, and on rare occasion unreference a big chunk of it. Will the GC always use up its allotted time dealing with the new updates? |
Yes, that means requirements on the "one GC" would be pretty severe – high enough throughput for general usage and guaranteed low-latency for real-time requirements. Not even sure if that's doable and the alternative would be having a small set (maybe only two) of GCs that can be swapped out. |
It should collect all garbage eventually because I increase the timeout if the mark queue is not shrinking, until a cutoff value where I do a full collection. I'd be happy to do some tests on different ways to do quick collections after this. A first step would be having some basic infrastructure to measure dfferences in gc pause/throughput. ATM I have some basic scripts that need some cleanup but I believe it would be beneficial to automate it to spot regressions (maybe an entry on codespeed ?) About having multiple gc, in this PR I duplicated the codepath because it allowed me to trust the verification (which is inline to the marking code, meh) but it could be merged quite easily and I don't think there should be any penalty in saying that a full collection is an incremental one with infinte timeout. Btw, congrats on the codebase it's really clean and quite nice to dive into. |
Thanks. It would be nice if we could somehow avoid calling I assume all the verification stuff is just for debugging? It doesn't all seem to get disabled by |
About clock_now, I had a counter at some point to only check once every N objects but didn't see measurable improvements. However a lot of things changed since then so it may very well be worth it now. I will have a look. You're right, seems like I forgot to disable the verification macros when GC_VERIFY is not defined. |
A couple formatting points to fix:
|
This one isn't actually a style point, but required for correctly declaring a zero-argument C function. |
This is really impressive, and would enable lots of new applications. |
Let me know if I missed anything. |
I'm really excited about bringing the max latency down to enable realtime audio applications, so bringing down the GC delay would be a huge step. Thanks for the work on this! -s |
Yes, I plan to merge this. |
I finally got around spending a little more time on this.
@ssfrr I have no experience in realtime audio, do you happen to know the order of magnitude of tolerable pause times ? @JeffBezanson About the escape bit, if I understood you correctly, it would not be too hard to implement on top of what I have now. We would still have to sweep the whole heap though. The only way around this is to implement moving and LLVM does not seem really suited for this. In fact, I only see one way to avoid moving an object while in the middle of some computation using its address. It involves reloading every root after every potentially-allocating operation, this does not strikes me as cheap and would probably confuses LLVM opt passes. |
Realtime audio deadlines are around 10-20ms max if you're processing live sound (~5ms is not uncommon). Typically there's an audio callback that gets called with every new frame of incoming audio, and within that context the application processes the sound and returns a new frame to be output. So given a 20ms deadline, ideally as much of that time as possible can be used for processing (if the system is blocked for 10ms of the 20ms, that only allows 50% CPU utilization). This is the main reason that audio processing software is almost always written in C or C++, with all the memory allocation happening up front and blocking operations like disk access in a separate lower-priority thread. I think that this is a niche enough use case that I wouldn't expect the GC design to have huge compromises to meet it. It would be nice if it were at least possible to write performant audio processing code in Julia though, even if it required a bit of care on the developer's part. I've got a few audio related projects and have been toying with the idea of using julia to implement the engine. |
What's the status of this? I would like to see it get merged before I start working on #2818, since I expect they will have a fairly high degree of overlap |
Taken from Jeff's comments on a recent pull request: #5227 (comment)
I'm cautiously confident that the incremental marking is correct (I've been using it exclusively for some time now).
master : I intend to run a more serious benchmark later today (morning here) comparing master and all combinations of (inc_mark, track_esc) on runtime, pause times, max heap size. |
@JeffBezanson By the way, is there any reason for the non power of 2 gc page size on 64bit cpu ? Git blame gives a very old commit. |
@carnaval, this is really impressive; faster throughput with short pauses sounds almost too good to be true. (I oh-so-vaguely understand generational gc and why it's not impossible, but still...) Is the Travis error on the Clang build anything to be worried about? Naively it does look like a potential memory corruption problem. |
This is awesome. We should plan to merge it at the start of the 0.4 cycle
|
Had one tried to make it optimal from the outset, Julia would still be stuck with about 4 users (the perfect being the enemy of the good and all that). But it's really great to see this getting attention, it seems very timely given the improvements that are being made in other areas. |
How about merging it now but only enabled with a switch? |
In a little non-rigorous testing just now this is performing way better than the standard GC with AudioIO. I made a sloppy audio node that does a buffer allocate/free on every block of audio rather than iterating. With the normal GC I hear frequent periodic dropouts when I have more then 4-5 of the nodes running at once. With the incremental GC I can get 200 of them running with occasional drop-outs, and 100 was running with no dropouts (again, in my very quick testing) Granted I'm operating at pretty large buffer sizes right now, but this definitely seems like a huge win. There was one compile-time issue, I had to change the MAP_ANONYMOUS to MAP_ANON in gc.c. It looks like a linux/OSX compatibility issue (this is an OSX box), perhaps just a #ifndef MAP_ANONYMOUS
#define MAP_ANONYMOUS MAP_ANON
#endif |
So what's in the way of merging this? |
I'm not sure we can expect a speedup on this benchmark, since the generational hypothesis doesn't model well the behavior of the program. If, for example, you throw away those strings quickly instead of storing them in a huge array you should get an improvement over current results. This effect is made much worse by the fact that, even if the string array gets promoted to old gen, it must be remarked fully every time because we are stuffing it with young objects. The way around this is to use a card table alongside (or as a replacement of) the remembered set to allow partial invalidation of large arrays. This will make the write barrier more complicated because given a pointer it's not straightforward (i.e. O(1) for a very small value of 1 :-) ) to know how it is stored in memory. That being said I'm having a look right now to see if there are ways to help this specific case. |
Good point --- the benchmark should be repeated with the new objects being |
That makes sense, thanks for the detailed explanation. You are right that the new GC is much better under the conditions you outlined (although both tests represent extremes of "normal" allocation behvaior). Julia/julia-dev-33 [master] » ./julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+944 (2014-10-04 23:57 UTC)
_/ |\__'_|_|_|\__'_| | Commit c5f136a (1 day old master)
|__/ | x86_64-apple-darwin13.4.0
julia> using DataFrames
julia> function test(N)
for _ in 1:N
reverse(utf16("this is a test"))
end
end
test (generic function with 1 method)
julia> @time test(5_000_000)
elapsed time: 1.410968703 seconds (800865016 bytes allocated, 35.24% gc time)
julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.167030852 seconds (1326598072 bytes allocated, 5.80% gc time)
julia> @time test(5_000_000)
elapsed time: 2.39665995 seconds (800000080 bytes allocated, 58.28% gc time)
julia> @time test(5_000_000)
elapsed time: 6.629903708 seconds (800000080 bytes allocated, 87.44% gc time)
julia> @time test(5_000_000)
elapsed time: 6.533877549 seconds (800000080 bytes allocated, 87.42% gc time)
julia> @time test(5_000_000)
elapsed time: 6.931448025 seconds (800000080 bytes allocated, 88.15% gc time)
julia> @time test(5_000_000)
elapsed time: 6.193674193 seconds (800000080 bytes allocated, 86.91% gc time) This PR Julia/julia-dev-33 [newgc] » ./julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+525 (2014-09-28 22:51 UTC)
_/ |\__'_|_|_|\__'_| | newgc/8c54091 (fork: 13 commits, 27 days)
|__/ | x86_64-apple-darwin13.4.0
julia> using DataFrames
julia> function test(N)
for _ in 1:N
reverse(utf16("this is a test"))
end
end
test (generic function with 1 method)
julia> @time test(5_000_000)
elapsed time: 1.03702919 seconds (763 MB allocated, 9.41% gc time)
julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.971573486 seconds (1265 MB allocated, 6.24% gc time)
julia> @time test(5_000_000)
elapsed time: 2.389991338 seconds (762 MB allocated, 53.33% gc time)
julia> @time test(5_000_000)
elapsed time: 1.740331145 seconds (762 MB allocated, 43.69% gc time)
julia> @time test(5_000_000)
elapsed time: 1.314867743 seconds (762 MB allocated, 29.68% gc time)
julia> @time test(5_000_000)
elapsed time: 1.404829552 seconds (762 MB allocated, 32.67% gc time)
julia> @time test(5_000_000)
elapsed time: 1.397260517 seconds (762 MB allocated, 33.38% gc time |
Excellent! |
@carnaval I just added you to the contributors team. You can move this branch to the main repository, under a name like |
Thanks ! I'll try and do this without breaking everything. We all know what comes with great power ... :-) |
Just don't push to master by accident, and especially don't force push to master. |
This also includes various allocation changes which should improve performances. There is also a start of generational behavior for <2k objects. This broke the heuristics in the process, still pretty much a WIP.
…nsients, bugfixes... Still in a pretty broken state (at the very least incremental codepath isn't working) @trrousse :-)
…arrier. A bit more cleanup too. Also some missing write barrier in new code.
…t. A bit more tweaking of the collection heuristics. We are now faster/less memory hungry on almost every benchmark of the micro, kernel & shootout suite.
…llection. Slight cleanup. Address some of Jeff's comments.
} | ||
|
||
|
||
static inline int gc_setmark(void *o, int sz, int mark_mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in some cases, I think this is being called with a sz
value that doesn't include the jl_value_t *type
tag (e.g. when it comes from jl_datatype_size
)
This is work I started when hitting long GC pause in a soft realtime application. No matter how low you can bring the allocation count down (and the language admittedly makes it easy), there is still a point where you must wait for tens of ms which is a pain in, e.g., render code at 60 fps.
To be clear this is only incremental marking, but incremental sweeping should be much easier.
This patch :
Some points :
This is ugly and could be resolved with a macro (something like
gc_store(a, b, ...);
) but I was not sold on changing such pervasive syntax.To enable, build with -DGC_INC. GC_VERIFY enables the verifier and GC_TIME prints pause timings.
Of course, feedback is much appreciated. Cheers !