Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

command line flag to limit heap memory usage? #17987

Open
tanmaykm opened this issue Aug 12, 2016 · 33 comments · Fixed by #21135
Open

command line flag to limit heap memory usage? #17987

tanmaykm opened this issue Aug 12, 2016 · 33 comments · Fixed by #21135
Labels
GC Garbage collector

Comments

@tanmaykm
Copy link
Member

Should there be a command line flag to limit heap memory used by a Julia process?
Similar to -Xms and -Xmx for the JVM.

This may be useful while running Julia inside containers (docker) with memory limits or
running multiple Julia processes on the same machine.

@yuyichao
Copy link
Contributor

What kind of memory should the options limit?

@tanmaykm
Copy link
Member Author

Limit on the heap would be the most useful I suppose.
Would setrlimit of RLIMIT_AS and RLIMIT_STACK do the trick on linux?

@yuyichao
Copy link
Contributor

yuyichao commented Aug 13, 2016

RLIMIT_STACK doesn't affect memory usage at all, only how much recursion you can have before a stackoverflow happens. (You can have multiple stacks for example and the stack space is pretty small anyway)

RLIMIT_AS is a pretty terrible setting, it doesn't limit actually memory usage and it makes memory management harder and less efficient by not being able to reserve a large flat address space.

Limit on the heap would be the most useful I suppose.

What heap though? A GC option to do GC more aggressively when approaching a user defined memory limit is possible, although there's a lot of memory that is allocated by user/C library as well as LLVM that we can't control.

@ViralBShah
Copy link
Member

Yes more aggressive GC when approaching a user specified limit would significantly reduce memory consumption on JuliaBox.

@ViralBShah
Copy link
Member

Also useful for running Julia in many shared settings where otherwise we will end up consuming all memory before the GC pressure kicks in.

@nkottary
Copy link
Contributor

nkottary commented Sep 1, 2016

I think the limit that we pass as command line argument should be checked here:

    /**
     * src/gc.c:826
     */

    if (__unlikely((gc_num.allocd += osize) >= 0) || gc_debug_check_pool()) {
        //gc_num.allocd -= osize;
        jl_gc_collect(0);
        //gc_num.allocd += osize;
    }

And here:

    /**
     * src/gc.c:553
     */

    #define should_collect() (__unlikely(gc_num.allocd>0))

Is that right?

@yuyichao
Copy link
Contributor

yuyichao commented Sep 1, 2016

No,. It should be folded to how allocd is updated.

@ViralBShah
Copy link
Member

ViralBShah commented Sep 2, 2016

Based on conversation with @yuyichao, the idea is to try and add a condition to detect memory pressure in the heuristic for full collection at gc.c:1768.

        (full || large_frontier ||
        ((not_freed_enough || promoted_bytes >= gc_num.interval) &&
         (promoted_bytes >= default_collect_interval || prev_sweep_full)) ||
        gc_check_heap_size(live_sz_ub, live_sz_est)) &&
       gc_num.pause > 1

@yuyichao
Copy link
Contributor

yuyichao commented Sep 2, 2016

Also the decision of collection interval a few lines below (assignment to gc_num.interval)

@tkelman tkelman added the GC Garbage collector label Sep 2, 2016
@JeffBezanson
Copy link
Member

We call uv_get_total_memory to determine the system memory size, which is used in turn to set the max collect interval. My impression is that the new option should manually replace the value returned by uv_get_total_memory.

We use RLIMIT_AS as a hint for how much vmem we try to map up front, but we also try smaller amounts until the mapping succeeds, so we probably don't need to change that code path.

@vtjnash
Copy link
Member

vtjnash commented Sep 7, 2016

I think we can also update uv_get_total_memory to chase the crazy linux cgroup information. My current understanding is that it needs to do the following (falling back to the current lookup if any of these steps fail):

  • parse /proc/self/mountinfo to locate the cgroup mount root(s) for the memory subsystem(s)
  • parse /proc/self/cgroup to find the cgroup path(s) for the memory subsystem(s) or by their names from parsing mountinfo, and determine which combination is providing the memory limit hierarchy (I think only one combination should match?)
  • read <root>/<cgroup-path>/memory.limit_in_bytes for that hierarchy

@vtjnash vtjnash added the help wanted Indicates that a maintainer wants help on an issue or pull request label Sep 7, 2016
@aviks
Copy link
Member

aviks commented Sep 23, 2016

There are requests on the mailing lists for something similar: https://groups.google.com/d/msg/julia-users/CNE3xBjpCEk/Je6zi-mxBAAJ

@joshjob42
Copy link

joshjob42 commented Mar 11, 2017

I'm don't think I'm knowledgeable enough about any of this to make this PR, but I just thought I'd comment to say that there are definitely still people who can't use Julia in their HPC environment because of the sorts of problems referenced in #10390 / here. If I run 'julia -p 8' on either 0.5.1 or master, on a 64GB machine, the job hangs or I get "libsuitesparseconfig.so: failed to map segment from shared object: Cannot allocate memory" or reports that the processes didn't connect to Master within 60 seconds, etc.

It was easy enough when I had to recompile from scratch just deleting "16*" in gc.c. Is there some sort of nasty quick fix like that I could do to just force it to work, or am I just going to have to pretend like every Julia process is going to eat 8GB of RAM? :P

Edit: I just found the 16* thing in the gc-pages.c, #define DEFAULT_REGION_PG_COUNT (16 * 8 * 4096) // 8 GB. I'm going to try to compile from source eliminating the "16*". Provided it works, I will reiterate the question asked under 10390 about why it's still the default if it seems to routinely cause problems in HPC environments. Would a PR be accepted that simply deleted it and made that always 512MB instead of 8GB?

@aviks
Copy link
Member

aviks commented Apr 17, 2017

This is marked fixed by #21135, but I did not see any user facing options in that PR. How is this meant to work, @vtjnash ?

@tkelman tkelman reopened this Apr 17, 2017
@vtjnash vtjnash closed this as completed Apr 18, 2017
@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2017

That PR changes the GC to only allocate heap memory (RLIMIT_AS) as you use it. Thus the user-facing option is now simply not to allocate larger arrays than you intended to allocate.

@yuyichao
Copy link
Contributor

The two are actually unrelated since the option is about heap usage not about pool size.

@yuyichao yuyichao reopened this Apr 18, 2017
@yuyichao
Copy link
Contributor

And FWIW #17987 (comment) is NOT what this issue is about.

@aviks
Copy link
Member

aviks commented Apr 18, 2017

Thanks for the explanation Jameson. That helps in many situations. But for this issue, the following is what I had in mind:

Say I own a machine with 8GB ram. I start a Julia process, and in that I run a program written by someone else. What I want to control is: if that program allocates anything more than 2Gb, throw an OutOfMemory error. Also, if the program has allocated 1.8GB, run the GC. How can I achieve that?

@joshjob42
Copy link

@yuyichao You said in #10390 that the issue remaining last September there (at the end of the thread) was no longer #10390 but was instead this #17987. Yet deleting the 16* as I described in my comment above (and was given as a "fix" in #10390) fixed the issue I (and others on that thread) was having. I came here with my comment only because of your comment on 10390 when you closed the issue. If #21135 resolves any/all problems potentially stemming from the 16* in gc-pages.c (by making it unnecessary/etc.), then it solves #10390 (or whatever lingering problems from there, which you identified as being this issue). I'm sorry if this issue isn't actually the one now covering the problem people were having in 10390; it wasn't intentional.

@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2017

RLIMIT_AS can now be used to hard-limit max allocation size (although might need to add a bit of slack during configuration to account for openblas). Are we not already using it to configure the GC threshold also?

@yuyichao
Copy link
Contributor

Are we not already using it to configure the GC threshold also?

No. It's also pretty wrong since the GC would have to guess how much memory other libraries uses. OpenBLAS might allocate a lot at start up but it's very far from the library that can allocate the most memory.

@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2017

That would also roughly apply to the current usage of uv_get_total_memory

@yuyichao
Copy link
Contributor

Kind of but not really. The current usage of total memory only set the max collection interval which is very different from total memory consumption.

@floswald
Copy link

just chipping in to say that I still face exactly the same problem as in my comment on #10390. I have julia showing up as consuming about 4GB of resident ram, but virtual memory required is more than twice that - hence kills my job on a 8GB compute node.

@s2maki
Copy link
Contributor

s2maki commented Mar 4, 2021

Bumping this back to awareness. In particular, the OOM killer of docker/kubernetes is nasty. Obviously it would be nicer if the attempt to allocate memory over a certain limit failed at the OS level, but it seems that requires changes to the Linux kernel that no one is interested in doing. In lieu of that, having arguments or autodetection in Julia that lets it know when to get more aggressive in GC and/or throw an exception when the memory limit is exceeded would be a godsend. Any idea where this is on the priority list? I would volunteer to implement it myself, but the learning curve of that area of the Julia source is rather sharp.

@mkschleg
Copy link

I also have run into this on a slurm cluster. Would be nice to have some way to set the max memory of a node at process creation.

@tyleransom
Copy link

I am running into a similar issue as @mkschleg and @s2maki on an HPC cluster.

@Moelf
Copy link
Contributor

Moelf commented Oct 2, 2021

since the GC would have to guess how much memory other libraries uses.

while this is impossible to perfectly address (ccall-ed thing or run(...) can do whatever they want), it won't hurt if Julia can at least set upper bounds for pure Julia code. At least that way user who has negligible external subroutine can set memory limit fairly accurately

@oscardssmith
Copy link
Member

added in #45369

@oscardssmith oscardssmith removed the help wanted Indicates that a maintainer wants help on an issue or pull request label May 19, 2022
@paulmelis
Copy link
Contributor

To be fair, the new option --heap-size-hint added by #45369 does not limit the amount of heap memory used, right? It merely triggers a collection if current usage is above the given threshold. So the live set can still exceed the size value passed, which seems different from the original intent of this issue (and the way that Java's -Xmx works, which triggers an out-of-memory exception when the heap size would go over the set threshold and therefore never can).

@oscardssmith
Copy link
Member

It's a little stronger than "trigger collection if current memory usage is higher than limit". What it does (currently) is to make the GC collect much more frequently when memory is above the limit. I wouldn't be surprised if in the coming months it becomes more forceful, but we aren't there yet.

@fingolfin
Copy link
Member

That's fair enough, but doesn't this still mean that this issue is not resolved (yet), and so should not be closed just yet?

@oscardssmith oscardssmith reopened this May 20, 2022
@vtjnash
Copy link
Member

vtjnash commented Feb 10, 2024

heap-size-hint is now stronger. But perhaps it should be made even stronger by prohibiting any attempt to directly jl_gc_tracked_malloc or jl_gc_allocobj any chunk that would exceed 90% of that size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging a pull request may close this issue.