Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: More robust handling of case when -ll:nsize allocation fails #1785

Open
manopapad opened this issue Oct 29, 2024 · 4 comments
Open

Realm: More robust handling of case when -ll:nsize allocation fails #1785

manopapad opened this issue Oct 29, 2024 · 4 comments
Assignees
Labels
Realm Issues pertaining to Realm

Comments

@manopapad
Copy link
Contributor

So it looks like currently when we ask for an -ll:nsize that is too large we get a warning, then the run continues without that allocation at all.

For example, running legate with:

--cpus=1 --gpus=1 --omps=1 --ompthreads=28 --utility=2 --sysmem=256 --numamem=308442 --fbmem=14184 --zcmem=128 --regmem=0

we see:

[0 - 7f6783618000]    0.000000 {4}{numa}: insufficient memory in NUMA node 0 (323424878592 > 67007963136 bytes) - skipping allocation
[0 - 7f6783618000]    0.000000 {4}{numa}: insufficient memory in NUMA node 1 (323424878592 > 56011509760 bytes) - skipping allocation
[0 - 7f6783618000]    0.000736 {4}{threads}: reservation ('OMP0 proc 1d00000000000003 (worker 16)') cannot be satisfied

then later the available memories are:

[0 - 7fe255c00000]    0.231816 {2}{legate.mapper}: Memories on rank 0:
[0 - 7fe255c00000]    0.231842 {2}{legate.mapper}:   1e00000000000000 (SYSTEM_MEM): 268435456 bytes
[0 - 7fe255c00000]    0.231848 {2}{legate.mapper}:   1e00000000000001 (SYSTEM_MEM): 0 bytes
[0 - 7fe255c00000]    0.231855 {2}{legate.mapper}:   1e00000000000002 (GPU_FB_MEM): 14873001984 bytes
[0 - 7fe255c00000]    0.231867 {2}{legate.mapper}:   1e00000000000003 (GPU_DYNAMIC_MEM): 15655829504 bytes
[0 - 7fe255c00000]    0.231876 {2}{legate.mapper}:   1e00000000000004 (Z_COPY_MEM): 134217728 bytes
[0 - 7fe255c00000]    0.231882 {2}{legate.mapper}:   1e00000000000005 (FILE_MEM): 0 bytes

I think this should either be an error, or Realm should proceed and just allocate the memory but allow it to span multiple NUMA domains.

@apryakhin apryakhin added the Realm Issues pertaining to Realm label Oct 29, 2024
@eddy16112
Copy link
Contributor

eddy16112 commented Oct 29, 2024

@manopapad Does the following behaviors sound good to you?

  1. If -ll:nsize or -ll:ncpu is set, but numa is not available on the machine, we consider it as an error
  2. if -ll:nsize is too large, we consider it as an error. How about -ll:ncpu? We do oversubscription just as -ll:cpu?

@manopapad
Copy link
Contributor Author

Let's bring it up for discussion at an upcoming Realm/Legion meeting

@eddy16112
Copy link
Contributor

Based on the discussion in the meeting, we will let realm crash in both cases.

@muraj
Copy link

muraj commented Oct 30, 2024

Just to be clear, I do not approve of crashing inside Realm unless it's a true bug that Realm engineering needs to deal with. That said, we have cases today (like specifying -ll:gpu when the CUDA module is not loaded, or specifying too large a value for -ll:fsize) where we crash. To maintain consistency and expectations, we should make these different cases respond in the same way. For now, we'll crash with an error message. Later on, when we get better consensus on how to propagate errors and maintain compatibility, we'll return an error and turn off the feature, but otherwise limp on, or some other mechanism that we work out.

@eddy16112 eddy16112 self-assigned this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Realm Issues pertaining to Realm
Projects
None yet
Development

No branches or pull requests

4 participants