Profiler: Show memory state on deferred allocation OOM #1797

manopapad · 2024-11-26T23:40:24Z

Separating out a side discussion from #1739.

Honestly it might not even be necessary to visualize the "history" of mapping-stage allocations, for the purposes of OOM debugging. Just a visualization of the deferred memory state at the point of OOM might be enough. That gives enough information to understand what valid deferred allocations are stopping the incoming allocation from succeeding. No need to even visualize the invalid instances.

Right, so perhaps that deserves a different visualization, perhaps one with matplotlib or something that generates a static visualization of all the instances and where they are in memory and how much memory they take up so you can see the holes and the fragmentation and what instances are currently valid (uncollectable)

In order to get a full picture of memory usage, we would need to visualize a number of different objects that take up space on a Realm memory, some of which are only visible internally to the Runtime:

valid PhysicalInstances
DeferredBuffers / DeferredValues
upper bound eager reservations (in the one-pool world)
Future instances
other?

We also need a way to let the mapper request this logging (today e.g. the DefaultMapper simply aborts on deferred allocation failure).

The text was updated successfully, but these errors were encountered:

lightsighter · 2024-11-27T02:30:12Z

@manopapad I think the real challenge of this is picking a visualization tool. I can dump all the data out of Legion to make that picture say with graphviz or matplotlib, but there are going to be hundreds, if not thousands, of instances and holes to report, so I think we need a more dynamic visualization tool for rendering that because the zoomed-in close representation is not going to be comprehensible to a human. They aren't going to see what they need to need to see in large and then be able to zoom in on things to look at. Do you have thoughts on how you'd want to do that? Alternatively we can do a text-based representation for now and just have a tool that reports the largest holes in sorted order and the total size of all holes.

manopapad · 2024-11-27T19:55:24Z

Yes, we can start with a text dump for now, and iterate on the actual visualization. Maybe @bryevdv has a good idea.

One more thing to note, in Legate we would also like to include additional information in this visualization, e.g. which user-level object corresponds to each field, so we would need to dump additional information on top of this.

lightsighter · 2024-11-27T21:38:21Z

So my plan was to add the following method to the mapper runtime:

void MapperRuntime::dump_memory_state(Memory m, const char *filename);

Any mapper could invoke that at any time to dump the memory state of a particular memory. You don't have to wait until you are OOM, but can do it as many times as you want throughout you run. I'm not promising that it will be fast as it will finish writing to the file and close the file before returning, but there's nothing stopping you from using it periodically.

What would you add to that function call to record what you want and then how would you write the tool to parse it?

manopapad · 2024-12-03T09:13:10Z

What would you add to that function call to record what you want and then how would you write the tool to parse it?

I don't think we would add extra information to the call directly, but would possibly include extra information in the output file. In particular, we'd want to record which Legate-level Stores correspond to which Legion fields, and record relevant information on the Stores that would help a user track values back to their code:

information on the Store(s) that an Instance is (partially) covering: (global) shape, type, transformation (e.g. "slice [1:,1:]")
what partition(s) an Instance corresponds to (e.g. "2x4 Tiling of the root Store in 300x150 tiles")
provenance of operation that created the Store (e.g. x = np.empty(...) creates the Store in user-land, even though the actual Instance allocation happens later)

lightsighter self-assigned this Nov 27, 2024

lightsighter added the enhancement label Nov 27, 2024

lightsighter mentioned this issue Nov 27, 2024

Profiler: Show "truly-in-use" memory usage line #1739

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiler: Show memory state on deferred allocation OOM #1797

Profiler: Show memory state on deferred allocation OOM #1797

manopapad commented Nov 26, 2024

lightsighter commented Nov 27, 2024

manopapad commented Nov 27, 2024

lightsighter commented Nov 27, 2024

manopapad commented Dec 3, 2024

Profiler: Show memory state on deferred allocation OOM #1797

Profiler: Show memory state on deferred allocation OOM #1797

Comments

manopapad commented Nov 26, 2024

lightsighter commented Nov 27, 2024

manopapad commented Nov 27, 2024

lightsighter commented Nov 27, 2024

manopapad commented Dec 3, 2024