Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler: Show memory state on deferred allocation OOM #1797

Open
manopapad opened this issue Nov 26, 2024 · 4 comments
Open

Profiler: Show memory state on deferred allocation OOM #1797

manopapad opened this issue Nov 26, 2024 · 4 comments
Assignees

Comments

@manopapad
Copy link
Contributor

Separating out a side discussion from #1739.

Honestly it might not even be necessary to visualize the "history" of mapping-stage allocations, for the purposes of OOM debugging. Just a visualization of the deferred memory state at the point of OOM might be enough. That gives enough information to understand what valid deferred allocations are stopping the incoming allocation from succeeding. No need to even visualize the invalid instances.

Right, so perhaps that deserves a different visualization, perhaps one with matplotlib or something that generates a static visualization of all the instances and where they are in memory and how much memory they take up so you can see the holes and the fragmentation and what instances are currently valid (uncollectable)

In order to get a full picture of memory usage, we would need to visualize a number of different objects that take up space on a Realm memory, some of which are only visible internally to the Runtime:

  • valid PhysicalInstances
  • DeferredBuffers / DeferredValues
  • upper bound eager reservations (in the one-pool world)
  • Future instances
  • other?

We also need a way to let the mapper request this logging (today e.g. the DefaultMapper simply aborts on deferred allocation failure).

Slide1

@lightsighter
Copy link
Contributor

@manopapad I think the real challenge of this is picking a visualization tool. I can dump all the data out of Legion to make that picture say with graphviz or matplotlib, but there are going to be hundreds, if not thousands, of instances and holes to report, so I think we need a more dynamic visualization tool for rendering that because the zoomed-in close representation is not going to be comprehensible to a human. They aren't going to see what they need to need to see in large and then be able to zoom in on things to look at. Do you have thoughts on how you'd want to do that? Alternatively we can do a text-based representation for now and just have a tool that reports the largest holes in sorted order and the total size of all holes.

@manopapad
Copy link
Contributor Author

Yes, we can start with a text dump for now, and iterate on the actual visualization. Maybe @bryevdv has a good idea.

One more thing to note, in Legate we would also like to include additional information in this visualization, e.g. which user-level object corresponds to each field, so we would need to dump additional information on top of this.

@lightsighter
Copy link
Contributor

So my plan was to add the following method to the mapper runtime:

void MapperRuntime::dump_memory_state(Memory m, const char *filename);

Any mapper could invoke that at any time to dump the memory state of a particular memory. You don't have to wait until you are OOM, but can do it as many times as you want throughout you run. I'm not promising that it will be fast as it will finish writing to the file and close the file before returning, but there's nothing stopping you from using it periodically.

What would you add to that function call to record what you want and then how would you write the tool to parse it?

@manopapad
Copy link
Contributor Author

What would you add to that function call to record what you want and then how would you write the tool to parse it?

I don't think we would add extra information to the call directly, but would possibly include extra information in the output file. In particular, we'd want to record which Legate-level Stores correspond to which Legion fields, and record relevant information on the Stores that would help a user track values back to their code:

  • information on the Store(s) that an Instance is (partially) covering: (global) shape, type, transformation (e.g. "slice [1:,1:]")
  • what partition(s) an Instance corresponds to (e.g. "2x4 Tiling of the root Store in 300x150 tiles")
  • provenance of operation that created the Store (e.g. x = np.empty(...) creates the Store in user-land, even though the actual Instance allocation happens later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants