Skip to content

[Core][Distributed] add same-node detection#5369

Merged
simon-mo merged 13 commits intovllm-project:mainfrom
youkaichao:single_host
Jun 11, 2024
Merged

[Core][Distributed] add same-node detection#5369
simon-mo merged 13 commits intovllm-project:mainfrom
youkaichao:single_host

Conversation

@youkaichao
Copy link
Copy Markdown
Member

This PR adds a function to detect if all processes inside a process group lives in the same node. It should take over #4903 .

The idea is to test if all processes can access the same part of shared memory.

Our CI can only run in the same node now. I manually tested the correctness for multi-node. In the future, we should add this test for multi-node.

@youkaichao youkaichao requested review from esmeetu and removed request for esmeetu June 10, 2024 03:38
@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Jun 10, 2024

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager? Or is that not compatible with torch.distributed?

Edit: Also, we might want to log the exceptions that are being raised instead of silently ignoring them.

@youkaichao
Copy link
Copy Markdown
Member Author

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Jun 10, 2024

I think the code may be simplified a bit using multiprocessing.managers.SharedMemoryManager?

do you have any idea on how to use it? i checked the doc, and it seems quite difficult to use. We have to select a port to bind.

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

we might want to log the exceptions that are being raised

the exception is about one process in another node cannot access the shared memory segment, which is how we test if processes are within the same node. we don't need to expose this expected "exception" to users.

Is there any way to only suppress that specific class of exceptions? Otherwise I guess the current way is fine.

@youkaichao
Copy link
Copy Markdown
Member Author

Is there any way to only suppress that specific class of exceptions?

limited to oserror now.

@youkaichao
Copy link
Copy Markdown
Member Author

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Jun 10, 2024

Hmm, now that you've mentioned it, I don't see how the low-level SharedMemory is shared over the network. So we may have to use SharedMemoryManager anyway because it provides this functionality.

can you elaborate on this? I don't get it.

From my understanding, you said that we need to use address parameter of SharedMemoryManager, otherwise the memory cannot be shared over the network which is required in multi-node case. However, there is no mention of this in the docs for SharedMemory. So, I think we can't use SharedMemory for multi-node case; the reason your test still succeeded might be because you caught the Exception that was raised due to this, which has nothing to do with torch.distributed.

If we are relying on this behaviour for the test, then we can also just omit the address parameter for SharedMemoryManager.

@youkaichao
Copy link
Copy Markdown
Member Author

If I understand correctly, SharedMemoryManager will start a server process within the node it lives in, and whenever a connection comes to request a shared memory, it creates a shared memory segment in the node it lives in, and returns a wrapper/proxy of that memory. That said, I don't think it can be used to test if all process lives in the same node.

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Jun 10, 2024

We can only test this via manually sharing the filename of SharedMemory. This will cause exception in multi-node cases, which is exactly what we want to detect.

Alright, I understand your intention more clearly now. In that case, it's probably not worth the effort to use SharedMemoryManager.

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

@youkaichao
Copy link
Copy Markdown
Member Author

Could you suppress errors only for the relevant parts of the code so that the above becomes clear? Remember to add a try/except to ensure that the cleanup work is done even when unexpected exceptions occur.

added.

@DarkLight1337
Copy link
Copy Markdown
Member

Could you suppress errors only for the relevant parts of the code so that the above becomes clear?

By this I mean that you should minimize the number of lines suppressed. (I assume that the error only occurs when constructing the SharedMemory object?)

added.

How about the code involving unlink?

@youkaichao
Copy link
Copy Markdown
Member Author

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

@DarkLight1337
Copy link
Copy Markdown
Member

I assume that the error only occurs when constructing the SharedMemory object?

it is difficult to assume. Python exception might appear from some random reason. I prefer to use contextlib.suppress(OSError) for a larger scope, to ensure this function will not crash execution.

We can add an except to the try-finally block to handle those errors (such as logging them)

How about the code involving unlink?

The unlink part does not need try-finally. It is already a cleanup step. If unlink fails, we don't need to unlink again.

I see, that is fine then.

@youkaichao
Copy link
Copy Markdown
Member Author

We can add an except to the try-finally block to handle those errors (such as logging them)

Added in 72961a5, please take a look.

@DarkLight1337
Copy link
Copy Markdown
Member

Looks good, let's get this merged then.

@youkaichao youkaichao enabled auto-merge (squash) June 11, 2024 03:26
@simon-mo simon-mo disabled auto-merge June 11, 2024 17:53
@simon-mo simon-mo merged commit c4bd03c into vllm-project:main Jun 11, 2024
@youkaichao youkaichao deleted the single_host branch June 11, 2024 18:00
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 12, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants