[RFC]: Add runtime weight update API #5723

lyuqin-scale · 2024-06-20T20:22:40Z

Motivation.

In online RL training, vLLM can significantly accelerate the rollout stage. To achieve this, we need weight sync from main training process to vLLM worker process, and then call the existing API in vLLM to update the weights by
model_runner.model.load_weights
An example of such implementation can be found in OpenRLHF, https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_worker_wrap.py

However, user has to monkey patch vLLM worker to introduce such behavior. It would be great if vLLM naturally supports weight sync at runtime.

Proposed Change.

Add a NCCL-based weight sync process group during vLLM initialization, so that main process can dist.broadcast weight to vLLM worker process later
Expose a weight sync API, for example:
def update_weight(self, name, dtype, shape)

then in master process, user can achieve weight sync via the following (modified from OpenRLHF):

for name, param in model.named_parameters():
    # Fire all vllm engines for broadcast
    if torch.distributed.get_rank() == 0:
        shape = param.shape if self.strategy.args.zero_stage != 3 else param.ds_shape
        refs = [
            engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
            for engine in self.vllm_engines
        ]

        torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
        ray.get(refs)

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-06-20T20:26:21Z

thanks for the information! can you describe, the processes involved, which tensor lives in which device and which process, and what is the desired transfer?

also, cc @hijkzzz from #5477

lyuqin-scale · 2024-06-20T20:42:25Z

@youkaichao
process: vLLM worker process(es) on GPU 0 and 1, main training process on GPU 2
tensors: HF weights in main training process on GPU2, to be dist.broadcast to temp tensors of same size to vLLM workers on GPU 0 and 1, then within vLLM worker process:
model_runner.model.load_weights(weights=[(name, weight)])
where weight is the temp tensor of one of the weights broadcasted from main process to vLLM worker process

the transfer is preferred to be via NCCL

hijkzzz · 2024-06-21T13:01:09Z

More importantly, we need to support establishing an NCCL group between DeepSpeed and vLLM engines.

hijkzzz · 2024-06-27T09:05:18Z

Update:
This API should support LoRA weight updates as much as possible

See:
https://github.com/OpenLLMAI/OpenRLHF/pull/335/files

lyuqin-scale added the RFC label Jun 20, 2024

youkaichao mentioned this issue Jun 24, 2024

[RFC]: A Flexible Architecture for Distributed Inference #5775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Add runtime weight update API #5723

[RFC]: Add runtime weight update API #5723

lyuqin-scale commented Jun 20, 2024

youkaichao commented Jun 20, 2024

lyuqin-scale commented Jun 20, 2024

hijkzzz commented Jun 21, 2024

hijkzzz commented Jun 27, 2024

[RFC]: Add runtime weight update API #5723

[RFC]: Add runtime weight update API #5723

Comments

lyuqin-scale commented Jun 20, 2024

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

youkaichao commented Jun 20, 2024

lyuqin-scale commented Jun 20, 2024

hijkzzz commented Jun 21, 2024

hijkzzz commented Jun 27, 2024