Your current environment
llm = LLM(model=model1_path, tensor_parallel_size=torch.cuda.device_count())
llm = LLM(model=model2_path, tensor_parallel_size=torch.cuda.device_count())
It will cause CUDA out of memory when execute the second line.
How would you like to use vllm
I want to use two model in pipeline in one python code to infer. When finish inference on the first model, how to release this model and release GPU memory to load another one, since directly reloading may cause CUDA OUT OF MEMORY for it doesn't release the first one.