[Core] Support LoRA on quantized models#4012
Conversation
|
@Yard1 All checks have passed. Could I trouble you to please review it again? |
|
Very nice -- ty for closing out the last chores 🎉 |
| output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1) | ||
|
|
||
| del llm_tp1 | ||
| cleanup() |
There was a problem hiding this comment.
@Yard1 @jeejeelee I'm not able to release GPU memory usage using cleanup or del llm_tp1. Are you sure this would work (if CI had multiple GPUs and we could enable this test)?
On 0.4.2 version, if I start vllm.LLM I can't find any way to release the GPU memory again. I have to kill the process. Do you know any other way?
There was a problem hiding this comment.
I forgot a bit, but I'm quite sure test_baichuan.py can be tested using tp=2. I've tested it myself.
There was a problem hiding this comment.
I have tried it using the following code, GPU memory can release normly, however, I didn't use the released 0.4.2 version, I used f12c3b5 instead.
@pytest.mark.parametrize("model", MODELS)
# @pytest.mark.skip("Requires multiple GPUs")
def test_quant_model_tp_equality(tinyllama_lora_files, model):
# Cannot use as it will initialize torch.cuda too early...
# if torch.cuda.device_count() < 2:
# pytest.skip(f"Not enough GPUs for tensor parallelism {2}")
llm_tp1 = vllm.LLM(model=model.model_path,
enable_lora=True,
max_num_seqs=16,
max_loras=4,
tensor_parallel_size=1,
quantization=model.quantization,
gpu_memory_utilization=0.8,
trust_remote_code=True)
output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
del llm_tp1
cleanup()
llm_tp1 = vllm.LLM(model=model.model_path,
enable_lora=True,
max_num_seqs=16,
max_loras=4,
tensor_parallel_size=1,
quantization=model.quantization,
gpu_memory_utilization=0.8,
trust_remote_code=True)
output_tp1 = do_sample(llm_tp1, tinyllama_lora_files, lora_id=1)
del llm_tp1
cleanup()There was a problem hiding this comment.
I don't have any multi-GPU machines available right now, I will use tp=2 for testing again later.
There was a problem hiding this comment.
Great, thanks for checking that. I believe the tensor_parallel_size=2 uses multiprocessing and so the behavior is different from what I experimented with. I'll post a new issue.
Building upon the excellent work done in #2828
Since there hasn't been much progress on #2828,so I'd like to continue and complete this feature.
Compared to #2828, the main improvement is the addition of support for
tensor parallelism.@fmmoret Please feel free to contact me if there are any inappropriate issues.