-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[Feature] Support dynamic loading and unloading of Lora adapters #2891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi. I'm looking forward to this feature getting added. Any updates on progress? Thanks. |
@mitchklusty thanks for noticing. We are recently supporting other Lora features such as unified paging and tensor parallelism, which will cause huge changes to Lora codes. I'm afraid the feature of dynamic loading/unloading have to wait for these features, so mass conflicts can be avoided. Really sorry for that. |
|
@Fridge003 That's ok, I completely understand. Any idea on a rough timeline for when it might get implemented or is it still too early to say? |
We have added this feature to half year plan, so it should be implemented before end of June. If everything goes smooth, it can be done before end of April ideally |
Ok, great! Thanks for adding this feature, |
|
excited to see this |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen it if needed. |
|
hi! so what's the progress? we're really waiting for this feature |
Sorry for keep you waiting... We are in short of developers, and I'm also really busy with other tasks. |
thx! Just wanted to clarify - this merge request will be closed and we need to wait for an another one from @lifuhuang, right? Mb we can help somehow to speed up the process? It seems like the main changes in the code are already have been done |
|
Hi @kdduha, I discussed with @Fridge003 offline, from what I learned, the change in this PR was branched off main in Jan so it has been somewhat outdated due to the changes introduced over the past months, so indeed we would need a separate PR. I plan to start working on this feature roughly in a week after wrapping up something small task I have and should be able to finish in June. But if you are interested in collaborating or taking a stab yourself, let me know, you can find me in Slack (Lifu). |
|
@mitchklusty @binarycrayon @kdduha |
|
@mitchklusty @binarycrayon @kdduha , currently #7446 is still in review, but I have added usage descriptions. Please feel free to checkout to this branch and test it out and let me and @Fridge003 know if you have any feedbacks or questions. Please also be aware of the few usability limitations of this first version, I am working on addressing them at the moment and will send separate PRs after the current one is merged. Stay tuned 🥂 |
Motivation
This PR aims to implement the dynamic LoRA feature mentioned in #2686.
This PR is still under developing, please comment if the code can be improved.
Modifications
Current implementation of LoRA modules
Current LoRA features are implemented under folder
python/sglang/srt/lora, where three fileslora.py,lora_manager.py,lora_config.pyare included. Initial support can be referred to #1307.In the
__init__function ofModelRunner, aLoraManagerwill be created if a validlora_pathis passed inserver_args. The initialization ofLoraManagercontains two parts: first callinginit_lorasto load huggingface LoRA weights to CPU and replace the targeted layers withBaseLayerWithLoRAinstances, then callinginit_lora_memory_poolto preallocate the memory pool for S-Lora. The definition of lora modules inlora.pyare implemented on the basis of vllm implementation.Before forwarding the batch,
LoraManagerwill callprepare_lora_batchmethod to load active Lora adapters from memory pool. During loading, lora weight not used in current batch can be evicted from buffer if necessary.Unit tests are put under
test/srt/models/test_lora.py. The test for inference can be passed, but the test for serving is skipped, so the feature of LoRA serving might require further check. The benchmark codes can be found inbenchmark/lora/lora_bench.py.Implementation of dynamic serving LoRA
Dynamical serving LoRA means the lora adapters can be loaded and unloaded at users' commands during server runtime. This feature has been supported in vllm (vllm lora doc). As mentioned in #1433, current implementation supports multi-lora serving, but the loading and unloading of Lora modules can only be done at initialization of servers.
The design of loading and unloading LoRA at API side can be similar to
update_weights_from_diskAPI, since both of their behavior is changing the weights that current server is running on. In this design, the two APIs are named asload_lora_adapterandunload_lora_adapteras in vllm.After the user send the
LoadLoraAdapterReq/UnloadLoraAdapterReqrequest to server, the serve will grab a write lock and wait for the request in progress to be finished. Then, the request will be transmitted toModelRunnerthrough several passes, and be handled byLoraManagerowned byModelRunner.At
LoraManagerside, the loading of new lora adapter just follows the process of initialization: collecting new target modules, initialize new lora weights on CPU, and if open new space in memory buffer if needed.The implementation of unloading and testing scripts to be done...
Checklist