-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Reduce time it takes to import SGLang #12510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I will continue working on this, there is more improvements to be made. |
| for name, cls in _CONFIG_REGISTRY.items(): | ||
| with contextlib.suppress(ValueError): | ||
| AutoConfig.register(name, cls) | ||
| def _register_custom_configs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a lot of these changes in this file do not really improve the times, and they are overly invasive. However, the transformers related code is a huge time increase. I think they may be impossible to remove since we init many transformer related objects in the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so moving the configs into a function to load them later does massively improve the times. This, and also moving the _CUSTOMIZED_MM_PROCESSOR import drastically improves the import time.
aff6929 to
9adc75a
Compare
Signed-off-by: Raayan Dhar [email protected] <[email protected]>
Signed-off-by: Raayan Dhar [email protected] <[email protected]>
9adc75a to
6ed8399
Compare
Signed-off-by: Raayan Dhar [email protected] <[email protected]>
Signed-off-by: Raayan Dhar [email protected] <[email protected]>
|
|
||
| def add(self, req: Req, is_retracted: bool = False) -> None: | ||
| """Add a request to the pending queue.""" | ||
| from sglang.srt.managers.schedule_batch import RequestStage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very repetitive pattern, but the from sglang.srt.managers.schedule_batch import is somewhat expensive. The RequestStage object is just an enum, so we don't want to be doing the massive import for just an enum in my opinion. Maybe we can put this enum in a different file / some other solution that effectively does the same thing, so we can just import at the top level.
| for name, cls in _CONFIG_REGISTRY.items(): | ||
| with contextlib.suppress(ValueError): | ||
| AutoConfig.register(name, cls) | ||
| def _register_custom_configs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so moving the configs into a function to load them later does massively improve the times. This, and also moving the _CUSTOMIZED_MM_PROCESSOR import drastically improves the import time.
Signed-off-by: Raayan Dhar [email protected] <[email protected]>
| ) | ||
| from sglang.utils import LazyImport | ||
|
|
||
| MoeRunner = LazyImport("sglang.srt.layers.moe.moe_runner.runner", "MoeRunner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this (LazyImport(...))is something we can apply to many other files as well, I'm not sure if there's a downside?
| disable_overlap_schedule: bool, | ||
| offload_tags: set[str], | ||
| ): | ||
| from sglang.srt.layers.moe import TboDPAttentionPreparer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not worth it to lazy load this one here. It saves 600 ms in import time but this will get called a lot more in the event loop. Although I'm not sure because this may be cached.
|
Hi experts, At this point, looking at the profiling, there's been some pretty good improvement in times. Looking at The changes are largely just moving imports into functions so they are lazy-loaded, or moving imports to only run when we type check. Now as I've commented earlier, not all of the changes are very pretty. My rationale is the following-nearly all of the functions that I moved imports into are probably only going to run very intermittently, or even just once at object init (e.g. a lot of the functions in Also, this is largely specific to the import path and object in the issue (Scheduler). I think these changes should help other paths as well (i.e. the changes to
So these two changes could be used more broadly than just this path. At this point I'm going to open to review. Not sure if this exactly tackles what the original issue was trying to get at, so appreciate any clarification on what direction this PR should go. Appreciate the time reviewers take to look at this PR! |


Motivation
We notice that the time to import things in SGLang takes a lot of time (#10492). I have been looking into what is taking up a lot of time and if there are simple ways to help reduce this import time. From the original issue, we want to reduce:
which is what I have been focusing my efforts on. However, I think there are things we can do to reduce time for other imports. This is more of a V1 to get community feedback from experts.
Modifications
There are some heavy imports. For example, the quantization methods import at the module level is heavy. Moving some imports to the function level (only time it is used), we can reduce module import time. However, I can see how this can easily be an antipattern. In fact, it can hurt performance if we have a function that is used a lot that we have an import in. I tried to only do this in functions that we only expect to run once or a small number of times. However, I can understand the argument against this kind of code. I also don't think all the changes to
hf_transformer_utils.pyhelp so I will be taken a deeper look, since the changes are a bit invasive.Accuracy Tests
These changes should not affect model outputs.
Benchmarking and Profiling
Running
for i in {1..100}; do (time python -B -c "from sglang.srt.managers.scheduler import Scheduler") 2>&1 | grep "^real"; done | python calc_avg.py(calc_avg.py)With these changes:
Compared to top-of-main:
so we have ~1.5 second improvement. Not the best, so I am going to keep working on it. I mostly targeted improving the timing in the creation of
ModelConfigobject. The difference so far is largely from removing the quantization import:Without these changes import_sglang_tom.log

With these changes import_sglang_new.log

Machine:
Checklist