Skip to content

Conversation

@raayandhar
Copy link

Motivation

We notice that the time to import things in SGLang takes a lot of time (#10492). I have been looking into what is taking up a lot of time and if there are simple ways to help reduce this import time. From the original issue, we want to reduce:

time python -c "from sglang.srt.managers.scheduler import Scheduler"

which is what I have been focusing my efforts on. However, I think there are things we can do to reduce time for other imports. This is more of a V1 to get community feedback from experts.

Modifications

There are some heavy imports. For example, the quantization methods import at the module level is heavy. Moving some imports to the function level (only time it is used), we can reduce module import time. However, I can see how this can easily be an antipattern. In fact, it can hurt performance if we have a function that is used a lot that we have an import in. I tried to only do this in functions that we only expect to run once or a small number of times. However, I can understand the argument against this kind of code. I also don't think all the changes to hf_transformer_utils.py help so I will be taken a deeper look, since the changes are a bit invasive.

Accuracy Tests

These changes should not affect model outputs.

Benchmarking and Profiling

Running for i in {1..100}; do (time python -B -c "from sglang.srt.managers.scheduler import Scheduler") 2>&1 | grep "^real"; done | python calc_avg.py (calc_avg.py)

With these changes:

===========Timing Statistics============                                                                            
Number of runs: 100                                                                                                 
Mean:     8.308s                                                                                                    
Median:   8.236s                                                                                                    
Std Dev:  0.297s                                                                                                    
Min:      7.999s                                                                                                    
Max:      9.329s                                                                                                    
========================================

Compared to top-of-main:

===========Timing Statistics============                                                                            
Number of runs: 100                                                                                                 
Mean:     9.836s                                                                                                    
Median:   9.790s                                                                                                    
Std Dev:  0.332s                                                                                                    
Min:      8.655s                                                                                                    
Max:      11.801s                                                                                                   
======================================== 

so we have ~1.5 second improvement. Not the best, so I am going to keep working on it. I mostly targeted improving the timing in the creation of ModelConfig object. The difference so far is largely from removing the quantization import:

Without these changes import_sglang_tom.log
Screenshot 2025-11-01 at 9 03 47 PM

With these changes import_sglang_new.log
Screenshot 2025-11-01 at 9 04 04 PM

Machine:

  • AMD EPYC 7343 16-Core Processor
  • L40S GPU

Checklist

@raayandhar
Copy link
Author

I will continue working on this, there is more improvements to be made.

for name, cls in _CONFIG_REGISTRY.items():
with contextlib.suppress(ValueError):
AutoConfig.register(name, cls)
def _register_custom_configs():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a lot of these changes in this file do not really improve the times, and they are overly invasive. However, the transformers related code is a huge time increase. I think they may be impossible to remove since we init many transformer related objects in the scheduler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so moving the configs into a function to load them later does massively improve the times. This, and also moving the _CUSTOMIZED_MM_PROCESSOR import drastically improves the import time.

@hnyls2002 hnyls2002 self-assigned this Nov 3, 2025

def add(self, req: Req, is_retracted: bool = False) -> None:
"""Add a request to the pending queue."""
from sglang.srt.managers.schedule_batch import RequestStage
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very repetitive pattern, but the from sglang.srt.managers.schedule_batch import is somewhat expensive. The RequestStage object is just an enum, so we don't want to be doing the massive import for just an enum in my opinion. Maybe we can put this enum in a different file / some other solution that effectively does the same thing, so we can just import at the top level.

for name, cls in _CONFIG_REGISTRY.items():
with contextlib.suppress(ValueError):
AutoConfig.register(name, cls)
def _register_custom_configs():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so moving the configs into a function to load them later does massively improve the times. This, and also moving the _CUSTOMIZED_MM_PROCESSOR import drastically improves the import time.

)
from sglang.utils import LazyImport

MoeRunner = LazyImport("sglang.srt.layers.moe.moe_runner.runner", "MoeRunner")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this (LazyImport(...))is something we can apply to many other files as well, I'm not sure if there's a downside?

disable_overlap_schedule: bool,
offload_tags: set[str],
):
from sglang.srt.layers.moe import TboDPAttentionPreparer
Copy link
Author

@raayandhar raayandhar Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not worth it to lazy load this one here. It saves 600 ms in import time but this will get called a lot more in the event loop. Although I'm not sure because this may be cached.

@raayandhar
Copy link
Author

raayandhar commented Nov 6, 2025

Hi experts,

At this point, looking at the profiling, there's been some pretty good improvement in times. Looking at time python -X importtime -c "from sglang.srt.managers.scheduler import Scheduler" 2> import_sglang.log, we started at ~6000 ms overall, but now we are down to 4000 ms; see below:
Top-of-main, newly updated (import_sglang_tom.log):
importtime (2)
This is the improved version, with my changes (import_sglang_improved.log)
importtime
I think it's best to just click the image and click again to see clearly. But in words we see an improvement of around 33%. In the improved version, most of what's left are transformers / torch imports that are basically unavoidable (without some extremely invasive changes). Otherwise a lot of the other imports have massively shrunk, i.e. model_config from 4300 ms to 550 ms, etc. You can see the logs for more details. I have some version that is super insanely optimized (just to see what's possible) that can improve it even further but the changes are really invasive and impractical.

The changes are largely just moving imports into functions so they are lazy-loaded, or moving imports to only run when we type check. Now as I've commented earlier, not all of the changes are very pretty. My rationale is the following-nearly all of the functions that I moved imports into are probably only going to run very intermittently, or even just once at object init (e.g. a lot of the functions in hf_transformers_utils.py are this way). Then, doing the lazy loading looks maybe a bit uglier but otherwise we reap good benefits for effectively no downside. I left some more comments with other thoughts of mine on how to best do this.

Also, this is largely specific to the import path and object in the issue (Scheduler). I think these changes should help other paths as well (i.e. the changes to hf_transformer_utils.py should be useful, among others). If there other paths that should be targeted let me know and I will work on it. Furthermore, I think there's a lot of free lunch for the two types of changes:

    1. For imports that are only used as types, moving them under a if TYPE_CHECKING block seems to have no downside. A lot of code seems to have this but I guess some parts don't since I was able to find these changes for this path.
    1. Using the LazyImport module when possible. This issue has been described before (Slow import #606), and this module is only used in sglang/__init__.py. It seems like there's no downside to using this (but I could be misunderstanding, please let me know), so we could be using it more broadly.

So these two changes could be used more broadly than just this path.

At this point I'm going to open to review. Not sure if this exactly tackles what the original issue was trying to get at, so appreciate any clarification on what direction this PR should go. Appreciate the time reviewers take to look at this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants