[rollout, perf] fix: support profiling for rollout via remote command in async mode#5001
[rollout, perf] fix: support profiling for rollout via remote command in async mode#5001bithighrr wants to merge 5 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces remote command interfaces to enable profiling for rollout processes in async mode, specifically addressing issues with SGLang implementations. The changes include adding new profiling-related functions and methods across verl/utils/profiler/__init__.py, verl/utils/profiler/rollout_profile.py, verl/workers/megatron_workers.py, verl/workers/rollout/base.py, verl/workers/rollout/sglang_rollout/http_server_engine.py, and verl/workers/rollout/sglang_rollout/sglang_rollout.py. The new rollout_profile_args function generates profiling parameters, and abstract methods for profiling control are added to BaseRollout and implemented in ServerAdapter. The integration in ActorRolloutRefWorker allows for starting and stopping profiling based on configuration. Overall, the changes enhance the profiling capabilities for distributed rollout systems.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Support in #4320 |
… in async mode
What does this PR do?
fix 4575
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
python3 -m verl.trainer.main_ppo \ ... actor_rollout_ref.rollout.name=sglang \ actor_rollout_ref.rollout.mode=async \ actor_rollout_ref.rollout.profiler.enable=True \ actor_rollout_ref.rollout.profiler.tool=torch \ actor_rollout_ref.rollout.profiler.ranks="[0]" \ actor_rollout_ref.rollout.profiler.save_path="/path/to/save" \ actor_rollout_ref.rollout.profiler.tool_config.torch.profile_by_stage=False \ actor_rollout_ref.rollout.profiler.tool_config.torch.merge_profiles=False \ actor_rollout_ref.rollout.profiler.tool_config.torch.step_start=1 \ actor_rollout_ref.rollout.profiler.tool_config.torch.step_end=5 \ global_profiler.steps='[0, 1, 2]' \ .../path/to/save.API and Usage Example
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.