[Generator] Support non-remote (e.g. colocated) SGLang engine #68

CharlieFRuan · 2025-07-08T05:06:25Z

Overview

Before this PR, we can only use SGLang as a backend to generate rollout as a remote server (see sglang_server.py).

This PR implements sglang_engine.py to allow using SGLang locally (e.g. colocate with the policy model).

We bump SGLang to 0.4.8.post1 for now. Bumping to 0.4.9.post1 causes weight sync to hang when not colocated (but using local engine) -- i.e. the test no_colocate_nccl_fsdp2_sglang in test_policy_local_engines_e2e.py would fail. 0.4.8.post1 already supports two-stage wake up: sgl-project/sglang#7099

Currently, we still cannot support TP > 1 with the local engines and leave it as a future TODO.

Three quirks

We use a remote task get_sglang_engine() to create SGLangInferenceEngine, since we need a GPU to import SGLang, otherwise sglang will try to import vllm, making dependencies management a bit messy
To support weight sync via CUDA IPC, we need to write per-tp-worker code. Since SGLang does not support worker-extension-cls like vLLM does, the only way I found is to use custom_weight_loader. We base64 encode the ipc handles into a tensor and reuse SGLang's update_weights_from_tensor().
SGLang currently cannot sleep, wake up, and start generating. They have to do explicit weight sync, hence the no_sync parameter change in eval_weights_manager ([Bug][sleep] Create engine, sleep, wake up, generate --> gibberish sgl-project/sglang#7939)

Tests

Parametrized the test_policy_vllm_e2e.py to also run with SGLang, and renamed the test as a result. This test covers instantiating the engine, sleep, wake up, weight sync, then generate. We also test with different config combinations.
Parametrized the test_engine_generation.py which tests both remote sglang and local sglang.
See E2E results below too

Future TODO

Support TP > 1 for the non-remote SGLang engines, reaching parity with non-remote vLLM engines

E2E `run_gsm8k.sh` on 4xH100

Did four runs: for each of vLLM and SGLang, did non-colocated (2 TP=1 engines for inference, 2 for training), and colocated (4 TP=1 engines for inference, 4 for training).
Performance

Metrics

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py

CharlieFRuan · 2025-07-09T08:27:09Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+        """Update named weights in SGLang engine."""
+        extras = request.get("extras")
+        if extras is not None and "ipc_handles" in extras:
+            # CUDA IPC -- Here we reuse SGLang's update_weights_from_tensor, but actually load the


For CUDA IPC weight syncing, an alternative is to use SGLang's Engine.collective_rpc(), and patch a update_weights_cuda_ipc() to sgl's Scheduler (entity that carries out the RPC), mimicing our vLLM's implementation.

Tried it, but I found it hard to patch a method to SGLang's Scheduler without native support like vLLM's worker_extension_cls, since sgl instantiates Scheduler in subprocess and it can easily lose what we patched in the main process. Modifying source code might work, but likely the current solution is better?

I do think this current solution is better than patching. Let's now chat with SGLang folks to get their input in more detail on enabling per-TP worker code

CharlieFRuan · 2025-07-12T00:31:18Z

skyrl-train/pyproject.toml

 deepspeed = [
    "deepspeed==0.16.5"
 ]
+cpu_ci_test = [


This is added for CPU tests like test_models.py which requires flash attention. We can remove this once vllm and sglang depend on the same flash attention (hence moved back to the main dependencies rather than in each extra).

tyler-griggs

Adding a first round of comments

tyler-griggs · 2025-07-13T20:26:14Z

skyrl-train/pyproject.toml

 ]
 vllm = [
    "vllm==0.8.5",
+    "flash-attn@https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl",


I am going to momentarily ignore the pyproject file in this first pass of reviews because I assume it will be rebased on #73

skyrl-train/examples/gsm8k/run_gsm8k.sh

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py

tyler-griggs · 2025-07-13T20:40:16Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+
+
+# TODO(charlie): duplicate of setup_envvars_for_vllm, is it needed?
+def setup_envvars_for_sglang(kwargs, bundle_indices):


As far as I know, distributed_executor_backend is specific to vLLM's configuration. Also, it should only matter for TP>1.

For noset_visible_devices, I have a feeling this also would only matter for TP>1 so it's hard to test now whether something like this will be needed.

I see! I'll leave it here and address when we support TP > 1, would that be fine? It does not seem to affect current cases

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

skyrl-train/skyrl_train/trainer.py

skyrl-train/tests/gpu/test_engine_generation.py

tyler-griggs · 2025-07-13T21:35:05Z

skyrl-train/tests/gpu/test_policy_local_engines_e2e.py

By the way, not sure if you saw it, but we did have a similar test for sglang under tests/sglang. It seems like your changes here replace it and we should delete the tests/sglang folder. We originally split sglang into its own folder so we could just run uv run --isolated --extra dev --extra vllm pytest tests/gpu and run all tests without having to separately run vllm and sglang. I don't know the right way to structure this long-term, but for now we rarely run all gpu tests like this and more often we manually choose some subset of tests, so I think it's fine.

Ah yes, I'll delete tests/sglang. And what you said about being able to run an entire folder with --extra vllm makes sense. We could structure our future tests with marks=pytest.mark.sglang and pytest.mark.vllm and that should help keep both groups of tests in the same folder while able to run an entire folder.

skyrl-train/tests/gpu/test_policy_local_engines_e2e.py

CharlieFRuan · 2025-07-14T23:02:44Z

@tyler-griggs addressed comments, rebased to main, and tested with GPU tests and run_gsm8k.sh. Ready for another round of review:)

SumanthRH · 2025-07-23T21:02:13Z

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py

+                before_cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", "")
+                os.environ["CUDA_VISIBLE_DEVICES"] = "0"
+                from skyrl_train.inference_engines.sglang.sglang_engine import SGLangRayActor
+
+                os.environ["CUDA_VISIBLE_DEVICES"] = before_cuda_visible_devices
+


Seems like another issue with CUDA_VISIBLE_DEVICES patching in Ray for num_gpus=0 cc @pcmoritz

We are going to fix it, see ray-project/ray#54868

SumanthRH · 2025-07-23T21:07:29Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+            request_tensor = [("ipc_request", request_tensor)]
+            obj = UpdateWeightsFromTensorReqInput(
+                serialized_named_tensors=[
+                    MultiprocessingSerializer.serialize(request_tensor) for _ in range(self._tp_size)


QQ: I don't know much about weight sync in sglang. Does MultiprocessingSerializer.serialize take tensor on GPU and serialize it via pickle in sglang? If so, I'm wondering why the conversion base64 encoding + .cuda op was needed?

The serialize() will have to take in a list of (str, torch.Tensor) (i.e. named tensors) so that it matches what the codepath in SGLang expects when they deserialize it

We want to pass our request (which includes ipc handle info) down to the codepath so our customized update_weight_cuda_ipc() can handle it in a per-TP-worker fashion

However our request is not a tensor, so we base64 encoded it and store in a torch.unit8 Tensor

Not sure if I answered your question...

Okay got it.

I am wondering why the tensor needs to be on gpu though? I am not too familiar with sglang internals, but it looks like the Engine class runs the tokenizer manager and the TP workers are in a different process. Does MultiprocessingSerializer.serialize expect a gpu tensor?

Oh yes you are right, it should be a CPU tensor. Changed and tested. Thanks for the catch!

SumanthRH · 2025-07-23T21:09:55Z

cc @tyler-griggs Let's try to get this PR in today

SumanthRH · 2025-07-28T18:28:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for using SGLang as a non-remote (colocated) inference engine. The changes include a new sglang_engine.py implementation with workarounds for SGLang's limitations, updates to the engine creation logic to support both vLLM and SGLang backends, and extensive refactoring of tests to be parameterized and cover both backends. My review focuses on improving robustness, maintainability, and cleaning up debugging artifacts. Key suggestions include ensuring correct GPU scheduling for Ray tasks, making data serialization more robust, and replacing debug print statements with structured logging.

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py

gemini-code-assist · 2025-07-28T18:30:03Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+# TODO(charlie): duplicate of setup_envvars_for_vllm, is it needed?
+def setup_envvars_for_sglang(kwargs, bundle_indices):


The TODO comment is valid. The function setup_envvars_for_sglang is very similar to setup_envvars_for_vllm in skyrl_train/inference_engines/vllm/vllm_engine.py. To improve maintainability and reduce code duplication, these two functions should be refactored into a single, shared utility function.

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

skyrl-train/skyrl_train/trainer.py

SumanthRH · 2025-08-09T06:49:15Z

skyrl-train/skyrl_train/inference_engines/sglang/sglang_engine.py

+    end_marker = b"__END_OF_REQUEST__"
+    end_index = tensor_bytes.find(end_marker)
+    if end_index == -1:
+        raise ValueError("End marker not found in tensor data")


SumanthRH

Thanks!

Left a nit - trying to understand why sglang import happens in a separate ray task - but beyond that LGTM.

Some CPU tests are failing. Let's make sure CPU tests and the new GPU tests pass.

CharlieFRuan · 2025-08-12T07:56:32Z

This is ready again. Able to run the entire skyrl-train/tests/gpu/test_policy_local_engines_e2e.py. The CPU test is fixed by itself for some reason (not sure if flaky or fixed unintentionally), was some sentencepiece/tiktoken missing dependency issue.

skyrl-train/skyrl_train/utils/utils.py

CharlieFRuan · 2025-08-12T19:55:03Z

current 1.5B Qwen2.5 gsm8k run on 4xH1001

tyler-griggs

I'm leaving one open thread to @SumanthRH , otherwise approve!

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py

CharlieFRuan commented Jul 8, 2025

View reviewed changes

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py Show resolved Hide resolved

CharlieFRuan commented Jul 9, 2025

View reviewed changes

CharlieFRuan mentioned this pull request Jul 10, 2025

[Dependencies] Upgrade to torch 2.7 #73

Merged

2 tasks

CharlieFRuan force-pushed the pr-0707-sglang-non-remote branch from eb042d5 to 1e0da62 Compare July 10, 2025 22:23

CharlieFRuan mentioned this pull request Jul 11, 2025

[Bug][sleep] Create engine, sleep, wake up, generate --> gibberish sgl-project/sglang#7939

Closed

5 tasks

CharlieFRuan force-pushed the pr-0707-sglang-non-remote branch from 116e902 to 94f9619 Compare July 11, 2025 22:02

CharlieFRuan marked this pull request as ready for review July 11, 2025 23:22

CharlieFRuan commented Jul 12, 2025

View reviewed changes

CharlieFRuan changed the title ~~[Generator] Add initial support for non-remote SGLang engine~~ [Generator] Support non-remote (e.g. colocated) SGLang engine Jul 12, 2025

tyler-griggs mentioned this pull request Jul 13, 2025

[sglang] Add SGLang missing feature support #82

Open

5 tasks

tyler-griggs self-requested a review July 13, 2025 20:21

tyler-griggs reviewed Jul 13, 2025

View reviewed changes

CharlieFRuan added 17 commits July 14, 2025 21:42

[Generator] Add initial support for non-remote SGLang engine

dd44871

fix lint

d8511cc

support CUDA IPC weight sync

aa23222

update bash script

a591da4

fix gibberish output in eval

b5e8ecd

add unit test for local sglang engine

b38f6d2

Add more tests

3487d21

fix lint

d24e601

trivial

ef3303b

trivial

a6b406e

trivial

a9612fc

fix CI

aae58a2

fix CI

7cb62dc

fix CI

2dc1215

fix CI

94d8608

Address comments

6009589

remove sglang test folder

55d58c0

CharlieFRuan force-pushed the pr-0707-sglang-non-remote branch from 83844a7 to 55d58c0 Compare July 14, 2025 21:47

CharlieFRuan added 4 commits July 14, 2025 21:53

remove cpu ci test

673132f

revert uv.lock change

cac9eb7

fix vllm gpu test

aa90146

fix dtype error

172eea7

SumanthRH reviewed Jul 23, 2025

View reviewed changes

Pass in CPU tensor to MultiprocessingSerializer.serialize() instead

b0ad3a2

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

Merge branch 'main' into pr-0707-sglang-non-remote

977f6f4

SumanthRH reviewed Aug 9, 2025

View reviewed changes

SumanthRH approved these changes Aug 9, 2025

View reviewed changes

CharlieFRuan added 4 commits August 11, 2025 20:58

fix gpu tests

c2e2209

Merge branch 'main' into pr-0707-sglang-non-remote

f17619c

Merge branch 'main' into pr-0707-sglang-non-remote

0b5978e

Fix SGLang with NCCL_CUMEM_ENABLE

26ac831

CharlieFRuan force-pushed the pr-0707-sglang-non-remote branch from 933ecaf to 26ac831 Compare August 12, 2025 07:54

CharlieFRuan commented Aug 12, 2025

View reviewed changes

skyrl-train/skyrl_train/utils/utils.py Show resolved Hide resolved

tyler-griggs approved these changes Aug 12, 2025

View reviewed changes

skyrl-train/skyrl_train/inference_engines/ray_wrapped_inference_engine.py Show resolved Hide resolved

SumanthRH approved these changes Aug 12, 2025

View reviewed changes

SumanthRH merged commit 34e06da into NovaSky-AI:main Aug 12, 2025
3 checks passed

CharlieFRuan deleted the pr-0707-sglang-non-remote branch August 12, 2025 22:20



		# TODO(charlie): duplicate of setup_envvars_for_vllm, is it needed?
		def setup_envvars_for_sglang(kwargs, bundle_indices):

[Generator] Support non-remote (e.g. colocated) SGLang engine #68

[Generator] Support non-remote (e.g. colocated) SGLang engine #68

Uh oh!

Conversation

CharlieFRuan commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Three quirks

Tests

Future TODO

E2E run_gsm8k.sh on 4xH100

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyler-griggs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CharlieFRuan commented Jul 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CharlieFRuan Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH commented Jul 23, 2025

Uh oh!

SumanthRH commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CharlieFRuan commented Jul 8, 2025 •

edited

Loading

E2E `run_gsm8k.sh` on 4xH100

pcmoritz Jul 23, 2025 •

edited

Loading

CharlieFRuan Jul 23, 2025 •

edited

Loading