-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Fix remote weight info nnode>1 and dp>1 #17389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ShangmingCai
merged 39 commits into
sgl-project:main
from
JD-ETH:feat/gloo-info-on-server-rank
Mar 31, 2026
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
c461757
remote instance gloo comm
JD-ETH da95870
refactor design
JD-ETH 8f54bc4
no longer global state
JD-ETH f5b3911
Merge upstream/main and resolve conflicts
JD-ETH 99dbe44
Improve _sync_scheduler_infos_across_nodes: add retry logic, clean up
JD-ETH a67fa58
Reduce retry defaults and remove per-attempt warning
JD-ETH 1107be1
Use 15-min TCPStore timeout instead of retry loop
JD-ETH 99fcb85
test case pass
JD-ETH 0a32f1b
format
JD-ETH d7e1093
update
JD-ETH fd4d45d
update
JD-ETH 0ab2dcc
Merge branch 'feat/gloo-info-on-server-rank' of https://github.com/JD…
JD-ETH 1af71d9
Merge branch 'main' of https://github.com/JD-ETH/sglang into feat/glo…
JD-ETH e134c30
Merge upstream/main and resolve http_server.py conflict
JD-ETH 6d96652
Refactor _register_to_engine_info_bootstrap to reuse NetworkAddress
JD-ETH 4b6bcd1
reorder server starts
JD-ETH eb8c886
update
JD-ETH f3e2ffd
bug fix
JD-ETH 1d41b0f
update bootstrap
JD-ETH b5256f5
deterministic port and fastapi
JD-ETH 5ccf4e7
refactor piping away
JD-ETH 7600596
Merge remote-tracking branch 'upstream/main' into feat/gloo-info-on-s…
JD-ETH 263e724
address comments
JD-ETH 8ed1559
format
JD-ETH 490da34
Merge branch 'feat/gloo-info-on-server-rank' of https://github.com/JD…
JD-ETH 9797567
Merge remote-tracking branch 'upstream/main' into feat/gloo-info-on-s…
JD-ETH 741496c
no get
JD-ETH 824ea3b
rename endpoint
JD-ETH 2aad7d6
optional admin
JD-ETH 1035d69
address comments
JD-ETH c3dd88a
add file
JD-ETH 9a9cedc
Merge branch 'feat/gloo-info-on-server-rank' of https://github.com/JD…
JD-ETH 366caf9
Merge branch 'main' into feat/gloo-info-on-server-rank
ShangmingCai dffea01
Auto-derive engine_info_bootstrap_port from --port to avoid conflicts
JD-ETH 6a2d8cc
Fix multi-node bootstrap port: derive from dist_init_addr not --port
JD-ETH ad00d26
Use NetworkAddress.parse instead of fragile string split for dist_ini…
JD-ETH 85b03b1
Replace auto-derive engine_info_bootstrap_port with fixed default 6789
JD-ETH 3982849
fix test
JD-ETH 1f1b1ab
Merge branch 'main' into feat/gloo-info-on-server-rank
JD-ETH File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
105 changes: 105 additions & 0 deletions
105
python/sglang/srt/entrypoints/engine_info_bootstrap_server.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # Copyright 2023-2024 SGLang Team | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # ============================================================================== | ||
|
|
||
| import logging | ||
| import threading | ||
| from typing import Dict, Optional, Tuple | ||
|
|
||
| import uvicorn | ||
| from fastapi import FastAPI, HTTPException | ||
| from fastapi.responses import PlainTextResponse | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class EngineInfoBootstrapServer: | ||
JD-ETH marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """Lightweight HTTP server for per-rank model info registration. | ||
|
|
||
| Runs in a daemon thread on node_rank==0. Each ModelRunner registers its | ||
| info via HTTP PUT after model initialization. The Engine | ||
| accesses the collected info directly in-process; external consumers can | ||
| query via HTTP GET. | ||
|
|
||
| Currently supports transfer engine memory registration info. | ||
| """ | ||
|
|
||
| def __init__(self, host: str, port: int): | ||
| self.host = host | ||
| self.port = port | ||
|
|
||
| # Storage: {tp_rank: (session_id, weights_info_dict)} | ||
| self.transfer_engine_info: Dict[int, Tuple] = {} | ||
| self.lock = threading.Lock() | ||
|
|
||
| app = FastAPI() | ||
|
|
||
| @app.get("/health") | ||
| def health(): | ||
| return PlainTextResponse("OK") | ||
|
|
||
| @app.put("/register_transfer_engine_info") | ||
| def register_transfer_engine_info(data: dict): | ||
| try: | ||
| tp_rank = data["tp_rank"] | ||
| info = data["transfer_engine_info"] | ||
| session_id = info["session_id"] | ||
| weights_info_dict = info["weights_info_dict"] | ||
|
|
||
| with self.lock: | ||
| self.transfer_engine_info[tp_rank] = ( | ||
| session_id, | ||
| weights_info_dict, | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"Registered transfer engine info for tp_rank={tp_rank}, " | ||
| f"session_id={session_id}" | ||
| ) | ||
| return PlainTextResponse("OK") | ||
| except Exception as e: | ||
| logger.error(f"Failed to register engine info: {e}") | ||
| raise HTTPException(status_code=400, detail=str(e)) | ||
|
|
||
| @app.get("/get_transfer_engine_info") | ||
| def get_transfer_engine_info(rank: int): | ||
| if rank < 0: | ||
| raise HTTPException(status_code=400, detail="Invalid rank parameter") | ||
|
|
||
| with self.lock: | ||
| info = self.transfer_engine_info.get(rank) | ||
|
|
||
| if info is None: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"No transfer engine info for rank {rank}", | ||
| ) | ||
|
|
||
| return {"rank": rank, "remote_instance_transfer_engine_info": list(info)} | ||
|
|
||
| config = uvicorn.Config(app, host=host, port=port, log_level="warning") | ||
| self._server = uvicorn.Server(config) | ||
| self._thread = threading.Thread( | ||
| target=self._server.run, | ||
| daemon=True, | ||
| ) | ||
| self._thread.start() | ||
| logger.info(f"EngineInfoBootstrapServer started on {host}:{port}") | ||
|
|
||
| def close(self): | ||
| self._server.should_exit = True | ||
| self._thread.join(timeout=5) | ||
|
|
||
| def get_transfer_engine_info(self, rank: int) -> Optional[Tuple]: | ||
| """Direct in-process access for co-located HTTP server (no HTTP round-trip).""" | ||
| return self.transfer_engine_info.get(rank) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.