-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[CI] [ROCm] Bugfix device environment issue #1984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,7 +72,7 @@ def set_stage_devices( | |
| else: | ||
| mapped_devices.append(str(idx)) | ||
| mapped_devices_str = ",".join(mapped_devices) | ||
| os.environ[env_var] = mapped_devices_str | ||
| current_omni_platform.set_device_control_env_var(mapped_devices_str) | ||
| if toks: | ||
| try: | ||
| selected_physical = int(mapped_devices[0]) | ||
|
|
@@ -99,7 +99,7 @@ def set_stage_devices( | |
| selected_physical = None | ||
| if selected_physical is None: | ||
| selected_physical = int(logical_idx) | ||
| os.environ[env_var] = str(selected_physical) | ||
| current_omni_platform.set_device_control_env_var(str(selected_physical)) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This assignment happens only after Useful? React with 👍 / 👎. |
||
| logger.debug( | ||
| "[Stage-%s] Logical index %d -> physical %s; set %s to single device", | ||
| stage_id, | ||
|
|
@@ -111,7 +111,7 @@ def set_stage_devices( | |
| logger.debug("[Stage-%s] Using default device visibility (devices=%s)", stage_id, devices) | ||
| else: | ||
| selected_physical = int(str(devices)) | ||
| os.environ[env_var] = str(selected_physical) | ||
| current_omni_platform.set_device_control_env_var(str(selected_physical)) | ||
| logger.debug("[Stage-%s] Set %s to single device %s (fallback)", stage_id, env_var, selected_physical) | ||
| except Exception as e: | ||
| logger.warning("Failed to interpret devices for stage %s: %s", stage_id, e) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -99,3 +99,17 @@ def synchronize(cls) -> None: | |
| def get_free_memory(cls, device: torch.device | None = None) -> int: | ||
| free, _ = torch.cuda.mem_get_info(device) | ||
| return free | ||
|
|
||
| @classmethod | ||
| def set_device_control_env_var(cls, devices: str | int | None) -> None: | ||
| import os | ||
|
|
||
| os.environ["HIP_VISIBLE_DEVICES"] = devices | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does ROCm also need
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gcanlin yes. Some of the libraries like In the latest vLLM platform code, they synced/set both |
||
| os.environ["CUDA_VISIBLE_DEVICES"] = devices | ||
|
|
||
| @classmethod | ||
| def unset_device_control_env_var(cls) -> None: | ||
| import os | ||
|
|
||
| os.environ.pop("HIP_VISIBLE_DEVICES", None) | ||
| os.environ.pop("CUDA_VISIBLE_DEVICES", None) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On ROCm this block only snapshots
os.environ[device_control_env_var]before launch, butunset_device_control_env_var()now clears both HIP and CUDA. In environments that enter with onlyHIP_VISIBLE_DEVICESset (our AMD wrapper does this),previous_visible_devicesisNone, so the first stage launch drops the inherited HIP mask entirely. Any later stage/worker then inherits full-node visibility, and the multi-stage ROCm configs undertests/e2e/stage_configs/rocm/get remapped against physical GPUs instead of the shard-local subset.Useful? React with 👍 / 👎.