-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Isolation tests plugin #14904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Eli-Siegel-nvidia
wants to merge
7
commits into
NVIDIA-NeMo:llmb-nemo-r2.5.0
Choose a base branch
from
Eli-Siegel-nvidia:feat/es/isolation_plugin
base: llmb-nemo-r2.5.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Isolation tests plugin #14904
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
abdf86d
added script files to be run as part of the noise plugin
Eli-Siegel-nvidia fe9c097
added isolation exposed as argument; use nccl noise
Eli-Siegel-nvidia 1bef7bd
fixed variable to split nodes to pairs
Eli-Siegel-nvidia 6e5197c
Apply isort and black reformatting
Eli-Siegel-nvidia 6ec6164
added missing import
Eli-Siegel-nvidia d098060
Apply isort and black reformatting
Eli-Siegel-nvidia a3a6e9d
Fixed description for args
Eli-Siegel-nvidia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
nemo/lightning/run/scripts/split_nodes/node_allocation/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| """Node allocation package for distributing compute nodes between workloads. | ||
|
|
||
| This package provides utilities and strategies for splitting allocated nodes | ||
| between workloads based on their topology information. | ||
| """ | ||
|
|
||
| __version__ = "1.0.0" |
199 changes: 199 additions & 0 deletions
199
nemo/lightning/run/scripts/split_nodes/node_allocation/parsers.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,199 @@ | ||
| """Parsers and utility functions for node allocation. | ||
|
|
||
| This module contains functions for parsing topology files, node lists, | ||
| and other utility functions needed for node allocation. | ||
| """ | ||
|
|
||
| import re | ||
| from typing import Dict, List, Optional, Set, Tuple | ||
|
|
||
|
|
||
| def expand_nodes(raw_string: str) -> List[str]: | ||
| """Expand a string containing node specifications into a list of node names. | ||
|
|
||
| Args: | ||
| raw_string: String containing node specifications, e.g. "pool1-1195,pool1-[2110-2111]" | ||
| or "hgx-isr1-[001-008]" | ||
|
|
||
| Returns: | ||
| List of expanded node names | ||
| """ | ||
| # Extract the nodes part after "Nodes=" | ||
| nodes_part = raw_string.split("Nodes=")[1].split()[0] | ||
|
|
||
| expanded = [] | ||
| # Split on commas to handle each node specification | ||
| for spec in nodes_part.split(','): | ||
| # Handle range format: prefix-[start-end] | ||
| if '[' in spec and ']' in spec: | ||
| # Extract prefix (e.g., "pool1-" or "hgx-isr1-") | ||
| prefix = spec.split('[')[0] | ||
| range_part = spec.split('[')[1].split(']')[0] | ||
|
|
||
| if '-' in range_part: | ||
| start_str, end_str = range_part.split('-') | ||
| start = int(start_str) | ||
| end = int(end_str) | ||
| # Determine padding based on start number's string representation | ||
| padding = len(start_str) | ||
| for num in range(start, end + 1): | ||
| expanded.append(f"{prefix}{num:0{padding}d}") | ||
| else: | ||
| # Single number in brackets | ||
| num = int(range_part) | ||
| # Determine padding based on the number's length | ||
| padding = len(range_part) | ||
| expanded.append(f"{prefix}{num:0{padding}d}") | ||
| else: | ||
| # Handle individual node format without ranges | ||
| expanded.append(spec) | ||
|
|
||
| return expanded | ||
|
|
||
|
|
||
| def parse_topology_file(topology_file: str) -> Tuple[Dict[str, str], Dict[str, Dict[str, str]]]: | ||
| """Parse a topology file and return node-to-switch mapping and switch relationships. | ||
|
|
||
| Args: | ||
| topology_file: Path to the topology file | ||
|
|
||
| Returns: | ||
| Tuple containing: | ||
| - node_to_switch: Dictionary mapping node names to their switch | ||
| - switch_hierarchy: Dict containing switch parent-child relationships | ||
| """ | ||
| with open(topology_file) as f: | ||
| topo_output = f.read().strip().splitlines() | ||
|
|
||
| # Parse topology to map nodes to switches | ||
| node_to_switch: Dict[str, str] = {} | ||
| switch_hierarchy: Dict[str, Dict[str, str]] = { | ||
| 'parents': {}, # Maps switches to their parents | ||
| 'children': {}, # Maps switches to their children | ||
| } | ||
|
|
||
| current_switch = None | ||
|
|
||
| for line in topo_output: | ||
| # Look for switch definitions - match any switch name after SwitchName= | ||
| m = re.search(r'SwitchName=([^\s]+) Level=(\d+)', line) | ||
| if m: | ||
| current_switch = m.group(1) | ||
| switch_level = int(m.group(2)) | ||
|
|
||
| # For leaf switches (Level 0), find their parents | ||
| if switch_level == 0: | ||
| parent_match = re.search(r'Switches=(.*?)$', line) | ||
| if parent_match: | ||
| parents = parent_match.group(1).strip() | ||
| switch_hierarchy['parents'][current_switch] = parents | ||
|
|
||
| # Add this switch as a child of its parents | ||
| for parent in parents.split(','): | ||
| if parent not in switch_hierarchy['children']: | ||
| switch_hierarchy['children'][parent] = [] | ||
| switch_hierarchy['children'][parent].append(current_switch) | ||
|
|
||
| # Look for node definitions and map them to the current switch | ||
| if "Nodes=" in line and current_switch: | ||
| expanded = expand_nodes(line) | ||
| for node in expanded: | ||
| node_to_switch[node] = current_switch | ||
|
|
||
| return node_to_switch, switch_hierarchy | ||
|
|
||
|
|
||
| def parse_allocated_nodes(allocated_nodes_file: str) -> List[str]: | ||
| """Parse a file containing allocated nodes. | ||
|
|
||
| Args: | ||
| allocated_nodes_file: Path to the file containing the list of allocated nodes | ||
|
|
||
| Returns: | ||
| List of node names | ||
| """ | ||
| with open(allocated_nodes_file) as f: | ||
| allocated_nodes = f.read().strip().split() | ||
| return allocated_nodes | ||
|
|
||
|
|
||
| def parse_node_input(node_input: str, is_file: bool = False) -> List[str]: | ||
| """Parse node input, either from a file or directly from a string. | ||
|
|
||
| Args: | ||
| node_input: Either a file path or a direct node list string | ||
| is_file: Whether the input is a file path | ||
|
|
||
| Returns: | ||
| List of node names | ||
| """ | ||
| if is_file: | ||
| with open(node_input) as f: | ||
| nodes = f.read().strip().split() | ||
| return nodes | ||
| else: | ||
| # Direct input string, could be compressed, so don't split | ||
| return [node_input] | ||
|
|
||
|
|
||
| def group_nodes_by_switch(allocated_nodes: List[str], node_to_switch: Dict[str, str]) -> Dict[str, List[str]]: | ||
| """Group allocated nodes by their switch. | ||
|
|
||
| Args: | ||
| allocated_nodes: List of allocated node names | ||
| node_to_switch: Dictionary mapping nodes to switches | ||
|
|
||
| Returns: | ||
| Dictionary mapping switches to their list of allocated nodes | ||
| """ | ||
| switch_to_nodes: Dict[str, List[str]] = {} | ||
| missing_nodes: List[str] = [] | ||
|
|
||
| for node in allocated_nodes: | ||
| switch = node_to_switch.get(node) | ||
| if switch: | ||
| switch_to_nodes.setdefault(switch, []).append(node) | ||
| else: | ||
| missing_nodes.append(node) | ||
|
|
||
| if missing_nodes: | ||
| print(f"Warning: {len(missing_nodes)} node(s) not found in topology!") | ||
|
|
||
| return switch_to_nodes | ||
|
|
||
|
|
||
| def calculate_switch_distance(switch1: str, switch2: str, switch_hierarchy: Dict[str, Dict[str, str]]) -> int: | ||
| """Calculate the 'distance' between two switches based on topology. | ||
|
|
||
| Distance is defined as: | ||
| - 0 if switches are the same | ||
| - 1 if they share a direct parent | ||
| - 2 if they only share the core switch | ||
|
|
||
| Args: | ||
| switch1: First switch name | ||
| switch2: Second switch name | ||
| switch_hierarchy: Dict containing switch parent-child relationships | ||
|
|
||
| Returns: | ||
| Distance between the switches (0, 1, or 2) | ||
| """ | ||
| if switch1 == switch2: | ||
| return 0 | ||
|
|
||
| # Get parents | ||
| parents = switch_hierarchy.get('parents', {}) | ||
|
|
||
| # If we don't have parent info, assume maximum distance | ||
| if switch1 not in parents or switch2 not in parents: | ||
| return 2 | ||
|
|
||
| parent1 = parents.get(switch1) | ||
| parent2 = parents.get(switch2) | ||
|
|
||
| # If they share a direct parent, they're close | ||
| if parent1 and parent2 and parent1 == parent2: | ||
| return 1 | ||
|
|
||
| # Otherwise, they meet at the core | ||
| return 2 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 GPUs per node is only true for h100 nodes. B200, GB200, GB300 nodes usually have 4 GPUs per node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to understand how to define similar test for other systems, since they need to be different.
I added check for it in llmb side