Skip to content

Switch trtllm slurm scripts to use mpirun directly for better stability#6

Merged
Kipok merged 2 commits intomainfrom
igitman/trtllm-stability
Feb 21, 2024
Merged

Switch trtllm slurm scripts to use mpirun directly for better stability#6
Kipok merged 2 commits intomainfrom
igitman/trtllm-stability

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Feb 20, 2024

Need this to avoid errors on some slurm configurations.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Collaborator

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Kipok Kipok merged commit 79a013b into main Feb 21, 2024
@Kipok Kipok deleted the igitman/trtllm-stability branch February 21, 2024 21:48
dgtm777 pushed a commit that referenced this pull request Apr 2, 2025
…ty (#6)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
…ty (#6)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
gwarmstrong added a commit to gwarmstrong/NeMo-Skills that referenced this pull request Feb 14, 2026
Per review feedback: all benchmark-specific packages should go to core
for now since JIT install is not yet implemented. Previously only
PythonTool-specific deps were in core while benchmark deps like datasets,
sacrebleu, faiss-cpu, etc. were only in main.txt. This led to an
inconsistent boundary where math grader deps were in core but BFCL deps
were not, despite both being benchmark-specific.

Addresses review comments #1, NVIDIA-NeMo#4, NVIDIA-NeMo#6 on PR NVIDIA-NeMo#1229.

Signed-off-by: George Armstrong <georgea@nvidia.com>
gwarmstrong added a commit to gwarmstrong/NeMo-Skills that referenced this pull request Feb 14, 2026
Rewrite the dependency boundary section to:
- Define core as "everything needed for inference + evaluation" (not
  just PythonTool-specific deps)
- Remove references to deleted requirements/main.txt
- Clarify that all benchmark evaluator deps go to core until JIT
  install is implemented
- Improve dataset module separation guidance (pipeline = cluster I/O
  only, core = all local logic)
- Add note about summarize-results refactor (issue NVIDIA-NeMo#779)

Addresses review comments NVIDIA-NeMo#3, NVIDIA-NeMo#4, NVIDIA-NeMo#6, NVIDIA-NeMo#7 on PR NVIDIA-NeMo#1229.

Signed-off-by: George Armstrong <georgea@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants