Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.

Add setup script to fix deepep timeouts + add deepgemm fast warmup#133

Merged
ishandhanani merged 1 commit into
mainfrom
trevor-m/timeouts
Feb 3, 2026
Merged

Add setup script to fix deepep timeouts + add deepgemm fast warmup#133
ishandhanani merged 1 commit into
mainfrom
trevor-m/timeouts

Conversation

@trevor-m
Copy link
Copy Markdown
Contributor

@trevor-m trevor-m commented Feb 3, 2026

  1. Switches to sglang branch which is v0.5.8 with [DeepGemm] Add a flag for fast warmup sgl-project/sglang#18111 cherry-picked on top of it. Remember to set SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1 in your script.
  2. Patch DeepEP to increase device timeout from 100s ->1000s

Summary by CodeRabbit

  • Chores

    • Added an automated configuration script to streamline dependency installation and system configuration updates.
  • Bug Fixes

    • Increased device timeout limits to improve stability for long-running operations and reduce timeout-related failures.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

A new shell script automates Git dependency configuration, updates a device timeout constant in CUDA kernel configuration, and installs Python dependencies with specific environment variables and build optimizations.

Changes

Cohort / File(s) Summary
Shell Automation Script
configs/fix-timeouts.sh
New script that configures Git remotes to fetch from a fork, checks out the fastdg branch, updates NUM_TIMEOUT_CYCLES constant in CUDA kernel configs from 200000000000ull to 2000000000000ull, sets environment variables (SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1), and reinstalls Python dependencies with CUDA architecture and parallelism settings.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A script hops in to fix the wait,
Adjusting timeouts, sealing fate,
With Git remotes and CUDA care,
Dependencies reinstalled with flair,
Faster warmups fill the air!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: a setup script is added to fix deepep timeouts and enable deepgemm fast warmup, which matches the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch trevor-m/timeouts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@configs/fix-timeouts.sh`:
- Around line 6-10: The script currently runs "cd /sgl-workspace/sglang" without
verifying it succeeded, so subsequent git commands (git remote remove origin,
git remote add origin ..., git fetch origin, git checkout origin/${BRANCH}) may
run in the wrong directory; fix by testing that /sgl-workspace/sglang exists and
that cd returns success (e.g., if [ -d "/sgl-workspace/sglang" ] && cd
"/sgl-workspace/sglang" || { echo "Failed to enter /sgl-workspace/sglang"; exit
1; }) before running git commands so the script aborts instead of operating in
an unintended repo.
- Around line 13-15: The script currently runs sed and pip immediately after `cd
/sgl-workspace/DeepEP`; add a guard after the cd to abort if it fails so sed -i
and the pip install line (TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip
install --force-reinstall --no-build-isolation .) never run in the wrong
directory — e.g. test the exit status of the cd command and exit with a non-zero
status and error message if it fails (or enable errexit) before executing the
sed replacement of NUM_TIMEOUT_CYCLES and the pip install.
🧹 Nitpick comments (1)
configs/fix-timeouts.sh (1)

1-2: Consider adding set -e for safer script execution.

Adding set -e (or set -euo pipefail) would cause the script to exit immediately if any command fails, preventing cascading issues from undetected errors in git or pip commands.

Suggested improvement
 #!/bin/bash
+set -euo pipefail
+
 BRANCH="fastdg"

Comment thread configs/fix-timeouts.sh
Comment on lines +6 to +10
cd /sgl-workspace/sglang
git remote remove origin
git remote add origin https://github.com/trevor-m/sglang.git
git fetch origin
git checkout origin/${BRANCH}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle cd failure to prevent commands running in wrong directory.

If /sgl-workspace/sglang doesn't exist, the script continues and git commands would execute in the current directory, potentially corrupting an unrelated repository.

Proposed fix
-cd /sgl-workspace/sglang
+cd /sgl-workspace/sglang || { echo "Failed to cd to /sgl-workspace/sglang"; exit 1; }
 git remote remove origin
 git remote add origin https://github.com/trevor-m/sglang.git
 git fetch origin
 git checkout origin/${BRANCH}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd /sgl-workspace/sglang
git remote remove origin
git remote add origin https://github.com/trevor-m/sglang.git
git fetch origin
git checkout origin/${BRANCH}
cd /sgl-workspace/sglang || { echo "Failed to cd to /sgl-workspace/sglang"; exit 1; }
git remote remove origin
git remote add origin https://github.com/trevor-m/sglang.git
git fetch origin
git checkout origin/${BRANCH}
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 6-6: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
In `@configs/fix-timeouts.sh` around lines 6 - 10, The script currently runs "cd
/sgl-workspace/sglang" without verifying it succeeded, so subsequent git
commands (git remote remove origin, git remote add origin ..., git fetch origin,
git checkout origin/${BRANCH}) may run in the wrong directory; fix by testing
that /sgl-workspace/sglang exists and that cd returns success (e.g., if [ -d
"/sgl-workspace/sglang" ] && cd "/sgl-workspace/sglang" || { echo "Failed to
enter /sgl-workspace/sglang"; exit 1; }) before running git commands so the
script aborts instead of operating in an unintended repo.

Comment thread configs/fix-timeouts.sh
Comment on lines +13 to +15
cd /sgl-workspace/DeepEP
sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh
TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip install --force-reinstall --no-build-isolation .
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle cd failure to prevent sed and pip install running in wrong directory.

If the cd fails, sed would attempt to modify a file that may not exist or modify an unintended file, and pip install would install from the wrong location.

Proposed fix
-cd /sgl-workspace/DeepEP
+cd /sgl-workspace/DeepEP || { echo "Failed to cd to /sgl-workspace/DeepEP"; exit 1; }
 sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh
 TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip install --force-reinstall --no-build-isolation .
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd /sgl-workspace/DeepEP
sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh
TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip install --force-reinstall --no-build-isolation .
cd /sgl-workspace/DeepEP || { echo "Failed to cd to /sgl-workspace/DeepEP"; exit 1; }
sed -i 's/#define NUM_TIMEOUT_CYCLES 200000000000ull/#define NUM_TIMEOUT_CYCLES 2000000000000ull/' csrc/kernels/configs.cuh
TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip install --force-reinstall --no-build-isolation .
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 13-13: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
In `@configs/fix-timeouts.sh` around lines 13 - 15, The script currently runs sed
and pip immediately after `cd /sgl-workspace/DeepEP`; add a guard after the cd
to abort if it fails so sed -i and the pip install line
(TORCH_CUDA_ARCH_LIST="10.0;10.3" MAX_JOBS=$(nproc) pip install
--force-reinstall --no-build-isolation .) never run in the wrong directory —
e.g. test the exit status of the cd command and exit with a non-zero status and
error message if it fails (or enable errexit) before executing the sed
replacement of NUM_TIMEOUT_CYCLES and the pip install.

@ishandhanani ishandhanani merged commit d31774c into main Feb 3, 2026
7 checks passed
This was referenced Feb 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants