Skip to content

[Fix] Add timeout and retry to MGSM data download to prevent CI timeout#21732

Closed
Fridge003 wants to merge 1 commit intomainfrom
fix/mgsm-download-timeout
Closed

[Fix] Add timeout and retry to MGSM data download to prevent CI timeout#21732
Fridge003 wants to merge 1 commit intomainfrom
fix/mgsm-download-timeout

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

Summary

  • test_moe_eval_accuracy_large.py::test_mgsm_en has been consistently timing out in CI since ~Mar 30 12:00 UTC
  • Root cause: urllib.request.urlopen() in simple_eval_mgsm.py:get_lang_examples() has no timeout when downloading MGSM benchmark data from openaipublic.blob.core.windows.net. When the download hangs (network issue from CI runners), the test hangs indefinitely until the 1200s CI timeout kills it
  • Evidence: Server logs show zero Prefill/Decode batches during the entire 18+ minute hang — the test is stuck in data download, not inference
  • Not a code regression: Bisect shows the 3 commits between last pass (62a63eeff) and first fail (b76730701) don't touch inference code for Mixtral

Changes

  • Add timeout=60 to urllib.request.urlopen() call
  • Add retry logic (3 attempts with exponential backoff)
  • Add explicit urllib.error and urllib.request imports

Test plan

  • CI test_moe_eval_accuracy_large passes (if blob storage is reachable) or fails fast with a clear error (if not)
  • No functional change to test behavior when download succeeds

🤖 Generated with Claude Code

The test_moe_eval_accuracy_large.py test_mgsm_en has been consistently
timing out in CI because urllib.request.urlopen() had no timeout when
downloading MGSM benchmark data from openaipublic.blob.core.windows.net.
When the download hangs, the test hangs indefinitely until the 1200s CI
timeout kills it, with zero inference requests reaching the server.

Add a 60s timeout and 3 retries with exponential backoff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-ut test_moe_ep.py

@github-actions
Copy link
Copy Markdown
Contributor

2-gpu-h100: View workflow run

cd test/ && python3 registered/moe/test_moe_ep.py

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-ut test_moe_eval_accuracy_large.py

@github-actions
Copy link
Copy Markdown
Contributor

2-gpu-h100: View workflow run

cd test/ && python3 registered/eval/test_moe_eval_accuracy_large.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant