Skip to content

[AMD/NPU] Pin transformers to stable 4.57.1#14722

Closed
sunxxuns wants to merge 1 commit intosgl-project:mainfrom
sunxxuns:fix-amd-select-logprobs
Closed

[AMD/NPU] Pin transformers to stable 4.57.1#14722
sunxxuns wants to merge 1 commit intosgl-project:mainfrom
sunxxuns:fix-amd-select-logprobs

Conversation

@sunxxuns
Copy link
Collaborator

@sunxxuns sunxxuns commented Dec 9, 2025

Summary

Fix test_hellaswag_select and test_select failures on AMD/ROCm in stage-a-test-1-amd.

Root Cause

transformers==5.0.0rc0 has a regression on AMD where input_token_logprobs returns only 1 aggregated entry with None values instead of per-token logprobs. This breaks the select functionality.

Solution

Change pyproject_other.toml (used by AMD and NPU) to use the stable 4.57.1 version instead of the release candidate 5.0.0rc0.

Testing

Tested locally on MI300x:

  • transformers 4.57.1: works correctly (3 logprob entries per choice)
  • transformers 5.0.0rc0: broken (1 entry with None values)

Test plan

  • stage-a-test-1-amd should pass
  • Existing tests on CUDA should still pass (uses main pyproject.toml with 5.0.0rc0)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sunxxuns, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the stability of logprob computations within the runtime_endpoint.py module. It specifically targets and resolves a ZeroDivisionError that could arise during the normalization of prompt logprobs, particularly in scenarios involving token healing where the number of tokens might lead to division by zero. The changes introduce conditional checks to safeguard against such errors and ensure that calculations are performed only when valid data is present, ultimately improving the robustness of the system and fixing a reported test failure.

Highlights

  • ZeroDivisionError Fix: Resolved a ZeroDivisionError that could occur in the compute_normalized_prompt_logprobs function when len(input_token_logprobs[i]) - 1 evaluated to zero.
  • Robustness Improvement: Added checks to ensure logprob calculations only proceed if input_token_logprobs is not empty and contains valid entries, preventing potential errors from missing data.
  • Test Failure Resolution: This fix addresses and resolves the stage-a-test-1-amd failure.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch from 8341fbe to cb6c96a Compare December 9, 2025 09:29
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a ZeroDivisionError that occurs when recalculating normalized prompt log probabilities after removing a healed token. The fix correctly introduces a check to ensure the number of tokens is greater than one before division. Additionally, a new check is added to gracefully handle cases where logprobs may be empty, preventing potential IndexError exceptions. While the changes are good, I've identified a logic issue where the normalized log probability is not updated when the last remaining token is removed, which could lead to incorrect selections. I've provided a suggestion to address this.

@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch 6 times, most recently from b6556dc to 4cc4db3 Compare December 9, 2025 09:40
@sunxxuns sunxxuns changed the title [AMD] Fix ZeroDivisionError in compute_normalized_prompt_logprobs Fix ZeroDivisionError in compute_normalized_prompt_logprobs Dec 9, 2025
@sunxxuns sunxxuns added the run-ci label Dec 9, 2025
@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch 2 times, most recently from 50a2d82 to f23d2a8 Compare December 9, 2025 09:48
@sunxxuns sunxxuns changed the title Fix ZeroDivisionError in compute_normalized_prompt_logprobs [AMD] Fix ZeroDivisionError in compute_normalized_prompt_logprobs Dec 9, 2025
@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch 5 times, most recently from c43d687 to 58a6e9a Compare December 9, 2025 10:23
@sunxxuns sunxxuns changed the title [AMD] Fix ZeroDivisionError in compute_normalized_prompt_logprobs [CI test] amd Dec 9, 2025
@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch 2 times, most recently from 9683762 to b612dc6 Compare December 10, 2025 06:13
@github-actions github-actions bot added the amd label Dec 10, 2025
transformers 5.0.0rc0 has a regression on AMD where input_token_logprobs
returns only 1 aggregated entry with None values instead of per-token
logprobs. This breaks the select functionality (test_select and
test_hellaswag_select).

Change pyproject_other.toml (used by AMD and NPU) to use the stable
4.57.1 version instead of the release candidate 5.0.0rc0.

Tested locally on MI300x:
- transformers 4.57.1: works correctly (3 logprob entries per choice)
- transformers 5.0.0rc0: broken (1 entry with None values)
@sunxxuns sunxxuns force-pushed the fix-amd-select-logprobs branch from b612dc6 to b2dc76e Compare December 10, 2025 06:16
@sunxxuns sunxxuns changed the title [CI test] amd [AMD/NPU] Pin transformers to stable 4.57.1 Dec 10, 2025
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Dec 10, 2025
@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 10, 2025

#14801 merged

@HaiShaw HaiShaw closed this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants