Skip to content

Conversation

@RyanMullins
Copy link
Contributor

@RyanMullins RyanMullins commented Oct 29, 2025

What does this PR do?

Related to #41922, this PR corrects the assignment of the rope_scaling dictionary present on some Gemma 3 pre-trained models on HF Hub when normalizing to the new rope_parameters value.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zucchini-nlp PTAL since you have been handling the RoPE changes.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma3

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the context for this is that rope_scaling clobbering rope_parameters (the old behaviour) gave the right config for older models but breaks on upcoming models that might have specific configs for e.g. sliding_attention?

@RyanMullins
Copy link
Contributor Author

I'm guessing the context for this is that rope_scaling clobbering rope_parameters (the old behaviour) gave the right config for older models but breaks on upcoming models that might have specific configs for e.g. sliding_attention?

Yes, we've observed that rope_scaling was being applied to both full_attention and sliding_attention for the Gemma 3 checkpoints on HF Hub and in our attempts to re-convert them from the original Orbax checkpoint. The correct behavior is for rope_scaling to only be applied to full_attention if it exists in the config, and the sliding_attention config should always be default RoPE @ 10k.

@Rocketknight1
Copy link
Member

Got it, thanks for the fix!

@Rocketknight1 Rocketknight1 merged commit 02c324f into huggingface:main Oct 30, 2025
17 checks passed
i3hz pushed a commit to i3hz/transformers that referenced this pull request Oct 30, 2025
* Fix: Gemma3TextConfig rope scaling assignments

* Fix: type annotation for rope_parameters
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* Fix: Gemma3TextConfig rope scaling assignments

* Fix: type annotation for rope_parameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants