Skip to content

fix(rope): read original_max_position_embeddings from yarn validator's argument#45887

Merged
zucchini-nlp merged 4 commits into
huggingface:mainfrom
bzantium:fix-yarn-validator-nested-rope
May 12, 2026
Merged

fix(rope): read original_max_position_embeddings from yarn validator's argument#45887
zucchini-nlp merged 4 commits into
huggingface:mainfrom
bzantium:fix-yarn-validator-nested-rope

Conversation

@bzantium
Copy link
Copy Markdown
Contributor

What does this PR do?

_validate_yarn_rope_parameters is called by validate_rope once per per-attention-type sub-dict, with the sub-dict passed as the rope_parameters argument. The factor consistency check inside the function however reads original_max_position_embeddings from self.rope_parameters[...] instead of from the argument:

def _validate_yarn_rope_parameters(self, rope_parameters: dict, ignore_keys=None):
    ...
    original_max_position_embeddings = self.rope_parameters["original_max_position_embeddings"]
    #                                  ^^^^^^^^^^^^^^^^^^^^
    # Full nested dict here, not the per-type sub-dict.

This raises KeyError for any config that keeps the nested {full_attention: ..., sliding_attention: ...} shape — the per-type sub-dict containing original_max_position_embeddings is inside one of those top-level keys, not at the top level.

The fix is to read from the function argument that validate_rope already populates correctly.

Why no in-tree model hits this today

Searched src/transformers/models/*/configuration_*.py for the nested shape (grep -l '"full_attention":'):

gemma3, gemma3n, gemma4, laguna, modernbert, modernbert_decoder, t5gemma2

None of them combine the nested shape with rope_type=yarn, so the bug stays dormant inside the repo. It surfaces for downstream models that do — e.g. one I encountered uses yarn for full_attention layers and default for sliding_attention to apply YaRN only to global-attention layers while keeping unscaled RoPE on sliding-attention layers (Gemma3-style split with the YaRN twist).

Reproducer

from transformers import PreTrainedConfig


class _NestedRopeConfig(PreTrainedConfig):
    model_type = "_repro"

    def __init__(self, **kwargs):
        self.layer_types = ["full_attention", "sliding_attention"]
        self.num_hidden_layers = 2
        self.max_position_embeddings = 35000
        self.head_dim = 128
        self.hidden_size = 1280
        self.num_attention_heads = 32
        nested = {
            "full_attention": {
                "rope_type": "yarn",
                "rope_theta": 10000.0,
                "factor": 40.0,
                "original_max_position_embeddings": 4096,
            },
            "sliding_attention": {
                "rope_type": "default",
                "rope_theta": 10000.0,
            },
        }
        self.rope_parameters = nested
        # Snapshot before super().__init__ so convert_rope_params_to_dict
        # cannot pollute the top level with a `rope_theta` sibling key.
        snapshot = {k: dict(v) for k, v in nested.items()}
        super().__init__(**kwargs)
        self.rope_parameters = snapshot


_NestedRopeConfig().validate_rope()

Before the fix:

KeyError: 'original_max_position_embeddings'
  at src/transformers/modeling_rope_utils.py:879

After the fix: validate_rope returns cleanly (only the existing factor-mismatch info-warning fires, which is unrelated and preserved).

Changes

  • src/transformers/modeling_rope_utils.py: 1-character change — self.rope_parametersrope_parameters inside _validate_yarn_rope_parameters. All sibling validators in the same file (_validate_default_rope_parameters, _validate_linear_rope_parameters, _validate_dynamic_rope_parameters, _validate_longrope_rope_parameters, _validate_llama3_rope_parameters) already read from the argument, so this brings yarn in line.

Tests

No new test added — the reproducer requires constructing a custom config with the nested shape, and there is no existing test fixture in tests/utils/test_modeling_rope_utils.py that exercises that path (no in-tree model uses nested + yarn). Happy to add a test if you'd prefer; let me know which directory you'd want it in (tests/utils/ or alongside one of the gemma3/modernbert configs that use the nested shape).

I ran the reproducer above against main (KeyError) and against this branch (clean) to confirm.

AI assistance disclosure

I used Claude Code to help draft this PR and the reproducer. I diagnosed the bug myself from a downstream model load failure, verified the fix in-place with the reproducer above, and reviewed the change line-by-line.

Who can review?

@Cyrilvallez @ArthurZucker — RoPE / config validation

…s argument

`_validate_yarn_rope_parameters` is called by `validate_rope` once per
per-attention-type sub-dict, with the sub-dict passed as the `rope_parameters`
argument. The `factor` consistency check inside the function however reads
`original_max_position_embeddings` from `self.rope_parameters[...]` instead
of from the argument, which raises `KeyError` for any config that keeps the
nested `{full_attention, sliding_attention, ...}` shape — the per-type
sub-dicts are inside one of those keys, not at the top level.

Other rope validators in the same file (`_validate_default_rope_parameters`,
`_validate_linear_rope_parameters`, etc.) all read from the function argument,
so this matches their pattern.
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, can you adjust tests in tests/utils/test_modeling_rope_utils (if you aren't an AI code agent)?

@bzantium bzantium force-pushed the fix-yarn-validator-nested-rope branch from 3612aa3 to fa4ef81 Compare May 11, 2026 13:05
@bzantium
Copy link
Copy Markdown
Contributor Author

bzantium commented May 11, 2026

@zucchini-nlp Thanks for the review! Test added and no, a human :) I used Claude Code only as a drafting aid.

Comment thread tests/utils/test_modeling_rope_utils.py Outdated
Comment on lines +139 to +153
def test_yarn_validation_with_per_attention_type_nested_rope(self):
"""A yarn entry inside nested per-attention-type `rope_parameters` validates cleanly."""
config = LlamaConfig()
config.layer_types = ["full_attention", "sliding_attention"]
config.rope_parameters = {
"full_attention": {
"rope_type": "yarn",
"rope_theta": 10000.0,
"factor": 2.0,
"original_max_position_embeddings": int(config.max_position_embeddings / 2.0),
},
"sliding_attention": {"rope_type": "default", "rope_theta": 10000.0},
}
config.validate_rope()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiice, lets extend it to all rope types. Above we have a test for validation without layer types, so we can mimic but this time add config.layer_types

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended in 4a08efc, mirroring test_rope_validation's two loops (missing-key and exclusive-param) with config.layer_types set.

@bzantium bzantium force-pushed the fix-yarn-validator-nested-rope branch from fa4ef81 to 1654714 Compare May 12, 2026 05:54
@bzantium bzantium force-pushed the fix-yarn-validator-nested-rope branch from 1654714 to 4a08efc Compare May 12, 2026 05:54
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp added this pull request to the merge queue May 12, 2026
Merged via the queue into huggingface:main with commit dfcb1a3 May 12, 2026
29 checks passed
@bzantium bzantium deleted the fix-yarn-validator-nested-rope branch May 14, 2026 00:55
jp1924 pushed a commit to jp1924/transformers that referenced this pull request May 18, 2026
…s argument (huggingface#45887)

* fix(rope): read original_max_position_embeddings from yarn validator's argument

`_validate_yarn_rope_parameters` is called by `validate_rope` once per
per-attention-type sub-dict, with the sub-dict passed as the `rope_parameters`
argument. The `factor` consistency check inside the function however reads
`original_max_position_embeddings` from `self.rope_parameters[...]` instead
of from the argument, which raises `KeyError` for any config that keeps the
nested `{full_attention, sliding_attention, ...}` shape — the per-type
sub-dicts are inside one of those keys, not at the top level.

Other rope validators in the same file (`_validate_default_rope_parameters`,
`_validate_linear_rope_parameters`, etc.) all read from the function argument,
so this matches their pattern.

* test(rope): mirror test_rope_validation for per-attention-type nested rope_parameters

* test(rope): apply ruff format to nested-rope test

---------

Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants