Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
30aaa21
update
zucchini-nlp Aug 1, 2025
0e5d07b
batch update model code
zucchini-nlp Aug 5, 2025
0712f62
typos
zucchini-nlp Aug 5, 2025
4fc7355
too many diffs, dump
zucchini-nlp Aug 7, 2025
b616240
dump again
zucchini-nlp Aug 8, 2025
06cd2a8
another dump
zucchini-nlp Aug 8, 2025
f66ad57
fix copies
zucchini-nlp Aug 8, 2025
7dc077f
make `rope_scaling_dict` self attr
zucchini-nlp Aug 8, 2025
4ac0f18
fix a few more tests
zucchini-nlp Aug 11, 2025
9ad42e9
another update
zucchini-nlp Aug 11, 2025
98944d5
fix a few more tests, hopefully last ones
zucchini-nlp Aug 11, 2025
1213769
fox copies
zucchini-nlp Aug 11, 2025
d787da7
a huuuge merge conflict resolved!
zucchini-nlp Aug 11, 2025
00d4b3d
fix copies again
zucchini-nlp Aug 11, 2025
f9d4de3
fix newly added models, I hate rebasing on main
zucchini-nlp Aug 11, 2025
d695f5a
update config files
zucchini-nlp Aug 11, 2025
303f218
modular files
zucchini-nlp Aug 12, 2025
3229fba
fix rope utils test
zucchini-nlp Aug 12, 2025
1914d82
docstring has to be indented more, why?
zucchini-nlp Aug 12, 2025
fccb637
oops forgot to update some modualr files
zucchini-nlp Aug 12, 2025
709c414
copy from doesn't copy decorators?
zucchini-nlp Aug 12, 2025
b00d90c
fix overriden test as well
zucchini-nlp Aug 12, 2025
c8120cf
add a new test
zucchini-nlp Aug 12, 2025
a352362
fix failing tests again
zucchini-nlp Aug 12, 2025
2f54cb3
update docstrings
zucchini-nlp Aug 14, 2025
11edd47
fix phi3
zucchini-nlp Aug 14, 2025
6bc850d
Merge branch 'main' into rope-refactor-version-2
zucchini-nlp Aug 14, 2025
206f03d
fix two models
zucchini-nlp Aug 14, 2025
8a9085f
fix copies
zucchini-nlp Aug 14, 2025
88328d2
forgot to add
zucchini-nlp Aug 14, 2025
5324748
stupid bug from modular conversion
zucchini-nlp Aug 18, 2025
62518de
Merge remote-tracking branch 'upstream/main' into rope-refactor-versi…
zucchini-nlp Aug 20, 2025
01e79a8
fix slow tests
zucchini-nlp Aug 21, 2025
1a4ccc7
update to call rotary emb once per model forward
zucchini-nlp Sep 22, 2025
13d30ad
3K tests failing?!
zucchini-nlp Sep 22, 2025
df6a3c5
update
zucchini-nlp Sep 22, 2025
5c129c5
update more models
zucchini-nlp Sep 23, 2025
c54bf4a
fix copies
zucchini-nlp Sep 23, 2025
9b2b357
fix the rest of tests hopefully
zucchini-nlp Sep 23, 2025
373a8f3
merge main
zucchini-nlp Sep 23, 2025
54ad1dc
fix after rebase
zucchini-nlp Sep 23, 2025
c576edb
fix the rope tests
zucchini-nlp Sep 23, 2025
dc9ad8d
fix docs omni
zucchini-nlp Sep 23, 2025
686b1a7
change a bit
zucchini-nlp Sep 24, 2025
182a600
models with layer types
zucchini-nlp Sep 24, 2025
fdf6818
why it was deleted?
zucchini-nlp Sep 24, 2025
c786413
fix a few tests
zucchini-nlp Sep 24, 2025
722541f
fix last test!
zucchini-nlp Sep 24, 2025
6a7321e
delete extra empty lines
zucchini-nlp Sep 24, 2025
5a9cb98
add a test case
zucchini-nlp Sep 24, 2025
403f56f
more changes
zucchini-nlp Sep 24, 2025
a6c4124
fix models
zucchini-nlp Sep 24, 2025
193eb23
typing hint for nested rope params
zucchini-nlp Sep 24, 2025
d1eeb42
merge main
zucchini-nlp Sep 24, 2025
235ef33
missed when resolving conflicts
zucchini-nlp Sep 24, 2025
6efc183
delete layer types and fix typo
zucchini-nlp Sep 24, 2025
cf7097d
fix copies
zucchini-nlp Sep 24, 2025
234f233
fix copies
zucchini-nlp Sep 24, 2025
a425778
update docs text
zucchini-nlp Oct 6, 2025
8886f85
docs
zucchini-nlp Oct 6, 2025
52192d9
huuge update all models
zucchini-nlp Oct 7, 2025
74a4b4f
fix copies
zucchini-nlp Oct 7, 2025
362e88d
rename attr to align with new format
zucchini-nlp Oct 7, 2025
f458fd0
delete redundant rope tests
zucchini-nlp Oct 7, 2025
a7cf992
trigger ci
zucchini-nlp Oct 7, 2025
05cd1ec
merge main
zucchini-nlp Oct 7, 2025
f3172fa
update the case
zucchini-nlp Oct 7, 2025
5739b6e
this is why i hate rebasing
zucchini-nlp Oct 7, 2025
ecde27c
maybe fixed?
zucchini-nlp Oct 7, 2025
9ecfa5e
oops
zucchini-nlp Oct 7, 2025
5e37957
now fix?
zucchini-nlp Oct 7, 2025
2b279e0
fix last tests and copies
zucchini-nlp Oct 7, 2025
9617211
merge main
zucchini-nlp Oct 8, 2025
878a933
fix copies?
zucchini-nlp Oct 8, 2025
617f1ae
fix minimax and gemma3n
zucchini-nlp Oct 8, 2025
b983014
update typo
zucchini-nlp Oct 8, 2025
f7c5043
deprecation end version
zucchini-nlp Oct 8, 2025
07fa630
final fix copies :fingers-crossed:
zucchini-nlp Oct 8, 2025
89beae3
oh my, add the docs in toctree
zucchini-nlp Oct 8, 2025
721de32
oke, this is really the last fix
zucchini-nlp Oct 8, 2025
d530a86
kill me please...
zucchini-nlp Oct 10, 2025
a5b397b
fix copies and hope that tests won't start failing again
zucchini-nlp Oct 10, 2025
c66337c
use rope scaling if saved
zucchini-nlp Oct 10, 2025
1bfc013
fix slow tests
zucchini-nlp Oct 10, 2025
baf23d9
fix cwm and unrelated deepseek
zucchini-nlp Oct 13, 2025
3de40c3
Merge branch 'main' into rope-refactor-version-2
zucchini-nlp Oct 13, 2025
87278a9
fix last
zucchini-nlp Oct 13, 2025
9dcd94d
update
zucchini-nlp Oct 13, 2025
dca1b94
hope it works now, it took so long
zucchini-nlp Oct 13, 2025
fcdff3b
lets keep None for now, I will try to remove after checking tests
zucchini-nlp Oct 14, 2025
9f6d963
some more fixes, i find and replace does not always find all cases
zucchini-nlp Oct 14, 2025
ff23a68
last fix of tests
zucchini-nlp Oct 14, 2025
e4617dc
Merge branch 'main' (commit b84c0b31) into rope-refactor-version-2
ydshieh Oct 14, 2025
3c955d2
arthur's comment for extra foreward kwargs
zucchini-nlp Oct 14, 2025
b4186d1
delete unused code
zucchini-nlp Oct 14, 2025
cd17598
Merge branch 'main' into rope-refactor-version-2
zucchini-nlp Oct 15, 2025
717ddc6
fix slow qwen tests
zucchini-nlp Oct 16, 2025
5fdae07
delete layer types from models
zucchini-nlp Oct 16, 2025
51ca43a
faulty modular conversion
zucchini-nlp Oct 16, 2025
080a742
fix qwen omni
zucchini-nlp Oct 16, 2025
1049305
merge main
zucchini-nlp Oct 16, 2025
5f126b7
fix copies and style
zucchini-nlp Oct 16, 2025
1cba3b8
address my comment
zucchini-nlp Oct 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1255,6 +1255,8 @@
title: Importing Utilities
- local: internal/time_series_utils
title: Utilities for Time Series
- local: internal/rope_utils
title: Rotary Embeddings Utilities
title: Internal helpers
- sections:
- local: reference/environment_variables
Expand Down
89 changes: 89 additions & 0 deletions docs/source/en/internal/rope_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Utilities for Rotary Embedding

This page explains how the Rotary Embedding is computed and applied in Transformers and what types of RoPE are supported.


## Overview

Rotary Position Embeddings are a technique used to inject positional information into attention mechanisms without relying on explicit position encodings.
Instead of adding position vectors to token embeddings, RoPE rotates query and key vectors in the complex plane according to their positions enabling relative positional awareness and better extrapolation to unseen sequence lengths.

The Transformers library provides a flexible and extensible implementation of various RoPE types defined in `[`~modeling_rope_utils.ROPE_VALIDATION_FUNCTIONS`]`, including both the default and scaled variants:

| Rope Type | Description |
|------------|-------------|
| `"default"` | Standard rotary embedding as in LLaMA. |
| `"linear"` | Linear-scaled RoPE which allows longer context windows. |
| `"dynamic"` | NTK-aware scaling computed by rescaling frequency base (`θ`) for longer context. |
| `"yarn"` | YaRN scaling variant providing smoother extrapolation and stability. |
| `"longrope"` | [LongRoPE](https://github.com/microsoft/LongRoPE) scaling as in Phi-2 model series. |
| `"llama3"` | RoPE scaling as in Llama3.1. |


# Configuration in Model Configs

To enable and customize rotary embeddings, add a `rope_parameters` field to your model’s configuration file (`config.json`). This field controls the RoPE behavior across model layers. Note that each RoPE variant defines its own set of expected keys and missing keys will raise an error. See the example below which creates a llama config with default RoPE parameters:


```python
from transformers import LlamaConfig

config = LlamaConfig()
config.rope_parameters = {
"rope_type": "default", # type of RoPE to use
"rope_theta": 10000.0 # base frequency parameter
}

# If we want to apply a scaled RoPE type, we need to pass extra parameters
config.rope_parameters = {
"rope_type": "linear",
"rope_theta": 10000.0,
"factor": 8.0 # scale factor for context extension
}
```

## Per-Layer-Type RoPE Configuration

Some models such as Gemma-3 use different layer types with different attention mechanisms, i.e. "full attention" in some blocks and "sliding-window attention" in others. Transformers supports specifying distinct RoPE parameters per layer type for these models. In this case, `rope_parameters` should be a nested dictionary, where top-level keys correspond to `config.layer_types` and values are per-type RoPE parameters. During model initialization, each decoder layer will automatically look up the matching RoPE configuration based on its declared layer type.


```python
from transformers import Gemma3Config

config = Gemma3Config()
config.rope_parameters = {
"full_attention": {
"rope_type": "dynamic",
"rope_theta": 1000000.0,
"factor": 8.0,
"original_max_position_embeddings": 8096,
},
"sliding_attention": {
"rope_type": "default",
"rope_theta": 10000.0,
}
}
```

# Utilities

[[autodoc]] RopeParameters
- __call__


2 changes: 1 addition & 1 deletion docs/source/en/modular_transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ class Olmo2DecoderLayer(OlmoDecoderLayer):
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
residual = hidden_states
Expand Down
101 changes: 34 additions & 67 deletions examples/modular-transformers/configuration_duplicated_method.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@
# modular_duplicated_method.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

from typing import Optional

from ...configuration_utils import PreTrainedConfig
from ...modeling_rope_utils import rope_config_validation
from ...modeling_rope_utils import RopeParameters, rope_config_validation, standardize_rope_params


class DuplicatedMethodConfig(PreTrainedConfig):
Expand Down Expand Up @@ -65,45 +67,10 @@ class DuplicatedMethodConfig(PreTrainedConfig):
results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
accordingly.
Expected contents:
`rope_type` (`str`):
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
'duplicated_method3'], with 'default' being the original RoPE implementation.
`factor` (`float`, *optional*):
Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
most scaling types, a `factor` of x will enable the model to handle sequences of length x *
original maximum pre-trained length.
`original_max_position_embeddings` (`int`, *optional*):
Used with 'dynamic', 'longrope' and 'duplicated_method3'. The original max position embeddings used during
pretraining.
`attention_factor` (`float`, *optional*):
Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
computation. If unspecified, it defaults to value recommended by the implementation, using the
`factor` field to infer the suggested value.
`beta_fast` (`float`, *optional*):
Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
ramp function. If unspecified, it defaults to 32.
`beta_slow` (`float`, *optional*):
Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
ramp function. If unspecified, it defaults to 1.
`short_factor` (`list[float]`, *optional*):
Only used with 'longrope'. The scaling factor to be applied to short contexts (<
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
size divided by the number of attention heads divided by 2
`long_factor` (`list[float]`, *optional*):
Only used with 'longrope'. The scaling factor to be applied to long contexts (<
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
size divided by the number of attention heads divided by 2
`low_freq_factor` (`float`, *optional*):
Only used with 'duplicated_method3'. Scaling factor applied to low frequency components of the RoPE
`high_freq_factor` (`float`, *optional*):
Only used with 'duplicated_method3'. Scaling factor applied to high frequency components of the RoPE
rope_parameters (`RopeParameters`, *optional*):
Dictionary containing the configuration parameters for the RoPE embeddings. The dictionaty should contain
a value for `rope_theta` and optionally parameters used for scaling in case you want to use RoPE
with longer `max_position_embeddings`.
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.0):
Expand Down Expand Up @@ -146,28 +113,27 @@ class DuplicatedMethodConfig(PreTrainedConfig):

def __init__(
self,
vocab_size=32000,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act="silu",
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=None,
bos_token_id=1,
eos_token_id=2,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
attention_bias=False,
attention_dropout=0.0,
mlp_bias=False,
head_dim=None,
vocab_size: Optional[int] = 32000,
hidden_size: Optional[int] = 4096,
intermediate_size: Optional[int] = 11008,
num_hidden_layers: Optional[int] = 32,
num_attention_heads: Optional[int] = 32,
num_key_value_heads: Optional[int] = None,
hidden_act: Optional[str] = "silu",
max_position_embeddings: Optional[int] = 2048,
initializer_range: Optional[float] = 0.02,
rms_norm_eps: Optional[int] = 1e-6,
use_cache: Optional[bool] = True,
pad_token_id: Optional[int] = None,
bos_token_id: Optional[int] = 1,
eos_token_id: Optional[int] = 2,
pretraining_tp: Optional[int] = 1,
tie_word_embeddings: Optional[bool] = False,
rope_parameters: Optional[RopeParameters | dict[RopeParameters]] = None,
attention_bias: Optional[bool] = False,
attention_dropout: Optional[float] = 0.0,
mlp_bias: Optional[bool] = False,
head_dim: Optional[int] = None,
**kwargs,
):
self.vocab_size = vocab_size
Expand All @@ -187,16 +153,17 @@ def __init__(
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
self.mlp_bias = mlp_bias
self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
# Try to set `rope_scaling` if available, otherwise use `rope_parameters`
rope_scaling = kwargs.pop("rope_scaling", None)
self.rope_parameters = rope_scaling or rope_parameters

# Validate the correctness of rotary position embeddings parameters
# BC: if there is a 'type' field, copy it it to 'rope_type'.
if self.rope_scaling is not None and "type" in self.rope_scaling:
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
rope_theta = kwargs.get("rope_theta", 10000.0)
standardize_rope_params(self, rope_theta=rope_theta)
rope_config_validation(self)

super().__init__(
Expand Down
71 changes: 37 additions & 34 deletions examples/modular-transformers/configuration_my_new_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@
# modular_my_new_model.py file directly. One of our CI enforces this.
# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨

from typing import Optional

from ...configuration_utils import PreTrainedConfig
from ...modeling_rope_utils import rope_config_validation
from ...modeling_rope_utils import RopeParameters, rope_config_validation, standardize_rope_params


class MyNewModelConfig(PreTrainedConfig):
Expand Down Expand Up @@ -147,38 +149,30 @@ class MyNewModelConfig(PreTrainedConfig):

def __init__(
self,
vocab_size=32000,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act="silu",
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=None,
bos_token_id=1,
eos_token_id=2,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
attention_bias=False,
attention_dropout=0.0,
vocab_size: Optional[int] = 32000,
hidden_size: Optional[int] = 4096,
intermediate_size: Optional[int] = 11008,
num_hidden_layers: Optional[int] = 32,
num_attention_heads: Optional[int] = 32,
num_key_value_heads: Optional[int] = None,
hidden_act: Optional[str] = "silu",
max_position_embeddings: Optional[int] = 2048,
initializer_range: Optional[float] = 0.02,
rms_norm_eps: Optional[int] = 1e-6,
use_cache: Optional[bool] = True,
pad_token_id: Optional[int] = None,
bos_token_id: Optional[int] = 1,
eos_token_id: Optional[int] = 2,
pretraining_tp: Optional[int] = 1,
tie_word_embeddings: Optional[bool] = False,
rope_parameters: Optional[RopeParameters | dict[RopeParameters]] = None,
attention_bias: Optional[bool] = False,
attention_dropout: Optional[float] = 0.0,
mlp_bias=True,
head_dim=None,
head_dim: Optional[int] = None,
new_param=0,
**kwargs,
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
Expand All @@ -196,15 +190,24 @@ def __init__(
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
self.mlp_bias = mlp_bias
self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
# Try to set `rope_scaling` if available, otherwise use `rope_parameters`
rope_scaling = kwargs.pop("rope_scaling", None)
self.rope_parameters = rope_scaling or rope_parameters

# Validate the correctness of rotary position embeddings parameters
# BC: if there is a 'type' field, copy it it to 'rope_type'.
if self.rope_scaling is not None and "type" in self.rope_scaling:
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
rope_theta = kwargs.get("rope_theta", 10000.0)
standardize_rope_params(self, rope_theta=rope_theta)
rope_config_validation(self)

super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
self.new_param = new_param
Loading