[dependencies] Upgrade transformers to >=5.0.0,<=5.3.0#1426
[dependencies] Upgrade transformers to >=5.0.0,<=5.3.0#1426erictang000 merged 15 commits intomainfrom
Conversation
| mhc_expansion_rate: mHC expansion rate. Connectors are trainable when this is > 1. | ||
| """ | ||
|
|
||
| # Type hints for config attributes |
There was a problem hiding this comment.
Do we need to remove these? It would be good to keep them for documentation purposes if possible :)
|
For the tx backend, you will also need to adapt for the change that |
| # Broadcast non-persistent buffers (e.g. inv_freq from RotaryEmbedding) that | ||
| # are excluded from state_dict. On non-rank-0 meta-init these are still on | ||
| # meta device with no data; rank 0 has the correctly computed values. | ||
| _sync_non_persistent_buffers(model, sharded_sd) |
There was a problem hiding this comment.
I'm curious do you know why upgrading transformers necessitates this change? Seems a little surprising :)
| if hasattr(provider, "q_lora_rank") and hasattr(hf_config, "q_lora_rank"): | ||
| provider.q_lora_rank = hf_config.q_lora_rank | ||
|
|
||
| # Workaround for transformers v5 moving rope_theta into rope_parameters |
There was a problem hiding this comment.
Curious why this is needed, since megatron-bridge already updated NVIDIA-NeMo/Megatron-Bridge#2068 -- if this is still needed, should we raise an issue against megatron-bridge so we can remove this workaround going forward?
skyrl/tx/models/configs.py
Outdated
| def __init__( | ||
| self, | ||
| config: PretrainedConfig | dict, | ||
| config: PretrainedConfig | dict | None = None, |
There was a problem hiding this comment.
Do you know why this is needed now? Who is calling this without passing in a config? I'm also concerned that the defaults in
max_lora_adapters: int = 0,
max_lora_rank: int = 0,
shard_attention_heads: bool = True,
could cause trouble and it would be better to not need to have the **kwargs part, since it can mask problems.
There was a problem hiding this comment.
seems like it's this PR in transformers 5.4.0: huggingface/transformers#41250
i'm pinning to <= 5.3.0 so it actually isn't an issue right now (but i guess i was testing with 5.4.0 when originally changing this code). I can revert the changes here for now and we can revisit when upgrading to >=5.4.0.
seems like megatron-bridge caps at <=5.3.0 as well and there's some relevant activity on transformers so these changes could be avoided in the changes: huggingface/transformers#45070
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) | ||
| hf_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name, attn_implementation="eager", use_safetensors=True, trust_remote_code=True | ||
| model_name, attn_implementation="eager", use_safetensors=True, torch_dtype=torch.float32 |
There was a problem hiding this comment.
this is needed now since behavior in v5 changed from defaulting to float32 to defaulting to the default model dtype.

Upgrade to transformers v5
Summary
Upgrades
transformersfrom>=4.56.1,<5to>=5.0.0,<=5.3.0and adapts SkyRL's model initialization, FSDP loading, and test code to accommodate v5 breaking changes.CI
Round 2 CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23917102581 -> 10 failing from before
Megatron CI Round 2: https://github.com/NovaSky-AI/SkyRL/actions/runs/23959241150/job/69884903884 -> 1 failing from before
~~Round 1 CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23876002482 ~~ -> 17 still failing
Megatron CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23920479124Key changes
Meta-device model initialization (
fsdp_utils.py,model_wrapper.py,fsdp_worker.py)v5 disallows
from_pretrained()insideaccelerate.init_empty_weights()(TypeError: Parameter.__new__() got an unexpected keyword argument '_is_hf_initialized'). Replaced with:from_pretrained()(loads real weights)from_config()insidetorch.device("meta")(empty shell; weights broadcast by FSDP)rope_scaling,rope_theta, and_attn_implementationare applied to the config before the branch so both paths are consistent.FSDP2 non-persistent buffer sync (
fsdp_utils.py)from_configon meta produces non-persistent buffers (inv_freqin RotaryEmbedding) with no data. These are excluded fromstate_dict()and never broadcast. Fixes:_sync_non_persistent_buffers()broadcasts these from rank 0 after state dict loadingoffload_fsdp2_model_to_cpu()now materializes only meta buffers instead of callingmodel.to_empty()(which wiped all loaded parameters → NaN)CriticModel
post_init()(model_wrapper.py)v5 added
all_tied_weights_keysinPreTrainedModel.post_init(). The dynamicCriticModelclass now callsself.post_init(), and the meta-init path wraps construction inno_init_weights().Strict dataclass configs (
configs.py)PretrainedConfigis now a strict dataclass. MadeModelConfig.__init__args optional with defaults; fixedget_text_config()signature for v5.VLM
mm_token_type_ids(model_wrapper.py, VLM tests)v5 requires
mm_token_type_idsfor M-RoPE in multimodal models. Threaded throughHFModelWrapper.forward()and tests.Megatron
rope_theta(megatron_worker.py)v5 moved
rope_thetaintorope_parametersdict. Added workaround to setprovider.rotary_basefrom the new location.Other fixes
cuda_ipc_strategy.py:.view(-1)→.reshape(-1)for non-contiguous weight tensorsvllm_server.py: guardsock.close()against uvloopTransportSocketAttributeErrortest_remote_inference_client_chat_template.py: userender_chat_completion()for prompt token verification