Skip to content

Conversation

@ysjprojects
Copy link
Collaborator

@ysjprojects ysjprojects commented May 17, 2025

class Qwen3MoeMLP(nn.Module):
    def __init__(self, config, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size if intermediate_size is not None else config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

In Qwen3 MoE, the MLP module is created with two possible intermediate sizes. The sparse MoE block uses MLP with size config.moe_intermediate_size while the decoder layer uses MLP with size config.intermediate_size.

This is also observed in DeepseekV3 and likely many more MoE models to come. Therefore we extend this flexibility to LitGPT's own MLP modules.

@Borda Borda enabled auto-merge (squash) May 22, 2025 12:15
@Borda Borda merged commit f99ca4e into Lightning-AI:main May 28, 2025
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants