You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
The text was updated successfully, but these errors were encountered:
Duplicate of question asked on the mutransformers repository (link)
Hi !
I was wondering if (learned) positional embeddings should be MuReadout layers, since they map to a finite dimensional space. Specifically
https://github.com/microsoft/mutransformers/blob/480287ce7b18a07a3432e8f2fbc0f0e5b71e2599/mutransformers/models/bert/modeling_bert.py#L174
In addition to that, did you try using muP for sparse MoE models ? Am curious about any findings for those. Specifically, I was wondering if the routing gate (hdim, num_experts) would also be a MuReadout layer (if we don't scale the number of experts).
Would be grateful for any advice :)
Thank you !
The text was updated successfully, but these errors were encountered: