You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a Mixtral implementation on JAX which works fine with 8x7B model but it generated garbage output with 8x22B model. I couldn't find any paper or doc describing the detailed architeture, is there any?
The only different noticed is Mixtral-8x7B uses tokenizer-v1 and Mixtral-8x22B uses tokenizer-v3. I also understand they have different number of params (47B vs 141B). However beyond this, do both models pretty much share exact model architecture? Are there any subtle differences in implementation I'm missing? Where can I find more details?
Thanks
Suggested Solutions
No response
The text was updated successfully, but these errors were encountered:
Python -VV
Pip Freeze
Reproduction Steps
N/A
Expected Behavior
N/A
Additional Context
We have a Mixtral implementation on JAX which works fine with 8x7B model but it generated garbage output with 8x22B model. I couldn't find any paper or doc describing the detailed architeture, is there any?
The only different noticed is Mixtral-8x7B uses tokenizer-v1 and Mixtral-8x22B uses tokenizer-v3. I also understand they have different number of params (47B vs 141B). However beyond this, do both models pretty much share exact model architecture? Are there any subtle differences in implementation I'm missing? Where can I find more details?
Thanks
Suggested Solutions
No response
The text was updated successfully, but these errors were encountered: