[BUG: ] Do we have a paper of a doc describing Mixtral-8x7B and Mixtral-8x22B model architectures? #233

vipannalla · 2024-11-07T00:18:48Z

Python -VV

N/A

Pip Freeze

N/A

Reproduction Steps

N/A

Expected Behavior

N/A

Additional Context

We have a Mixtral implementation on JAX which works fine with 8x7B model but it generated garbage output with 8x22B model. I couldn't find any paper or doc describing the detailed architeture, is there any?

The only different noticed is Mixtral-8x7B uses tokenizer-v1 and Mixtral-8x22B uses tokenizer-v3. I also understand they have different number of params (47B vs 141B). However beyond this, do both models pretty much share exact model architecture? Are there any subtle differences in implementation I'm missing? Where can I find more details?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG: ] Do we have a paper of a doc describing Mixtral-8x7B and Mixtral-8x22B model architectures? #233

[BUG: ] Do we have a paper of a doc describing Mixtral-8x7B and Mixtral-8x22B model architectures? #233

vipannalla commented Nov 7, 2024

[BUG: ] Do we have a paper of a doc describing Mixtral-8x7B and Mixtral-8x22B model architectures? #233

[BUG: ] Do we have a paper of a doc describing Mixtral-8x7B and Mixtral-8x22B model architectures? #233

Comments

vipannalla commented Nov 7, 2024

Python -VV

Pip Freeze

Reproduction Steps

Expected Behavior

Additional Context

Suggested Solutions