Skip to content

Conversation

@Nexesenex
Copy link
Owner

  • common : increase max number of experts to 128

  • common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

  • gguf-py : add architecture-specific block mappings that override selected general block mappings

  • convert-hf : add model conversion support for ArcticForCausalLM

  • convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

  • llama : add inference support for LLM_ARCH_ARCTIC


* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
@Nexesenex Nexesenex merged commit 30ce1ed into Nexesenex:downstream May 24, 2024
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* q6_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 104.2 t/s up from 83.2 t/s.
I.e., q6_k_r4 now beats q6_0_r4.

* q5_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 107.8 t/s up from 96.9 t/s.
I.e., q5_k_r4 now beats q5_0_r4.

* q4_k_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 122.1 t/s up from 110 t/s.
I.e., q4_k_r4 is now (nearly) on par with q4_0_r4.

* iq4_xs_r4: Better ARM implementation

PP-512(LLaMA-3.1-8B) is now 131.3 t/s up from 115.8 t/s.
iq4_xs_r4 is now the prompt processing champion on ARM.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants