TG improvements for MoE models #404
Merged
+39
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR does 3 things:
GGML_OP_GET_ROWSop implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.ggml_cuda_op_mul_mat_vec_q_idfunction did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.Prompt processing is not eaffected by these changes.
Here is a graph obtained with
sweep-benchshowing TG performance as a function of the number of tokens in the KV cacheN_KV. The model is DeepSeek-Lite quantized toQ4_0. The GPU is RTX-4080. Black symbols are without using SER, red symbols are with-ser 4,1. The command line is