UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979
UPSTREAM PR #18958: CUDA: use mmvq for mul-mat-id for small batch sizes#979
Conversation
|
Explore the complete analysis inside the Version Insights Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The code modifications did not result in measurable performance impact. |
5b137d4 to
ab9ebfa
Compare
4f9b49b to
30f9ba9
Compare
d0db284 to
54bfaef
Compare
|
Based on the analysis, no functions were identified with meaningful performance changes between the base and target versions. The function_insights_topk tool returned empty results for both response time and throughput time metrics, indicating that the code changes in this version do not introduce measurable performance impacts to any functions in the analyzed binary. This suggests that the modifications between versions are either:
Conclusion: No performance review is warranted for this version comparison, as no functions exhibit detectable changes in execution time metrics. See the complete breakdown in Version Insights |
e11b5e5 to
82a6249
Compare
Mirrored from ggml-org/llama.cpp#18958
Currently for batch_sizes > 1, we immediately move to mmq which is suboptimal for small batch sizes. Bring performance of batched bench in line (previously there was a dip at n_tokens = 2)
Micro-benchmark for test-backend-ops