[Performance]: Moe Memory usage much larger

Proposal to Enhance Memory Efficiency and Performance in MoE W8A8
Background:
Last week, a pull request (PR) (see [PR #580](https://github.com/vllm-project/vllm-ascend/pull/580)) introduced optimizations on memory usage for MoE W8A8, which significantly enhanced the handling of key-value caches (kvcaches). However, a recent commit ([5c6d05a](https://github.com/vllm-project/vllm-ascend/commit/5c6d05a59e996ab0ce6b91e7d4e267d7be1157f8)) appears to have rolled back some of these optimizations, potentially impacting memory efficiency.

Observations:

Memory Management: Efficient memory usage is crucial as it directly impacts the number of kvcaches that can be maintained, which are integral to model performance and scalability.
Performance Trade-offs: While speed is essential, excessive memory usage can lead to constraints on larger dataset handling and model scaling.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: Moe Memory usage much larger #744

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance]: Moe Memory usage much larger #744

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions