[AutoDeploy][Bug]: Cutlass MOE kernel caused the accuracy drop

### System Info

With the H100 from cw, the cutlass moe BF16 kernel caused the accuracy drop for gsm8k. 

### Who can help?

@nzmora-nvidia 

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

pytest tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronMOE -s -vv 


### Expected behavior

Fix the accuracy issue, or If the kernel cannot support the BF16, then let's stick with the triton.. 

### actual behavior

N/A

### additional notes

N/A

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoDeploy][Bug]: Cutlass MOE kernel caused the accuracy drop #9184

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AutoDeploy][Bug]: Cutlass MOE kernel caused the accuracy drop #9184

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions