[Performance][torch.compile]: FlashInfer Attention + quant fusion performance issue with TP=4

As seen in #27080, attention+quant fusion on 4xB200 with the FlashInfer attn backend performs worse than unfused code. This should be resolved so that we can turn on this fusion by default.

cc @nvpohanh @pavanimajety @zou3519 for visibility