[Performance][torch.compile]: Inductor partition performance issues

### Performance issues

As seen in #27080, Inductor partition is not always faster than Dynamo partition or no partition. Those are two separate issues:
- [ ] On Blackwell, no partition is sometimes faster than Inductor partition (particularly TTFT with attention+quant fusion). This might just be attributable to the difference in performance between `FULL_AND_PIECEWISE` and `FULL_DECODE_ONLY` cudagraph modes.
- [ ] On Hopper, Inductor partition seems to outperform Dynamo partition, although we only have numbers for TP=4. We should try again with llama-8B TP=1.

cc @zou3519 @BoyuanFeng 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance][torch.compile]: Inductor partition performance issues #27828

Performance issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance][torch.compile]: Inductor partition performance issues #27828

Description

Performance issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions