Skip to content

[Performance][torch.compile]: Inductor partition performance issues #27828

@ProExpertProg

Description

@ProExpertProg

Performance issues

As seen in #27080, Inductor partition is not always faster than Dynamo partition or no partition. Those are two separate issues:

  • On Blackwell, no partition is sometimes faster than Inductor partition (particularly TTFT with attention+quant fusion). This might just be attributable to the difference in performance between FULL_AND_PIECEWISE and FULL_DECODE_ONLY cudagraph modes.
  • On Hopper, Inductor partition seems to outperform Dynamo partition, although we only have numbers for TP=4. We should try again with llama-8B TP=1.

cc @zou3519 @BoyuanFeng

Metadata

Metadata

Assignees

Labels

performancePerformance-related issues

Type

No type

Projects

Status

Ready

Relationships

None yet

Development

No branches or pull requests

Issue actions