[TMP]: Disable layout rematerialization cache by alexbaden · Pull Request #1275 · intel/intel-xpu-backend-for-triton

alexbaden · 2024-06-06T17:57:17Z

Temporarily disables the layout rematerialization cache for backward slice rematerialization. The remat cache is exposing a bug on systems with the LTS driver which results in an accuracy error with two HuggingFace models under the PyTorch Inductor benchmarks. Tracking down the driver issue and providing a reproducer to the driver team will take some time, so this is a stopgap solution. I ran the same Float32 / training HF benchmarks compared with llvm-target and there is a small regression on the T5Small model, but all other models are relatively similar to llvm-target. I am running the accuracy tests locally for Float32 / Training HF and so far results are promising so I am marking this ready for review.

Issue #1255

etiotto · 2024-06-07T14:12:49Z

    auto layoutIt = layout.find(v);
    assert(layoutIt != layout.end());
    // If we already have a remat value for this value, use it.
+#if 0 // FIXME: Fails on LTS driver


AS discussed offline I think we should compile this code conditionally based on the target environment (e.g. driver version).

Hi @etiotto I understand your concerns but I don't think that is a good idea for the following reasons:

this change would affect many signatures in RemoveLayoutConversions.cpp causing further divergence / merge conflict with upstream

this change is expected to be short lived once we identify the regression and find a better workaround or get the LTS driver team to fix it

none of our nightly CI uses the LTS driver so the negative codepath would be essentially untested

If we object to having this change in llvm-target because of potential performance degradation then I would suggest just merging it directly to the PyTorch release branch. Of course, that precludes PyTorch upstream from using llvm-target until we find a better workaround or get the LTS driver fix in.

FWIW I actually did implement the change but can't test it because the only LTS machine is currently broken.

cc @vlad-penkin

There's a new development I am investigating... standby...

@etiotto I was able to recover the LTS machine reproducer and add the LTS driver flag. Please take a look. Because of the regression in llvm-target this branch is not up to date. After the PR is approved, I will update and merge. But I want to try and keep the branch where it is now locally so I can continue testing.

This phase does not affect LTS driver

vlad-penkin · 2024-06-07T21:23:37Z

        pm.enable_debug()
        passes.ttir.add_convert_to_ttgpuir(pm, f"xpu:{device_arch}", opt.num_warps, opt.threads_per_warp, opt.num_ctas)

+        is_lts_driver = Version(metadata["target"].arch['driver_version']) == Version("1.3.27642")


NIT: Be consistent with string quote character within a file.

Temporarily disables the layout rematerialization cache for backward slice rematerialization. The remat cache is exposing a bug on systems with the LTS driver which results in an accuracy error with two HuggingFace models under the PyTorch Inductor benchmarks. Issue #1255 (cherry picked from commit eb51a81)

This reverts commit eb51a81.

The workarounds were added in #1275 and #1337. All huggingface training float32 models pass with the LTS workaround removed: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9865658421 Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>

alexbaden requested review from etiotto, vlad-penkin and whitneywhtsang June 6, 2024 19:33

alexbaden marked this pull request as ready for review June 6, 2024 19:37

etiotto reviewed Jun 7, 2024

View reviewed changes

alexbaden added 4 commits June 7, 2024 13:23

[TMP]: Disable layout rematerialization cache

f639798

fix formatting

58b674d

[WIP]: Conditionally disable remat cache

42610aa

Conditionally disable remat cache under LTS driver

452f4ff

alexbaden force-pushed the alex/disable_remat_cache branch from 8e019df to 452f4ff Compare June 7, 2024 20:44

Keep remat cache enabled for hoist convert layout materialization phase

a7aa8bf

This phase does not affect LTS driver

vlad-penkin reviewed Jun 7, 2024

View reviewed changes

etiotto approved these changes Jun 7, 2024

View reviewed changes

alexbaden merged commit eb51a81 into llvm-target Jun 8, 2024

whitneywhtsang added a commit that referenced this pull request Jun 12, 2024

Revert "[TMP]: Disable layout rematerialization cache (#1275)"

0b04d61

This reverts commit eb51a81.

whitneywhtsang added a commit that referenced this pull request Jun 12, 2024

Revert "[TMP]: Disable layout rematerialization cache (#1275)"

c45830b

This reverts commit eb51a81.

whitneywhtsang mentioned this pull request Jul 10, 2024

Remove LTS workaround #1595

Merged

whitneywhtsang deleted the alex/disable_remat_cache branch July 22, 2024 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TMP]: Disable layout rematerialization cache#1275

[TMP]: Disable layout rematerialization cache#1275
alexbaden merged 5 commits into
llvm-targetfrom
alex/disable_remat_cache

alexbaden commented Jun 6, 2024 •

edited

Loading

Uh oh!

etiotto Jun 7, 2024

Uh oh!

alexbaden Jun 7, 2024 •

edited

Loading

Uh oh!

alexbaden Jun 7, 2024

Uh oh!

alexbaden Jun 7, 2024

Uh oh!

vlad-penkin Jun 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexbaden commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etiotto Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

alexbaden Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexbaden Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

alexbaden Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

vlad-penkin Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexbaden commented Jun 6, 2024 •

edited

Loading

alexbaden Jun 7, 2024 •

edited

Loading