Skip to content

[TMP]: Disable layout rematerialization cache#1275

Merged
alexbaden merged 5 commits into
llvm-targetfrom
alex/disable_remat_cache
Jun 8, 2024
Merged

[TMP]: Disable layout rematerialization cache#1275
alexbaden merged 5 commits into
llvm-targetfrom
alex/disable_remat_cache

Conversation

@alexbaden
Copy link
Copy Markdown
Contributor

@alexbaden alexbaden commented Jun 6, 2024

Temporarily disables the layout rematerialization cache for backward slice rematerialization. The remat cache is exposing a bug on systems with the LTS driver which results in an accuracy error with two HuggingFace models under the PyTorch Inductor benchmarks. Tracking down the driver issue and providing a reproducer to the driver team will take some time, so this is a stopgap solution. I ran the same Float32 / training HF benchmarks compared with llvm-target and there is a small regression on the T5Small model, but all other models are relatively similar to llvm-target. I am running the accuracy tests locally for Float32 / Training HF and so far results are promising so I am marking this ready for review.

Issue #1255

image

@alexbaden alexbaden marked this pull request as ready for review June 6, 2024 19:37
auto layoutIt = layout.find(v);
assert(layoutIt != layout.end());
// If we already have a remat value for this value, use it.
#if 0 // FIXME: Fails on LTS driver
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS discussed offline I think we should compile this code conditionally based on the target environment (e.g. driver version).

Copy link
Copy Markdown
Contributor Author

@alexbaden alexbaden Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @etiotto I understand your concerns but I don't think that is a good idea for the following reasons:

  • this change would affect many signatures in RemoveLayoutConversions.cpp causing further divergence / merge conflict with upstream
  • this change is expected to be short lived once we identify the regression and find a better workaround or get the LTS driver team to fix it
  • none of our nightly CI uses the LTS driver so the negative codepath would be essentially untested

If we object to having this change in llvm-target because of potential performance degradation then I would suggest just merging it directly to the PyTorch release branch. Of course, that precludes PyTorch upstream from using llvm-target until we find a better workaround or get the LTS driver fix in.

FWIW I actually did implement the change but can't test it because the only LTS machine is currently broken.

cc @vlad-penkin

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a new development I am investigating... standby...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etiotto I was able to recover the LTS machine reproducer and add the LTS driver flag. Please take a look. Because of the regression in llvm-target this branch is not up to date. After the PR is approved, I will update and merge. But I want to try and keep the branch where it is now locally so I can continue testing.

@alexbaden alexbaden force-pushed the alex/disable_remat_cache branch from 8e019df to 452f4ff Compare June 7, 2024 20:44
pm.enable_debug()
passes.ttir.add_convert_to_ttgpuir(pm, f"xpu:{device_arch}", opt.num_warps, opt.threads_per_warp, opt.num_ctas)

is_lts_driver = Version(metadata["target"].arch['driver_version']) == Version("1.3.27642")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Be consistent with string quote character within a file.

@alexbaden alexbaden merged commit eb51a81 into llvm-target Jun 8, 2024
alexbaden added a commit that referenced this pull request Jun 8, 2024
Temporarily disables the layout rematerialization cache for backward
slice rematerialization. The remat cache is exposing a bug on systems
with the LTS driver which results in an accuracy error with two
HuggingFace models under the PyTorch Inductor benchmarks.

Issue #1255

(cherry picked from commit eb51a81)
alexbaden added a commit that referenced this pull request Jun 8, 2024
Temporarily disables the layout rematerialization cache for backward
slice rematerialization. The remat cache is exposing a bug on systems
with the LTS driver which results in an accuracy error with two
HuggingFace models under the PyTorch Inductor benchmarks.

Issue #1255

(cherry picked from commit eb51a81)
alexbaden added a commit that referenced this pull request Jun 8, 2024
Temporarily disables the layout rematerialization cache for backward
slice rematerialization. The remat cache is exposing a bug on systems
with the LTS driver which results in an accuracy error with two
HuggingFace models under the PyTorch Inductor benchmarks.

Issue #1255

(cherry picked from commit eb51a81)
whitneywhtsang added a commit that referenced this pull request Jun 12, 2024
whitneywhtsang added a commit that referenced this pull request Jun 12, 2024
etiotto pushed a commit that referenced this pull request Jul 10, 2024
The workarounds were added in
#1275 and
#1337.
All huggingface training float32 models pass with the LTS workaround
removed:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9865658421

Signed-off-by: Whitney Tsang <whitney.tsang@intel.com>
@whitneywhtsang whitneywhtsang deleted the alex/disable_remat_cache branch July 22, 2024 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants