Skip to content

python3Packages.torchWithRocm: fix build#298609

Merged
ulrikstrid merged 1 commit intoNixOS:masterfrom
mschwaig:fix-torch-with-rocm
Mar 30, 2024
Merged

python3Packages.torchWithRocm: fix build#298609
ulrikstrid merged 1 commit intoNixOS:masterfrom
mschwaig:fix-torch-with-rocm

Conversation

@mschwaig
Copy link
Member

@mschwaig mschwaig commented Mar 24, 2024

Description of changes

torchWithRocm fails to build, since we merged ROCm 6.0 (#287846).
This PR together with #298451 should fix that.

I am still in the process of building this fix locally, to see if it works.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added the 6.topic: python Python is a high-level, general-purpose programming language. label Mar 24, 2024
@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Mar 24, 2024
@ofborg ofborg bot requested review from teh, thoughtpolice and tscholak March 24, 2024 12:10
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux. labels Mar 24, 2024
@mschwaig mschwaig removed the 8.has: package (new) This PR adds a new package label Mar 24, 2024
@mschwaig
Copy link
Member Author

torchWithRocmstill fails to build with this and the PR linked above:

FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/hip/detail/HIPHooks.cpp.o 
/nix/store/kvlhk0gpm2iz1asbw1xjac2ch0r8kyw9-gcc-wrapper-13.2.0/bin/g++ -DAT_PER_OPERATOR_HEADERS -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMAGMA_V2 ->
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
In file included from /build/source/aten/src/ATen/core/IListRef.h:631,
                 from /build/source/aten/src/ATen/DeviceGuard.h:3,
                 from /build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:6:
/build/source/aten/src/ATen/core/IListRef_inl.h: In static member function ‘static c10::detail::IListRefConstRef<at::OptionalTensorRef> c10::detail::IListRefTagImpl<c10::IListRefTag::Boxed,>
/build/source/aten/src/ATen/core/IListRef_inl.h:171:17: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  171 |     const auto& ivalue = (*it).get();
      |                 ^~~~~~
/build/source/aten/src/ATen/core/IListRef_inl.h:171:35: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu>
  171 |     const auto& ivalue = (*it).get();
      |                          ~~~~~~~~~^~
/build/source/aten/src/ATen/hip/detail/HIPHooks.cpp: In member function ‘virtual bool at::cuda::detail::CUDAHooks::isPinnedPtr(const void*) const’:
/build/source/aten/src/ATen/hip/detail/HIPHooks.cpp:158:15: error: ‘hipPointerAttribute_t’ {aka ‘struct hipPointerAttribute_t’} has no member named ‘memoryType’
  158 |   return attr.memoryType == hipMemoryTypeHost;
      |               ^~~~~~~~~~

Maybe the correct fix for now is to switch to rocmPackages_5.

@mschwaig
Copy link
Member Author

mschwaig commented Mar 24, 2024

I tried switching back to rcomPackages_5, and that succeeds at building torch, but my runtime test of running nanoGPT with nix develop fails at runtime:

 (Triggered internally at /build/source/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
  return node.target(*args, **kwargs)
/tmp/nix-shell.XPsinF/tmpqculhp7v/main.c:2:10: fatal error: hip/hip_runtime.h: No such file or directory
    2 | #include <hip/hip_runtime.h>
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
/tmp/nix-shell.XPsinF/tmpdyfzmrqg/main.c:2:10: fatal error: hip/hip_runtime.h: No such file or directory
    2 | #include <hip/hip_runtime.h>
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.

I can fix this like so (not sure if it should be in torchWithRocm or downstream:

propagatedBuildInputs = [ pkgs.rocmPackages_5.hip-common ];

But, then it's complaining about not finding hip/hip_version.h, and that file is there on the master branch of https://github.com/rocm/hip, but not in the version tagged as 5.7.1 tag.

At the same time it does not seem like upstream torch supports ROCm 6.0 already.

@mschwaig
Copy link
Member Author

mschwaig commented Mar 24, 2024

The hip/hip_version.h file would normally be generated by a CMmakeLists.txt file, but that's not checked in in the release branches either.
I manually put together such a file to see how far that would go. The build for that is still running.

The branch for it is here: master...mschwaig:nixpkgs:fix-torch-with-rocm-2

@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 8400be6 to 45437c9 Compare March 25, 2024 22:26
@mschwaig
Copy link
Member Author

Actually I found out that those runtime errors involving hip-common were expected, since I only had nanoGPT working with the --compile=False flag.

So the right solution for now should be to go back to rocmPackages_5 and leave HIP alone.
This PR also has some commits to fix missing or erroneous deprecation warnings for ROCm 6.0.

I'm running nixpkgs-review for this now.

@ofborg ofborg bot added the 8.has: package (new) This PR adds a new package label Mar 25, 2024
@mschwaig
Copy link
Member Author

mschwaig commented Mar 26, 2024

Result of nixpkgs-review pr 298609 run on x86_64-linux 1

3 packages marked as broken and skipped:
  • rocmPackages.mivisionx
  • rocmPackages.mivisionx-cpu
  • rocmPackages.mivisionx-hip
3 packages failed to build:
  • rocmPackages_5.mivisionx (rocmPackages_5.mivisionx-hip)
  • rocmPackages_5.mivisionx-cpu
  • rocmPackages_5.mivisionx-opencl
10 packages built:
  • python311Packages.torchWithRocm
  • python311Packages.torchWithRocm.cxxdev
  • python311Packages.torchWithRocm.dev
  • python311Packages.torchWithRocm.dist
  • python311Packages.torchWithRocm.lib
  • python312Packages.torchWithRocm
  • python312Packages.torchWithRocm.cxxdev
  • python312Packages.torchWithRocm.dev
  • python312Packages.torchWithRocm.dist
  • python312Packages.torchWithRocm.lib

rocmPackages_5.mivisionxneeds a fix for its version of libjpeg_turbo, which I plan to open another PR for.

@mschwaig mschwaig marked this pull request as ready for review March 26, 2024 17:37
@mschwaig mschwaig marked this pull request as draft March 26, 2024 17:47
@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 347c3be to 45437c9 Compare March 28, 2024 11:08
@mschwaig
Copy link
Member Author

Result of nixpkgs-review pr 298609 run on x86_64-linux 1

6 packages marked as broken and skipped:
  • rocmPackages.mivisionx
  • rocmPackages.mivisionx-cpu
  • rocmPackages.mivisionx-hip
  • rocmPackages_6.mivisionx
  • rocmPackages_6.mivisionx-cpu
  • rocmPackages_6.mivisionx-hip
3 packages failed to build:
  • rocmPackages_5.mivisionx (rocmPackages_5.mivisionx-hip)
  • rocmPackages_5.mivisionx-cpu
  • rocmPackages_5.mivisionx-opencl
10 packages built:
  • python311Packages.torchWithRocm
  • python311Packages.torchWithRocm.cxxdev
  • python311Packages.torchWithRocm.dev
  • python311Packages.torchWithRocm.dist
  • python311Packages.torchWithRocm.lib
  • python312Packages.torchWithRocm
  • python312Packages.torchWithRocm.cxxdev
  • python312Packages.torchWithRocm.dev
  • python312Packages.torchWithRocm.dist
  • python312Packages.torchWithRocm.lib

@mschwaig mschwaig marked this pull request as ready for review March 28, 2024 11:10
@mschwaig mschwaig requested a review from ulrikstrid March 28, 2024 13:52
@ulrikstrid
Copy link
Member

Can we split this into 2 PRs? One for pytorch and one for deprecated packages. I don't see why they should be in the same one.

If you do that we can merge both quickly 😊 or if you can give a compelling reason

@mschwaig mschwaig force-pushed the fix-torch-with-rocm branch from 45437c9 to 1c92ae9 Compare March 30, 2024 11:51
Copy link
Member

@ulrikstrid ulrikstrid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@ulrikstrid ulrikstrid merged commit ae32f55 into NixOS:master Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: python Python is a high-level, general-purpose programming language. 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 11-100 This PR causes between 11 and 100 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants