Added fbgemm recipe building fbgemm, fbgemm-gpu, and fbgemm-gpu-genai#31820
Added fbgemm recipe building fbgemm, fbgemm-gpu, and fbgemm-gpu-genai#31820das-intensity wants to merge 6 commits into
Conversation
|
Hi! This is the staged-recipes linter and your PR looks excellent but I have some suggestions. File-specific lints and/or hints:
|
|
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( I do have some suggestions for making it better though... For recipes/fbgemm/meta.yaml:
For recipes/asmjit/meta.yaml:
This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/20698834647. Examine the logs at this URL for more detail. |
h-vetinari
left a comment
There was a problem hiding this comment.
Thanks for the work on this. This will need some more iteration. Perhaps also consider writing this as a v1 recipe (not required, but should be beneficial, e.g. for things like the git checkout stuff).
| number: 0 | ||
| skip: true # [py<38] | ||
| skip: true # [win] | ||
| skip: true # [aarch64] # git_url source requires git on build system, problematic for cross-compilation |
There was a problem hiding this comment.
Even if that were the case (which I doubt), we would be able to use $BUILD_PREFIX/bin/git
There was a problem hiding this comment.
So interestingly it uses git to clone, but then says git isn't available. See here: https://gist.github.com/das-intensity/7f5a5bc9d238bbcd63940863a6ab3404
There was a problem hiding this comment.
Try adding
- git
- git-lfs
to the build environment. For whatever reason, the checkout procedure tries to look in there
FileNotFoundError: [Errno 2] No such file or directory: '/home/conda/staged-recipes/build_artifacts/fbgemm_1767418241898/_build_env/bin/git'
git-lfs may not be strictly necessary, but addresses
git: 'lfs' is not a git command. See 'git --help'.
| - name: fbgemm | ||
| build: | ||
| script: | | ||
| git submodule update --init --recursive |
There was a problem hiding this comment.
conda should normally default to recursive checkouts of submodules? Did you test that this is required?
As an aside, why not start out with a v1 recipe right away?
There was a problem hiding this comment.
conda should normally default to recursive checkouts
Right you are, I think I just missed removing this output. Will drop.
why not start out with a v1 recipe right away?
I think I read this: https://github.com/conda-forge/staged-recipes/blob/main/recipes/example-v1/README.md?plain=1#L3
but is not yet fully supported by conda-forge's automation.
and figured I didn't know enough conda-forge to know whether what's "not yet fully supported" would come back to bite me (plus I was more familiar with legacy style).
There was a problem hiding this comment.
but is not yet fully supported by conda-forge's automation.
That comment is 2 years old. By now things work pretty much without a hitch.
| - python | ||
| - pip | ||
| - setuptools-git-versioning | ||
| - pytorch | ||
| - pytorch * *cuda* # [cuda_compiler_version != "None"] | ||
| - scikit-build | ||
| - tabulate | ||
| - jinja2 | ||
| - pyyaml | ||
| - cuda-cudart-dev # [cuda_compiler_version != "None"] | ||
| - cuda-nvrtc-dev # [cuda_compiler_version != "None"] | ||
| - cuda-nvtx-dev # [cuda_compiler_version != "None"] | ||
| - libcublas-dev # [cuda_compiler_version != "None"] | ||
| - libcusolver-dev # [cuda_compiler_version != "None"] | ||
| - libcusparse-dev # [cuda_compiler_version != "None"] | ||
| - libcurand-dev # [cuda_compiler_version != "None"] |
There was a problem hiding this comment.
all of this should be in host:, not build:
There was a problem hiding this comment.
I suspect you're probably right about SOME of these, but I switched from:
build:
- {{ compiler('c') }}
- {{ compiler('cxx') }}
- {{ compiler('cuda') }} # [cuda_compiler_version != "None"]
- {{ stdlib('c') }}
- cmake
- make
- ninja
- git
- python
- pip
- setuptools-git-versioning
- pytorch
- pytorch * *cuda* # [cuda_compiler_version != "None"]
- scikit-build
- tabulate
- jinja2
- pyyaml
- cuda-cudart-dev # [cuda_compiler_version != "None"]
- cuda-nvrtc-dev # [cuda_compiler_version != "None"]
- cuda-nvtx-dev # [cuda_compiler_version != "None"]
- libcublas-dev # [cuda_compiler_version != "None"]
- libcusolver-dev # [cuda_compiler_version != "None"]
- libcusparse-dev # [cuda_compiler_version != "None"]
- libcurand-dev # [cuda_compiler_version != "None"]
host:
- python
- pip
- setuptools
- setuptools-git-versioning
- wheel
- pytorch
- scikit-build
- numpy
- cuda-version {{ cuda_compiler_version }} # [cuda_compiler_version != "None"]
to
build:
- {{ compiler('c') }}
- {{ compiler('cxx') }}
- {{ compiler('cuda') }} # [cuda_compiler_version != "None"]
- {{ stdlib('c') }}
- cmake
- make
- ninja
- git
host:
- python
- pip
- setuptools
- setuptools-git-versioning
- wheel
- pytorch
- pytorch * *cuda* # [cuda_compiler_version != "None"]
- scikit-build
- numpy
- tabulate
- jinja2
- pyyaml
- cuda-version {{ cuda_compiler_version }} # [cuda_compiler_version != "None"]
- cuda-cudart-dev # [cuda_compiler_version != "None"]
- cuda-nvrtc-dev # [cuda_compiler_version != "None"]
- cuda-nvtx-dev # [cuda_compiler_version != "None"]
- libcublas-dev # [cuda_compiler_version != "None"]
- libcusolver-dev # [cuda_compiler_version != "None"]
- libcusparse-dev # [cuda_compiler_version != "None"]
- libcurand-dev # [cuda_compiler_version != "None"]
but the cuda build failed with:
[ 7%] Building CXX object CMakeFiles/asmjit.dir/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/external/asmjit/src/asmjit/core/builder.cpp.o
/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_build_env/bin/x86_64-conda-linux-gnu-c++ -DPROTOBUF_USE_DLLS -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -isystem /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.11/site-packages/torch/include -isystem /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/include -isystem /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_build_env/targets/x86_64-linux/include -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/asmjit.dir/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/external/asmjit/src/asmjit/core/builder.cpp.o -MF CMakeFiles/asmjit.dir/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/external/asmjit/src/asmjit/core/builder.cpp.o.d -o CMakeFiles/asmjit.dir/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/external/asmjit/src/asmjit/core/builder.cpp.o -c /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/external/asmjit/src/asmjit/core/builder.cpp
In file included from /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/include/ATen/cuda/CUDAContext.h:3,
from /home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/work/fbgemm_gpu/src/embedding_inplace_ops/embedding_inplace_update_gpu.cpp:11:
/home/conda/staged-recipes/build_artifacts/fbgemm_1767562352076/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/include/ATen/cuda/CUDAContextLight.h:7:10: fatal error: cusparse.h: No such file or directory
7 | #include <cusparse.h>
| ^~~~~~~~~~~~
compilation terminated.
| - test -f $PREFIX/lib/libfbgemm${SHLIB_EXT} # [unix] | ||
| - test -f $PREFIX/include/fbgemm/FbgemmBuild.h # [unix] | ||
|
|
||
| - name: fbgemm-gpu |
There was a problem hiding this comment.
I suspect it might be better to use the name fbgemm for this output (though we can keep fbgemm-gpu as an alias wrapper if necessary)
There was a problem hiding this comment.
That does sound like a logical answer, but fbgemm_gpu is the actual module name.
If we changed fbgemm -> libfbgemm like you suggested, it would free up the package name, but from a user standpoint, the docs clearly show 3 related but distinct pacakges:
- FBGEMM
- FBGEMM_GPU <--- this output package
- FBGEMM_GPU_GENAI
(don't shoot me, I'm just the messenger maintainer)
So given this, calling this fbgemm would be confusing to users.
There was a problem hiding this comment.
fbgemm_gpu_cpu is braindead naming for something CPU-only, and the rest is not much better. There's a fundamental difference between fbgemm (what I call libfbgemm), the C++ library, and fbgemm_gpu (what I want to call fbgemm), the python bindings. The upstream naming does not reflect that at all.
So given this, calling this
fbgemmwould be confusing to users.
I don't think so, or rather, it doesn't matter. If a user installs fbgemm (in my proposed naming), they'd get both the library that the want, as well as the python bindings (that they perhaps don't need). If that user cares about slimming down their environment, they can learn about the libfbgemm naming.
Summing up, this is what I mean. We can't stop upstream from loading a shotgun and blowing their feet off, but we don't have to follow suit on this particular aspect.
| thing | name upstream | name conda-forge (proposed) |
|---|---|---|
| library | fbgemm(not installable via PyPI) |
libfbgemm |
| python bindings (CUDA) |
fbgemm-gpu |
fbgemm (build string cuda*)(possible to keep upstream name as compat. wrapper) |
| python bindings (CPU) |
fbgemm-gpu-cpu |
fbgemm (build string cpu*)(possible to keep upstream name as compat. wrapper) |
| genAI extension | fbgemm-gpu-genai |
fbgemm-genai(depending on fbgemm) |
CC @conda-forge/pytorch-cpu for viz.
There was a problem hiding this comment.
I guess the lack of any fbgemm-gpu will ensure that users wanting that "perhaps look harder", and then perhaps we can start the description field with:
Upstream name FBGEMM_GPU
so it shows up on anaconda search pages e.g. https://anaconda.org/search?q=pytorch
Will make the change!
| - name: fbgemm-gpu-genai | ||
| build: | ||
| skip: true # [cuda_compiler_version == "None"] | ||
| script: | | ||
| cd fbgemm_gpu | ||
| python setup.py --package_variant=genai --package_channel=release install --prefix=$PREFIX --single-version-externally-managed --record=record.txt |
There was a problem hiding this comment.
This looks problematic to me; how does the genai variant interact with the cpu/cuda variants? I would have expected this output to depend on the the previous one. Otherwise, we would have to implement pretty complex mutex rules, which is definitely not the right approach here.
There was a problem hiding this comment.
See my above comment #31820 (comment), while it uses the same setup.py file it's really a whole different codebase.
I will say there's some overlap, but fbgemm-gpu and fbgemm-gpu-genai both have lots that the other doesn't.
| -DCMAKE_PREFIX_PATH=$PREFIX \ | ||
| -DCMAKE_BUILD_TYPE=Release \ | ||
| -DFBGEMM_LIBRARY_TYPE=shared \ | ||
| -DASMJIT_STATIC=OFF \ |
There was a problem hiding this comment.
Can we build this on top of #31498 (comment)? You can have several recipes per PR; if fbgemm depends on asmjit, CI here will determine the DAG and build the recipes in the correct order (so that fbgemm can depend on asmjit)
|
The cuda build timed out after 6hrs. Unfortunately I'm not surprised. It didn't take quite that long on my local machine IIRC, but my local is pretty powerful. How can I go about debugging the pipeline machines? E.g. how can I know if it's fully pegging the CPU, such that perhaps a cmake/etc flag might help. |
Most effective is reducing the GPU arches for now (to a single one). We can switch to cirun once the feedstock is created. |
Checklist
url) rather than a repo (e.g.git_url) is used in your recipe (see here for more details).