-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Add NVIDIA DCGM and DCGM-exporter (prometheus) #235024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
40f8d8b
jsoncpp: fix enableStatic
6ba4923
libevent: fix sslSupport = false
85888a1
tclap: add 1.4 variant
6549776
cudatoolkit: fix builds for 10.*
5ba94f8
cudatoolkit: fix build for 12.0.1
1cdc375
dcgm: init at 3.1.8
b25101f
prometheus-dcgm-exporter: init at 3.1.8-3.1.5
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| { lib | ||
| , stdenv | ||
| , fetchgit | ||
| , cmake | ||
| , doxygen | ||
| , python3 | ||
| }: | ||
| stdenv.mkDerivation { | ||
| pname = "tclap"; | ||
|
|
||
| # This version is slightly newer than 1.4.0-rc1: | ||
| # See https://github.com/mirror/tclap/compare/1.4.0-rc1..3feeb7b2499b37d9cb80890cadaf7c905a9a50c6 | ||
| version = "1.4-3feeb7b"; | ||
|
|
||
| src = fetchgit { | ||
| url = "git://git.code.sf.net/p/tclap/code"; | ||
| rev = "3feeb7b2499b37d9cb80890cadaf7c905a9a50c6"; # 1.4 branch | ||
| hash = "sha256-byLianB6Vf+I9ABMmsmuoGU2o5RO9c5sMckWW0F+GDM="; | ||
| }; | ||
|
|
||
| postPatch = '' | ||
| substituteInPlace CMakeLists.txt \ | ||
| --replace '$'{CMAKE_INSTALL_LIBDIR_ARCHIND} '$'{CMAKE_INSTALL_LIBDIR} | ||
| substituteInPlace packaging/pkgconfig.pc.in \ | ||
| --replace '$'{prefix}/@CMAKE_INSTALL_INCLUDEDIR@ @CMAKE_INSTALL_FULL_INCLUDEDIR@ | ||
| ''; | ||
|
|
||
| nativeBuildInputs = [ | ||
| cmake | ||
| doxygen | ||
| python3 | ||
| ]; | ||
|
|
||
| # Installing docs is broken in this package+version so we stub out some files | ||
| preInstall = '' | ||
| touch docs/manual.html | ||
| ''; | ||
|
|
||
| doCheck = true; | ||
|
|
||
| meta = with lib; { | ||
| description = "Templatized C++ Command Line Parser Library (v1.4)"; | ||
| homepage = "https://tclap.sourceforge.net/"; | ||
| license = licenses.mit; | ||
| maintainers = teams.deshaw.members; | ||
| platforms = platforms.all; | ||
| }; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| { lib | ||
| , callPackage | ||
| , gcc11Stdenv | ||
| , fetchFromGitHub | ||
| , addOpenGLRunpath | ||
| , catch2 | ||
| , cmake | ||
| , cudaPackages_10_2 | ||
| , cudaPackages_11_8 | ||
| , cudaPackages_12 | ||
| , fmt_9 | ||
| , git | ||
| , jsoncpp | ||
| , libevent | ||
| , plog | ||
| , python3 | ||
| , symlinkJoin | ||
| , tclap_1_4 | ||
| , yaml-cpp | ||
| }: | ||
| let | ||
| # Flags copied from DCGM's libevent build script | ||
| libevent-nossl = libevent.override { sslSupport = false; }; | ||
| libevent-nossl-static = libevent-nossl.overrideAttrs (super: { | ||
| CFLAGS = "-Wno-cast-function-type -Wno-implicit-fallthrough -fPIC"; | ||
| CXXFLAGS = "-Wno-cast-function-type -Wno-implicit-fallthrough -fPIC"; | ||
| configureFlags = super.configureFlags ++ [ "--disable-shared" "--with-pic" ]; | ||
| }); | ||
|
|
||
| jsoncpp-static = jsoncpp.override { enableStatic = true; }; | ||
|
|
||
| # DCGM depends on 3 different versions of CUDA at the same time. | ||
| # The runtime closure, thankfully, is quite small because most things | ||
| # are statically linked. | ||
| cudaPackageSetByVersion = [ | ||
| { | ||
| version = "10"; | ||
| # Nixpkgs cudaPackages_10 doesn't have redist packages broken out. | ||
| pkgSet = [ | ||
| cudaPackages_10_2.cudatoolkit | ||
| cudaPackages_10_2.cudatoolkit.lib | ||
| ]; | ||
| } | ||
| { | ||
| version = "11"; | ||
| pkgSet = getCudaPackages cudaPackages_11_8; | ||
| } | ||
| { | ||
| version = "12"; | ||
| pkgSet = getCudaPackages cudaPackages_12; | ||
| } | ||
| ]; | ||
|
|
||
| # Select needed redist packages from cudaPackages | ||
| # C.f. https://github.com/NVIDIA/DCGM/blob/7e1012302679e4bb7496483b32dcffb56e528c92/dcgmbuild/scripts/0080_cuda.sh#L24-L39 | ||
| getCudaPackages = p: with p; [ | ||
| cuda_cccl | ||
| cuda_cudart | ||
| cuda_nvcc | ||
| cuda_nvml_dev | ||
| libcublas | ||
| libcufft | ||
| libcurand | ||
| ]; | ||
|
|
||
| # Builds CMake code to add CUDA paths for include and lib. | ||
| mkAppendCudaPaths = { version, pkgSet }: | ||
| let | ||
| # The DCGM CMake assumes that the folder containing cuda.h contains all headers, so we must | ||
| # combine everything together for headers to work. | ||
| # It would be more convenient to use symlinkJoin on *just* the include subdirectories | ||
| # of each package, but not all of them have an include directory and making that work | ||
| # is more effort than it's worth for this temporary, build-time package. | ||
| combined = symlinkJoin { | ||
| name = "cuda-combined-${version}"; | ||
| paths = pkgSet; | ||
| }; | ||
| # The combined package above breaks the build for some reason so we just configure | ||
| # each package's library path. | ||
| libs = lib.concatMapStringsSep " " (x: ''"${x}/lib"'') pkgSet; | ||
| in '' | ||
| list(APPEND Cuda${version}_INCLUDE_PATHS "${combined}/include") | ||
| list(APPEND Cuda${version}_LIB_PATHS ${libs}) | ||
| ''; | ||
|
|
||
| # gcc11 is required by DCGM's very particular build system | ||
| # C.f. https://github.com/NVIDIA/DCGM/blob/7e1012302679e4bb7496483b32dcffb56e528c92/dcgmbuild/build.sh#L22 | ||
| in gcc11Stdenv.mkDerivation rec { | ||
samuela marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| pname = "dcgm"; | ||
| version = "3.1.8"; | ||
|
|
||
| src = fetchFromGitHub { | ||
| owner = "NVIDIA"; | ||
| repo = "DCGM"; | ||
| rev = "refs/tags/v${version}"; | ||
| hash = "sha256-OXqXkP2ZUNPzafGIgJ0MKa39xB84keVFFYl+JsHgnks="; | ||
| }; | ||
|
|
||
| # Add our paths to the CUDA paths so FindCuda.cmake can find them. | ||
| EXTRA_CUDA_PATHS = lib.concatMapStringsSep "\n" mkAppendCudaPaths cudaPackageSetByVersion; | ||
| prePatch = '' | ||
| echo "$EXTRA_CUDA_PATHS"$'\n'"$(cat cmake/FindCuda.cmake)" > cmake/FindCuda.cmake | ||
| ''; | ||
|
|
||
| hardeningDisable = [ "all" ]; | ||
samuela marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| nativeBuildInputs = [ | ||
| addOpenGLRunpath | ||
| cmake | ||
| git | ||
| python3 | ||
|
|
||
| jsoncpp-static | ||
| jsoncpp-static.dev | ||
| libevent-nossl-static | ||
| libevent-nossl-static.dev | ||
| plog.dev # header-only | ||
| tclap_1_4 # header-only | ||
| ]; | ||
|
|
||
| buildInputs = [ | ||
| catch2 | ||
| fmt_9 | ||
| yaml-cpp | ||
| ]; | ||
|
|
||
| # libcuda.so must be found at runtime because it is supplied by the NVIDIA | ||
| # driver. autoAddOpenGLRunpathHook breaks on the statically linked exes. | ||
| postFixup = '' | ||
| find "$out/bin" "$out/lib" -type f -executable -print0 | while IFS= read -r -d "" f; do | ||
| if isELF "$f" && [[ $(patchelf --print-needed "$f" || true) == *libcuda.so* ]]; then | ||
| addOpenGLRunpath "$f" | ||
| fi | ||
| done | ||
| ''; | ||
samuela marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| disallowedReferences = lib.concatMap (x: x.pkgSet) cudaPackageSetByVersion; | ||
|
|
||
| meta = with lib; { | ||
| description = "Data Center GPU Manager (DCGM) is a daemon that allows users to monitor NVIDIA data-center GPUs."; | ||
| homepage = "https://developer.nvidia.com/dcgm"; | ||
| license = licenses.asl20; | ||
| maintainers = teams.deshaw.members; | ||
| mainProgram = "dcgmi"; | ||
| platforms = platforms.linux; | ||
| }; | ||
| } | ||
66 changes: 66 additions & 0 deletions
66
pkgs/servers/monitoring/prometheus/dcgm-exporter/default.nix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| { lib | ||
| , buildGoModule | ||
| , fetchFromGitHub | ||
| , cudaPackages | ||
| , dcgm | ||
| , linuxPackages | ||
| }: | ||
| buildGoModule rec { | ||
| pname = "dcgm-exporter"; | ||
| version = "3.1.8-3.1.5"; | ||
|
|
||
| src = fetchFromGitHub { | ||
| owner = "NVIDIA"; | ||
| repo = pname; | ||
| rev = "refs/tags/${version}"; | ||
| hash = "sha256-Jzv3cU3gmGIXV+DV3wV/1zSWwz18s3Jax6JC7WZW7Z4="; | ||
| }; | ||
|
|
||
| # Upgrade to go 1.17 during the vendoring FOD build because it fails otherwise. | ||
| overrideModAttrs = _: { | ||
| preBuild = '' | ||
| substituteInPlace go.mod --replace 'go 1.16' 'go 1.17' | ||
| go mod tidy | ||
| ''; | ||
| postInstall = '' | ||
| cp go.mod "$out/go.mod" | ||
| ''; | ||
| }; | ||
|
|
||
| CGO_LDFLAGS = "-ldcgm"; | ||
|
|
||
| buildInputs = [ | ||
| dcgm | ||
| ]; | ||
|
|
||
| # gonvml and go-dcgm do not work with ELF BIND_NOW hardening because not all | ||
| # symbols are available on startup. | ||
| hardeningDisable = [ "bindnow" ]; | ||
|
|
||
| # Copy the modified go.mod we got from the vendoring process. | ||
| preBuild = '' | ||
| cp vendor/go.mod go.mod | ||
| ''; | ||
|
|
||
| vendorHash = "sha256-KMCV79kUY1sNYysH0MmB7pVU98r7v+DpLIoYHxyyG4U="; | ||
|
|
||
| nativeBuildInputs = [ | ||
| cudaPackages.autoAddOpenGLRunpathHook | ||
| ]; | ||
|
|
||
| # Tests try to interact with running DCGM service. | ||
| doCheck = false; | ||
|
|
||
| postFixup = '' | ||
| patchelf --add-needed libnvidia-ml.so "$out/bin/dcgm-exporter" | ||
| ''; | ||
|
|
||
| meta = with lib; { | ||
| description = "NVIDIA GPU metrics exporter for Prometheus leveraging DCGM"; | ||
| homepage = "https://github.com/NVIDIA/dcgm-exporter"; | ||
| license = licenses.asl20; | ||
| maintainers = teams.deshaw.members; | ||
| mainProgram = "dcgm-exporter"; | ||
| platforms = platforms.linux; | ||
| }; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these build flags necessary? if at all possible, it is preferable to just use dependencies as they are default packaged
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCGM is very particular about its build. Its entire build system is actually deterministic, but it uses Docker to achieve that. This makes it extremely particular about the exact build flags used for every dependency. These dependencies are built with these specific flags and I ran into build failures in other configurations. While it might be theoretically possible to find a different configuration that succeeds, this one exactly matches what upstream does in its build system so it is the most likely to succeed and have matching behavior.