Skip to content

Try GPU CI with cupy (DNM)#466

Closed
jaimergp wants to merge 73 commits into
gpu-testsfrom
gpu-tests-cupy
Closed

Try GPU CI with cupy (DNM)#466
jaimergp wants to merge 73 commits into
gpu-testsfrom
gpu-tests-cupy

Conversation

@jaimergp

@jaimergp jaimergp commented Jan 23, 2023

Copy link
Copy Markdown
Member

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Same as #446

Issues:

@conda-forge-webservices

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

@conda-forge-webservices

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@jaimergp

Copy link
Copy Markdown
Member Author

@leofang - seems to be working :D Can you take a look at the logs for 0d99872? I'll see what happens with qemu now.

@leofang

leofang commented Jan 27, 2023

Copy link
Copy Markdown
Member

Oh wow @jaimergp many thanks for the test drive! It seems to work fine! (cc: @kmaehashi, we're testing GPU CI for conda-forge!)
https://github.com/conda-forge/cf-autotick-bot-test-package-feedstock/actions/runs/3998064331/jobs/6860216209#step:3:4910

Jaime, can we turn on tests? Even running a subset of GPU tests is a great improvement.

@kmaehashi

Copy link
Copy Markdown
Member

Great, thank you @jaimergp for testing!

Comment thread recipe/run_test.py Outdated
@conda-forge-webservices

This comment was marked as outdated.

1 similar comment
@conda-forge-webservices

Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-webservice.

It appears you are making a pull request from a branch in your feedstock and not a fork. This procedure will generate a separate build for each push to the branch and is thus not allowed. See our documentation for more details.

Please close this pull request and remake it from a fork of this feedstock.

Have a great day!

@jaimergp

Copy link
Copy Markdown
Member Author

Thanks for the tips! My goal is here is not so much making everything pass, but at least assure that the UX is nice, and having the machine die mid job is not one 😬 I'll try adding more deps and see if that passes, but in the meantime I wonder if this is a resource starvation issue. It passes on CUDA 11 but maybe CUDA 12 is heavier? How much disk/RAM are you using in your test boxes? Thanks!

@jaimergp

Copy link
Copy Markdown
Member Author

Interesting do we have any metrics on what the machine was doing before it stopped? Was there high memory usage or something else amiss?

Sadly it's an ephemeral VM and OpenStack doesn't offer any immediate way of keeping a history around as far as I can see, but it would be interesting to have some info, so I'll see what we can do.

@kmaehashi

Copy link
Copy Markdown
Member

How much disk/RAM are you using in your test boxes?

We are running under 8 cores CPU & 20 GB disk. As for RAM, we allocate 52 GB but this includes RAM disk space so I'm not sure how much is actually required for tests itself, unfortunately. With that said I guess the problem is elsewhere, as 99% of the test has passed. How about adding -v to pytest, or try with the subset (e.g., cupy_tests only) and see what happens?

@jaimergp

Copy link
Copy Markdown
Member Author

Hm, these machines are:

CPU  RAM Disk
4 12GB 60GB

@jaimergp

Copy link
Copy Markdown
Member Author

Ok if I don't run cupyx_tests it doesn't error out badly. I have a VM with grafana running the whole thing now so we'll see how it goes :)

We still see cufft errors though, despite the packages being in meta.yaml:

In file included from /tmp/tmpmfodjtfc/tmpnl7smisg/cupy_callback_c05c60196d1dae6877fd7811c4f34c5b.cpp:771:
/tmp/tmpmfodjtfc/tmpnl7smisg/cupy_cufft.h:11:10: fatal error: cufft.h: No such file or directory
   11 | #include <cufft.h>
      |          ^~~~~~~~~
compilation terminated.
_ Test1dCallbacks.test_fft_load[_param_2_{n=None, norm=None, shape=(10, 10)}] __

@leofang

leofang commented Oct 19, 2023

Copy link
Copy Markdown
Member

We still see cufft errors though, despite the packages being in meta.yaml:

I see. I think this is a package layout problem specific to conda. CuPy expects headers can be found in $CUDA_PATH/include, and so in the past we've set $CUDA_PATH to $CONDA_PREFIX in the activation script. But, with CUDA 12.0+ the CTK headers are only available in $CONDA_PREFIX/targets/x86_64-linux/include/, hence the missing headers.

We either have to patch CuPy or rework on the conda package layout, neither is trivial task. I suggest we note this issue and move on.

@leofang

leofang commented Oct 19, 2023

Copy link
Copy Markdown
Member

Also, the cuFFT callback support in CuPy was never expected to work with conda packages, due to static libcufft & headers not shipped in the past. NVIDIA is working on a new solution that would get Python libraries like CuPy lifted, so let's not bothered by this 🙂)

@jaimergp

Copy link
Copy Markdown
Member Author

76% of the test suite, this is how it's looking:

tests/cupyx_tests/scipy_tests/signal_tests/test_signaltools.py ......... [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 75%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
........................................................................ [ 76%]
.................................................s...................... [ 76%]
image image image

@jaimergp

Copy link
Copy Markdown
Member Author

The VM test finished correctly, but it's true that we are dangerously close to the RAM limits:

image

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

@jaimergp

Copy link
Copy Markdown
Member Author

CUDA 12 logs

@aktech

aktech commented Oct 24, 2023

Copy link
Copy Markdown

Since this is not using the Github Runner (I ran the build manually), maybe it does OOM and hence the issues we are seeing? What do you think @aktech?

I think that does explain all the jobs that failed without apparent reason. GitHub's message around lack of resources Memory/CPU was right. We also have a larger flavor available with 16GB RAM, if we know (or can find out) that it won't exceed that then we can give that a try as well.

@jakirkham

Copy link
Copy Markdown
Member

I don't think we know how much memory is used. So we probably need to collect more data first

Perhaps it is worth running something like pytest-monitor to collect and analyze data about memory usage in tests. This will store results in a SQLite DB, which can be analyzed later, but we would need to persist that database as an artifact for retrieval once the job has ended

Is there some way to do store artifacts from completed CI runs in our setup here? Can we store results even if the job fails?

@jaimergp

Copy link
Copy Markdown
Member Author

I haven't tried adding CI artifacts yet, actually. That's a good point. After increasing the VM RAM to 16GB it doesn't OOM. We are also investigating how we can protect the GHA runner from the OOM killer a bit so other processes are stopped instead of that one. That should alleviate the disconnection problems we've seen. If the runner process dies there's nothing we can do about sending CI artifacts or other "post-mortem" diagnosis steps via GHA.

@jaimergp

Copy link
Copy Markdown
Member Author

store_build_artifacts require 7z and/or zip, but these are not present in the VM image right now. We'll add this in the next update.

@leofang

leofang commented Oct 27, 2023

Copy link
Copy Markdown
Member

Q: I may have lost track the chronological order of the commits & above comments, was OOM hit only when cupyx_tests was enabled?

@jaimergp

Copy link
Copy Markdown
Member Author

Correct! There's a grafana plot a few messages above.

@leofang

leofang commented Oct 27, 2023

Copy link
Copy Markdown
Member

Not sure if there's any memory leak, what if we run the cupyx tests in a separate process?

@jaimergp jaimergp closed this Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants