-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update all ci- containers to reflect main #7995
Comments
So last time I tried to update the images but ran into issues where tests no longer passed under updated GPU and CPU images (see docker pull tlcpack/ci-gpu:v0.73-t3). I had chased down the image building, there are a few important things. First disable caching in order to make sure you pull the freshest Second there were some updates to the drivers which caused breakage, specifically this test no longer works because it incorrectly believes that CUDA is on (even though there is no GPU): https://github.com/apache/tvm/blob/main/tests/cpp/build_module_test.cc#L84 These needs to be patched to actually check for the GPU's existence. There were also changes intended to fix Rust CI which should included in master today allowing us to turn Rust CI back on. |
@tkonolige and i will attempt to update to 18.04 (e.g. include #7970) with these updates. |
#8031 contains the updated docker images |
@tkonolige is kindly rebuilding the containers. @tkonolige , can you document the |
Built from: |
we accidentally tested the wrong thing; so re-re-testing here: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/89/pipeline/6 |
seems like the last run was unexpectedly unable to access to the gpu in "Frontend : GPU" phase. retrying with printing |
apologies for the light updates here. we determined that the "Frontend : GPU" tests get into a state where either the GPU hardware is inaccessible after a while or TVM's existence check is wrong. Since we didn't change the CUDA version used here--we just updated to 18.04--the theory is that there is some interoperability problem between CUDA running in the containers (at 10.0) and the CUDA driver loaded on the docker host side (either 10.2 or 11.0, depending which CI node you hit). @tkonolige and I have spent the last couple days running on a test TVM CI cluster using the same AMI (which has only CUDA 11.0). With CUDA 10.0 (ci-gpu) and 11.0 (host), we ran into another similar-looking bug during the GPU unit tests:
We then upgraded ci-gpu to use CUDA 11.0, and this test seemed to pass all the way to the end of the GPU integration tests, modulo a tolerance issue:
we'll try and push this CUDA 11.0 ci-gpu container through the test CI cluster to see how far we can get. feel free to comment if there are concerns updating to CUDA 11.0. |
update: #8130 is an alternate solution to the issues presented in #8108, which doesn't sacrifice accuracy. we have tested this in an instance of the TVM CI using CUDA 11 on the host and docker container side, and all tests pass. we'll now attempt to merge #8130. following that, we'll disable all CI nodes running CUDA 10 and re-run ci-docker-staging against main using our CUDA 11 containers. based on our experiments, we think this will pass, and we can then promote those containers to tlcpack and declare victory. |
#8130 is merged, testing the containers again. hoping we see green this time: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/92/pipeline |
It looks like our most recent run passed enough to merge the containers. the failure here is actually a separate CI problem triggered by #8023. will submit another PR to fix the issue triggered there, but we should be able to proceed here. |
@Areush Not sure but could these failures be related to the container update? |
@d-smirnov should be resolved by #8160 |
Indeed, resolved. Thank you!
--
Dmitriy Smirnov
…On Tue, 1 Jun 2021 at 18:35, Andrew Reusch ***@***.***> wrote:
@d-smirnov <https://github.com/d-smirnov> should be resolved by #8160
<#8160>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7995 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGADO3676QIY5GVC67ENWKTTQUK7ZANCNFSM44H4RGXA>
.
|
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to #7995
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995
* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995
This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:
It will also unblock PRs:
Steps:
main
and use that specific git hash for following steps. Document it here.<username>/ci-<container>:v0.<ver>
(we may not need to build all of these if Bumped Ubuntu version to 18.04 for ci_gpu #7970 is not included here)Jenkinsfile
to point all containers at<username>/ci-<container>:v0.<ver>
tlcpack/ci-<container>:v0.<ver>
Jenkinsfile
to the new containers.Let's use this tracking issue to record challenges we face in updating the drivers.
@jroesch can you please note down where you were with this?
cc @tqchen @u99127 @d-smirnov @leandron @tristan-arm
The text was updated successfully, but these errors were encountered: