Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update all ci- containers to reflect main #7995

Closed
17 tasks done
areusch opened this issue May 6, 2021 · 17 comments · Fixed by #8155
Closed
17 tasks done

Update all ci- containers to reflect main #7995

areusch opened this issue May 6, 2021 · 17 comments · Fixed by #8155
Assignees

Comments

@areusch
Copy link
Contributor

areusch commented May 6, 2021

This is a tracking issue for the process of updating the TVM ci- containers to reflect the following PRs:

It will also unblock PRs:

Steps:

  • Pick the latest revision of main and use that specific git hash for following steps. Document it here.
  • Build docker containers, tagging them as e.g. <username>/ci-<container>:v0.<ver> (we may not need to build all of these if Bumped Ubuntu version to 18.04 for ci_gpu #7970 is not included here)
    • ci-arm
    • ci-cpu
    • ci-gpu
    • ci-i386
    • ci-lint
    • ci-qemu
    • ci-wasm
  • Push docker containers to Docker Hub
  • Create a draft PR modifying Jenkinsfile to point all containers at <username>/ci-<container>:v0.<ver>
  • Force-push the PR to ci-docker-staging branch
  • Jenkins will notice the push and start a build here
  • Debug the build and repeat these steps until the build passes
  • Push the valid containers to tlcpack/ci-<container>:v0.<ver>
  • Update the PR to point Jenkinsfile to the new containers.
  • Merge the PR.

Let's use this tracking issue to record challenges we face in updating the drivers.

@jroesch can you please note down where you were with this?

cc @tqchen @u99127 @d-smirnov @leandron @tristan-arm

@jroesch
Copy link
Member

jroesch commented May 6, 2021

So last time I tried to update the images but ran into issues where tests no longer passed under updated GPU and CPU images (see docker pull tlcpack/ci-gpu:v0.73-t3).

I had chased down the image building, there are a few important things.

First disable caching in order to make sure you pull the freshest apt as if you have built before you can have a stale layer which pulled an apt-registry which is no longer valid.

Second there were some updates to the drivers which caused breakage, specifically this test no longer works because it incorrectly believes that CUDA is on (even though there is no GPU):

https://github.com/apache/tvm/blob/main/tests/cpp/build_module_test.cc#L84

These needs to be patched to actually check for the GPU's existence.

There were also changes intended to fix Rust CI which should included in master today allowing us to turn Rust CI back on.

@areusch
Copy link
Contributor Author

areusch commented May 11, 2021

@tkonolige and i will attempt to update to 18.04 (e.g. include #7970) with these updates.

@tkonolige
Copy link
Contributor

#8031 contains the updated docker images

@areusch
Copy link
Contributor Author

areusch commented May 12, 2021

@tkonolige is kindly rebuilding the containers. @tkonolige , can you document the main hash used following the steps above?

@areusch
Copy link
Contributor Author

areusch commented May 13, 2021

@areusch
Copy link
Contributor Author

areusch commented May 13, 2021

Built from: 55471632d97bb9d0d4535bd977ca4fb7e7cdcc28

@areusch
Copy link
Contributor Author

areusch commented May 13, 2021

@areusch
Copy link
Contributor Author

areusch commented May 14, 2021

we accidentally tested the wrong thing; so re-re-testing here: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/89/pipeline/6

@areusch
Copy link
Contributor Author

areusch commented May 14, 2021

@areusch
Copy link
Contributor Author

areusch commented May 17, 2021

seems like the last run was unexpectedly unable to access to the gpu in "Frontend : GPU" phase. retrying with printing nvidia-smi before running those tests: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/91/pipeline/

@areusch
Copy link
Contributor Author

areusch commented May 20, 2021

apologies for the light updates here. we determined that the "Frontend : GPU" tests get into a state where either the GPU hardware is inaccessible after a while or TVM's existence check is wrong. Since we didn't change the CUDA version used here--we just updated to 18.04--the theory is that there is some interoperability problem between CUDA running in the containers (at 10.0) and the CUDA driver loaded on the docker host side (either 10.2 or 11.0, depending which CI node you hit).

@tkonolige and I have spent the last couple days running on a test TVM CI cluster using the same AMI (which has only CUDA 11.0). With CUDA 10.0 (ci-gpu) and 11.0 (host), we ran into another similar-looking bug during the GPU unit tests:

[ RUN      ] BuildModule.Heterogeneous
[22:11:58] /workspace/src/target/opt/build_cuda_on.cc:89: Warning: cannot detect compute capability from your device, fall back to compute_30.
unknown file: Failure
C++ exception with description "[22:11:58] /workspace/src/runtime/cuda/cuda_device_api.cc:117: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: no CUDA-capable device is detected

We then upgraded ci-gpu to use CUDA 11.0, and this test seemed to pass all the way to the end of the GPU integration tests, modulo a tolerance issue:

tests/python/contrib/test_cublas.py::test_batch_matmul FAILED
// ...
>       np.testing.assert_allclose(actual, desired, rtol=rtol, atol=atol, verbose=True)
E       AssertionError: 
E       Not equal to tolerance rtol=1e-05, atol=1e-07
E       
E       Mismatched elements: 2875175 / 3866624 (74.4%)
E       Max absolute difference: 0.00541687
E       Max relative difference: 0.00015383
E        x: array([[[29.647408, 31.88966 , 33.90233 , ..., 34.673954, 32.908764,
E                31.219051],
E               [30.993076, 30.78019 , 33.67124 , ..., 36.1395  , 29.176218,...
E        y: array([[[29.646427, 31.889557, 33.900528, ..., 34.673126, 32.90791 ,
E                31.21726 ],
E               [30.991737, 30.780437, 33.67001 , ..., 36.139397, 29.174744,...

we'll try and push this CUDA 11.0 ci-gpu container through the test CI cluster to see how far we can get. feel free to comment if there are concerns updating to CUDA 11.0.

@areusch
Copy link
Contributor Author

areusch commented May 25, 2021

update: #8130 is an alternate solution to the issues presented in #8108, which doesn't sacrifice accuracy. we have tested this in an instance of the TVM CI using CUDA 11 on the host and docker container side, and all tests pass. we'll now attempt to merge #8130.

following that, we'll disable all CI nodes running CUDA 10 and re-run ci-docker-staging against main using our CUDA 11 containers. based on our experiments, we think this will pass, and we can then promote those containers to tlcpack and declare victory.

areusch added a commit that referenced this issue May 26, 2021
@areusch
Copy link
Contributor Author

areusch commented May 26, 2021

#8130 is merged, testing the containers again. hoping we see green this time: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/92/pipeline

areusch added a commit that referenced this issue May 26, 2021
@areusch
Copy link
Contributor Author

areusch commented May 27, 2021

It looks like our most recent run passed enough to merge the containers. the failure here is actually a separate CI problem triggered by #8023. will submit another PR to fix the issue triggered there, but we should be able to proceed here.

areusch added a commit to areusch/tvm that referenced this issue May 27, 2021
areusch added a commit to areusch/tvm that referenced this issue May 27, 2021
tqchen pushed a commit that referenced this issue May 29, 2021
@d-smirnov
Copy link
Contributor

@areusch
Copy link
Contributor Author

areusch commented Jun 1, 2021

@d-smirnov should be resolved by #8160

@d-smirnov
Copy link
Contributor

d-smirnov commented Jun 1, 2021 via email

mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Jun 3, 2021
areusch added a commit to areusch/tvm that referenced this issue Jun 16, 2021
 * These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to apache#7995
areusch added a commit to areusch/tvm that referenced this issue Jun 16, 2021
 * These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to apache#7995
trevor-m pushed a commit to trevor-m/tvm that referenced this issue Jun 17, 2021
trevor-m pushed a commit to neo-ai/tvm that referenced this issue Jun 17, 2021
areusch added a commit to areusch/tvm that referenced this issue Jun 22, 2021
 * These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to apache#7995
tqchen pushed a commit that referenced this issue Jun 22, 2021
* These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to #7995
ylc pushed a commit to ylc/tvm that referenced this issue Sep 29, 2021
* These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to apache#7995
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this issue Mar 4, 2022
* These currently do not render due to
 readthedocs/sphinx_rtd_theme#1115
 * Breakage was likely caused due to apache#7995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants