Update all ci- containers to reflect main #7995

areusch · 2021-05-06T20:49:23Z

jroesch · 2021-05-06T23:44:42Z

So last time I tried to update the images but ran into issues where tests no longer passed under updated GPU and CPU images (see docker pull tlcpack/ci-gpu:v0.73-t3).

I had chased down the image building, there are a few important things.

First disable caching in order to make sure you pull the freshest apt as if you have built before you can have a stale layer which pulled an apt-registry which is no longer valid.

Second there were some updates to the drivers which caused breakage, specifically this test no longer works because it incorrectly believes that CUDA is on (even though there is no GPU):

https://github.com/apache/tvm/blob/main/tests/cpp/build_module_test.cc#L84

These needs to be patched to actually check for the GPU's existence.

There were also changes intended to fix Rust CI which should included in master today allowing us to turn Rust CI back on.

areusch · 2021-05-11T21:48:41Z

@tkonolige and i will attempt to update to 18.04 (e.g. include #7970) with these updates.

tkonolige · 2021-05-12T19:53:56Z

#8031 contains the updated docker images

areusch · 2021-05-12T21:13:47Z

@tkonolige is kindly rebuilding the containers. @tkonolige , can you document the main hash used following the steps above?

areusch · 2021-05-13T01:17:58Z

Testing: https://ci.tlcpack.ai/job/tvm/job/ci-docker-staging/87/

areusch · 2021-05-13T16:33:22Z

Built from: 55471632d97bb9d0d4535bd977ca4fb7e7cdcc28

areusch · 2021-05-13T19:51:29Z

retesting: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/88/pipeline

…7995.

…8037)

areusch · 2021-05-14T17:22:51Z

we accidentally tested the wrong thing; so re-re-testing here: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/89/pipeline/6

areusch · 2021-05-14T20:40:23Z

https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/90/pipeline

areusch · 2021-05-17T20:06:24Z

seems like the last run was unexpectedly unable to access to the gpu in "Frontend : GPU" phase. retrying with printing nvidia-smi before running those tests: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/91/pipeline/

areusch · 2021-05-20T18:36:42Z

apologies for the light updates here. we determined that the "Frontend : GPU" tests get into a state where either the GPU hardware is inaccessible after a while or TVM's existence check is wrong. Since we didn't change the CUDA version used here--we just updated to 18.04--the theory is that there is some interoperability problem between CUDA running in the containers (at 10.0) and the CUDA driver loaded on the docker host side (either 10.2 or 11.0, depending which CI node you hit).

@tkonolige and I have spent the last couple days running on a test TVM CI cluster using the same AMI (which has only CUDA 11.0). With CUDA 10.0 (ci-gpu) and 11.0 (host), we ran into another similar-looking bug during the GPU unit tests:

[ RUN      ] BuildModule.Heterogeneous
[22:11:58] /workspace/src/target/opt/build_cuda_on.cc:89: Warning: cannot detect compute capability from your device, fall back to compute_30.
unknown file: Failure
C++ exception with description "[22:11:58] /workspace/src/runtime/cuda/cuda_device_api.cc:117: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: no CUDA-capable device is detected

We then upgraded ci-gpu to use CUDA 11.0, and this test seemed to pass all the way to the end of the GPU integration tests, modulo a tolerance issue:

tests/python/contrib/test_cublas.py::test_batch_matmul FAILED
// ...
>       np.testing.assert_allclose(actual, desired, rtol=rtol, atol=atol, verbose=True)
E       AssertionError: 
E       Not equal to tolerance rtol=1e-05, atol=1e-07
E       
E       Mismatched elements: 2875175 / 3866624 (74.4%)
E       Max absolute difference: 0.00541687
E       Max relative difference: 0.00015383
E        x: array([[[29.647408, 31.88966 , 33.90233 , ..., 34.673954, 32.908764,
E                31.219051],
E               [30.993076, 30.78019 , 33.67124 , ..., 36.1395  , 29.176218,...
E        y: array([[[29.646427, 31.889557, 33.900528, ..., 34.673126, 32.90791 ,
E                31.21726 ],
E               [30.991737, 30.780437, 33.67001 , ..., 36.139397, 29.174744,...

we'll try and push this CUDA 11.0 ci-gpu container through the test CI cluster to see how far we can get. feel free to comment if there are concerns updating to CUDA 11.0.

areusch · 2021-05-25T22:11:07Z

update: #8130 is an alternate solution to the issues presented in #8108, which doesn't sacrifice accuracy. we have tested this in an instance of the TVM CI using CUDA 11 on the host and docker container side, and all tests pass. we'll now attempt to merge #8130.

following that, we'll disable all CI nodes running CUDA 10 and re-run ci-docker-staging against main using our CUDA 11 containers. based on our experiments, we think this will pass, and we can then promote those containers to tlcpack and declare victory.

areusch · 2021-05-26T17:07:48Z

#8130 is merged, testing the containers again. hoping we see green this time: https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/92/pipeline

areusch · 2021-05-27T22:32:41Z

It looks like our most recent run passed enough to merge the containers. the failure here is actually a separate CI problem triggered by #8023. will submit another PR to fix the issue triggered there, but we should be able to proceed here.

d-smirnov · 2021-06-01T12:39:59Z

@Areush Not sure but could these failures be related to the container update?
https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-8151/2
https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-8149/2

areusch · 2021-06-01T17:35:40Z

@d-smirnov should be resolved by #8160

d-smirnov · 2021-06-01T19:59:19Z

Indeed, resolved. Thank you! -- Dmitriy Smirnov

…

On Tue, 1 Jun 2021 at 18:35, Andrew Reusch ***@***.***> wrote: @d-smirnov <https://github.com/d-smirnov> should be resolved by #8160 <#8160> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7995 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGADO3676QIY5GVC67ENWKTTQUK7ZANCNFSM44H4RGXA> .

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

…7995. (apache#8037)

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to #7995

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

areusch mentioned this issue May 6, 2021

Bumped Ubuntu version to 18.04 for ci_gpu #7970

Merged

leandron mentioned this issue May 10, 2021

[BYOC][ACL] ACL migrated to v21.02 #7649

Merged

leandron mentioned this issue May 12, 2021

[Fix] CI QEMU Install libpython3.8 #8020

Merged

areusch mentioned this issue May 12, 2021

[CI] Added llvm-12 to ubuntu1804_install_llvm.sh #8008

Merged

areusch assigned tkonolige May 12, 2021

areusch mentioned this issue May 12, 2021

[DOCKER,CI] Add PAPI to docker images #8016

Merged

areusch added a commit to areusch/tvm that referenced this issue May 13, 2021

Mark zephyr install world-writable in docker image to unblock apache#…

57b338a

…7995.

areusch added a commit to areusch/tvm that referenced this issue May 13, 2021

Mark zephyr install world-writable in docker image to unblock apache#…

6755c3c

…7995.

tqchen pushed a commit that referenced this issue May 14, 2021

Mark zephyr install world-writable in docker image to unblock #7995. (#…

1317df9

…8037)

This was referenced May 20, 2021

[BYOC][NNAPI]: Add testing package to ci_cpu image #8088

Merged

[CI][Docker] set environment variables for UTF-8, to prevent errors when running black #8089

Merged

tkonolige mentioned this issue May 25, 2021

[CUBLAS] Remove deprecated CUBLAS_TENSOR_OP_MATH flag #8130

Merged

areusch added a commit that referenced this issue May 26, 2021

rev jenkins containers for #7995

f402013

areusch added a commit that referenced this issue May 26, 2021

rev jenkins containers for #7995

0e95c9e

areusch added a commit to areusch/tvm that referenced this issue May 27, 2021

rev jenkins containers for apache#7995

6e871cd

areusch mentioned this issue May 27, 2021

rev jenkins containers for #7995 #8155

Merged

areusch added a commit to areusch/tvm that referenced this issue May 27, 2021

rev jenkins containers for apache#7995

6b93391

tqchen closed this as completed in #8155 May 29, 2021

tqchen pushed a commit that referenced this issue May 29, 2021

rev jenkins containers for #7995 (#8155)

2e2dea7

mehrdadh pushed a commit to mehrdadh/tvm that referenced this issue Jun 3, 2021

rev jenkins containers for apache#7995 (apache#8155)

d33d171

areusch added a commit to areusch/tvm that referenced this issue Jun 16, 2021

Fix bulleted lists in TVM documentation.

2d353bf

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

areusch mentioned this issue Jun 16, 2021

Fix bulleted lists in TVM documentation #8268

Merged

areusch added a commit to areusch/tvm that referenced this issue Jun 16, 2021

Fix bulleted lists in TVM documentation.

9581dbb

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

trevor-m pushed a commit to trevor-m/tvm that referenced this issue Jun 17, 2021

Mark zephyr install world-writable in docker image to unblock apache#…

096700b

…7995. (apache#8037)

trevor-m pushed a commit to trevor-m/tvm that referenced this issue Jun 17, 2021

rev jenkins containers for apache#7995 (apache#8155)

5f6c74a

trevor-m pushed a commit to neo-ai/tvm that referenced this issue Jun 17, 2021

Mark zephyr install world-writable in docker image to unblock apache#…

bf5077d

…7995. (apache#8037)

trevor-m pushed a commit to neo-ai/tvm that referenced this issue Jun 17, 2021

rev jenkins containers for apache#7995 (apache#8155)

4bf8fb9

areusch added a commit to areusch/tvm that referenced this issue Jun 22, 2021

Fix bulleted lists in TVM documentation.

9184804

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

tqchen pushed a commit that referenced this issue Jun 22, 2021

Fix bulleted lists in TVM documentation. (#8268)

dbbf259

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to #7995

ylc pushed a commit to ylc/tvm that referenced this issue Sep 29, 2021

Fix bulleted lists in TVM documentation. (apache#8268)

b5bbf24

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

zxy844288792 pushed a commit to zxy844288792/tvm that referenced this issue Mar 4, 2022

Fix bulleted lists in TVM documentation. (apache#8268)

3364e94

* These currently do not render due to readthedocs/sphinx_rtd_theme#1115 * Breakage was likely caused due to apache#7995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update all ci- containers to reflect main #7995

Update all ci- containers to reflect main #7995

areusch commented May 6, 2021 •

edited

Loading

jroesch commented May 6, 2021

areusch commented May 11, 2021

tkonolige commented May 12, 2021

areusch commented May 12, 2021

areusch commented May 13, 2021

areusch commented May 13, 2021

areusch commented May 13, 2021

areusch commented May 14, 2021

areusch commented May 14, 2021

areusch commented May 17, 2021

areusch commented May 20, 2021

areusch commented May 25, 2021

areusch commented May 26, 2021

areusch commented May 27, 2021

d-smirnov commented Jun 1, 2021

areusch commented Jun 1, 2021

d-smirnov commented Jun 1, 2021 via email

Update all ci- containers to reflect main #7995

Update all ci- containers to reflect main #7995

Comments

areusch commented May 6, 2021 • edited Loading

jroesch commented May 6, 2021

areusch commented May 11, 2021

tkonolige commented May 12, 2021

areusch commented May 12, 2021

areusch commented May 13, 2021

areusch commented May 13, 2021

areusch commented May 13, 2021

areusch commented May 14, 2021

areusch commented May 14, 2021

areusch commented May 17, 2021

areusch commented May 20, 2021

areusch commented May 25, 2021

areusch commented May 26, 2021

areusch commented May 27, 2021

d-smirnov commented Jun 1, 2021

areusch commented Jun 1, 2021

d-smirnov commented Jun 1, 2021 via email

areusch commented May 6, 2021 •

edited

Loading