[microTVM][ARM] Keep microtvm testing only in QEMU Image #11809

mehrdadh · 2022-06-21T22:35:32Z

This would prevent rebuilding multiple image on each update on CMSIS, ethosu, etc.
cc @Mousius @alanmacd @areusch @gromero @leandron

areusch

let's see what @u99127 @Mousius @manupa-arm have to say here first. i think this is mainly to make it operationally easier for us to ensure the microTVM upstream tests use the same deps as we install in Reference VM. i think that as we get towards supporting tvmc in a release and doing further testing with microTVM against tvmc in its own venv with e.g. Zephyr installed separately, we could probs focus on that in place of Reference VM. anyway, the ci_ images remain the test authority, so this is again mainly an operational convenience for folks using Reference VM as we do to test on-device at OctoML.

areusch · 2022-06-23T20:58:26Z

docker/Dockerfile.ci_cpu

 COPY install/ubuntu_install_caffe.sh /install/ubuntu_install_caffe.sh
 RUN bash /install/ubuntu_install_caffe.sh

-# Github Arm(R) Ethos(TM)-N NPU driver


i think this one should stay in ci_cpu, it's unrelated to microTVM

If my understanding is correct and this is used for running this command:

run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib-test_ethosu tests/python/contrib/test_ethosu -n auto

I suggest we also move this to ci_qemu to have all microtvm tests in one place.

manupak · 2022-06-24T10:09:38Z

In order to reason about this, it would be great to understand why do we maintain two seperate images for microTVM and TVM, especially when QEMU is a software that runs on the cpu. (Sorry I might've missed the discussion why went on to create to two seperate images)

In an alternate world, we could consolidate all testing dependencies to ci_cpu and keep TVM and microTVM aligned in-terms of testing environment, unless we need conflicting dependencies in the testing for TVM vs microTVM not just additive ones.

areusch · 2022-06-27T21:35:17Z

hmm yeah so here's what i know about the history here:

originally when we added QEMU to the TVM CI, USE_MICRO was an optional CMake setting. having ci_qemu separate from ci_cpu provided some diversity in testing the various config settings. MLF accidentally made USE_MICRO required; however, at this time, we had already completed a refactor that substantially reduced the code footprint of USE_MICRO so i haven't exactly prioritized fixing it.
- the folks who might care about USE_MICRO=OFF still are those who use libtvm_runtime.so without using Python. i'm absolutely happy to prioritize a fix (or review one) if the runtime impact turns out to be significant for anyone
QEMU was pretty large and we do want, operationally, to avoid requiring any one user to download a huge docker image just to repro the CI. ci_gpu is 25GB right now, but I don't want that to become the norm. keeping Zephyr installed separately was a way to avoid that.
i don't think there is a reason why the ci_cpu deps are incompatible with anything in ci_qemu, but obviously this could be a reason to separate them if so.

ci_cpu is 7GB now, so it's not small but it has a ways to go before it's the long pole. i could go either way here, though if folks want to continue adding micro toolchains those could become somewhat sizeable.

mehrdadh · 2022-06-27T22:16:13Z

Adding to Andrew's point, I think we can remove USE_MICRO since it's almost always ON. Also, I prefer to keep ci_qemu separate and also call it ci_microtvm. I believe this image could grow larger in future and if we combine it with ci_cpu it becomes similar to ci_cpu.

Mousius · 2022-06-28T07:59:19Z

Adding to Andrew's point, I think we can remove USE_MICRO since it's almost always ON. Also, I prefer to keep ci_qemu separate and also call it ci_microtvm. I believe this image could grow larger in future and if we combine it with ci_cpu it becomes similar to ci_cpu.

I support this approach as well, having two smaller containers with a specific responsibility rather than one monolithic container should make it easier to manage in the long term as the needs evolve. The only downside is the typical container hopping for development, though I think we can figure out how to get that to work also.

Minor note, I think ci_micro rather than ci_microtvm as it's already identified as a tvm image?

manupak · 2022-06-28T09:03:18Z

I agree that having modular (smaller) container aids the ease of reproducability from a developer PoV.

I'm worried with this approach is that this creates room for set of diveregent dependencies to be introduced between microTVM vs TVM (the deps that are not specific to micro), especially when they are not seperate projects. Right now, only thing that keeps the common deps align is on the basis that we use the same helper scripts (ubuntu_install*.sh) to construct the image.

E.g. : This could open up the possibility to advance the version of tensorflow supported by TVM project while microTVM lagging behind because there will now be tested in two environments.

Maybe we could create a base container with such common deps to ensure such divergence are harder to do ? (and extend that image into ci_cpu and ci_qemu)

manupak · 2022-06-28T09:12:05Z

cc : @leandron

leandron · 2022-06-28T10:27:01Z

Looking at the alternatives being proposed until now, I feel that for our current type of usage makes more sense to unify ci_qemu into ci_cpu, for two reasons:

According to our Jenkinsfile, both jobs that require ci_cpu and ci_qemu will run on the same target machines, so the expectation is that our Jenkins nodes will more commonly pull ci_cpu + ci_qemu in terms of storage
Creating a base image with common dependencies, would also generate a side problem, which is a time dependency to rebuild our images (we can't build all of them at the same time), and the need to create a generic label e.g. latest or last-successful for this base image to be picked up in the dependant images. I'd like to avoid this if possible.

Given our current Docker update practices that don't require updating all images at once, I feel we better to consolidate the dependencies in one single image until we can devise better rules for Docker updating.

areusch · 2022-06-28T22:58:27Z

i think while i'm sympathetic to reducing the overall image disk size, in practice:

DockerHub stores these for us and we don't have a concern there
we have the problem of docker image bloat hogging disk on our CI nodes anyway due to image revisions, and have already solved that.

so i don't think we need to make that a priority in this decision. i think then that it comes down to what's operationally easier, and i think that's the POV @Mousius and @manupa-arm were advocating. i tend to agree here--another way that we could get into trouble with consolidating ci_cpu and ci_qemu is two different docker/install scripts could add conflicting packages. we can mitigate that by keeping microTVM separated, and I'd vote for continuing that with renaming ci_qemu to ci_micro.

cc @driazati @konturn in case they have other thoughts from a CI perspective.

Mousius · 2022-06-29T08:28:25Z

According to our Jenkinsfile, both jobs that require ci_cpu and ci_qemu will run on the same target machines, so the expectation is that our Jenkins nodes will more commonly pull ci_cpu + ci_qemu in terms of storage

Provided the images have the same initial layers (base image or copy paste both should work), the storage cost should be the same as it'll re-use the layers when pulling the containers - if a node already had ci_cpu it should have a bunch of layers it can re-use for ci_micro.

Creating a base image with common dependencies, would also generate a side problem, which is a time dependency to rebuild our images (we can't build all of them at the same time), and the need to create a generic label e.g. latest or last-successful for this base image to be picked up in the dependant images. I'd like to avoid this if possible.

As far as I can tell we'd build ci_base, tag it with the same tag as the rest of the images and use that in the FROM of the subsequent images:

ARG TAG="latest"
FROM tlcpackstaging/ci-base:${TAG}

Therefore, apart from a small network up/down to propagate the image, it should be a similar time to build as we have now? Sorry if I missed something here @leandron?

manupak · 2022-06-29T14:20:19Z

another way that we could get into trouble with consolidating ci_cpu and ci_qemu is two different docker/install scripts could add conflicting packages.

Actually, I was arguing that we should avoid the need of conflicting packages as micro is another backend of TVM. Such a divergence is worrying for a compiler that is aiming to support both micro and non-micro compilation flows.

NOTE : I do acknowledge micro and non-micro have different requirement for dependencies but the common ones should not diverge IMO.

driazati · 2022-06-29T14:57:44Z

I also support keeping the images focused on their actual use instead of everything under the sun. With #11768 it should be less common that we're using different tags for the images (everything should be built and updated at once if we follow that process, though it is possible still to do manual updates) and we can avoid dependency drift by being careful in reviewing Docker image changes.

Docker image size for us is a problem (more than storage) that makes CI slower and harder to reproduce, so they should contain only what they need to run their specific tests.

mehrdadh · 2022-06-30T23:15:28Z

I'm in favor of having a base image with common dependencies to eliminate human errors in build process for each docker image, but even with that people might add any script to any of the docker files. So as a reviewer we need to keep an eye on changes on those files. Also I think there shouldn't be any divergence between TVM and microTVM dependencies and I would like to second @manupa-arm's point on this.
For now, I think we can try to keep all microtvm testing in a single docker image and rename it to ci_micro to it make more readable. We can think about other proposals and maybe discuss it in one of the TVM community meetings.

manupak · 2022-07-01T10:14:33Z

Thanks all for the discussion!.

we can avoid dependency drift by being careful in reviewing Docker image changes.

If we are solely relying on the reviewers, we need would need better written guidance in an OSS project to aid both reviewers (to cite) and authors (to follow), on what could be added to : ci_base (if we are taking this approach), ci_micro and ci_cpu.
NOTE : Im scoping this discussion to the specific issue being discussed here.

but even with that people might add any script to any of the docker files. So as a reviewer we need to keep an eye on changes on those files.

I suppose how ci_base would help here is that we can make the above guidance to be simplified to say all the layers in the image added by ci_micro Dockerfile has to be micro specific.

mehrdadh · 2022-07-01T17:11:04Z

@manupa-arm I agree with you. Both better documentation and having a ci_base image which has common dependencies are great ideas.
It would be great if we can make an action items list based on the discussion here.

areusch · 2022-07-08T15:20:05Z

i think most of the discussion here has aligned around purpose-built CI images. @leandron are your concerns assuaged?

regarding ci_base: I'm somewhat supportive of this, but currently unsure whether this would move the needle very much in practice and there are some operational concerns. we currently build images mostly automatically around the same time via the nightly rebuild and also in CI. the common deps we're talking about here are mostly a mix of apt packages plus some things built from sources downloaded from (i think) version-pinned URLs. there are some other things in here, to be sure--rust packages, for instance. i do agree that ci_base would enforce that we don't diverge, but just stating that i don't know, given the above, that i expect to see a huge effect in creating that. i agree with @Mousius that the layers would ideally be reused; i wouldn't underestimate the network transfer time, however.

I do want to note that we also mitigate drift in other was e.g. via common install scripts and (soon) via locked python dependencies (this more addresses the example of tensorflow divergence). however, i acknowledge the drawback that this is merely convention and dependent on reviewers.

for verticals like microTVM, i think a separate image makes sense. for e.g. backends, it's less clear--what if you want to test cuda with paddlepaddle? there is no such paddlepaddle dep for arm64, so it can't go in the base image. does this mean base images need to be arch-specific--if so, where's the line (e.g. what if it's available for 20.04 but not 18.04)?

so far as action items, i think we should capture the need of either a common base image or documentation into a GH issue. we should also update our docs accordingly. would love to hear more thoughts @Mousius @manupa-arm @leandron @driazati

manupak

Im just suggesting a mandatory change orthogonal to the conversation.

manupak · 2022-07-13T15:09:24Z

docker/Dockerfile.ci_cpu

-ENV CMSIS_PATH=/opt/arm/ethosu/cmsis/
-
-# Arm(R) Ethos(TM)-U NPU driver
-COPY install/ubuntu_install_ethosu_driver_stack.sh /install/ubuntu_install_ethosu_driver_stack.sh


run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib-test_ethosu tests/python/contrib/test_ethosu -n auto

we need this line moved alongside as well.

we should move that after we updated the qemu image with the dependencies for this test

I can update this PR by adding Arm(R) Ethos(TM)-U NPU driver to the qemu and merge this. Then we will update the qemu image and send another PR to remove test_ethosu and move it to qemu.

sorry we already have that. I meant Github Arm(R) Ethos(TM)-N NPU driver, I believe that is required, correct me if I'm wrong.

I tested it in qemu, looks fine. So I moved it

areusch · 2022-07-25T04:54:16Z

@manupa-arm can you take a look again?

manupak

Thanks! looking good.

It would be great to capture the steps discussed and agreed in the issue (as per @areusch) and link to the PR.

@ashutosh-arm could you take a look as well from cmsis tests, whether they would be all run in ci_qemu now ?

asparkhi · 2022-07-25T09:04:03Z

Thanks! looking good.

It would be great to capture the steps discussed and agreed in the issue (as per @areusch) and link to the PR.

@ashutosh-arm could you take a look as well from cmsis tests, whether they would be all run in ci_qemu now ?

afaik CMSIS-NN tests do not run on ci_qemu at present. In this code change, I don't see CMSIS-NN tests moved to tests/scripts/task_python_microtvm.sh. I am referring to

tvm/tests/scripts/task_python_integration.sh

Line 64 in 6eb3a1f

    
           run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib tests/python/contrib --ignore=tests/python/contrib/test_ethosu

. Am I missing something @mehrdadh / @manupa-arm ?

manupak · 2022-07-25T09:09:21Z

I see.

Yes none of the contrib tests are run (that is how cmsis tests are run today in ci_cpu).
We need to add that as well then.

mehrdadh · 2022-07-25T16:17:13Z

@ashutosh-arm thanks for catching this. I moved test_cmsisnn tests to qemu, there's also test_ethosn which we should move, but we need to update qemu image for that.
I think I need to send another PR for that along with sharding the qemu tests since it has timeout issue right now

manupak · 2022-07-25T16:25:03Z

@mehrdadh test_ethosn (unlike test_ethosu) is not related to microTVM.

mehrdadh · 2022-07-25T16:56:18Z

@manupa-arm thanks for clarification! Then I think this PR is ready, I just need to figure out the timeout issue.
cc @areusch @driazati

areusch · 2022-07-25T18:18:05Z

just musing about naming--previously we called this image ci_qemu because it contained qemu. it seems like with this effort + riscv32 + the hexagon effort, we are gravitating more towards purpose-driven CI images (we discussed this before). there is already a ci_arm--shall we consider a container rename such as:

ci_arm -> ci_arm_aarch64
ci_qemu -> ci_arm_cortexm

mehrdadh · 2022-07-25T18:54:48Z

@areusch I like to change the name, but I think ci_arm_cortexm is too specific. We could have riscv support with Zephyr in future which I assume the testing is gonna be part of this image. Then it would be confusing. I think ci_micro works better here. wdyt?

u99127 · 2022-07-25T20:19:39Z

just musing about naming--previously we called this image ci_qemu because it contained qemu. it seems like with this effort + riscv32 + the hexagon effort, we are gravitating more towards purpose-driven CI images (we discussed this before). there is already a ci_arm--shall we consider a container rename such as:

ci_arm -> ci_arm_aarch64

ci_qemu -> ci_arm_cortexm

I'd suggest that ci_arm is renamed to ci_aarch64 .

Ramana

areusch · 2022-07-26T01:27:10Z

@u99127 i'm fine with that, should we then ci_cortexm instead?

areusch · 2022-08-02T19:52:25Z

@manupa-arm please take another look when you have a minute

mehrdadh · 2022-08-02T19:52:36Z

just rebased with main since https://github.com/apache/tvm/pull/12258/files is merged, timeout issue should be resolved.

mehrdadh · 2022-08-05T16:26:13Z

@manupa-arm friendly reminder about this PR. thanks!

manupak

Thanks! LGTM!

* Move scripts * Address comments * move ethosu tests * move cmsisnn tests to qemu

github-actions bot requested review from Mousius, areusch and leandron June 21, 2022 22:36

mehrdadh force-pushed the microtvm/refactor_cpu_tests branch from d9b08dc to 9b789b4 Compare June 23, 2022 16:42

github-actions bot requested a review from gromero June 23, 2022 16:42

areusch reviewed Jun 23, 2022

View reviewed changes

areusch mentioned this pull request Jul 11, 2022

[ci][docker] fix the path of custom toolchain in ci_qemu for csinn2 #11905

Merged

manupak requested changes Jul 13, 2022

View reviewed changes

manupak reviewed Jul 25, 2022

View reviewed changes

mehrdadh mentioned this pull request Jul 25, 2022

[CI] We need to shard ci_qemu tests #12180

Closed

mehrdadh added 4 commits August 2, 2022 12:52

Move scripts

c52d8a6

Address comments

ed38a13

move ethosu tests

fe99139

move cmsisnn tests to qemu

74de280

areusch approved these changes Aug 2, 2022

View reviewed changes

mehrdadh force-pushed the microtvm/refactor_cpu_tests branch from 0cf89c7 to 74de280 Compare August 2, 2022 19:52

manupak approved these changes Aug 5, 2022

View reviewed changes

mehrdadh merged commit 1f97f1f into apache:main Aug 5, 2022

mehrdadh deleted the microtvm/refactor_cpu_tests branch August 5, 2022 22:39

This was referenced Aug 7, 2022

[CI Problem] Increase number of shards for Cortex-M. #12333

Closed

[skip ci] Increase the number of shards for Cortex-M from 4 to 8. #12334

Merged

This was referenced Aug 9, 2022

CI monitoring rotation schedule #11462

Closed

Add ci_minimal to nightly docker rebuild tlc-pack/tlcpack#135

Merged

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022

[microTVM][ARM] Keep microtvm testing only in QEMU Image (apache#11809)

7d4c981

* Move scripts * Address comments * move ethosu tests * move cmsisnn tests to qemu

[microTVM][ARM] Keep microtvm testing only in QEMU Image #11809

[microTVM][ARM] Keep microtvm testing only in QEMU Image #11809

Uh oh!

Conversation

mehrdadh commented Jun 21, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

areusch left a comment

Choose a reason for hiding this comment

Uh oh!

areusch Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

mehrdadh Jun 23, 2022

Choose a reason for hiding this comment

Uh oh!

manupak commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

areusch commented Jun 27, 2022

Uh oh!

mehrdadh commented Jun 27, 2022

Uh oh!

Mousius commented Jun 28, 2022

Uh oh!

manupak commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manupak commented Jun 28, 2022

Uh oh!

leandron commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

areusch commented Jun 28, 2022

Uh oh!

Mousius commented Jun 29, 2022

Uh oh!

manupak commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

driazati commented Jun 29, 2022

Uh oh!

mehrdadh commented Jun 30, 2022

Uh oh!

manupak commented Jul 1, 2022

Uh oh!

mehrdadh commented Jul 1, 2022

Uh oh!

areusch commented Jul 8, 2022

Uh oh!

manupak left a comment

Choose a reason for hiding this comment

Uh oh!

manupak Jul 13, 2022

Choose a reason for hiding this comment

Uh oh!

mehrdadh Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

mehrdadh Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehrdadh Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

mehrdadh Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

areusch commented Jul 25, 2022

Uh oh!

manupak left a comment

Choose a reason for hiding this comment

Uh oh!

asparkhi commented Jul 25, 2022

Uh oh!

manupak commented Jul 25, 2022

Uh oh!

mehrdadh commented Jul 25, 2022

Uh oh!

mehrdadh commented Jun 21, 2022 •

edited by github-actions bot

Loading

manupak commented Jun 24, 2022 •

edited

Loading

manupak commented Jun 28, 2022 •

edited

Loading

leandron commented Jun 28, 2022 •

edited

Loading

manupak commented Jun 29, 2022 •

edited

Loading

mehrdadh Jul 19, 2022 •

edited

Loading

mehrdadh commented Aug 2, 2022 •

edited

Loading