Skip to content

Conversation

@mehrdadh
Copy link
Member

@mehrdadh mehrdadh commented Jun 21, 2022

This would prevent rebuilding multiple image on each update on CMSIS, ethosu, etc.
cc @Mousius @alanmacd @areusch @gromero @leandron

@github-actions github-actions bot requested review from Mousius, areusch and leandron June 21, 2022 22:36
@mehrdadh mehrdadh force-pushed the microtvm/refactor_cpu_tests branch from d9b08dc to 9b789b4 Compare June 23, 2022 16:42
@github-actions github-actions bot requested a review from gromero June 23, 2022 16:42
Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see what @u99127 @Mousius @manupa-arm have to say here first. i think this is mainly to make it operationally easier for us to ensure the microTVM upstream tests use the same deps as we install in Reference VM. i think that as we get towards supporting tvmc in a release and doing further testing with microTVM against tvmc in its own venv with e.g. Zephyr installed separately, we could probs focus on that in place of Reference VM. anyway, the ci_ images remain the test authority, so this is again mainly an operational convenience for folks using Reference VM as we do to test on-device at OctoML.

COPY install/ubuntu_install_caffe.sh /install/ubuntu_install_caffe.sh
RUN bash /install/ubuntu_install_caffe.sh

# Github Arm(R) Ethos(TM)-N NPU driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this one should stay in ci_cpu, it's unrelated to microTVM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding is correct and this is used for running this command:

run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib-test_ethosu tests/python/contrib/test_ethosu -n auto

I suggest we also move this to ci_qemu to have all microtvm tests in one place.

@manupak
Copy link
Contributor

manupak commented Jun 24, 2022

In order to reason about this, it would be great to understand why do we maintain two seperate images for microTVM and TVM, especially when QEMU is a software that runs on the cpu. (Sorry I might've missed the discussion why went on to create to two seperate images)

In an alternate world, we could consolidate all testing dependencies to ci_cpu and keep TVM and microTVM aligned in-terms of testing environment, unless we need conflicting dependencies in the testing for TVM vs microTVM not just additive ones.

@areusch
Copy link
Contributor

areusch commented Jun 27, 2022

hmm yeah so here's what i know about the history here:

  1. originally when we added QEMU to the TVM CI, USE_MICRO was an optional CMake setting. having ci_qemu separate from ci_cpu provided some diversity in testing the various config settings. MLF accidentally made USE_MICRO required; however, at this time, we had already completed a refactor that substantially reduced the code footprint of USE_MICRO so i haven't exactly prioritized fixing it.
    • the folks who might care about USE_MICRO=OFF still are those who use libtvm_runtime.so without using Python. i'm absolutely happy to prioritize a fix (or review one) if the runtime impact turns out to be significant for anyone
  2. QEMU was pretty large and we do want, operationally, to avoid requiring any one user to download a huge docker image just to repro the CI. ci_gpu is 25GB right now, but I don't want that to become the norm. keeping Zephyr installed separately was a way to avoid that.
  3. i don't think there is a reason why the ci_cpu deps are incompatible with anything in ci_qemu, but obviously this could be a reason to separate them if so.

ci_cpu is 7GB now, so it's not small but it has a ways to go before it's the long pole. i could go either way here, though if folks want to continue adding micro toolchains those could become somewhat sizeable.

@mehrdadh
Copy link
Member Author

Adding to Andrew's point, I think we can remove USE_MICRO since it's almost always ON. Also, I prefer to keep ci_qemu separate and also call it ci_microtvm. I believe this image could grow larger in future and if we combine it with ci_cpu it becomes similar to ci_cpu.

@Mousius
Copy link
Member

Mousius commented Jun 28, 2022

Adding to Andrew's point, I think we can remove USE_MICRO since it's almost always ON. Also, I prefer to keep ci_qemu separate and also call it ci_microtvm. I believe this image could grow larger in future and if we combine it with ci_cpu it becomes similar to ci_cpu.

I support this approach as well, having two smaller containers with a specific responsibility rather than one monolithic container should make it easier to manage in the long term as the needs evolve. The only downside is the typical container hopping for development, though I think we can figure out how to get that to work also.

Minor note, I think ci_micro rather than ci_microtvm as it's already identified as a tvm image?

@manupak
Copy link
Contributor

manupak commented Jun 28, 2022

I agree that having modular (smaller) container aids the ease of reproducability from a developer PoV.

I'm worried with this approach is that this creates room for set of diveregent dependencies to be introduced between microTVM vs TVM (the deps that are not specific to micro), especially when they are not seperate projects. Right now, only thing that keeps the common deps align is on the basis that we use the same helper scripts (ubuntu_install*.sh) to construct the image.

E.g. : This could open up the possibility to advance the version of tensorflow supported by TVM project while microTVM lagging behind because there will now be tested in two environments.

Maybe we could create a base container with such common deps to ensure such divergence are harder to do ? (and extend that image into ci_cpu and ci_qemu)

@manupak
Copy link
Contributor

manupak commented Jun 28, 2022

cc : @leandron

@leandron
Copy link
Contributor

leandron commented Jun 28, 2022

Looking at the alternatives being proposed until now, I feel that for our current type of usage makes more sense to unify ci_qemu into ci_cpu, for two reasons:

  1. According to our Jenkinsfile, both jobs that require ci_cpu and ci_qemu will run on the same target machines, so the expectation is that our Jenkins nodes will more commonly pull ci_cpu + ci_qemu in terms of storage
  2. Creating a base image with common dependencies, would also generate a side problem, which is a time dependency to rebuild our images (we can't build all of them at the same time), and the need to create a generic label e.g. latest or last-successful for this base image to be picked up in the dependant images. I'd like to avoid this if possible.

Given our current Docker update practices that don't require updating all images at once, I feel we better to consolidate the dependencies in one single image until we can devise better rules for Docker updating.

@areusch
Copy link
Contributor

areusch commented Jun 28, 2022

i think while i'm sympathetic to reducing the overall image disk size, in practice:

  • DockerHub stores these for us and we don't have a concern there
  • we have the problem of docker image bloat hogging disk on our CI nodes anyway due to image revisions, and have already solved that.

so i don't think we need to make that a priority in this decision. i think then that it comes down to what's operationally easier, and i think that's the POV @Mousius and @manupa-arm were advocating. i tend to agree here--another way that we could get into trouble with consolidating ci_cpu and ci_qemu is two different docker/install scripts could add conflicting packages. we can mitigate that by keeping microTVM separated, and I'd vote for continuing that with renaming ci_qemu to ci_micro.

cc @driazati @konturn in case they have other thoughts from a CI perspective.

@Mousius
Copy link
Member

Mousius commented Jun 29, 2022

According to our Jenkinsfile, both jobs that require ci_cpu and ci_qemu will run on the same target machines, so the expectation is that our Jenkins nodes will more commonly pull ci_cpu + ci_qemu in terms of storage

Provided the images have the same initial layers (base image or copy paste both should work), the storage cost should be the same as it'll re-use the layers when pulling the containers - if a node already had ci_cpu it should have a bunch of layers it can re-use for ci_micro.

Creating a base image with common dependencies, would also generate a side problem, which is a time dependency to rebuild our images (we can't build all of them at the same time), and the need to create a generic label e.g. latest or last-successful for this base image to be picked up in the dependant images. I'd like to avoid this if possible.

As far as I can tell we'd build ci_base, tag it with the same tag as the rest of the images and use that in the FROM of the subsequent images:

ARG TAG="latest"
FROM tlcpackstaging/ci-base:${TAG}

Therefore, apart from a small network up/down to propagate the image, it should be a similar time to build as we have now? Sorry if I missed something here @leandron?

@manupak
Copy link
Contributor

manupak commented Jun 29, 2022

another way that we could get into trouble with consolidating ci_cpu and ci_qemu is two different docker/install scripts could add conflicting packages.

Actually, I was arguing that we should avoid the need of conflicting packages as micro is another backend of TVM. Such a divergence is worrying for a compiler that is aiming to support both micro and non-micro compilation flows.

NOTE : I do acknowledge micro and non-micro have different requirement for dependencies but the common ones should not diverge IMO.

@driazati
Copy link
Member

I also support keeping the images focused on their actual use instead of everything under the sun. With #11768 it should be less common that we're using different tags for the images (everything should be built and updated at once if we follow that process, though it is possible still to do manual updates) and we can avoid dependency drift by being careful in reviewing Docker image changes.

Docker image size for us is a problem (more than storage) that makes CI slower and harder to reproduce, so they should contain only what they need to run their specific tests.

@mehrdadh
Copy link
Member Author

I'm in favor of having a base image with common dependencies to eliminate human errors in build process for each docker image, but even with that people might add any script to any of the docker files. So as a reviewer we need to keep an eye on changes on those files. Also I think there shouldn't be any divergence between TVM and microTVM dependencies and I would like to second @manupa-arm's point on this.
For now, I think we can try to keep all microtvm testing in a single docker image and rename it to ci_micro to it make more readable. We can think about other proposals and maybe discuss it in one of the TVM community meetings.

@manupak
Copy link
Contributor

manupak commented Jul 1, 2022

Thanks all for the discussion!.

we can avoid dependency drift by being careful in reviewing Docker image changes.

If we are solely relying on the reviewers, we need would need better written guidance in an OSS project to aid both reviewers (to cite) and authors (to follow), on what could be added to : ci_base (if we are taking this approach), ci_micro and ci_cpu.
NOTE : Im scoping this discussion to the specific issue being discussed here.

but even with that people might add any script to any of the docker files. So as a reviewer we need to keep an eye on changes on those files.

I suppose how ci_base would help here is that we can make the above guidance to be simplified to say all the layers in the image added by ci_micro Dockerfile has to be micro specific.

@mehrdadh
Copy link
Member Author

mehrdadh commented Jul 1, 2022

@manupa-arm I agree with you. Both better documentation and having a ci_base image which has common dependencies are great ideas.
It would be great if we can make an action items list based on the discussion here.

@areusch
Copy link
Contributor

areusch commented Jul 8, 2022

i think most of the discussion here has aligned around purpose-built CI images. @leandron are your concerns assuaged?

regarding ci_base: I'm somewhat supportive of this, but currently unsure whether this would move the needle very much in practice and there are some operational concerns. we currently build images mostly automatically around the same time via the nightly rebuild and also in CI. the common deps we're talking about here are mostly a mix of apt packages plus some things built from sources downloaded from (i think) version-pinned URLs. there are some other things in here, to be sure--rust packages, for instance. i do agree that ci_base would enforce that we don't diverge, but just stating that i don't know, given the above, that i expect to see a huge effect in creating that. i agree with @Mousius that the layers would ideally be reused; i wouldn't underestimate the network transfer time, however.

I do want to note that we also mitigate drift in other was e.g. via common install scripts and (soon) via locked python dependencies (this more addresses the example of tensorflow divergence). however, i acknowledge the drawback that this is merely convention and dependent on reviewers.

for verticals like microTVM, i think a separate image makes sense. for e.g. backends, it's less clear--what if you want to test cuda with paddlepaddle? there is no such paddlepaddle dep for arm64, so it can't go in the base image. does this mean base images need to be arch-specific--if so, where's the line (e.g. what if it's available for 20.04 but not 18.04)?

so far as action items, i think we should capture the need of either a common base image or documentation into a GH issue. we should also update our docs accordingly. would love to hear more thoughts @Mousius @manupa-arm @leandron @driazati

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im just suggesting a mandatory change orthogonal to the conversation.

ENV CMSIS_PATH=/opt/arm/ethosu/cmsis/

# Arm(R) Ethos(TM)-U NPU driver
COPY install/ubuntu_install_ethosu_driver_stack.sh /install/ubuntu_install_ethosu_driver_stack.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib-test_ethosu tests/python/contrib/test_ethosu -n auto

we need this line moved alongside as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move that after we updated the qemu image with the dependencies for this test

Copy link
Member Author

@mehrdadh mehrdadh Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this PR by adding Arm(R) Ethos(TM)-U NPU driver to the qemu and merge this. Then we will update the qemu image and send another PR to remove test_ethosu and move it to qemu.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry we already have that. I meant Github Arm(R) Ethos(TM)-N NPU driver, I believe that is required, correct me if I'm wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it in qemu, looks fine. So I moved it

@areusch
Copy link
Contributor

areusch commented Jul 25, 2022

@manupa-arm can you take a look again?

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! looking good.

It would be great to capture the steps discussed and agreed in the issue (as per @areusch) and link to the PR.

@ashutosh-arm could you take a look as well from cmsis tests, whether they would be all run in ci_qemu now ?

@asparkhi
Copy link
Contributor

Thanks! looking good.

It would be great to capture the steps discussed and agreed in the issue (as per @areusch) and link to the PR.

@ashutosh-arm could you take a look as well from cmsis tests, whether they would be all run in ci_qemu now ?

afaik CMSIS-NN tests do not run on ci_qemu at present. In this code change, I don't see CMSIS-NN tests moved to tests/scripts/task_python_microtvm.sh. I am referring to

run_pytest ctypes ${TVM_INTEGRATION_TESTSUITE_NAME}-contrib tests/python/contrib --ignore=tests/python/contrib/test_ethosu
. Am I missing something @mehrdadh / @manupa-arm ?

@manupak
Copy link
Contributor

manupak commented Jul 25, 2022

I see.

Yes none of the contrib tests are run (that is how cmsis tests are run today in ci_cpu).
We need to add that as well then.

@mehrdadh
Copy link
Member Author

@ashutosh-arm thanks for catching this. I moved test_cmsisnn tests to qemu, there's also test_ethosn which we should move, but we need to update qemu image for that.
I think I need to send another PR for that along with sharding the qemu tests since it has timeout issue right now

@manupak
Copy link
Contributor

manupak commented Jul 25, 2022

@mehrdadh test_ethosn (unlike test_ethosu) is not related to microTVM.

@mehrdadh
Copy link
Member Author

@manupa-arm thanks for clarification! Then I think this PR is ready, I just need to figure out the timeout issue.
cc @areusch @driazati

@areusch
Copy link
Contributor

areusch commented Jul 25, 2022

just musing about naming--previously we called this image ci_qemu because it contained qemu. it seems like with this effort + riscv32 + the hexagon effort, we are gravitating more towards purpose-driven CI images (we discussed this before). there is already a ci_arm--shall we consider a container rename such as:

  • ci_arm -> ci_arm_aarch64
  • ci_qemu -> ci_arm_cortexm

@mehrdadh
Copy link
Member Author

@areusch I like to change the name, but I think ci_arm_cortexm is too specific. We could have riscv support with Zephyr in future which I assume the testing is gonna be part of this image. Then it would be confusing. I think ci_micro works better here. wdyt?

@u99127
Copy link

u99127 commented Jul 25, 2022

just musing about naming--previously we called this image ci_qemu because it contained qemu. it seems like with this effort + riscv32 + the hexagon effort, we are gravitating more towards purpose-driven CI images (we discussed this before). there is already a ci_arm--shall we consider a container rename such as:

  • ci_arm -> ci_arm_aarch64
  • ci_qemu -> ci_arm_cortexm

I'd suggest that ci_arm is renamed to ci_aarch64 .

Ramana

@areusch
Copy link
Contributor

areusch commented Jul 26, 2022

@u99127 i'm fine with that, should we then ci_cortexm instead?

@mehrdadh mehrdadh force-pushed the microtvm/refactor_cpu_tests branch from 0cf89c7 to 74de280 Compare August 2, 2022 19:52
@areusch
Copy link
Contributor

areusch commented Aug 2, 2022

@manupa-arm please take another look when you have a minute

@mehrdadh
Copy link
Member Author

mehrdadh commented Aug 2, 2022

just rebased with main since https://github.com/apache/tvm/pull/12258/files is merged, timeout issue should be resolved.

@mehrdadh
Copy link
Member Author

mehrdadh commented Aug 5, 2022

@manupa-arm friendly reminder about this PR. thanks!

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM!

@mehrdadh mehrdadh merged commit 1f97f1f into apache:main Aug 5, 2022
@mehrdadh mehrdadh deleted the microtvm/refactor_cpu_tests branch August 5, 2022 22:39
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
* Move scripts

* Address comments

* move ethosu tests

* move cmsisnn tests to qemu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants