[Test] Improve Katib CI/CD GitHub Actions #2024

andreyvelich · 2022-11-18T18:42:52Z

/kind feature
/area testing

Recently, we switched to GitHub Actions for our CI/CD pipelines, thanks a lot again @tenzen-y for driving this.

Since we have limitations now: 20 concurrent jobs and we haven't set AWS EC2 instances for our workers yet, we need to do some improvements to reduce execution time.

I think, we can try to do the following:

Should we run postgres test only for Random search experiment ? We run 3 Trials for Random experiment, so we can verify that DB works properly.
Can we build only the required suggestions images for each e2e test ? As I can see, build step takes around 15 min which is more than half of e2e.
@tenzen-y Are there any specific requirements why we clean cache for our build image after each e2e run ?
Do we need to build images for linux/amd64 if that is verified as part of e2e ?
In the longterm/separate tracking issue we can also do this:
- Run only required Experiments when appropriate source code has been changed (what we've done with Katib UI).
- Run all experiments test in periodic manner, e.g. once a day. For Pull Request test we can use only few e2e experiments.
- Use Katib SDK instead of this script to run e2e, similar to the Training Operator. So we can verify that our SDK is working.

@kubeflow/wg-training-leads @tenzen-y @anencore94 Are there any other improvements that you have in your mind ?

GitHub Actions improvements checklist

I can identify the following improvements:

Run postgres e2e only for random search.
Use Katib SDK to create E2E script.
Disable workflow when the new commit is published by using cancel-in-progress API.
Use Docker cache when building our images.
Remove linux/amd64 build from the pre-commit check since we verify this in E2E test.
Identify experiments for E2E from the appropriate code changes.
Run all E2E test only on the pre-releases or in the periodic manner.

Please let me know if we should add more items @johnugeorge @anencore94 @terrytangyuan @tenzen-y @gaocegege

Love this feature? Give it a 👍 We prioritize the features with the most 👍

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2022-11-18T21:09:17Z

One thing that might help is to avoid concurrent builds on the same PR (in case people pushed multiple commits which trigger separate builds). https://github.com/argoproj/argo-workflows/blob/master/.github/workflows/ci-build.yaml#L12-L14

We should utilize the cache on GitHub Actions.

If the image builds are time-consuming, we should consider pre-building the image cache that Docker can use.

anencore94 · 2022-11-21T01:44:17Z

Should we run postgres test only for Random search experiment ? We run 3 Trials for Random experiment, so we can verify that DB works properly.

I agree with you. Change to test on postgres only for one general experiment makes sense.

In the longterm/separate tracking issue we can also do this:

Also, I think we could seperate the e2e test on two stage, on.pull_request and on pre-release. It is reasonable to narrow e2e test on pull_requests action, However I think it is more safe to test each possible cases at least one time before release. Since some combination cases makes some unexpected results.

tenzen-y · 2022-11-28T16:25:18Z

@andreyvelich Thanks for creating this issue.

Can we build only the required suggestions images for each e2e test ? As I can see, build step takes around 15 min which is more than half of e2e.

Makes sense. When I migrated e2e tests to gh-actions, I made all e2e tests build all suggestion images to avoid complicated shell scripts. But as you say, we can avoid the complex scripts to rebuild the e2e test using the katib Python client.

@tenzen-y Are there any specific requirements why we clean cache for our build image after each e2e run ?

I added the step to avoid the error write /var/lib/docker/tmp/GetImageBlob424493410: no space left on device.
But we might be able to remove the step to clean caches if we make all e2e tests build only the required suggestions images for each e2e test as mentioned above.

Do we need to build images for linux/amd64 if that is verified as part of e2e ?

I added the platform linux/amd64 to verify if we can build multi-platform images. In e2e, we only build single-platform images.

github-actions · 2023-08-24T00:17:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2023-08-24T13:22:12Z

Since @tenzen-y made our E2E actions very stable we can close this issue. Thanks again for this effort!
Let's track additional improvements separately .

google-oss-prow bot added kind/feature area/testing labels Nov 18, 2022

andreyvelich mentioned this issue Nov 23, 2022

[Test] Reduce Katib GitHub Action Runs #2036

Merged

tenzen-y mentioned this issue Dec 13, 2022

Upgrade Python version to 3.10 #2057

Merged

1 task

andreyvelich mentioned this issue Jan 4, 2023

[SDK] Use Katib SDK for E2E Tests #2075

Merged

github-actions bot added the lifecycle/stale label Aug 24, 2023

andreyvelich closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test] Improve Katib CI/CD GitHub Actions #2024

[Test] Improve Katib CI/CD GitHub Actions #2024

andreyvelich commented Nov 18, 2022 •

edited

Loading

terrytangyuan commented Nov 18, 2022 •

edited

Loading

anencore94 commented Nov 21, 2022

tenzen-y commented Nov 28, 2022

github-actions bot commented Aug 24, 2023

andreyvelich commented Aug 24, 2023

[Test] Improve Katib CI/CD GitHub Actions #2024

[Test] Improve Katib CI/CD GitHub Actions #2024

Comments

andreyvelich commented Nov 18, 2022 • edited Loading

GitHub Actions improvements checklist

terrytangyuan commented Nov 18, 2022 • edited Loading

anencore94 commented Nov 21, 2022

tenzen-y commented Nov 28, 2022

github-actions bot commented Aug 24, 2023

andreyvelich commented Aug 24, 2023

andreyvelich commented Nov 18, 2022 •

edited

Loading

terrytangyuan commented Nov 18, 2022 •

edited

Loading