Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Mar 26, 2023

Previously all tests for providers were executed as single test type, even if only subset of provider tests were executed. However when tests for all providers are executed (for example when main tests are run or when you change some important files and add new dependencies) "Providers" test run all the tests for all providers sequentially which makes them longest running test type by far.

This has the side effect that often for those cases Providers test type is running alone where all the other test types had already completed. This means that parallelism of the CI instances is not used to a full extent.

This change leverages the current "parallel" framework of breeze that runs tests and rather than running single "Providers[...]" test type, it breakes the Provider test type into separate test type for each provider (if all providers are tested, this means that there are 70+ provider tests/docker composes running - not all in parallel, but up to CPU number ones are run in parallel via pooling mechanism).

This required a number of small changes, that resulted not only faster runs in CI and better use of multi-CPU instances, but also it allows much easier to utilise parallelism of local machines of develiopers to run complete set of Airflow tests using as many CPUS as they have allocated to their docker engine.

The changes include:

  • splitting the Provider type into as many providers are being tested ("Providers[nnn]")
  • escaping test type to avoid rich treating the provider names as not-defined markup and allows to display status in coloured form
  • pruning of docker-compose networks before, after and during the tests in order to avoid network exhaustion - each test type has its own docker-compose created network
  • avoiding unnecessary environment checks before each test in case of parallel test execution (produced unnecessary output)
  • amazon and google providers are always put at the beginning of the list if they are in the list - they are by far the longest so by starting them early we give them a chance to not be the last running test type, using parallelism / multiple CPUS more efficiently

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member Author

potiuk commented Mar 26, 2023

Example output:

Screenshot 2023-03-26 at 13 45 47

Also - it allows to run all the tests locally (including all providers) much faster (if you have multiple CPUS and a lot of memory) if you want to see if you have not broken any of the providers.

Plus as a side effect, if any of the provider tests fail, you know exactly which one and seeing the output of that test is much easier (unfolding the result of single provider is much faster and it's easier to scroll through the output of only that provider.

@potiuk potiuk force-pushed the split-providers-tests-in-ci branch 2 times, most recently from c548f3b to 6ad546d Compare March 26, 2023 12:17
@potiuk potiuk requested review from kaxil and mik-laj March 26, 2023 12:19
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will make Providers tests safe to run for Public runners - each of them will require much less of memory and the way it is done where each "provider" test will run in their separate docker container that will be torn down immediately when the test type completes, we will continuously reclaim resources from each test type while we are running the tests on Public runners.

@potiuk
Copy link
Member Author

potiuk commented Mar 26, 2023

I had to clean the networks a bit less aggressively (selectively each network rather than by prunning all unused networks - due to race conditions that were caused by massive parallelism we have :D

@potiuk potiuk requested a review from eladkal March 26, 2023 12:43
@potiuk potiuk force-pushed the split-providers-tests-in-ci branch from 6ad546d to 972b9e5 Compare March 26, 2023 12:48
@potiuk
Copy link
Member Author

potiuk commented Mar 26, 2023

BTW. It seems we might save a LOT on more parallelisation. The full tests take ~ 13 minutes instead of ~ 17 which is whooping ~25%.

I just need to work a bit on stability, as it seems we might put a bit too much pressure on the machines (and I need to check it with public runners)

@potiuk potiuk added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Mar 26, 2023
@potiuk potiuk force-pushed the split-providers-tests-in-ci branch from 972b9e5 to 8dec067 Compare March 26, 2023 13:22
@potiuk
Copy link
Member Author

potiuk commented Mar 26, 2023

OK. For public runners, it's as expected - slower than running all Provider tests together (34m with the split vs 26m without) - the overhead of startting docker-composes (including the DB) there is significantly bigger than parallelism gain (there are just 2 CPUS). So I Will disable this split for public runners, and see if I can stabilize it for the self-hosted ones.

@potiuk potiuk removed the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Mar 26, 2023
Previously all tests for providers were executed as single test type,
even if only subset of provider tests were executed. However when
tests for all providers are executed (for example when main tests
are run or when you change some important files and add new
dependencies) "Providers" test run all the tests for all providers
sequentially which makes them longest running test type by far.

This has the side effect that often for those cases Providers test
type is running alone where all the other test types had already
completed. This means that parallelism of the CI instances is not
used to a full extent.

This change leverages the current "parallel" framework of breeze that
runs tests and rather than running single "Providers[...]" test type,
it breakes the Provider test type into separate test type for each
provider (if all providers are tested, this means that there are
70+ provider tests/docker composes running - not all in parallel, but
up to CPU number ones are run in parallel via pooling mechanism).

This required a number of small changes, that resulted not only
faster runs in CI and better use of multi-CPU instances, but also
it allows much easier to utilise parallelism of local machines
of develiopers to run complete set of Airflow tests using as many
CPUS as they have allocated to their docker engine.

The changes include:

* splitting the Provider type into as many providers are being
  tested ("Providers[nnn]")
* escaping test type to avoid rich treating the provider names
  as not-defined markup and allows to display status in coloured
  form
* pruning of docker-compose networks before, after and during
  the tests in order to avoid network exhaustion - each test type
  has its own docker-compose created network
* avoiding unnecessary environment checks before each test in
  case of parallel test execution (produced unnecessary output)
* amazon provider, general providers and WWW  are always put at the
  beginning of the list if they are in the list - they are by far the
  longest so by starting them early we give them a chance to not be
  the last running test type, using parallelism / multiple CPUS
  more efficiently.
@potiuk potiuk force-pushed the split-providers-tests-in-ci branch from 8dec067 to 18b6e3c Compare March 26, 2023 16:01
@potiuk
Copy link
Member Author

potiuk commented Mar 26, 2023

Actually it turned out that just sorting the test types (and having Providers and WWW first) is enough to get similar gains (with much more stable results). Closing in favour of that .

@potiuk potiuk closed this Mar 26, 2023
@potiuk potiuk deleted the split-providers-tests-in-ci branch June 4, 2023 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant