Skip to content

Restructure CI tests as a single workflow#869

Merged
adcroft merged 5 commits into
NOAA-GFDL:dev/gfdlfrom
marshallward:gh_ci_artifacts
Apr 14, 2025
Merged

Restructure CI tests as a single workflow#869
adcroft merged 5 commits into
NOAA-GFDL:dev/gfdlfrom
marshallward:gh_ci_artifacts

Conversation

@marshallward
Copy link
Copy Markdown
Member

@marshallward marshallward commented Mar 26, 2025

This commit consolidates many of the CI tests into a single workflow. In principle, it replaces multiple builds of the model with a single workflow. In many respects, this resembles our regression suite on Gaea.

(Example: https://github.com/NOAA-GFDL/MOM6/actions/runs/14092259355)

In practice, it is only marginally faster by some metrics, and slower by others.

  • Although we don't rebuild FMS and the models, we do have to set up the ubuntu environment on every stage which has a cost (at least 30s).

    If we can eliminate this overhead somehow (e.g. container?) then this method may be much faster.

  • MacOS is now slower, but we are running more tests. Previously, we were only doing a small handful of them.

  • Setting up the artifacts is delicate (paths, for example), and there is some overhead in transfer, but it appears to be small (<10s).

  • Cleanup relies on a third-party GitHub Actions "app", which may not be supported over long term.

    Also, a failed test will not delete its artifacts. If these accumulate, then we may hit a storage limit.

    Worst case: Artifacts are wiped after 1 day. This is the minimum setting.

  • Minor point: As the number of tests increase, the graph scales to fit in the same box, and the actual stages become almost unreadable without zooming in. Not really the effect I was hoping for.

Notes on MacOS (aarch64) builds:

  • Disable fp-contract

    Seems like aarch64 (ARM) needs -ffp-contract=off to disable FMAs. -mno-fma is not supported!

  • Optimization is reduced to -O1 for DEBUG-REPRO equivalence.

    This is only an issue in the tidal forcing, and most likely due to trigonometric function evaluation. But I can't yet see any other solution.

There are also some very minor changes to the .testing Makefile:

  • Log output is now consistently set to 100 lines for all tests.

  • Builtin rule flags were replaced with explicit long forms

  • Empty SUFFIX: rule was added, to belabor the point.

@marshallward
Copy link
Copy Markdown
Member Author

Never seen this error:

Run geekyeggo/delete-artifact@v4
Error: HttpError: Resource not accessible by integration

Might need some time or help to work out this one...

@marshallward marshallward force-pushed the gh_ci_artifacts branch 2 times, most recently from ac6fa45 to 470c871 Compare March 26, 2025 18:58
@marshallward
Copy link
Copy Markdown
Member Author

marshallward commented Mar 26, 2025

The final cleanup stage needed permissions. This is what finally worked:

    permissions:
      id-token: write
      contents: write
      pull-requests: write
      repository-projects: write

The only one that I am certain is required is id-token: write. I will iterate a bit on the others and see what works. (The others may already default to write.)

This commit consolidates many of the CI tests into a single workflow.
In principle, it replaces multiple builds of the model with a single
workflow.  In many respects, this resembles our regression suite on
Gaea.

In practice, it is only marginally faster by some metrics, and slower by
others.

* Although we don't rebuild FMS and the models, we do have to set up the
  ubuntu environment on every stage which has a cost (at least 30s).

  If we can eliminate this overhead somehow (e.g. container?) then this
  method may be much faster.

* MacOS is now slower, but we are running more tests.  Previously, we
  were only doing a small handful of them.

* Setting up the artifacts is delicate (paths, for example), and there
  is some overhead in transfer, but it appears to be small (<10s).

* Cleanup relies on a third-party GitHub Actions "app", which may not be
  supported over long term.

  Also, a failed app will also fail to delete its artifacts.  If these
  accumulate, then we may hit a storage limit.

  Worst case: Artifacts are wiped after 1 day.  This is the minimum
  setting.

* Minor point: As the number of tests increase, the graph scales to fit
  in the same box, and the actual stages become almost unreadable
  without zooming in.  Not really the effect I was hoping for.

Notes on MacOS (aarch64) builds:

* Disable fp-contract

  Seems like aarch64 (ARM) needs -ffp-contract=off to disable FMAs.
  -mno-fma is not supported!

* Optimization is reduced to -O1 for DEBUG-REPRO equivalence.

  This is only an issue in the tidal forcing, and most likely due to
  trigonometric function evaluation.  But I can't yet see any other
  solution.

There are also some very minor changes to the .testing Makefile:

* Log output is now consistently set to 100 lines for all tests.

* Builtin rule flags were replaced with explicit long forms

* Empty SUFFIX: rule was added, to belabor the point.
@marshallward
Copy link
Copy Markdown
Member Author

This was the only necessary change.

    permissions:
      id-token: write

I think this is now ready for review.

@marshallward
Copy link
Copy Markdown
Member Author

Oh, I did remove the old test files, which means that the "mandatory" tests can no longer be run.

Some options (assuming this were to be accepted):

  • Restore them and then remove them in a second PR.
  • Change the settings, then accept the PR.
  • Accept the PR with admin override, then change the settings.

The first is probably the most correct but I don't think there is a big difference.

Any removal of existing tests, or change of required tests, will cause existing PRs to break until a rebase to the latest codebase. The timing is probably does not change things much.

@adcroft adcroft merged commit 1dae3f0 into NOAA-GFDL:dev/gfdl Apr 14, 2025
51 checks passed
@marshallward marshallward deleted the gh_ci_artifacts branch April 15, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants