Skip to content

[DRAFT] [Fix] Restore NullComm for spoofed cube-sphere run#386

Closed
FlorianDeconinck wants to merge 1 commit into
NOAA-GFDL:developfrom
FlorianDeconinck:fix/restore_NullComm
Closed

[DRAFT] [Fix] Restore NullComm for spoofed cube-sphere run#386
FlorianDeconinck wants to merge 1 commit into
NOAA-GFDL:developfrom
FlorianDeconinck:fix/restore_NullComm

Conversation

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator

Description

We definitely took NullComm out in #350 and replace it's usage by LocalComm.

One use case that cannot be covered with this replacement is running a single process spoofing 6 ranks cube-sphere pyfv3 model.

Their is two problem:

  1. Halo exchange fail. LocalComm can do halo exchange but you have to code for it (see unit test) and pyFV3 isn't.
  2. AllReduce raises with LocalComm which is also needed for pyFV3 (grid generation & tracer advection)

In this PR we restore NullComm to re-authorize the behavior.

Other fixes that would not bring NullComm back are:

  • Halo exchange disabled when running spoof 6-ranks (NullComm makes it nonsense anyway) + no-op allreduce on LocalComm
  • Exploring an easier setup of CachingComm for the 6-ranks spoof (which should work but requires running true 6-ranks first)
  • Pushing doubly periodic pyfv3 proper + allreduce on LocalComm

How has this been tested?

N/A

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. add new modules to docs/docstrings/)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • New check tests, if applicable, are included

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator Author

Poke @oelbert

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator Author

I didn't see/rember that NullComm has been made local to Pace -> https://github.com/NOAA-GFDL/pace/blob/fd67bfae4f2b694168c65cfccc36624f625c64d9/pace/comm.py#L30

That's another solution.

@romanc
Copy link
Copy Markdown
Collaborator

romanc commented Feb 24, 2026

I didn't see/rember that NullComm has been made local to Pace -> https://github.com/NOAA-GFDL/pace/blob/fd67bfae4f2b694168c65cfccc36624f625c64d9/pace/comm.py#L30

I moved a copy of NullComm to pace in NOAA-GFDL/pace#168 exactly because usage of NullComm in pace was non trivial (see discussion there).

Spoofing a 6-rank setup on one rank sounds scary. I don't think the DSL should support/promote this behavior. I can see limited use in tests, though NDSL tests work fine without it. I'd rather not have the general DSL user know about this.

If pace wants to support such workflows (e.g. for development), I think we should use/keep NullComm in pace and not re-introduce it here in NDSL.

@FlorianDeconinck
Copy link
Copy Markdown
Collaborator Author

Closing this.

Next steps for the underlying issues is:

  • in dynamics forbid one-rank cube-sphere and redirect to a working single-time doubly periodic grid
  • retire NullComm then because we can run a single rank Pace consistently
  • optionally - have a ready-to-consume C24 CachingComm capable of doing one-rank dyanmics by feeding the other 5 pre-recorded ranks

@FlorianDeconinck FlorianDeconinck deleted the fix/restore_NullComm branch February 24, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants