Skip to content

[Fix]: Removing race condition within DiagManagerMonitor testing#459

Merged
fmalatino merged 50 commits into
NOAA-GFDL:developfrom
fmalatino:fix/dm_monitor
Jun 4, 2026
Merged

[Fix]: Removing race condition within DiagManagerMonitor testing#459
fmalatino merged 50 commits into
NOAA-GFDL:developfrom
fmalatino:fix/dm_monitor

Conversation

@fmalatino
Copy link
Copy Markdown
Contributor

Description

This PR introduces changes that prevent a race condition that exists in test_dm_monitor_single. This PR builds off the work introduced by PR 450.

How has this been tested?

Tested using current CI

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. add new modules to docs/docstrings/)
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • New check tests, if applicable, are included

@fmalatino fmalatino requested a review from romanc May 7, 2026 17:44
Copy link
Copy Markdown
Collaborator

@romanc romanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to find a more central place to configure the dask scheduler.

Comment thread .github/workflows/unit_tests.yaml Outdated
Comment thread .github/workflows/unit_tests.yaml Outdated
Comment thread .github/workflows/unit_tests.yaml Outdated
Comment thread .github/workflows/unit_tests.yaml Outdated
Comment thread tests/monitor/test_dm_monitor_single.py Outdated
Comment thread tests/monitor/test_dm_monitor_single.py Outdated
Comment on lines +222 to +223
with dask.config.set(scheduler="synchronous"):
ds = xr.open_dataset("diag_manager_single_tile.nc", decode_times=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that any xarray dataset read in a non-parallel environment would need to be guarded by a with statement setting the scheduler to "synchronous"? That seems like something to be configured in a more central place to me, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked up the errors that were being produced I was pointed in the direction of using this method, albeit I am not sure why this is the case given that it is not in a parallel environment. I think the issue comes down to the call to decode_times parameter within open_dataset, this is based on the information I have seen thus far on what might be the problem. The most recent commit has removed the use of the dask scheduler and instead will use a call to MPIComm()._comm.Barrier(), given that traceback makes references to calls regarding threading.

@fmalatino
Copy link
Copy Markdown
Contributor Author

Per the issues we are seeing in this PR, I think it is necessary to re-work the installation of pyfms, such that it will use the pip installed netcdf for installation. I will move this PR to draft and note when a PR for updating the install of pyfms is available.

@fmalatino fmalatino marked this pull request as draft May 8, 2026 15:51
@romanc romanc mentioned this pull request May 12, 2026
7 tasks
Copy link
Copy Markdown
Collaborator

@romanc romanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While there is now a combination of system dependencies and user code that makes the tests pass, I'm still a bit uneasy on this feature. In particular, I'm wondering if there's anything we can learn from this message

Image

which shows up in pyFMS tests now that they run. Tracing that message back to FMS, I get here

https://github.com/NOAA-GFDL/FMS/blob/222f3c6de852b6ba565bd5d6bac55893e92ea106/diag_manager/diag_manager.F90#L4221-L4222

which indicates that something in the setup of the diag_manager isn't as expected. And indeed, in NDSL, diag_manager.init() is called without the time_init argument

diag_manager.init(diag_model_subset=diag_manager.DIAG_ALL)

which pyFMS does provide as an option:

https://github.com/NOAA-GFDL/pyFMS/blob/2dd98e645e99bfedd610d057e5aaf9541763ea3d/pyfms/py_diag_manager/diag_manager.py#L77-L80

Do you think the above error message could be related to the crashes we have seen previously? If not, do you think there's a way to configure pyFMS or the pyFMS-monitor to avoid the message (i.e. not use prepend_date or supply a time_init argument)? It looks like prepend_date is an option read from a namelist and it defaults to true

https://github.com/NOAA-GFDL/FMS/blob/222f3c6de852b6ba565bd5d6bac55893e92ea106/diag_manager/diag_manager.F90#L191-L197

Comment thread .github/workflows/unit_tests.yaml Outdated
@fmalatino
Copy link
Copy Markdown
Contributor Author

While there is now a combination of system dependencies and user code that makes the tests pass, I'm still a bit uneasy on this feature. In particular, I'm wondering if there's anything we can learn from this message

Image which [shows up in pyFMS tests](https://github.com/NOAA-GFDL/NDSL/actions/runs/25697320515/job/75448715648?pr=459) now that they run. Tracing that message back to `FMS`, I get here

https://github.com/NOAA-GFDL/FMS/blob/222f3c6de852b6ba565bd5d6bac55893e92ea106/diag_manager/diag_manager.F90#L4221-L4222

which indicates that something in the setup of the diag_manager isn't as expected. And indeed, in NDSL, diag_manager.init() is called without the time_init argument

diag_manager.init(diag_model_subset=diag_manager.DIAG_ALL)

which pyFMS does provide as an option:

https://github.com/NOAA-GFDL/pyFMS/blob/2dd98e645e99bfedd610d057e5aaf9541763ea3d/pyfms/py_diag_manager/diag_manager.py#L77-L80

Do you think the above error message could be related to the crashes we have seen previously? If not, do you think there's a way to configure pyFMS or the pyFMS-monitor to avoid the message (i.e. not use prepend_date or supply a time_init argument)? It looks like prepend_date is an option read from a namelist and it defaults to true

https://github.com/NOAA-GFDL/FMS/blob/222f3c6de852b6ba565bd5d6bac55893e92ea106/diag_manager/diag_manager.F90#L191-L197

Looping @rem1776 in here as he brought this feature in and generated this test.

@fmalatino fmalatino requested a review from rem1776 May 12, 2026 14:32
@rem1776
Copy link
Copy Markdown
Contributor

rem1776 commented May 12, 2026

@romanc Yeah skipping the init_time argument was intentional. It's more of an optional feature to change the name of your output files so that they include the model start date.

The init_time argument used in the diag_manager_init routine is only used to prepend the starting date to the output filename. If theres no init_time present and prepend_date is true, after outputting that note message FMS will set prepend_date to false so that it doesn't cause a problem.

We could add a line to set prepend_date to false in the created namelists, which would silence the message but not change any other behavior, if thats preferable. We may want to use this option the future, but I think it would be better to do that via a flag in the monitor's init routine instead of doing it every run.

@fmalatino fmalatino requested a review from romanc May 12, 2026 16:16
@romanc
Copy link
Copy Markdown
Collaborator

romanc commented May 12, 2026

@rem1776 thanks for the background on that init_time flag. I was asking because @fmalatino had to modify the test code to change

- ds = xr.open_mfdataset(filename, decode_times=True)
+ ds = netCDF4.Dataset(filename)

and then use num2date() in the test code to make them run on CI. Reading via xarray would end up in a segfault and traces would point to date/time routines. Since I have no experience with FMS, I found this suspicious and I was looking for anything that would explain why this would happen. Would you expect reading via xarray to fail?

We could add a line to set prepend_date to false in the created namelists, which would silence the message but not change any other behavior, if thats preferable. We may want to use this option the future, but I think it would be better to do that via a flag in the monitor's init routine instead of doing it every run.

I agree that this would be better placed in the monitor's init routine.

@fmalatino
Copy link
Copy Markdown
Contributor Author

@rem1776 thanks for the background on that init_time flag. I was asking because @fmalatino had to modify the test code to change

- ds = xr.open_mfdataset(filename, decode_times=True)
+ ds = netCDF4.Dataset(filename)

and then use num2date() in the test code to make them run on CI. Reading via xarray would end up in a segfault and traces would point to date/time routines. Since I have no experience with FMS, I found this suspicious and I was looking for anything that would explain why this would happen. Would you expect reading via xarray to fail?

We could add a line to set prepend_date to false in the created namelists, which would silence the message but not change any other behavior, if thats preferable. We may want to use this option the future, but I think it would be better to do that via a flag in the monitor's init routine instead of doing it every run.

I agree that this would be better placed in the monitor's init routine.

Do we want this change in this PR or a subsequent one?

Copy link
Copy Markdown
Collaborator

@romanc romanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are a bit stuck here with this PR. Let me propose the following

  1. Follow-up issue for the init_time argument
  2. Follow-up issue for the crash when using xarray.open_dataset()
  3. and then we merge this PR as-is since it's anyway an improvement over the current state where pyfms-tests don't even run.

Comment thread .github/workflows/unit_tests.yaml Outdated
Comment on lines +56 to +58
run: |
pip3 install .[${{matrix.extra}}]
pip3 list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove debug pip list ...

Suggested change
run: |
pip3 install .[${{matrix.extra}}]
pip3 list
run: pip3 install .[${{matrix.extra}}]

@fmalatino
Copy link
Copy Markdown
Contributor Author

Issues 467 and 468 have been added to track the xarray use issue and default arguments to DiagManagerMonitor respectively.

@fmalatino
Copy link
Copy Markdown
Contributor Author

Checking if this can be merged?

@romanc
Copy link
Copy Markdown
Collaborator

romanc commented Jun 4, 2026

Checking if this can be merged?

Fine by me and since Ryan also approved, I don't see a reason to wait any longer. Florian won't have time to review.

@fmalatino fmalatino added this pull request to the merge queue Jun 4, 2026
Merged via the queue into NOAA-GFDL:develop with commit 768386f Jun 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants