[Fix]: Removing race condition within DiagManagerMonitor testing#459
Conversation
romanc
left a comment
There was a problem hiding this comment.
I think we'll need to find a more central place to configure the dask scheduler.
| with dask.config.set(scheduler="synchronous"): | ||
| ds = xr.open_dataset("diag_manager_single_tile.nc", decode_times=True) |
There was a problem hiding this comment.
Are you suggesting that any xarray dataset read in a non-parallel environment would need to be guarded by a with statement setting the scheduler to "synchronous"? That seems like something to be configured in a more central place to me, no?
There was a problem hiding this comment.
When I looked up the errors that were being produced I was pointed in the direction of using this method, albeit I am not sure why this is the case given that it is not in a parallel environment. I think the issue comes down to the call to decode_times parameter within open_dataset, this is based on the information I have seen thus far on what might be the problem. The most recent commit has removed the use of the dask scheduler and instead will use a call to MPIComm()._comm.Barrier(), given that traceback makes references to calls regarding threading.
|
Per the issues we are seeing in this PR, I think it is necessary to re-work the installation of pyfms, such that it will use the pip installed netcdf for installation. I will move this PR to draft and note when a PR for updating the install of pyfms is available. |
romanc
left a comment
There was a problem hiding this comment.
While there is now a combination of system dependencies and user code that makes the tests pass, I'm still a bit uneasy on this feature. In particular, I'm wondering if there's anything we can learn from this message
which shows up in pyFMS tests now that they run. Tracing that message back to FMS, I get here
which indicates that something in the setup of the diag_manager isn't as expected. And indeed, in NDSL, diag_manager.init() is called without the time_init argument
NDSL/ndsl/monitor/diag_manager_monitor.py
Line 35 in 4d43f77
which pyFMS does provide as an option:
Do you think the above error message could be related to the crashes we have seen previously? If not, do you think there's a way to configure pyFMS or the pyFMS-monitor to avoid the message (i.e. not use prepend_date or supply a time_init argument)? It looks like prepend_date is an option read from a namelist and it defaults to true
Looping @rem1776 in here as he brought this feature in and generated this test. |
|
@romanc Yeah skipping the The We could add a line to set prepend_date to false in the created namelists, which would silence the message but not change any other behavior, if thats preferable. We may want to use this option the future, but I think it would be better to do that via a flag in the monitor's init routine instead of doing it every run. |
|
@rem1776 thanks for the background on that - ds = xr.open_mfdataset(filename, decode_times=True)
+ ds = netCDF4.Dataset(filename)and then use
I agree that this would be better placed in the monitor's init routine. |
Do we want this change in this PR or a subsequent one? |
romanc
left a comment
There was a problem hiding this comment.
It looks like we are a bit stuck here with this PR. Let me propose the following
- Follow-up issue for the
init_timeargument - Follow-up issue for the crash when using
xarray.open_dataset() - and then we merge this PR as-is since it's anyway an improvement over the current state where pyfms-tests don't even run.
| run: | | ||
| pip3 install .[${{matrix.extra}}] | ||
| pip3 list |
There was a problem hiding this comment.
please remove debug pip list ...
| run: | | |
| pip3 install .[${{matrix.extra}}] | |
| pip3 list | |
| run: pip3 install .[${{matrix.extra}}] |
|
Checking if this can be merged? |
Fine by me and since Ryan also approved, I don't see a reason to wait any longer. Florian won't have time to review. |

Description
This PR introduces changes that prevent a race condition that exists in
test_dm_monitor_single. This PR builds off the work introduced by PR 450.How has this been tested?
Tested using current CI
Checklist