-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset parallel=True failing with netcdf4 >= 1.6.1 #7079
Comments
I ran into this problem yesterday reading netcdf files on our HPC with a known good script and netcdf files. Unfortunately just trying to open the files again in a try..except block did not work for me. Looking back through my environment update history I found that the netcdf4 library had been updated since I'd last successfully run the script. The current version installed was |
I believe you are hitting Unidata/netcdf4-python#1192 The verdict is not out on that one yet. Your parallelization may not be thread safe, which makes 1.6.1 failures that expected. For now, if you can, downgrade to 1.6.0 or use an engine that is thread safe. Maybe h5netcdf (not sure!)? |
Also, you can try: import dask
dask.config.set(scheduler="single-threaded") That would ensure you don't use threads when reading with netcdf-c (netcdf4). Edit: this is not an xarray problem and I recommend to close this issue and follow up with the one already opened upstream. |
@ocefpaf and all: thank you! What a mysterious error this has been. Using the workaround
did indeed avoid the issue for me. |
Note that this is not a bug per se, netcdf-c was never thread safe and, when the work around were removed in netcdf4-python, this issue surfaced. The right fix is to disable threads, like in my example above, or to wait for a netcdf-c release that is thread safe. I don't think the work around will be re-added in netcdf4-python. |
This fix will restrict you to serial compute. You can also parallelize across processes using something like PBSCluster(
...,
cores=1,
processes=2,
) or |
I was waiting for someone who do stuff on clusters to comment on that. Thanks! (My workflow is my own laptop only, so I'm quite limited on that front 😄) |
Use LocalCluster! ;) |
From conda-forge/netcdf4-feedstock#141:
@pydata/xarray EDIT: We already have locks: xarray/xarray/backends/netCDF4_.py Lines 363 to 383 in 6e77f5e
|
It would be great if someone could put together a MCVE that reproduces the issue here. We have multiple tests in our test suite that use |
Turns out we're running tests on an older working version (logs) even though we don't have a pin.
|
|
Note that this will hopefully be removed soon - SciTools/iris#5095 - but the reviewer has been assigned to other urgent work so it's paused right now. |
I've opened #7488 which I think has actually exposed a few other failures. I doubt I'll have much time to put into this issue in the near time so anyone should feel free to jump in here. |
Update: I pushed two new tests to #7488. They are not failing in our test env. If someone that has reported this issue could try running the test suite, that would be super helpful in terms of confirming where the problem lies. |
@cefect, @pnorton-usgs, @kthyng - Is this still an issue for you? If so, could you try to run the xarray test suite in #7079 and report back? We haven't been able to trigger the error reported here so we could use some help running the test suite in an "offending" environment. |
@jhamman Sorry for my delay — I started this the other day and got waylaid. I'll try to get back to it today or tomorrow. |
I was able to reproduce the error with the current version of xarray and then have it work with the new version. Here is what I did: Make new environment
Check version
In python:
returns the following the first time
Next I used the PR version of xarray and reran the code above and then it was able to read in ok on the first try. Note: after a week or so those files won't work and will have to be updated with something more current but the pattern to use is clear from the file names. |
@kthyng - any difference when running with |
@jhamman Yes, using the PR version of xarray, with |
@kthyng those files are on a remote server and that may not be the segfault from the original issue here. It may be a server that is not happy with parallel access. Can you try that with local files? PS: you can also try with |
Ok I downloaded the two files and indeed there is no error with |
I'm not really sure what to think any more — we have had a real, consistent issue that seemed to fit the description of this issue which went away with one of the fixes above (using single threading), but using local files at the moment seems to remove the error even with the current version of xarray and either |
* tempoarily remove iris from ci, trying to reproduce #7079 * add parallel=True test when using dask cluster * lint * add local scheduler test * pin netcdf version >= 1.6.1 * Update ci/requirements/environment-py311.yml * Update ci/requirements/environment.yml * Update ci/requirements/environment.yml --------- Co-authored-by: Deepak Cherian <[email protected]>
What happened?
When using the
parallel=True
key,open_mfdataset
fails withNetCDF: Unknown file format
. Running the same command again (with try+except), or withparallel=False
executes as expected.works:
works:
fails:
all with
engine='netcdf4'
any help is highly appreciated as I'm a bit lost how to investigate this further.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
The text was updated successfully, but these errors were encountered: