Fix getting filenames out of netCDF datasets #8

charles-turner-1 · 2025-05-13T06:54:41Z

Mostly closes #7

As of right now I'm not sure that we can get all the filenames out of a multifile dataarray, only the first. With that said, probably unimportant in nearly all situations. I've added an xfailing test for it, but in practice I think it'll give the right answer if a multifile datarray is passed to this function.

@jemmajeffree think this might be up your alley - no rush with the review though.

charles-turner-1 · 2025-05-14T02:27:16Z

See also xarray discussion I've opened on this topic

jemmajeffree · 2025-05-20T23:50:06Z

src/access_intake_utils/chunking/_chunking.py

+            if hasattr(bound_method, "__self__"):
+                file_handles.append(Path(bound_method.__self__._filename))
+
+    return file_handles


overall structure looks good and I can follow what the function does, though I'm not familiar with the specific functions/classes/attributes used (can infer what they do from use, but not confirm that they do what they seem to)

jemmajeffree · 2025-05-20T23:52:27Z

versioneer.py

        files.append(versioneer_file)
    present = False
    try:
-        with open(".gitattributes", "r") as fobj:


I am curious why the removal of "r" is beneficial. Wouldn't it be more futureproof to specify read only?

This is just a versioneer artefact, not sure what's caused this to change tbh.

jemmajeffree · 2025-05-20T23:58:48Z

tests/test_chunking.py

+            engine="netcdf4",
+        )
+    else:
+        ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")


open_mfdataset seems to cope with a single path (as best I can tell, it uses it as a glob string that returns a single file); is there a reason to use open_dataset explicitly for only one filepath?

Good call, fixed

did you get both instances?

nope. Working on updating the related examples now anyway so will fix in there

jemmajeffree · 2025-05-21T00:04:19Z

tests/test_chunking.py

+    fhandles = _get_file_handles(ds)
+
+    if isinstance(fpath, list):
+        assert fhandles == [Path(f) for f in fpath]


would this work if fpath contains the same paths but in a different order to fhandles? Would you want this to pass or fail in that instance

in this case it'd fail, you could replace it with a set intersection to make it pass. Whether you'd want it to pass/fail is more of a philosophical question I guess

jemmajeffree · 2025-05-21T00:10:49Z

tests/test_chunking.py

+    else:
+        ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")
+    with warnings.catch_warnings():
+        chunk_dict = validate_chunkspec(


not sure if this is the right place to mention this, but does validate_chunkspec care if the chunks are not an integer divisor of file size in that dimension? Ie, if files have 3 timesteps, in disc chunks of 1, is it okay with chunks of two? in which case you'd end up with every second chunk being half as big as the other. Are we okay with that?

Good question. I actually don't know whether this would cause performance issues. I suspect not but can't say for certain..

Fix getting filenames out of netCDF datasets

8016434

charles-turner-1 requested a review from jemmajeffree May 13, 2025 06:54

charles-turner-1 added 5 commits May 13, 2025 14:55

Pre-commit

1655d08

actually fix pre commit

daa9918

Add missing test back in (refactor related)

6583caf

Pre-commit

0c6d4aa

Can this even work with a multifile dataarray? Moved test to xfailing

c92f28b

jemmajeffree approved these changes May 21, 2025

View reviewed changes

charles-turner-1 added 2 commits May 21, 2025 13:50

@jemmajeffree's suggestion on using open_mfdataset for single files

0a7536d

Pre-commit

25e41c1

charles-turner-1 merged commit f559460 into main May 21, 2025
14 checks passed

charles-turner-1 deleted the fix-ds-handles branch May 21, 2025 05:58

charles-turner-1 mentioned this pull request May 21, 2025

Get all file handles in xr.DataArray, not just first. #9

Open

Fix getting filenames out of netCDF datasets #8

Fix getting filenames out of netCDF datasets #8

Uh oh!

Conversation

charles-turner-1 commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charles-turner-1 commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

charles-turner-1 commented May 13, 2025 •

edited

Loading