Skip to content

Conversation

@charles-turner-1
Copy link
Collaborator

@charles-turner-1 charles-turner-1 commented May 13, 2025

Mostly closes #7

As of right now I'm not sure that we can get all the filenames out of a multifile dataarray, only the first. With that said, probably unimportant in nearly all situations. I've added an xfailing test for it, but in practice I think it'll give the right answer if a multifile datarray is passed to this function.

@jemmajeffree think this might be up your alley - no rush with the review though.

@charles-turner-1
Copy link
Collaborator Author

if hasattr(bound_method, "__self__"):
file_handles.append(Path(bound_method.__self__._filename))

return file_handles

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall structure looks good and I can follow what the function does, though I'm not familiar with the specific functions/classes/attributes used (can infer what they do from use, but not confirm that they do what they seem to)

files.append(versioneer_file)
present = False
try:
with open(".gitattributes", "r") as fobj:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why the removal of "r" is beneficial. Wouldn't it be more futureproof to specify read only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a versioneer artefact, not sure what's caused this to change tbh.

engine="netcdf4",
)
else:
ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_mfdataset seems to cope with a single path (as best I can tell, it uses it as a glob string that returns a single file); is there a reason to use open_dataset explicitly for only one filepath?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, fixed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you get both instances?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. Working on updating the related examples now anyway so will fix in there

fhandles = _get_file_handles(ds)

if isinstance(fpath, list):
assert fhandles == [Path(f) for f in fpath]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this work if fpath contains the same paths but in a different order to fhandles? Would you want this to pass or fail in that instance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it'd fail, you could replace it with a set intersection to make it pass. Whether you'd want it to pass/fail is more of a philosophical question I guess

else:
ds = xr.open_dataset(fpath, decode_timedelta=False, engine="netcdf4")
with warnings.catch_warnings():
chunk_dict = validate_chunkspec(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is the right place to mention this, but does validate_chunkspec care if the chunks are not an integer divisor of file size in that dimension? Ie, if files have 3 timesteps, in disc chunks of 1, is it okay with chunks of two? in which case you'd end up with every second chunk being half as big as the other. Are we okay with that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I actually don't know whether this would cause performance issues. I suspect not but can't say for certain..

@charles-turner-1 charles-turner-1 merged commit f559460 into main May 21, 2025
14 checks passed
@charles-turner-1 charles-turner-1 deleted the fix-ds-handles branch May 21, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

validate_chunkspec on an xarray dataset/dataarray object

2 participants