-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix restaring from files not containing the derefinement counter (from #1073) #1089
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is gross but we gotta do what we gotta do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a reasonable fix to me, sorry for not thinking of that before. A possibly cleaner choice, but that would require more work, would be to include the derefinement count in a separate dataset in the hdf5 file.
Given that the other PR is just 8 days in, I'm kind of in favor of the clean fix rather than carrying around a work-around (that might inspire similar extension in the future). |
(We could still merge this in now for an immediate resolution) |
Let me see if I can get this to work this morning. |
My plan would be to extend the |
@pgrete and @Yurlungur: I just switched to making a new dataset and leaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix, @lroberts36 . This cleaner solution seems good to me. Some nitpicks/hdf5 foo below.
src/outputs/restart_hdf5.cpp
Outdated
try { | ||
mesh_info.derefinement_count = ReadDataset<int>("/Blocks/derefinement_count"); | ||
} catch (const std::runtime_error &e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
knowing HDF5 I bet there's voluminous output here... I'm fine with this solution, but if we wanted to reduce the amount that HDF5 complains, you could instead check for the existence of "/Blocks/derefinement_count"
with, e.g., H5Lexists
(https://docs.hdfgroup.org/hdf5/v1_12/group___h5_l.html#title11). Despite the function being documented as looking at links, it works for any "name" in an HDF5 file: https://forum.hdfgroup.org/t/check-if-dataset-exists/1454/5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I am a little confused by the output. The beginning of ReadDataset
does
// make sure dataset exists
auto status = PARTHENON_HDF5_CHECK(H5Oexists_by_name(fh_, name.c_str(), H5P_DEFAULT));
PARTHENON_REQUIRE_THROWS(
status > 0, "Dataset '" + name + "' does not exist in HDF5 file " + filename_);
and that throw is what I am catching here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. That makes sense. In which case yeah I wouldn't expect HDF5 to complain at all. Bizarre.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah in that case disregard my suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the (quite extensive) error message is printed by every rank, I now went for the H5Lexists
option as @Yurlungur suggested (and we already use for the SparseInfo
field).
I will leave it to @pgrete to click automerge when he is happy with this. |
src/outputs/restart_hdf5.cpp
Outdated
mesh_info.derefinement_count = ReadDataset<int>("/Blocks/derefinement_count"); | ||
} catch (const std::runtime_error &e) { | ||
// File does not contain this dataset, so must be older. Set to default value of zero | ||
if (Globals::my_rank == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend to the check to only print with mesh refinement enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because of the HDF5 spew that comes out when the dataset doesn't exist, we should probably always leave the warning in to provide some explanation to the user of what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that I removed the hdf5 error message, I now added the additional check.
hsize_t loc_cnt[2] = {num_blocks_local, 1}; | ||
hsize_t glob_cnt[2] = {max_blocks_global, 1}; | ||
std::vector<int> tmpID = OutputUtils::ComputeDerefinementCount(pm); | ||
HDF5Write2D(gBlocks, "derefinement_count", tmpID.data(), &loc_offset[0], &loc_cnt[0], | ||
&glob_cnt[0], pl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we always write (i..e, a bunch of zeros for static runs) or move the if (pm->pmr)
from ComputeDerefinementCount
here to only write the deref counter for multilevel sims?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of lean towards always writing to file, with or without refinement, just to keep the structure of the files the same. I think this should happen with things written as they are now.
PR Summary
In #1073 the per-block derefinement counter was stored in restart files (and restored on restarts).
This changes the number of entries in the flattened
loc.level-gid-lid-cnghost-gflag
vector (from 5 per block to 6 per block).If one currently restarts with from an older file this is not taken into account and logical locations (and potentially more are off).
I came across this because of a
on restart.
Effectively, this renders restart any older file impossible/erroneous right now.
The proposed fix is not necessarily super-clean but it circumvents bumping the output format version (and introducing similar logic anyway).
But if there's a different/cleaner solution, I'm also open to suggestions.
Note, that this still doesn't fully fix a restart problem I'm tracking down (also in context of reviewing #1055) but it fixes the error above.
It might also be that the restart I'm trying to fix are broken because I used some Parthenon version in between that didn't contain all the mesh/tree related bugfixes that went in over the past weeks.
I'm still investigating, but I'm fairly sure that this PR and the proposed fix are necessary no matter what.
PR Checklist