Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix restaring from files not containing the derefinement counter (from #1073) #1089

Merged
merged 10 commits into from
May 29, 2024

Conversation

pgrete
Copy link
Collaborator

@pgrete pgrete commented May 23, 2024

PR Summary

In #1073 the per-block derefinement counter was stored in restart files (and restored on restarts).
This changes the number of entries in the flattened loc.level-gid-lid-cnghost-gflag vector (from 5 per block to 6 per block).
If one currently restarts with from an older file this is not taken into account and logical locations (and potentially more are off).
I came across this because of a

### PARTHENON ERROR
  Message:     Somehow didn't find a tree.
  File:        /project/project_465000552/env/src/athenapk/external/parthenon/src/mesh/forest/forest.hpp
  Line number: 128
Somehow didn't find a tree.

on restart.

Effectively, this renders restart any older file impossible/erroneous right now.
The proposed fix is not necessarily super-clean but it circumvents bumping the output format version (and introducing similar logic anyway).
But if there's a different/cleaner solution, I'm also open to suggestions.

Note, that this still doesn't fully fix a restart problem I'm tracking down (also in context of reviewing #1055) but it fixes the error above.
It might also be that the restart I'm trying to fix are broken because I used some Parthenon version in between that didn't contain all the mesh/tree related bugfixes that went in over the past weeks.
I'm still investigating, but I'm fairly sure that this PR and the proposed fix are necessary no matter what.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@pgrete pgrete added the bug Something isn't working label May 23, 2024
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gross but we gotta do what we gotta do

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a reasonable fix to me, sorry for not thinking of that before. A possibly cleaner choice, but that would require more work, would be to include the derefinement count in a separate dataset in the hdf5 file.

@pgrete
Copy link
Collaborator Author

pgrete commented May 23, 2024

This looks like a reasonable fix to me, sorry for not thinking of that before. A possibly cleaner choice, but that would require more work, would be to include the derefinement count in a separate dataset in the hdf5 file.

Given that the other PR is just 8 days in, I'm kind of in favor of the clean fix rather than carrying around a work-around (that might inspire similar extension in the future).
I should be able to get to that early next week.

@pgrete
Copy link
Collaborator Author

pgrete commented May 23, 2024

(We could still merge this in now for an immediate resolution)

@lroberts36
Copy link
Collaborator

(We could still merge this in now for an immediate resolution)

Let me see if I can get this to work this morning.

@pgrete
Copy link
Collaborator Author

pgrete commented May 23, 2024

My plan would be to extend the mesh_info object with a new vector containing the counters (which would by default set to 0 if not present).

@lroberts36
Copy link
Collaborator

lroberts36 commented May 23, 2024

@pgrete and @Yurlungur: I just switched to making a new dataset and leaving level-gid-lid... as it was previous to #1073. I think this should be a clean fix. If the new dataset is not present in the file, HDF5 writes some complaints to screen but the exception is caught and it runs through.

@lroberts36 lroberts36 requested a review from Yurlungur May 23, 2024 18:54
Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix, @lroberts36 . This cleaner solution seems good to me. Some nitpicks/hdf5 foo below.

Comment on lines 137 to 139
try {
mesh_info.derefinement_count = ReadDataset<int>("/Blocks/derefinement_count");
} catch (const std::runtime_error &e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knowing HDF5 I bet there's voluminous output here... I'm fine with this solution, but if we wanted to reduce the amount that HDF5 complains, you could instead check for the existence of "/Blocks/derefinement_count" with, e.g., H5Lexists (https://docs.hdfgroup.org/hdf5/v1_12/group___h5_l.html#title11). Despite the function being documented as looking at links, it works for any "name" in an HDF5 file: https://forum.hdfgroup.org/t/check-if-dataset-exists/1454/5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am a little confused by the output. The beginning of ReadDataset does

    // make sure dataset exists
    auto status = PARTHENON_HDF5_CHECK(H5Oexists_by_name(fh_, name.c_str(), H5P_DEFAULT));
    PARTHENON_REQUIRE_THROWS(
        status > 0, "Dataset '" + name + "' does not exist in HDF5 file " + filename_);

and that throw is what I am catching here I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That makes sense. In which case yeah I wouldn't expect HDF5 to complain at all. Bizarre.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in that case disregard my suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to leave it as is for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the (quite extensive) error message is printed by every rank, I now went for the H5Lexists option as @Yurlungur suggested (and we already use for the SparseInfo field).

src/outputs/restart_hdf5.cpp Outdated Show resolved Hide resolved
@lroberts36
Copy link
Collaborator

I will leave it to @pgrete to click automerge when he is happy with this.

mesh_info.derefinement_count = ReadDataset<int>("/Blocks/derefinement_count");
} catch (const std::runtime_error &e) {
// File does not contain this dataset, so must be older. Set to default value of zero
if (Globals::my_rank == 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend to the check to only print with mesh refinement enabled?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because of the HDF5 spew that comes out when the dataset doesn't exist, we should probably always leave the warning in to provide some explanation to the user of what is going on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that I removed the hdf5 error message, I now added the additional check.

Comment on lines +609 to +613
hsize_t loc_cnt[2] = {num_blocks_local, 1};
hsize_t glob_cnt[2] = {max_blocks_global, 1};
std::vector<int> tmpID = OutputUtils::ComputeDerefinementCount(pm);
HDF5Write2D(gBlocks, "derefinement_count", tmpID.data(), &loc_offset[0], &loc_cnt[0],
&glob_cnt[0], pl);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we always write (i..e, a bunch of zeros for static runs) or move the if (pm->pmr) from ComputeDerefinementCount here to only write the deref counter for multilevel sims?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of lean towards always writing to file, with or without refinement, just to keep the structure of the files the same. I think this should happen with things written as they are now.

@pgrete pgrete enabled auto-merge (squash) May 27, 2024 13:46
@pgrete pgrete disabled auto-merge May 27, 2024 13:46
@pgrete pgrete enabled auto-merge (squash) May 29, 2024 10:31
@pgrete pgrete merged commit bd615b7 into develop May 29, 2024
50 checks passed
@pgrete pgrete deleted the pgrete/fix-deref-counter-for-restarts branch May 29, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants