Skip to content

param restarts #887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 21, 2023
Merged

param restarts #887

merged 25 commits into from
Jun 21, 2023

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Jun 12, 2023

PR Summary

We've discussed previously that we now use Params for many purposes, including storing mesh-level data. In this PR, I implement scaffolding for this use case. In particular, I add:

  • The ability to output Kokkos::View and ParArray*D params to disk
  • The ability to read a mutable param from a restart file to set the value. In particular, a param can now be marked Immutable, Mutable, or Restart. Here Mutable is as it was before (though now this is an enum instead of a bool.) Restart is a superset of Mutable. A Restart param is mutable, but also re-read from the restart file upon restart. It is first set to whatever the package initialization provides, then overwritten.
  • Unit tests for the new Param machinery.

This new machinery enables some important use-cases. For Phoebus, it lets us stash spacetime information in Params, which is useful in some cases. I imagine that for things like turbulent forcing, or derived integral quantities, the Athena-PK team may also be interested.

Limitations, technical questions

  • I needed to enumerate the types that can be written to/read from file. So it's a limited list.
  • Because now more params are output, some regression tests are now failing, as the gold file contains less information. I can modify the tests to be less strict (I believe in large part the unit tests now cover the code paths they were checking), or I can create a new gold file. I don't know which is better. I'd like input. Please let me know which option you would prefer.
  • Outputting Params or reading Params that differ on different MPI ranks is undefined behavior. This was already true, but I now document it. I think permitting different Params on different ranks would be a pretty big lift, and I recommend against. At least for now.
  • On a related note, since params are treated as attributes in the HDF5 file, they must be written and read collectively and identically.

@bprather I know I've talked to you about this design. Please take a look at this and let me know if it meets your constraints. If I recall, there was some use-case you had, where you did not want the infrastructure to modify your mutable params, which is why I added the three mutability levels.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur self-assigned this Jun 12, 2023
@Yurlungur Yurlungur added the enhancement New feature or request label Jun 12, 2023
@Yurlungur Yurlungur linked an issue Jun 12, 2023 that may be closed by this pull request
@Yurlungur
Copy link
Collaborator Author

Issue #683 discusses this path vs an alternative path adding variables at the Mesh level. I think the Mesh level infrastructure has advantages for sure. But Params currently work, and this has a relatively clean implementation.

Copy link
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is passing tests, it LGTM. That being said, I am a little queasy about the general philosophy of putting more and more stuff into params, especially things that may be better thought of as simulation state (e.g. quantities that are evolved but are not directly tied to the grid). I haven't thought carefully about how it would work, but it seems like it would make more logical sense to have a way of registering fields at the mesh level of arbitrary shape and being able to write those as hdf5 datasets (instead of as metadata). It seems worth it to have a discussion at the Parthenon meeting about this.

Edit: Hadn't carefully read #683 before writing this. I think I generally agree with @pgrete suggestion there.

@Yurlungur
Copy link
Collaborator Author

If this is passing tests

It is not currently passing tests. Both for the regression test reason mentioned above, and for some other reason on the cuda machines. I need to debug. The fully general functionality for writing/reading arbitrary shapes was tricky to get right.

I am a little queasy about the general philosophy of putting more and more stuff into params, especially things that may be better thought of as simulation state (e.g. quantities that are evolved but are not directly tied to the grid). I haven't thought carefully about how it would work, but it seems like it would make more logical sense to have a way of registering fields at the mesh level of arbitrary shape and being able to write those as hdf5 datasets (instead of as metadata). It seems worth it to have a discussion at the Parthenon meeting about this.

This is fair. Yes I'd like to discuss at the parthenon meeting. I think my solution is significantly more flexible than @pgrete 's suggestion, as it's type aware, capable of looking at shape, and knows how to handle both host- and device-side data, but this may or may not be a good thing.

I do think my solution could be relatively straightforwardly translated into a Mesh level data solution... just by sticking a Params-like object at the mesh level.

Copy link
Collaborator

@forrestglines forrestglines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the AddParams from the documentation needs to be added to StateDescriptor

Comment on lines +237 to +238
template <typename T, REQUIRES(implements<scalar(T)>::value)>
void HDF5WriteAttribute(const std::string &name, const T &value, hid_t location) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it this what might enable writing arbitrary types to as attributes? I'm not sure how to understand REQUIRES(implements<scalar(T)>::value)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's using @lroberts36 concepts. It means it knows how to write things like a single double or an integer, but it doesn't do arbitrary types. You'll see I also have specializations for std::vector, ParArray and Kokkos::View. To do other types, one would need to add the appropriate template specialization.

@forrestglines
Copy link
Collaborator

Restarts with AthenaPK seem to work fine, AthenaPK tests are passing so far

@Yurlungur
Copy link
Collaborator Author

Big picture are you happy with the design @forrestglines ?

@forrestglines
Copy link
Collaborator

Big picture are you happy with the design @forrestglines ?

I think it's good, I see it as an extension of the existing design and not any significant departure.

If it's possible to add our own custom types to Param restarts in the future then it might make some of the usage of Params in my cluster pgen cleaner.

@Yurlungur
Copy link
Collaborator Author

If it's possible to add our own custom types to Param restarts in the future then it might make some of the usage of Params in my cluster pgen cleaner.

I think this is possible, but I'd rather leave it to a future PR. I think it requires an additional interface to register custom I/O ops

@Yurlungur
Copy link
Collaborator Author

@forrestglines how do things look now? I fixed the missing function in StateDescriptor. I have now updated the gold files, so regression tests should pass.

Also, because I always forget how to add regression tests, I added a doc page for it. And I fixed some errors I got when building the docs locally.

|| ncycle_out_mesh || 0 || int || Number of cycles between printing the mesh structure (e.g., total number of MeshBlocks and number of MeshBlocks per level) to standard out. Use a negative number to also print every time the mesh was modified. Default: 0 (i.e, off). |
|| ncrecv_bdry_buf_timeout_sec || -1.0 || Real || Timeout in seconds for the `ReceiveBoundaryBuffers` tasks. Disabed (negative) by default. Typically no need in production runs. Useful for debugging MPI calls. |
+------------------------------+---------+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+------------------------------+---------+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sphinx didn't seem to like this ultrawide table. I had to shrink it to get it to render.

@Yurlungur
Copy link
Collaborator Author

That being said, I am a little queasy about the general philosophy of putting more and more stuff into params, especially things that may be better thought of as simulation state (e.g. quantities that are evolved but are not directly tied to the grid). I haven't thought carefully about how it would work, but it seems like it would make more logical sense to have a way of registering fields at the mesh level of arbitrary shape and being able to write those as hdf5 datasets (instead of as metadata). It seems worth it to have a discussion at the Parthenon meeting about this.

Just to make sure comment is here for posterity: We discussed in the last parthenon call and decided to go with this design ans there are open questions about a variable based approach, and it was decided an in-hand solution was desired. That said the whole team agreed with the concern about putting more stuff into params.

@Yurlungur
Copy link
Collaborator Author

Clang tests currently failing due to #630 .

@Yurlungur
Copy link
Collaborator Author

Discussed with @bprather and currently he's happy. Calling it good assuming tests pass after the fix and @forrestglines approves.

Copy link
Collaborator

@bprather bprather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Mostly I'm happy that this is opt-in (at least, the reading -- iiuc it will make restart files bigger regardless, which I think is a small enough price to pay).

I don't think I carry around the kind of global state that would benefit from this, but it may prove useful to me at some point.

@forrestglines
Copy link
Collaborator

forrestglines commented Jun 21, 2023

The merge seems to have broken compilation for my build of Parthenon within Athenapk:

/vast/home/forrestglines/code/athenapk-project/athenapk/external/parthenon/src/interface/variable.cpp(217): error: identifier "mem_size" is undefined

1 error detected in the compilation of "/vast/home/forrestglines/code/athenapk-project/athenapk/external/parthenon/src/interface/variable.cpp".
make[2]: *** [parthenon/src/CMakeFiles/parthenon.dir/build.make:412: parthenon/src/CMakeFiles/parthenon.dir/interface/variable.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....

EDIT: It's a simple fix, only happens when building without sparse

Copy link
Collaborator

@forrestglines forrestglines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonahm-LANL I pushed a tiny fix, I think it's ready to merge

@Yurlungur
Copy link
Collaborator Author

Oops good catch. Thanks for the fix @forrestglines !

@Yurlungur
Copy link
Collaborator Author

Forcing merge since tests pass and the cuda/clang tests are currently broken. Thanks @pgrete for confirming.

@Yurlungur Yurlungur merged commit b8efc2d into develop Jun 21, 2023
@Yurlungur Yurlungur deleted the jmm/param-restarts branch June 21, 2023 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store arbitrary/mesh data in output
5 participants