Add support for in-situ histograms/profiles #962

pgrete · 2023-10-19T15:02:29Z

PR Summary

Add support for in-situ histograms/profiles

Detailed doc here: https://parthenon-hpc-lab.github.io/parthenon/pgrete/histogram-outputs/src/outputs.html#histograms

PR Checklist

Code passes cpplint
New features are documented.
Adds a test for any bugs fixed. Adds tests for new features.
Code is formatted
Changes are summarized in CHANGELOG.md
CI has been triggered on Darwin for performance regression tests.
Docs build
(@lanl.gov employees) Update copyright on changed files

…utputs

BenWibking

This will be very nice to have!

My major comments are that it would be nice to have:

less ugly runtime parameters (not having to use the histX_ prefixes)
having a runtime option to not discard values if they are outside the given range, and
an option to output in a text-based (csv, json, ...?) format.

In particular, regarding the first point, is it necessary to have more than one histogram per output file?

doc/sphinx/src/outputs.rst

BenWibking · 2023-10-19T21:12:38Z

doc/sphinx/src/outputs.rst

+   num_histograms = 2    # required, specifies how many histograms are defined in this block
+
+   # 1D histogram ("standard", i.e., counting occurance in bin)
+   hist0_ndim = 1


The hist0_, hist1_, etc. prefixes are very ugly. Is it necessary to support more than 1 histogram per output file?

I'm also not a big fan of this style, but it's the best I came up with so I'm open for suggestions (that do not involve moving to yaml/json/... for parsing inputs, which now seems to come up more often wink @bprather

The main motivation for this approach is to not clobber the (shared) filesystem with many new files.
I can see that one easily calculate 10+ histograms in a simulation at high cadence, which would result in many tiny files (that'd need to be managed again).
To mitigate this, I initially even tried to put all histograms of all times in a single file, but gave up because of HDF5 (deleting/updating data cannot be easily done).

You could append to a yaml/json output file to accomplish that... is avoiding an extra library dependency the issue?

This might be what you tried, but it is supposed to be possible to do streaming writes using HDF5 with the SWMR interface: https://docs.h5py.org/en/stable/swmr.html

To mitigate this, I initially even tried to put all histograms of all times in a single file, but gave up because of HDF5 (deleting/updating data cannot be easily done).

This might be what you tried, but it is supposed to be possible to do streaming writes using HDF5 with the SWMR interface: https://docs.h5py.org/en/stable/swmr.html

I have in the past used extendable datasets to output all data at all times in a single file in HDF5. I do not recommend it. The main issue is that this strategy is not robust against failure. If the code crashes during an output, the entire file is corrupted, not just the time you're trying to output. It's also less robust against the kind of thing we often do, where we restart a simulation and steer it differently upon restart, for example. For these reasons I am a strong believer in 1 file per output time.

I'm also not a big fan of this style, but it's the best I came up with so I'm open for suggestions

What about simply assuming one output block per histogram, but combining them into a single file if the dt is the same?

I gave this some more thought and am now sure how we should go ahead.
Originally, I liked @Yurlungur idea to have a default file_index and multiple output blocks as it looked clean.
Though when thinking about the implementation, I realized that this might introduce some arbitrary ordering in the resulting output file depending on the order of the output# blocks and number of different file_index given, which may result in additional fine tuning/logic when going from input file to postprocessing.

As ugly as the currently solution is, it's from that point of view unambiguous. All histograms in one output block are in one file and in the same order they are defined.
So I'm currently leaning towards keeping this verbose option but am also open to further suggestions (or going with the solution that @Yurlungur suggested if people prefer that).

I'm obviously biased towards my own solution :P. I also think if one relies on ordering in an hdf5 file, one is asking for trouble.

That said, I don't feel that strongly and am fine to defer to implementers choice :)

I'm obviously biased towards my own solution :P. I also think if one relies on ordering in an hdf5 file, one is asking for trouble.

"Order" here referred to the explicit number used in the input file that is also being used as the hdf5 group name in the output file, so it's unique and does not depend on the literal order written down in the input file.

Having said that, it just occurred to me that we could enforce that each histogram has a unique string name (rather than a number) which is then used as prefix (both in the input file as well as the output file).

How about going from

num_histograms = 3 # 1D histogram of a variable, binned by a variable hist0_ndim = 1 hist0_x_variable = advected hist0_x_variable_component = 0 hist0_x_edges_type = list hist0_x_edges_list = 1e-9, 1e-4,1e-1, 2e-1, 5e-1 ,1.0 hist0_binned_variable = advected hist0_binned_variable_component = 0 # 2D histogram of a variable, binned by a coordinate and a different variable hist1_ndim = 2 hist1_x_variable = HIST_COORD_X1 hist1_x_edges_type = lin hist1_x_edges_num_bins = 4 ...

to

hist_names = myname, rhoT # 1D histogram of a variable, binned by a variable myname_ndim = 1 myname_x_variable = advected myname_x_variable_component = 0 myname_x_edges_type = list myname_x_edges_list = 1e-9, 1e-4,1e-1, 2e-1, 5e-1 ,1.0 myname_binned_variable = advected myname_binned_variable_component = 0 # 2D histogram of a variable, binned by a coordinate and a different variable rhoT_ndim = 2 rhoT_x_variable = HIST_COORD_X1 rhoT_x_edges_type = lin rhoT_x_edges_num_bins = 4 ...

This (using names) would, of course, also work in the context of spreading the info over multiple output blocks (each with a unique name for a histogram), but I'm leaning towards keeping the current overall logic (one output block for one output file for a given dt) over adding additional content from different blocks.

What do people think?

I like the new convention with names instead of indices. 👍

New convention is implemented and documented.

src/outputs/histogram.cpp

Yurlungur

This is really cool! I'm very excited to take advantage of it. A few comments/questions below.

Also, should we have something in phdf.py for reading histograms? Maybe not as they seem to be pretty self-descriptive in file?

Yurlungur · 2023-10-20T19:28:12Z

doc/sphinx/src/outputs.rst

+   output corresponding to the histogram number.
+- ``hist#_ndim=INT`` (either ``1`` or ``2``)
+   Dimensionality of the histogram.
+- ``hist#_x_variable=STRING`` (variable name or special coordinate string ``HIST_COORD_X1``, ``HIST_COORD_X2``, ``HIST_COORD_X3`` or ``HIST_COORD_R``)


I'm a little bit confused by this interface. Is the following accurate?

In 2D hist_x_variable is the x axis and hist_binned_variable is the y axis? With hist_weight as the weighting quantity?

In 1D is it just hist_binned_variable with hist_x_variable no longer used?

Hmm, not quite. I might need to update the documentation (I'm happy for suggestions to make it more clear).

When I wrote this/chose the names, I thought about it in terms of plots.
So for both 1D and 2D histograms the "x" variable is the one on the bottom axis, i.e., always used.
Then the "binned" variable is the "data" variable to be shown (or put into "bins"), i..e, for a 1D histogram this variable will correspond to the line (i.e., the y values) and for the 2D histogram this will be the "z" dim.
The "y" variable itself is only used for 2D histograms and is the variable on the left/y axis.

I realize this might be/is confusing, though I didn't come up with better naming scheme that works for 1D and 2D histograms while not changing the meaning of an input parameter.

Hmm I'm still not sure I understand. Maybe a few examples would help me reason through it. In 1D, I often want to compute something like the top panel of this figure:

Note that each "curve" is basically "mass in a bin" vs bins of Ye. What variables should I pick in your framework to make this happen? My intuition would be hist_x_variable set to electron_fraction, hist_binned_variable set to density, and volume-weighting turned on... is that right?

Exactly right.
And to add a 2D example,

Here, the x variable is radius, the y variable plasma beta, and the "data" is again mass (calculated from binning the density field and apply volume weighting).

[edit] the line is the mass weighted mean plasma beta derived from the histogram data a posteriori [/edit]

Ok thanks for the explanation. This makes sense to me---although it might be useful to embed some of these examples in the docs to clarify. I think my head was stuck on what matplotlib requests of you when you build a histogram, which is slightly different, as it conflates some of these variables.

If it's okay to add these illustrative "not-pure-math/numerics" examples, I'm happy to add those to the docs.

I guess I don't know for sure---but they're from published works right? It seems like it should be ok in the context of documentation? (Mine is from DOI 10.3847/2041-8213/acba16)

I now added two examples.

doc/sphinx/src/outputs.rst

src/outputs/histogram.cpp

src/utils/sort.hpp

tst/regression/CMakeLists.txt

Yurlungur · 2023-10-26T16:19:10Z

src/outputs/histogram.cpp

+    const auto x_var_pack_string = x_var_type == VarType::Var
+                                       ? std::vector<std::string>{hist.x_var_name}
+                                       : std::vector<std::string>{};
+    const auto x_var = md->PackVariables(x_var_pack_string);


Not hugely important... but I wonder if it would be best to use the SparsePack machinery here? It might lead to a performance improvement if you built the PackDescriptors as static variables, as that way they are only built once?

Probably yes. I just went with what I'm used to and these parts are likely not performance critical anyway.
Do you feel strongly about moving to SparsePacks?

No I don't feel strongly in this instance---it's not important. I'm mainly just thinking about this from the perspective of eventually moving away from two separate pack infrastructures.

src/outputs/histogram.cpp

…utputs

pgrete · 2023-11-07T11:08:33Z

Also, should we have something in phdf.py for reading histograms? Maybe not as they seem to be pretty self-descriptive in file?

There's an example in the doc on how to read and plot the data, e.g.,

with h5py.File("parthenon.out8.histograms.00040.hdf", "r") as infile:
  # 1D histogram
   x = infile["0/x_edges"][:]
   y = infile["0/data"][:]
   plt.plot(x, y)
   plt.show()

so I'm not sure what else we should provide from the phdf point of view.

Yurlungur · 2023-11-07T15:27:58Z

I'm not sure what else we should provide from the phdf point of view.

I suspected as much. It's simple enough I don't think we need to provide anything else.

…tputs' into pgrete/histogram-outputs

pgrete · 2023-11-07T16:29:59Z

I now pushed all the change I mentioned in the comments.
@BenWibking @Yurlungur feel free to close the conservation that you consider resolved (and potentially re-review).

Yurlungur · 2023-11-07T16:31:28Z

I now pushed all the change I mentioned in the comments. @BenWibking @Yurlungur feel free to close the conservation that you consider resolved (and potentially re-review).

Way ahead of you and already done ;)

BenWibking

My comments have been addressed, so I'm happy with it now.

Yurlungur

I have a few minor concerns remaining, but I consider them non-blocking.

…utputs

pgrete · 2023-11-15T14:07:49Z

I think I've addressed all the leftover discussed changes.
If there are no objections, I'll enable auto-merge tomorrow (and afterwards make a new Parthenon release).
Also feel free to enable auto-merge earlier, if you're happy this the changes.

pgrete and others added 23 commits October 12, 2023 21:39

Add output skeleton for histogram outputs

3f3226e

histogram input processing

67434b3

Add CalcHist function

6cfd0bb

Actually calc histograms

a6d718f

First attempt at dumping to hdf5. existing groups are a problem...

37eaac2

Use separate hdf5 files for histogram outputs

b030779

Fix hdf5 output

eacd5cb

Move upper_bound to utils and add unit test

47065bc

Fix edge bin calc

32a25d5

Add histogram regression test

c80c146

Add missing unit test file

e801cf4

Allow coordinate based binning

2671932

Add sampling based hist

b3d144d

Add volume weighting

4a4d6ae

Fix Scatterview layout

3eb6df9

Separate edge parsing

5961d21

Add support for calculating lin and log bin edges

e515f7c

Merge remote-tracking branch 'origin/develop' into pgrete/histogram-o…

5e57d0f

…utputs

Write string as strings to HDF5

102baa0

Add support for variable weights to histogram

6b12bc4

Use direct indexing for lin and log bins

ead3aec

Fix test case for direct log indexed bins

2ac1c9a

Add doc

bc2684f

pgrete added the enhancement New feature or request label Oct 19, 2023

pgrete requested review from Yurlungur and bprather October 19, 2023 15:02

BenWibking reviewed Oct 19, 2023

View reviewed changes

Yurlungur requested changes Oct 26, 2023

View reviewed changes

Yurlungur and others added 2 commits October 26, 2023 10:26

Merge branch 'develop' into pgrete/histogram-outputs

4dad9bb

Merge remote-tracking branch 'origin/develop' into pgrete/histogram-o…

516974d

…utputs

pgrete added 3 commits November 7, 2023 11:05

Error check for edge case

6472537

Remove duplciated depdency check

e84bf9a

Dedup input parsing code

adff160

pgrete and others added 3 commits November 7, 2023 12:17

Reserve container space

58ee689

Make CalcHist a member function

783df88

Merge branch 'develop' into pgrete/histogram-outputs

028bcfb

pgrete added 3 commits November 7, 2023 17:06

Update doc

2b33f7a

Add capability to accumulate data outside bin ranges

7138c74

Merge remote-tracking branch 'refs/remotes/origin/pgrete/histogram-ou…

61e5df5

…tputs' into pgrete/histogram-outputs

BenWibking approved these changes Nov 7, 2023

View reviewed changes

Yurlungur approved these changes Nov 7, 2023

View reviewed changes

pgrete added 2 commits November 15, 2023 12:20

Merge remote-tracking branch 'origin/develop' into pgrete/histogram-o…

8998a23

…utputs

Use names as historam ids and update doc with examples

a9138c3

Now with png

3dfb4c6

Yurlungur enabled auto-merge November 15, 2023 15:42

Yurlungur merged commit e23cf99 into develop Nov 15, 2023

Yurlungur mentioned this pull request Dec 13, 2023

Support for File Formats #318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for in-situ histograms/profiles #962

Add support for in-situ histograms/profiles #962

pgrete commented Oct 19, 2023 •

edited

Loading

BenWibking left a comment

BenWibking Oct 19, 2023

pgrete Oct 20, 2023

BenWibking Oct 20, 2023 •

edited

Loading

BenWibking Oct 20, 2023

Yurlungur Oct 20, 2023

pgrete Nov 7, 2023

Yurlungur Nov 7, 2023

pgrete Nov 7, 2023

Yurlungur Nov 7, 2023

pgrete Nov 15, 2023

Yurlungur left a comment

Yurlungur Oct 20, 2023

pgrete Nov 7, 2023

Yurlungur Nov 7, 2023 •

edited

Loading

pgrete Nov 7, 2023 •

edited

Loading

Yurlungur Nov 7, 2023

pgrete Nov 7, 2023

Yurlungur Nov 7, 2023

pgrete Nov 15, 2023

Yurlungur Oct 26, 2023

pgrete Nov 7, 2023

Yurlungur Nov 7, 2023

pgrete commented Nov 7, 2023

Yurlungur commented Nov 7, 2023

pgrete commented Nov 7, 2023

Yurlungur commented Nov 7, 2023 •

edited

Loading

BenWibking left a comment

Yurlungur left a comment

pgrete commented Nov 15, 2023

Add support for in-situ histograms/profiles #962

Add support for in-situ histograms/profiles #962

Conversation

pgrete commented Oct 19, 2023 • edited Loading

PR Summary

PR Checklist

BenWibking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenWibking Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yurlungur Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

pgrete Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrete commented Nov 7, 2023

Yurlungur commented Nov 7, 2023

pgrete commented Nov 7, 2023

Yurlungur commented Nov 7, 2023 • edited Loading

BenWibking left a comment

Choose a reason for hiding this comment

Yurlungur left a comment

Choose a reason for hiding this comment

pgrete commented Nov 15, 2023

pgrete commented Oct 19, 2023 •

edited

Loading

BenWibking Oct 20, 2023 •

edited

Loading

Yurlungur Nov 7, 2023 •

edited

Loading

pgrete Nov 7, 2023 •

edited

Loading

Yurlungur commented Nov 7, 2023 •

edited

Loading