-
Notifications
You must be signed in to change notification settings - Fork 37
Add support for in-situ histograms/profiles #962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be very nice to have!
My major comments are that it would be nice to have:
- less ugly runtime parameters (not having to use the
histX_
prefixes) - having a runtime option to not discard values if they are outside the given range, and
- an option to output in a text-based (csv, json, ...?) format.
In particular, regarding the first point, is it necessary to have more than one histogram per output file?
doc/sphinx/src/outputs.rst
Outdated
num_histograms = 2 # required, specifies how many histograms are defined in this block | ||
|
||
# 1D histogram ("standard", i.e., counting occurance in bin) | ||
hist0_ndim = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hist0_
, hist1_
, etc. prefixes are very ugly. Is it necessary to support more than 1 histogram per output file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not a big fan of this style, but it's the best I came up with so I'm open for suggestions (that do not involve moving to yaml/json/... for parsing inputs, which now seems to come up more often wink @bprather
The main motivation for this approach is to not clobber the (shared) filesystem with many new files.
I can see that one easily calculate 10+ histograms in a simulation at high cadence, which would result in many tiny files (that'd need to be managed again).
To mitigate this, I initially even tried to put all histograms of all times in a single file, but gave up because of HDF5 (deleting/updating data cannot be easily done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could append to a yaml/json output file to accomplish that... is avoiding an extra library dependency the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be what you tried, but it is supposed to be possible to do streaming writes using HDF5 with the SWMR interface: https://docs.h5py.org/en/stable/swmr.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To mitigate this, I initially even tried to put all histograms of all times in a single file, but gave up because of HDF5 (deleting/updating data cannot be easily done).
This might be what you tried, but it is supposed to be possible to do streaming writes using HDF5 with the SWMR interface: https://docs.h5py.org/en/stable/swmr.html
I have in the past used extendable datasets to output all data at all times in a single file in HDF5. I do not recommend it. The main issue is that this strategy is not robust against failure. If the code crashes during an output, the entire file is corrupted, not just the time you're trying to output. It's also less robust against the kind of thing we often do, where we restart a simulation and steer it differently upon restart, for example. For these reasons I am a strong believer in 1 file per output time.
I'm also not a big fan of this style, but it's the best I came up with so I'm open for suggestions
What about simply assuming one output
block per histogram, but combining them into a single file if the dt
is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this some more thought and am now sure how we should go ahead.
Originally, I liked @Yurlungur idea to have a default file_index and multiple output blocks as it looked clean.
Though when thinking about the implementation, I realized that this might introduce some arbitrary ordering in the resulting output file depending on the order of the output#
blocks and number of different file_index
given, which may result in additional fine tuning/logic when going from input file to postprocessing.
As ugly as the currently solution is, it's from that point of view unambiguous. All histograms in one output block are in one file and in the same order they are defined.
So I'm currently leaning towards keeping this verbose option but am also open to further suggestions (or going with the solution that @Yurlungur suggested if people prefer that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm obviously biased towards my own solution :P. I also think if one relies on ordering in an hdf5 file, one is asking for trouble.
That said, I don't feel that strongly and am fine to defer to implementers choice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm obviously biased towards my own solution :P. I also think if one relies on ordering in an hdf5 file, one is asking for trouble.
"Order" here referred to the explicit number used in the input file that is also being used as the hdf5 group name in the output file, so it's unique and does not depend on the literal order written down in the input file.
Having said that, it just occurred to me that we could enforce that each histogram has a unique string name (rather than a number) which is then used as prefix (both in the input file as well as the output file).
How about going from
num_histograms = 3
# 1D histogram of a variable, binned by a variable
hist0_ndim = 1
hist0_x_variable = advected
hist0_x_variable_component = 0
hist0_x_edges_type = list
hist0_x_edges_list = 1e-9, 1e-4,1e-1, 2e-1, 5e-1 ,1.0
hist0_binned_variable = advected
hist0_binned_variable_component = 0
# 2D histogram of a variable, binned by a coordinate and a different variable
hist1_ndim = 2
hist1_x_variable = HIST_COORD_X1
hist1_x_edges_type = lin
hist1_x_edges_num_bins = 4
...
to
hist_names = myname, rhoT
# 1D histogram of a variable, binned by a variable
myname_ndim = 1
myname_x_variable = advected
myname_x_variable_component = 0
myname_x_edges_type = list
myname_x_edges_list = 1e-9, 1e-4,1e-1, 2e-1, 5e-1 ,1.0
myname_binned_variable = advected
myname_binned_variable_component = 0
# 2D histogram of a variable, binned by a coordinate and a different variable
rhoT_ndim = 2
rhoT_x_variable = HIST_COORD_X1
rhoT_x_edges_type = lin
rhoT_x_edges_num_bins = 4
...
This (using names) would, of course, also work in the context of spreading the info over multiple output blocks (each with a unique name for a histogram), but I'm leaning towards keeping the current overall logic (one output block for one output file for a given dt) over adding additional content from different blocks.
What do people think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new convention with names instead of indices. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New convention is implemented and documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! I'm very excited to take advantage of it. A few comments/questions below.
Also, should we have something in phdf.py
for reading histograms? Maybe not as they seem to be pretty self-descriptive in file?
doc/sphinx/src/outputs.rst
Outdated
output corresponding to the histogram number. | ||
- ``hist#_ndim=INT`` (either ``1`` or ``2``) | ||
Dimensionality of the histogram. | ||
- ``hist#_x_variable=STRING`` (variable name or special coordinate string ``HIST_COORD_X1``, ``HIST_COORD_X2``, ``HIST_COORD_X3`` or ``HIST_COORD_R``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit confused by this interface. Is the following accurate?
- In 2D
hist_x_variable
is the x axis andhist_binned_variable
is the y axis? Withhist_weight
as the weighting quantity? - In 1D is it just
hist_binned_variable
withhist_x_variable
no longer used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not quite. I might need to update the documentation (I'm happy for suggestions to make it more clear).
When I wrote this/chose the names, I thought about it in terms of plots.
So for both 1D and 2D histograms the "x" variable is the one on the bottom axis, i.e., always used.
Then the "binned" variable is the "data" variable to be shown (or put into "bins"), i..e, for a 1D histogram this variable will correspond to the line (i.e., the y values) and for the 2D histogram this will be the "z" dim.
The "y" variable itself is only used for 2D histograms and is the variable on the left/y axis.
I realize this might be/is confusing, though I didn't come up with better naming scheme that works for 1D and 2D histograms while not changing the meaning of an input parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm still not sure I understand. Maybe a few examples would help me reason through it. In 1D, I often want to compute something like the top panel of this figure:
Note that each "curve" is basically "mass in a bin" vs bins of Ye. What variables should I pick in your framework to make this happen? My intuition would be hist_x_variable
set to electron_fraction
, hist_binned_variable
set to density, and volume-weighting turned on... is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly right.
And to add a 2D example,
Here, the x variable is radius, the y variable plasma beta, and the "data" is again mass (calculated from binning the density field and apply volume weighting).
[edit] the line is the mass weighted mean plasma beta derived from the histogram data a posteriori [/edit]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the explanation. This makes sense to me---although it might be useful to embed some of these examples in the docs to clarify. I think my head was stuck on what matplotlib
requests of you when you build a histogram, which is slightly different, as it conflates some of these variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's okay to add these illustrative "not-pure-math/numerics" examples, I'm happy to add those to the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't know for sure---but they're from published works right? It seems like it should be ok in the context of documentation? (Mine is from DOI 10.3847/2041-8213/acba16)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now added two examples.
const auto x_var_pack_string = x_var_type == VarType::Var | ||
? std::vector<std::string>{hist.x_var_name} | ||
: std::vector<std::string>{}; | ||
const auto x_var = md->PackVariables(x_var_pack_string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not hugely important... but I wonder if it would be best to use the SparsePack
machinery here? It might lead to a performance improvement if you built the PackDescriptor
s as static
variables, as that way they are only built once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes. I just went with what I'm used to and these parts are likely not performance critical anyway.
Do you feel strongly about moving to SparsePacks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't feel strongly in this instance---it's not important. I'm mainly just thinking about this from the perspective of eventually moving away from two separate pack infrastructures.
There's an example in the doc on how to read and plot the data, e.g.,
so I'm not sure what else we should provide from the |
I suspected as much. It's simple enough I don't think we need to provide anything else. |
…tputs' into pgrete/histogram-outputs
I now pushed all the change I mentioned in the comments. |
Way ahead of you and already done ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments have been addressed, so I'm happy with it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor concerns remaining, but I consider them non-blocking.
I think I've addressed all the leftover discussed changes. |
PR Summary
Add support for in-situ histograms/profiles
Detailed doc here: https://parthenon-hpc-lab.github.io/parthenon/pgrete/histogram-outputs/src/outputs.html#histograms
PR Checklist