In high energy physics, experiments require file formats that can accomodate heterogeneity (each collection event can have differing amounts of data collected) and size (since HEP experiments can generate petabytes of data). Although the current file format of choice is ROOT, a file format developed by CERN, we believe that HDF5, which is a portable and more commonly used file format outside of HEP, has promise in being a new standard.
The only issue is that HDF5 works best with homogenous data formats, where each dataset occupies an n by m chunk of memory. This is not necessarily the case for HEP data, but we addressed this issue using an organizational method outlined in our schema.
We assume that data that we collect is composed of (insert some term for particle, chair, etc.) each carrying a certain number of attributes. Each ___ is associated with some increasing counter. In HEP, this counter is events. Each event can have an arbitrary number of particles of any type, making this data heterogenous.
To make this data homogenous, we can create n by m chunks of data for each type of particle, where n is the total number of this particle in all of the events, and the specific row for each of the particles contains all of the attributes for that particle in the original data.
We also create a list for each type of particle whose length is the total number of events. At position i, we have the data for how many particles of said type appeared in event i.
hepfile
is useful for datasets where there are n columns with different numbers
of rows. In more "pythonic" language, you can imagine a dictionary where each key
has a different number of values. For example:
data = {x: [1],
y: [1, 2],
z: ['1', '2', '3']
}
This can not simply be converted to most common data structures, like a Pandas DataFrame, or written to a simple homogeneous file structure, like a CSV file.
In a more complex case let's have an image of a town, with cartoon people here. To illustrate how to use hepfile with this example, we imagine a researcher conducting a census on a town. Each household in the town has some variable number of people in it, some variable number of vehicles, and only one residence. The people, vehicles, and residence all have different data associated with them. How would we record these data? Well, to first order, we might decide to record them in multiple spreadsheets or multiple .csv files.
One could also imagine this data stored in a database with each of the csv files as tables. But the goal is to keep all of this data in one file, so that it is easier for someone to do analysis. For example, someone might want to know the average number of people per bedroom, in the homes. Or the average number of vehicles as a function of combined ages of the household residents. If we have 3 separate files, this is more difficult to work with. What we want is one file and a way to extract information, collected by household.
To do this, we need some way to count the number of people or vehicles in any household, as well as keep track of what data fields will always have one entry per household (e.g. data about the residence itself).
One could imagine building a very large pandas
dataframe to do this
with a lot of join statements and then use .groupby()
approach or to store this in a database and
then use a lot of SQL join statements. But, we want to store this in a single file so, instead, we will
take our cue from ROOT and particle physicists, who are used to looping over subsets of their data.
For non-developers, hepfile
can be installed using pip
.
hepfile
includes some optional features that include optional dependencies so there are multiple
different options for installation.
The base package gives you the ability to read and write hepfiles using the more standard "loop & pack"
method and the dictionary tools. This is the "lightest" of the installation options and is only dependent
on numpy
and h5py
. To install the base package use:
python -m pip install hepfile
You can also get the awkward_tools
which is the hepfile integration with the awkward
package. This is
especially recommended for High Energy Physicists who are used to working with awkward arrays. The only
dependency this adds is awkward
. To install this version of the package use:
python -m pip install hepfile[awkward]
You can get the more data science focused tools by installing the df_tools
and csv_tools
. These provide
integration with pandas and typical csv files. This is recommended for those who are used to working
with pandas in python. This adds a a pandas
dependency to the base installation. To install this
distribution use:
python -m pip install hepfile[pandas]
To get both the awkward and pandas integration with hepfile (which adds pandas and awkward to the base installation dependencies) use:
python -m pip install hepfile[all]
Finally, for those running the example notebooks locally, you should install with the
learn
optional dependency because some of the tutorials rely on dependencies like matplotlib
and astropy:
python -m pip install hepfile[learn]
The documentation for hepfile can be found at hepfile.readthedocs.io.
For local builds for testing follow these steps:
- Clone this repo
- Navigate to the top-level directory of this project (probably called hepfile)
- We then recommend installing with the developer dependencies. To do this run:
python -m pip install -e '.[dev]'
- Then, run the following command to setup the pre-commit git hook to automatically run our code standard tests before committing!
pip install pre-commit
pre-commit install
As a side note, to test developments to the code use the following command in the top-level directory of the project:
pytest