This is a suite of programs and scripts for histogramming and reweighting of MC ntuples following the BlackHat format, such as generated by GoSam or Sherpa.
For the details of the structure of the ntuples, please refer to:
- arXiv:1310.7439,
- arXiv:1608.01195,
- CERN-THESIS-2021-047 Appendix A.
This readme is a work in progress. I will continue adding to it to make it a comprehensive guide to working with GoSam ntuples.
git clone --recursive https://github.com/ivankp/ntuple_analysis.git
Don't forget the --recursive
argument to also clone the required submodules.
- Edit or copy
src/hist.cc
. This is the program that reads the ntuples and fills the defined histograms. It also does reweighting. - Add histograms' definitions
around line 384.
Basically, add
h_(name_of_observable)
. - Fill histograms towards the end of the event loop, around line 553.
- Define histograms' binning in
run/binning.json
. - Select what sets of ntuples to run over in
run/submit.py
. - Generate job scripts and submit them to condor by running
./submit.py
inside therun
directory.
A C++
compiler supporting C++20
features is required.
For GCC
this means version 10.1.0
or higher.
Get at: https://root.cern/
A recent enough version of ROOT
that allows compilation with the -std=c++20
flag is reguired.
The programs have been tested with version 6.18/04
and 6.20/04
.
Any newer version should work. Older versions are most likely ok as well,
as long as the code compiles.
Get at: https://lhapdf.hepforge.org/
LHAPDF
is required to compute scale variations and PDF uncertainties.
Get at: http://fastjet.fr/
FastJet
is required to cluster final state particles into jets.
Checkout the repository directly by running the following command:
git clone --recursive https://github.com/ivankp/ntuple_analysis.git
The ntuple_analysis
repository contains a submodule histograms
, and the
--recursive
argument is necessary to check it out as well. If you forget the
--recursive
, you'll need to run
git submodule init
git submodule update
inside the repository after it is cloned.
To compile the programs run make
.
To compile in parallel run make -j
.
If you have a GitHub account and are familiar with git, I recommend forking the repository. This will give you a personal copy of the repository, to which you can commit on your own, while you'll still be able to synchronize your fork with the upstream repository in case of bug fixes or addition of features.
Read more:
An example histogramming analysis is defined in src/hist.cc
.
The simplest approach is to just make changes to that file.
A bit less straightforward, but more rational in the long run, approach is to
copy src/hist.cc
into another file under src/
. You can make as many copies
as you like, and have many different analyses defined for different studies.
The only complication is that you'll have to specify the compilation flags and
dependencies for the new program. But this is pretty simple to do. You can just
copy the paragraph in the Makefile
that starts with
C_hist :=
LF_hist :=
L_hist :=
bin/hist:
and replace the string hist
in it with the path to the new source file
relative to the src/
directory, omitting the file extension.
Executables don't have to be manually added to the all
target, because the
Makefile
contains a script that finds them.
New analysis code can be placed in (possibly nested) directories under src/
.
This directory structure will be replicated inside bin/
for the compiled
executables.
While hist.cc
contains a lot of code, most of it is there to provide a basic
infrastructure for defining an analysis.
In fact, you probably won't need to understand or edit most of it.
Though a great deal of automation can be achieved via manipulation of the
histogram bin types, the typical usage should only require defining names for
the histograms and defining the physics observables to fill them.
Although I used my own histogramming library that relies of template
meta-programming techniques, the program writes the output in the ROOT
format
as TH1D
s arranged in a hierarchy of directories inside a TFile
.
In my opinion, the complexity drawback of a custom histogramming
library is completely outweighed by the benefits of flexibility, code
reduction, automation, and the ability to express complex structures by
composition of conceptually simple building blocks via the type system.
For example, by defining a custom bin type, every histogram can be
automatically filled for all defined weights, all the different initial states
(gg, gq, qq), etc.
The logically most important part of the analysis program is the event loop. As far as it concerns histogramming, the histogram objects have to be defined outside the loop and filled inside.
Definitions of the histograms should be located just below the comment
// Histograms of main observables #################################
The h_
macro simplifies histogram definition by combining and automating
the following 3 steps:
- Histogram object definition;
- Matching the name of the histogram with the axes definition;
- Addition of the histogram pointer to a global array which is later used to iterate over all the defined histograms to write them to the output file.
All histogram objects have the same type, hist_t
, which is an alias for an
instance of the ivanp::hist::histogram
class template.
For simplicity, I suggest using the following convention. Say, you are interested in the distribution of the leading jet transverse momentum.
- Define a histogram
h_j1_pT
by writingh_(j1_pT)
. - Inside the event loop define the variable
double j1_pT
and compute its value. - Fill the histogram using
h_j1_pT(j1_pT)
.
Rather than defining a Fill
member function a la ROOT
, the operator()
is
overloaded to perform the same task with less typing.
Event weights are handled automatically though the bin type. Do not pass them to the histogram when filling it.
The physics variables should be defined and histograms filled inside the event loop below the comment
// Define observables and fill histograms #######################
As defined in the hist.cc
, all histograms have the same type.
They have a completely dynamic number of dimensions that is determined by the
axes' definitions.
To allow changes to the histograms' binning to be made without recompiling the analysis code, axes are defined in the runcard either directly or by referencing a dedicated binning file.
As example binning file is included in run/binning.json
.
Binning is defined in JSON
format in terms of nested arrays.
The root array lists all the definitions. Each element is an array of 2
elements. The first elements is a regular expression that is matched against
the names of the defined histograms; the second is an array of axes.
The regular expression need not be complicated and may simply be the exact
string of the histogram name. However, using regular expressions rather than
simple string comparison allows to easily specify the same binning for a
whole class of histograms.
The regular expressions are matched in the order of their definition.
As in the example, a final ".*"
definition may be used as a catch-all clause.
Otherwise, if any histogram is left unmatched, an error is thrown and the
program terminates.
The first level in the axes definitions corresponds to the histograms dimensions. This, for a simple 1D histogram it should contain 1 element.
The second level represents sub-binning. This array would have more than 1 element only if the histogram has at least 2 dimensions and requires different binning in the second dimension for different bins in the first dimension.
The third level is an individual axis definition. This array has to have at least 1 elements. The elements may either be numbers, specifying bin edges, or arrays, specifying ranges of bins.
All histograms contain underflow and overflow bins in all dimensions.
Examples:
[[[ 1,5,10 ]]]
: a 1D histogram with bin edges at 1, 5, and 10.[[[ [0,10,5] ]]]
: a uniformly binned 1D histogram with 5 bins between 0 and 10.[ [[ 0 ]], [[ [0,100,2], 1e3 ]] ]
: a 2D histogram with only underflow and overflow divided at 0, and no other bins in the first dimension, and the following bins in the second dimension: [0,50), [50,100), and [100,1000).
Binning an other run parameters are specified in a runcard in JSON format. A minimal runcard is shown below.
{
"input": {
"files": [ "ntuple.root" ]
},
"binning": "binning.json",
"output": "histograms.root"
}
Additional parameters can be specified as shown in this more complete example.
{
"input": {
"files": [ "ntuple.root" ],
"entries": 10000,
"tree": "t3"
},
"jets": {
"algorithm": [ "antikt", 0.4 ],
"cuts": { "eta": 4.4, "pt": 30 },
"njets_min": 0
},
"photons": {
"higgs_decay_seed": 0
},
"binning": [
[".*", [[[ [-1e3,1e3,200] ]]] ]
],
"output": "histograms.root",
"reweighting": [{
"ren_fac": [ [1,1],[0.5,0.5],[1,0.5],[0.5,1],[2,1],[1,2],[2,2] ],
"scale": "HT1",
"pdf": "CT14nlo",
"pdf_var": true
}]
}
The first argument to the analysis program is the name of the runcard file. If additional arguments are provided they are interpreted as the names of input ntuples, in which case the list of input files in the runcard is ignored.
If the first argument is -
, the runcard is read from stdin
.
The script run/submit.py
automates submission of histogramming jobs to
condor
.
Some of its parameters can be controlled via command line arguments.
To list them, run ./submit.py -h
.
All the arguments have default values and most use cases will probably not
require passing any of them.
submit.py
has to be run in the run
directory, or in general, the directory
where you expect all the output files to be created.
By default, submit.py
assumes that you are running the hist
analysis.
Use this option if you are running a different analysis.
If using a relative path, the path to the executable has to be specified
relative to the directory in which condor
will run the jobs.
Thus, if the submit.py
script is located inside ntuple_analysis/run/
,
the path to the executable should be ../../bin/analysis
, with the extra ..
accounting for the job scripts running in ntuple_analysis/run/condor/
.
A typical histogramming analysis running over 25M events will take on the
overs of tens of minutes. However, if you are also reweighting, the jobs may
take considerably longer, especially for the I
-type events.
The MSU Tier 3 HTCondor is configured to kill jobs submitted to the default
queue after 3 hours. To avoid this, pass the -m
option to submit.py
to
submit jobs to the Medium queue. This adds a +IsMediumJob = True
line to the
condor script.
Probably the most important part of submit.py
is the selection
variable.
It defines the criteria used to select the set of input ntuples.
Each element of the selection
list is an object defining all possible
properties characterizing the ntuples. If multiple values are listed for any
property, all possible combinations of the listed properties are searched for
in the database. Multiple objects can be listed in selection
to avoid
undesired combinations, as each element of selection
is considered
independently.
After you edit submit.py
, before you submit the condor
jobs, it is a good
idea to make a dry run that generates all the job scripts but doesn't submit
them. This way selection mistakes can be caught early by looking at the names
of the generated scripts in the condor_
directory and looking at the
runcards written inside a few of these scripts, to make sure that all
parameters appear as desired.
A dry run can be performed by passing the -x
option to submit.py
.
When submit.py
is run, it creates a condor_
directory, where all the
condor and job scripts will be generated. The script makes use of the DAGMan
feature of HTCondor
. This is done in order to run a merging job after all
the ntuple-processing jobs have finished without having to wait to run merging
manually.
If a merge.sh
script is present in the run directory, it is executed when all
the histogramming jobs complete. Additionally, if a finish.sh
is present, it
is run afterwards.
The default merge.sh
scripts automatically determines, based on the files'
names, which ones to merge from the separate runs for the same B, RS, I, or V
part, and afterwards, if all parts are present, merge them into the NLO result.
This is done by the merge
program, which also scales the histograms to the
differential cross section.
If you did reweighting and would like to combine scale and PDF variation
histograms into envelopes, this can be done using the envelopes
program.