Skip to content

ivankp/ntuple_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ntuple Analysis

This is a suite of programs and scripts for histogramming and reweighting of MC ntuples following the BlackHat format, such as generated by GoSam or Sherpa.

For the details of the structure of the ntuples, please refer to:

This readme is a work in progress. I will continue adding to it to make it a comprehensive guide to working with GoSam ntuples.

Quick start

Checking out the code

git clone --recursive https://github.com/ivankp/ntuple_analysis.git

Don't forget the --recursive argument to also clone the required submodules.

How to run

  1. Edit or copy src/hist.cc. This is the program that reads the ntuples and fills the defined histograms. It also does reweighting.
  2. Add histograms' definitions around line 384. Basically, add h_(name_of_observable).
  3. Fill histograms towards the end of the event loop, around line 553.
  4. Define histograms' binning in run/binning.json.
  5. Select what sets of ntuples to run over in run/submit.py.
  6. Generate job scripts and submit them to condor by running ./submit.py inside the run directory.

Detailed guide

Software requirements

C++ compiler

A C++ compiler supporting C++20 features is required. For GCC this means version 10.1.0 or higher.

ROOT

Get at: https://root.cern/

A recent enough version of ROOT that allows compilation with the -std=c++20 flag is reguired. The programs have been tested with version 6.18/04 and 6.20/04. Any newer version should work. Older versions are most likely ok as well, as long as the code compiles.

LHAPDF

Get at: https://lhapdf.hepforge.org/

LHAPDF is required to compute scale variations and PDF uncertainties.

FastJet

Get at: http://fastjet.fr/

FastJet is required to cluster final state particles into jets.

Installation

Checkout the repository directly by running the following command:

git clone --recursive https://github.com/ivankp/ntuple_analysis.git

The ntuple_analysis repository contains a submodule histograms, and the --recursive argument is necessary to check it out as well. If you forget the --recursive, you'll need to run

git submodule init
git submodule update

inside the repository after it is cloned.

To compile the programs run make. To compile in parallel run make -j.

If you have a GitHub account and are familiar with git, I recommend forking the repository. This will give you a personal copy of the repository, to which you can commit on your own, while you'll still be able to synchronize your fork with the upstream repository in case of bug fixes or addition of features.

Read more:

Running an analysis

Defining an analysis

An example histogramming analysis is defined in src/hist.cc. The simplest approach is to just make changes to that file.

A bit less straightforward, but more rational in the long run, approach is to copy src/hist.cc into another file under src/. You can make as many copies as you like, and have many different analyses defined for different studies. The only complication is that you'll have to specify the compilation flags and dependencies for the new program. But this is pretty simple to do. You can just copy the paragraph in the Makefile that starts with

C_hist :=
LF_hist :=
L_hist :=
bin/hist:

and replace the string hist in it with the path to the new source file relative to the src/ directory, omitting the file extension.

Executables don't have to be manually added to the all target, because the Makefile contains a script that finds them. New analysis code can be placed in (possibly nested) directories under src/. This directory structure will be replicated inside bin/ for the compiled executables.

While hist.cc contains a lot of code, most of it is there to provide a basic infrastructure for defining an analysis. In fact, you probably won't need to understand or edit most of it. Though a great deal of automation can be achieved via manipulation of the histogram bin types, the typical usage should only require defining names for the histograms and defining the physics observables to fill them.

Although I used my own histogramming library that relies of template meta-programming techniques, the program writes the output in the ROOT format as TH1Ds arranged in a hierarchy of directories inside a TFile. In my opinion, the complexity drawback of a custom histogramming library is completely outweighed by the benefits of flexibility, code reduction, automation, and the ability to express complex structures by composition of conceptually simple building blocks via the type system. For example, by defining a custom bin type, every histogram can be automatically filled for all defined weights, all the different initial states (gg, gq, qq), etc.

Defining histograms

The logically most important part of the analysis program is the event loop. As far as it concerns histogramming, the histogram objects have to be defined outside the loop and filled inside.

Definitions of the histograms should be located just below the comment

// Histograms of main observables #################################

The h_ macro simplifies histogram definition by combining and automating the following 3 steps:

  • Histogram object definition;
  • Matching the name of the histogram with the axes definition;
  • Addition of the histogram pointer to a global array which is later used to iterate over all the defined histograms to write them to the output file.

All histogram objects have the same type, hist_t, which is an alias for an instance of the ivanp::hist::histogram class template.

For simplicity, I suggest using the following convention. Say, you are interested in the distribution of the leading jet transverse momentum.

  • Define a histogram h_j1_pT by writing h_(j1_pT).
  • Inside the event loop define the variable double j1_pT and compute its value.
  • Fill the histogram using h_j1_pT(j1_pT).

Rather than defining a Fill member function a la ROOT, the operator() is overloaded to perform the same task with less typing.

Event weights are handled automatically though the bin type. Do not pass them to the histogram when filling it.

The physics variables should be defined and histograms filled inside the event loop below the comment

// Define observables and fill histograms #######################

Defining binning

As defined in the hist.cc, all histograms have the same type. They have a completely dynamic number of dimensions that is determined by the axes' definitions.

To allow changes to the histograms' binning to be made without recompiling the analysis code, axes are defined in the runcard either directly or by referencing a dedicated binning file.

As example binning file is included in run/binning.json.

Binning is defined in JSON format in terms of nested arrays.

The root array lists all the definitions. Each element is an array of 2 elements. The first elements is a regular expression that is matched against the names of the defined histograms; the second is an array of axes. The regular expression need not be complicated and may simply be the exact string of the histogram name. However, using regular expressions rather than simple string comparison allows to easily specify the same binning for a whole class of histograms. The regular expressions are matched in the order of their definition. As in the example, a final ".*" definition may be used as a catch-all clause. Otherwise, if any histogram is left unmatched, an error is thrown and the program terminates.

The first level in the axes definitions corresponds to the histograms dimensions. This, for a simple 1D histogram it should contain 1 element.

The second level represents sub-binning. This array would have more than 1 element only if the histogram has at least 2 dimensions and requires different binning in the second dimension for different bins in the first dimension.

The third level is an individual axis definition. This array has to have at least 1 elements. The elements may either be numbers, specifying bin edges, or arrays, specifying ranges of bins.

All histograms contain underflow and overflow bins in all dimensions.

Examples:

  • [[[ 1,5,10 ]]]: a 1D histogram with bin edges at 1, 5, and 10.
  • [[[ [0,10,5] ]]]: a uniformly binned 1D histogram with 5 bins between 0 and 10.
  • [ [[ 0 ]], [[ [0,100,2], 1e3 ]] ]: a 2D histogram with only underflow and overflow divided at 0, and no other bins in the first dimension, and the following bins in the second dimension: [0,50), [50,100), and [100,1000).

Defining runcards

Binning an other run parameters are specified in a runcard in JSON format. A minimal runcard is shown below.

{
  "input": {
    "files": [ "ntuple.root" ]
  },
  "binning": "binning.json",
  "output": "histograms.root"
}

Additional parameters can be specified as shown in this more complete example.

{
  "input": {
    "files": [ "ntuple.root" ],
    "entries": 10000,
    "tree": "t3"
  },
  "jets": {
    "algorithm": [ "antikt", 0.4 ],
    "cuts": { "eta": 4.4, "pt": 30 },
    "njets_min": 0
  },
  "photons": {
    "higgs_decay_seed": 0
  },
  "binning": [
    [".*", [[[ [-1e3,1e3,200] ]]] ]
  ],
  "output": "histograms.root",
  "reweighting": [{
    "ren_fac": [ [1,1],[0.5,0.5],[1,0.5],[0.5,1],[2,1],[1,2],[2,2] ],
    "scale": "HT1",
    "pdf": "CT14nlo",
    "pdf_var": true
  }]
}

Running the histogramming program

The first argument to the analysis program is the name of the runcard file. If additional arguments are provided they are interpreted as the names of input ntuples, in which case the list of input files in the runcard is ignored.

If the first argument is -, the runcard is read from stdin.

Running on Condor

The script run/submit.py automates submission of histogramming jobs to condor.

Some of its parameters can be controlled via command line arguments. To list them, run ./submit.py -h. All the arguments have default values and most use cases will probably not require passing any of them.

submit.py has to be run in the run directory, or in general, the directory where you expect all the output files to be created.

By default, submit.py assumes that you are running the hist analysis. Use this option if you are running a different analysis. If using a relative path, the path to the executable has to be specified relative to the directory in which condor will run the jobs. Thus, if the submit.py script is located inside ntuple_analysis/run/, the path to the executable should be ../../bin/analysis, with the extra .. accounting for the job scripts running in ntuple_analysis/run/condor/.

A typical histogramming analysis running over 25M events will take on the overs of tens of minutes. However, if you are also reweighting, the jobs may take considerably longer, especially for the I-type events. The MSU Tier 3 HTCondor is configured to kill jobs submitted to the default queue after 3 hours. To avoid this, pass the -m option to submit.py to submit jobs to the Medium queue. This adds a +IsMediumJob = True line to the condor script.

Probably the most important part of submit.py is the selection variable. It defines the criteria used to select the set of input ntuples. Each element of the selection list is an object defining all possible properties characterizing the ntuples. If multiple values are listed for any property, all possible combinations of the listed properties are searched for in the database. Multiple objects can be listed in selection to avoid undesired combinations, as each element of selection is considered independently.

After you edit submit.py, before you submit the condor jobs, it is a good idea to make a dry run that generates all the job scripts but doesn't submit them. This way selection mistakes can be caught early by looking at the names of the generated scripts in the condor_ directory and looking at the runcards written inside a few of these scripts, to make sure that all parameters appear as desired. A dry run can be performed by passing the -x option to submit.py.

When submit.py is run, it creates a condor_ directory, where all the condor and job scripts will be generated. The script makes use of the DAGMan feature of HTCondor. This is done in order to run a merging job after all the ntuple-processing jobs have finished without having to wait to run merging manually. If a merge.sh script is present in the run directory, it is executed when all the histogramming jobs complete. Additionally, if a finish.sh is present, it is run afterwards. The default merge.sh scripts automatically determines, based on the files' names, which ones to merge from the separate runs for the same B, RS, I, or V part, and afterwards, if all parts are present, merge them into the NLO result. This is done by the merge program, which also scales the histograms to the differential cross section.

If you did reweighting and would like to combine scale and PDF variation histograms into envelopes, this can be done using the envelopes program.

Implementation details

Histogramming library

Containers library

branch_reader

JSON parser

Advanced topics

Making use of additional branches

Changing bin types and tags

Changing output format

Adding dependencies in Makefile

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages