Skip to content

Commit

Permalink
Update README.md to reflect analysis split
Browse files Browse the repository at this point in the history
  • Loading branch information
AlexSchuy authored Jun 1, 2018
1 parent d84a76f commit 1d276fd
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ This project is intended to be run on the TeV cluster at the UW Department of Ph
First, clone the repository from git to a convenient location (ssh is recommended). Once you have cloned the repository, source the package_setup.sh script in bash. This will install the required packages and setup your environment. Note that pipenv is used to maintain a virtual python environment, so you must either start the pipenv virtual environment by running `pipenv shell` or run all scripts by prepending `pipenv run`, e.g., `pipenv run python package_test.py`. See the pipenv documentation for more information.

## Running Analysis
The main script is analysis.py; if you run `pipenv run python analysis.py -h`, you will see an up-to-date list of available commands. analysis.py supports training and testing several different models, or comparing all of the models, on all or part of the quark/gluon data, either locally or using the full TeV cluster and with or without hyper-parameter optimization. Note that a full analysis comparing all of the models with hyper-parameter optimization on the full data set can take weeks, even using the entire TeV cluster.
The main scripts are train.py and metrics.py; if you run `pipenv run python train.py -h`, you will see an up-to-date list of available commands. train.py supports training several different models on all or part of the quark/gluon data either locally or using the full TeV cluster and with or without hyper-parameter optimization. The trained model resulting from train.py is stored in a new 'run directory' which is specified relative to RUNS_PATH defined in constants.py. You can specify a name for the run directory, or a default one will be created for you based on the parameters passed to train.py.

By default, analysis.py uses the TeV cluster. To do so, you must first run `source dask_ssh.sh` on the scheduler machine (tev01). This will spawn a dask scheduler on tev01 and dask workers on the other tev machines. Once you're done running analysis.py, you can terminate these processes by pressing CTRL + C.
By default (i.e., without `--local` specified), train.py uses the TeV cluster to speed up computations. To do so, you must first run `source dask_ssh.sh` on the scheduler machine (tev01) (note that this script is currently broken). This will spawn a dask scheduler on tev01 and dask workers on the other tev machines. Once you're done running train.py, you can terminate these processes by pressing CTRL + C.

The fastest model to train is the Naive Bayes model, so let's use that as an example. Run `pipenv run python analysis.py -m "NB" --print_report --local --no_hyper` and you should see a classification report for the Naive Bayes model. If you want to use the cluster, try running `pipenv run python analysis.py -m "GBRT" --print_report --max_events 1000000`, and a scikit-learn Gradient-Boosted Regression Tree (more commonly known as a boosted decision tree or BDT) will be fit to the data.
The fastest model to train is the Naive Bayes model, so let's use that as an example. Run `pipenv run python train.py -m NB --local --no_hyper`. If you want to use the cluster (again, currently broken), try running `pipenv run python train.py -m "GBRT" --max_events 1000000`, and a scikit-learn Gradient-Boosted Regression Tree (more commonly known as a boosted decision tree or BDT) will be fit to the data.

Once you've created and trained a model, you can see performance metrics by running metrics.py. Run `pipenv run python metrics.py` and several plots will be created and saved in the run directory. There is an optional `--run_dir` parameter to specify the name of the run directory your model is saved in, but by default it uses the most recently modified run directory.

## Future Work
The next step is to create and test a convolutional neural network (CNN) model in keras, based on [this](https://arxiv.org/pdf/1612.01551.pdf) research, which was [presented](https://indico.cern.ch/event/579660/contributions/2582125/attachments/1494989/2325705/ptk_boost_2017.pdf) recently at Boost 2017. We can initially utilize a 2-dimensional CNN, using pseudorapidity (eta) and azimuthal angle (phi) as coordinates while meta variables such as transverse momentum (pt) can serve as image channels. However, in order to study long-lived particles, it will likely be crucial to extend to a 3-dimensional CNN by including depth information, as an important indicator of such particles is their displaced vertex in the third dimension.

0 comments on commit 1d276fd

Please sign in to comment.