-
Notifications
You must be signed in to change notification settings - Fork 16
Provided Software
This page describes the usage of the two scripts generate_data.py
and evaluate.py
provided in this repository. It does not describe any auxiliary code or the details of the implementation. If you encounter any problems, find any bugs, or need help please contact us via mail at [email protected] or via Slack. For more details see our support page.
To run the code in this repository a working installation of Python 3.7 or higher as well as an adequately new version of PyCBC are required. If you need to install Python 3.7 make sure to also install the appropriate python development libraries. For Ubuntu the commands would be
sudo apt-get install python3.7
sudo apt-get install python3.7-dev
We recommend to use a virtual environment for this mock data challenge. To create one you can use virtualenv
virtualenv -p python3.7 <env-name>
source <env-name>
To install an appropriate version of PyCBC simply install the requirements after cloning the repository by
pip install --upgrade pip setuptools
pip install -r requirements.txt
To download the code you can simply clone this repository to a suitable location on your machine by executing the command below in the desired directory.
git clone https://github.com/gwastro/ml-mock-data-challenge-1.git
This script contains the code to generate mock data for testing. To use it multiple options can be specified. An example call specifying the most common options would be
./generate_data.py \
--dataset 1 \
--output-injection-file injections.hdf \
--output-foreground-file foreground.hdf \
--output-background-file background.hdf \
--seed 42 \
--start-offset 0 \
--duration 32000 \
--verbose
- The
--dataset
option specifies how the noise is generated and which injections are made. For details please refer to this page. - All options prefixed by
output
specify where files generated by the code will be stored. The--output-injection-file
contains the parameters of the injected signals.--output-background-file
contains the pure detector noise.--output-foreground-file
contains the same noise with signals injected into it. For details on the structure of the foreground and background files please refer to this page. - The
--seed
is used to make the noise and signal generation reproducible. Two calls to this function with the same--dataset
,--seed
,--start-offset
, and--duration
will yield identical results. If the seed is not specified it will default to 0 and not be random! To use a random seed on each invocation of the program use a negative number as seed. - The
--start-offset
specifies at which time to start generating noise. It must be greater or equal zero. All noise starts to be generated at the same reference time. You may want to alter this value if you want to produce a large amount of noise on multiple machines in parallel. The--start-offset
for the second call to the function would in that case be the value given to the--duration
of the first call. In other words, this option tells the code how much data to skip in the beginning and where to start generating. - The
--duration
specifies how much data is generated (in seconds). Note that in total only7111579
seconds of data are available and for technical reasons no more than7024699
should be requested. We recommend to stay way below these limits. - The option
--verbose
prints status updates to the screen.
Additionally, you may want to only generate injections once and use them for multiple data sets. In this case you can omit the option --output-injection-file
and instead set the option --injection-file
. Pass the path to the injection file you want to use. The --injection-file
is expected to be of the format output by --output-injection-file
.
For further options and a description of them please refer to
./generate_data.py -h
Note that the code will download a file called segments.csv
. This file contains information on which GPS times to use for data generation. Irrespective of the data set specified by --dataset
data will be generated in these segments. ATTENTION! If you specify --dataset 4
the code will start to download a large (~94 GB) file containing real noise downsampled to 2 kHz. You can interrupt this download at any time and the function will pick up where it left off. However, the code is not able to generate any data for data set 4 before this file is downloaded completely. You can also download the file directly via
python -c "from generate_data import download_data; download_data()"
or from the URL https://www.atlas.aei.uni-hannover.de/work/marlin.schaefer/MDC/real_noise_file.hdf
.
For more control over the data generating process the functions from the script can be called directly. We consider this advanced usage and do not document it beyond the comments in the code.
This script contains the functionality to get the false-alarm rate (FAR) as well as the sensitivity of the search algorithm. As input it requires the file containing the injections, the file containing the foreground input data, as well as the event files returned by the search algorithm applied to the foreground and background data. It returns a file of the HDF5 format containing many different datasets. The most important of these are labeled far
and sensitive-distance
. They are of the same length and values of the sensitive-distance
correspond to the far
value at the same index. To plot them, they have to be sorted by the far
values. An example-call to the script would be
./evaluate.py \
--injection-file injections.hdf \
--foreground-events <path to output of algorithm on foreground data> \
--foreground-files foreground.hdf \
--background-events <path to output of algorithm on background data> \
--output-file eval-output.hdf \
--verbose
The options mean the following
- The option
--injection-file
specifies the injections that were used to create the foreground data. It corresponds to the output ofgenerate_data.py --output-injection-file
or the path given togenerate_data.py --injection-file
. - The option
--foreground-events
specifies the output of the search algorithm that was obtained using the foreground file returned bygenerate_data.py --output-foreground-file
. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts. - The option
--foreground-files
specifies the foreground data that was used as input to the algorithm. This file is only used to determine which injections were actually contained in the foreground data and how much data was analyzed. It has to be the file created bygenerate_data.py --output-foreground-file
. Multiple paths may be provided if the input data was split into multiple parts. - The option
--background-events
specifies the output of the search algorithm that was obtained using the background file returned bygenerate_data.py --output-background-file
. For details on the structure of these files please refer to this page. Multiple paths may be provided if the input data was split into multiple parts. - The option
--output-file
specifies where the analysis output should be stored. - The option
--verbose
tells the script to print status updates. - An option
--force
exists to allow the code to overwrite existing files.