_______ ____ _/ /__ /_____ a software framework for numerical approximations / __ `/ / /_ >/ ___/ of posterior distributions by Approximate / /_/ / /___/ / /__ Bayesian Computation Sequential Monte Carlo \__,_/_//____/\___/ in parallel
al3c is written for two types of users:
-
Those who have written their own software to address niche scientific questions, and would like to speed things up by using ABC in parallel without too much parallel programming or effort.
-
Those seeking use ABC with coalescent theory models.
To accomodate the latter users, we provide a number of examples using al3c with MaCS, a Markovian coalescent theory simulator, which can easily be modified for the user's specific case.
al3c: High-performance software for parameter inference using Approximate Bayesian Computation in Bioinformatics
For help with using al3c, please email Alexander Stram at [email protected]
We provide an al3c example for parallelized ABC-SMC using the coalescent simulator MaCS, available at https://github.com/gchen98/macs
Here are some quick examples that can get you running al3c without much thought:
Warning! If you have MPI libraries already installed, please skip to the "From source" example. Existing MPI libraries will likely interfere with the Linux and OS-X binaries we provide below.
git clone https://github.com/ahstram/al3c.git
cd al3c/
bin/al3c_linux_x86-64 config/macs_linux-x86-64.xml
git clone https://github.com/ahstram/al3c.git
cd al3c/
bin/al3c_darwin_x86-64 config/macs_darwin-x86-64.xml
To compile the example from source, you will need to ensure that an MPI library such as Intel MPI Library, OpenMPI or MPICH2 is installed, then follow:
git clone https://github.com/ahstram/al3c.git
cd al3c/
## The following command makes the al3c binary
make
## The following command compiles the MaCS binary, which requires C++ Boost libraries
git clone https://github.com/gchen98/macs.git macs-binary && cd macs-binary && make && ln -s ../macs-binary/macs ../macs/macs && cd ..
## This makes the MaCS plugin, which uses the above MaCS binary
cd macs/ && make && cd ../
## This initiates al3c, using the "config/macs.xml" configuration, which loads the MaCS plugin, which in turn calls the MaCS binary
bin/al3c config/macs.xml
Modifications to the make.inc files in the al3c/ and al3c/macs/ directory may be necessary for your environment. ##Tutorial
al3c consists of three components: the al3c binary, a user-defined shared library (.so file), and an XML file, which specifies parameters to be used with the user-defined shared library (.so file).
___________ | | | XML file | ________________ |___________| | | \ _________ | | \| | | | | al3c | -----> | desired output | /|_________| | | ___________ / | | | | |________________| | .so file | |___________|
No modification is necessary to the al3c binary-- it is readily available in the git repository.
We must create a shared library which will specify the:
- parameter space
- prior distribution
- parameter perturbation kernel
- simulation model
- distance function to compare observed & simulated data
and an XML configuration file, which will specify
- the location of the shared library (.so file)
- the observed data
- number of processors to use
- basic ABC-SMC parameters
A rough diagram of the relation between functions in our shared library follows:
void user_t::prior(); | +-----------------------+ | | | | V | | user_t::perturb();<-----+ | | | | | | | | | | V | | +-----> user_t::simulate(); | | | | | | | | V | | user_t::distance();------+ | | | | prior_density(); | (weight) | | perturb_density(); | | | V | user_summary_t::summarize(); | | | | | V | user_t::print(); ---------------+ | | V (output file)
###Compiling the al3c binary
Note that compiling al3c requires MPI libraries be installed. To avoid installing MPI libraries on a Linux 64-bit or OS X system, you may instead use the static binary "bin/al3c_linux_x86-64" (Linux) or "bin/al3c_darwin_x86-64" (OS X) and skip this step.
The following commands will obtain the al3c source code and and compile the al3c binary
git clone https://github.com/ahstram/al3c.git
cd al3c
make
###Coding the shared library Shared libraries for use with al3c should be written in C++, and include the "al3c.hpp" file which is found in the "include" directory.
al3c.hpp defines two functions:
u01, which must be used as the shared library's random number generator, or used to seed any external random number generator, in order to ensure that different processors do not use the same random number seed.
exec_cmd, an optional function call external programs to simulate data
al3c.hpp declares two templates:
user_t, the users' parameters of interest
user_summary_t, a summary statistic for user_t
The user must then define all functions listed in C/C++ Requirements, in order to complete the definition of these two templates.
For a sample shared library's source code, please see this sample. ###Compiling the shared library
We may compile our source file as a shared library using the "-shared" and "-fPIC" compilers flags, like such:
gcc -shared -o macs.so -fPIC macs.so.cpp
###XML configuration
The XML configuration must define all nodes listed in XML Requirements.
A sample XML file is available here.
###Running al3c al3c is run with the XML files to load as its sole argument, for example:
bin/al3c config/macs.xml
##Reference Manual
1 C/C++ Provisions 1.1 Variables 1.1.1 param 1.1.2 param_summary 1.1.3 S 1.1.4 O 1.1.5 N 1.1.6 D 1.1.7 d 1.1.8 w 1.2 Functions 1.2.1 u01 1.2.2 exec_cmd 2 C/C++ Requirements (sample) 2.1 Parameter & Parameter Summary Statistic Data Types 2.1.1 param_t 2.1.2 param_summary_t 2.2 Printing Parameter Data Type & Summarizing Acceptances 2.2.1 user_t::print 2.2.2 user_summary_t::summarize 2.3 Generating the Prior Distribution 2.3.1 user_t::prior 2.3.2 user_t::prior_density 2.4 Perturbing Parameters 2.4.1 user_t::perturb 2.4.2 user_t::perturb_density 2.5 Simulating Data & Measuring Goodness of Fit 2.5.1 user_t::simulate 2.5.2 user_t::distance 3 XML Requirements (sample) 3.1 lib - Shared library 3.2 MPI 3.2.1 NP - Number of Processors 3.3 ABC 3.3.1 G - Generations 3.3.5 R - Rank 3.3.4 E - Epsilon 3.3.3 A - Acceptances 3.3.2 T - Trials 3.4 O - Observed data
###C/C++ Provisions
####u01
float u01();
Uniform random number generator
Example
seed=u01();
seed_rng(seed);
while (1) {
x=rng();
...
}
####exec_cmd
uint exec_cmd(const char *cmd);
Run command via system call
Example
See simulate()
###C/C++: Parameter & Parameter Summary Statistics
####param_t
struct param_t;
A user defined struct with the parameters we are investigating
Example
struct param_t {
float MigrationRate_EurToAfr,
MigrationRate_AsnToAfr,
MigrationRate_AsnToEur,
EffectivePopulationSize_Afr,
GrowthRate_Eur,
GrowthRate_Asn,
PastEvent_EurToAfrMigration;
};
const char *framework_t::print();
Format parameter for printing
Relevant Variables
param_t *param // parameter to format
float *d // the distance of param's simulation
float *w // // the weight of param
bool header // printing header or not
Return Value
const char *d <return value>;
Example
user_t::print(bool header) {
ofstringstream output;
if (header)
output<<"#distance weight MigrationRate_EurToAfr MigrationRate_AsnToAfr MigrationRate_AsnToEur EffectivePopulationSize_Afr GrowthRate_Eur GrowthRate_Asn PastEvent_EurToAfrMigration"<<endl;
else
output<<*d<<*w<<param->MigrationRate_EurToAfr<<param->MigrationRate_AsnToAfr<<param->MigrationRate_AsnToEur<<param->EffectivePopulationSize_Afr<<param->GrowthRate_Eur<<param->GrowthRate_Asn<<param->PastEvent_EurToAfrMigration<<endl;
return output.str().c_str();
}
####param_summary_t
struct param_summary_t;
A user defined struct giving summary statistics necessary for a dynamic perturbation kernel
Example
struct param_summary_t {
float MigrationRate_EurToAfr_Variance,
MigrationRate_AsnToAfr_Variance,
MigrationRate_AsnToEur_Variance,
EffectivePopulationSize_Afr_Variance,
GrowthRate_Eur_Variance,
GrowthRate_Asn_Variance,
PastEvent_EurToAfrMigration_Variance;
};
####summarize
framework_t::summarize();
Write to param_summary_t based on an array of param_t's
Relevant Variables
param_t **params; // parameter to format
uint A; // number of parameters
Return Value
param_summary_t *param_summary; // Desired statistics of accepted parameters
Example
user_summary_t::summarize() {
float m1=0,m2=0;
for (uint a=0;a<A;a++) {
m1+=params[a]->MigrationRate_EurToAfr;
m2+=params[a]->MigrationRate_EurToAfr*params[a]->MigrationRate_EurToAfr;
}
m1/=(float)A;
m2/=(float)A;
param_summary->MigrationRate_EurToAfr_Variance=m2-m1*m1;
/*
...
(repeat for other parameters)
...
*/
m1=0, m2=0;
for (uint a=0;a<A;a++) {
m1+=params[a]->PastEvent_EurToAfrMigration;
m2+=params[a]->PastEvent_EurToAfrMigration*params[a]->MigrationRate_PastEvent_EurToAfrMigration;
}
m1/=(float)A;
m2/=(float)A;
param_summary->PastEvent_EurToAfrMigration_Variance=m2-m1*m1;
}
###C/C++: Generating the Prior Distribution
####prior
void framework_t::prior();
Generate random parameters from the prior distribution
Return Variables
param_t *param; // Parameter to write to according to the desired prior distribution
Example
void user_t::prior() {
param->MigrationRate_EurToAfr=u01()*0.4+0.84; // Unif[0.84,1.24]
param->MigrationRate_AsnToAfr=u01()*0.32+0.16; // Unif[0.16,0.48]
param->MigrationRate_AsnToEur=u01()*0.84+0.72; // Unif[0.72,1.56]
param->EffectivePopulationSize_Afr=u01()*0.2994+1.319; // Unif[1.319, 1.6184]
param->GrowthRate_Eur=u01()*0.31+0.28; // Unif[0.28, 0.59]
param->GrowthRate_Asn=u01()*0.45+0.30; // Unif[0.30, 0.75]
param->PastEvent_EurToAfrMigration=u01()*2.8+4.8; // Unif[4.8,7.6]
}
####prior_density
framework_t::prior_density();
Give the probability density of a parameter generated by prior()
Relevant Variables
param_t *param; // A parameter for which we want the pdf of according to the prior distribution
Return Variables
float <return value>; // The density of param according to the prior distribution
Example
void user_t::prior_density() {
if (0.84<=param->MigrationRate_EurToAfr && param->MigrationRate_EurToAfr<=1.24
&& 0.16<=param->MigrationRate_AsnToAfr && param->MigrationRate_AsnToAfr<=0.48
&& 0.72<=param->MigrationRate_AsnToEur && param->MigrationRate_AsnToEur<=1.56
&& 1.319<=param->EffectivePopulationSize_Afr && param->EffectivePopulationSize_Afr<=1.6184
&& 0.28<=param->GrowthRate_Eur && param->GrowthRate_Eur<=0.59
&& 0.3<=param->GrowthRate_Asn && param->GrowthRate_Asn<=0.75
&& 4.8<=param->PastEvent_EurToAfrMigration && param->PastEvent_EurToAfrMigration<=7.6)
return 1;
else
return 0;
}
###C/C++: Perturbing Variables
####perturb
void framework_t::perturb();
Perturb parameters according to a desired perturbation kernel
Relevant Variables
param_summary_t *param_summary; // A user defined struct giving summary statistics of the last generation's accepted parameters
param_t *param; // The parameter to be perturbed
Return Variables
param_t *param; // The perturbed parameter
Example
void user_t::perturb() {
param->MigrationRate_EurToAfr+=(u01()-0.5f)*sqrt(2*param_summary->MigrationRate_EurToAfr_Variance*12);
/* ... repeat for other parameters ... */
param->PastEvent_EurToAfrMigration+=(u01()-0.5f)*sqrt(2*param_summary->PastEvent_EurToAfrMigration*12);
}
####perturb_density
float framework_t::perturb_density(param_t *old_param);
Give the probability density of a parameter perturbation
Relevant Variables
param_t *old_param; // A parameter from last generation's acceptances
param_t *param; // The perturbed parameter
param_summary_t *summary; // A user defined struct giving summary statistics of the last generation's accepted parameters
Return Variables
float <return value> // The probability density of param being perturbed from old_param, according to the perturbation kernel specified in perturb()
Example
float user_t::perturb_density() {
if ( fabs(param->MigrationRate_EurToAfr - old_param->MigrationRate_EurToAfr) > sqrt(2*param_summary->MigrationRate_EurToAfr_Variance*12)/2.f )
return 0.f;
/* ... repeat for other parameters ... */
if ( fabs(param->PastEvent_EurToAfrMigration - old_param->PastEvent_EurToAfrMigration) > sqrt(2*param_summary->PastEvent_EurToAfrMigration_Variance*12)/2.f )
return 0.f;
return 1.f;
}
###C/C++: Data simulation
####simulate
framework_t::simulate();
Simulate data according to given parameters
Revelant Variables
param_t *param; // The parameter to simulate with
uint N; // The number of rows of simulated/observed data
uint D; // The number of columns (<b>D</b>imensions) of simulated/observed data
Return Variables
float **S; // An N*D array of floating points containing simulated data
Example
user_t::simulate() {
ostringstream cmd;
uint seed=(uint)u01()*UINT_MAX;
cmd<<"macs.sh 718 100000 -s "<<seed<<" -t .001 -I 3 176 170 372 0 -m 2 1 "<<param->MigrationRate_EurToAfr<<" -m 3 1 "<<param->MigrationRate_AsnToAfr<<" -m 3 2 "<<param->MigrationRate_AsnToEur<<" -n 1 "<<param->EffectivePopulationSize_Afr<<" -g 2 "<<param->GrowthRate_Eur<<" -g 3 "<<param->GrowthRate_Asn<<" -eg .0230000 2 0 -eg .0230001 3 0 -ej .0230002 3 2 -em .0230003 2 1 "<<param->PastEvent_EurToAfrMigration<<" -en .0230004 2 0.1861 -ej .051 2 1 -en .148 1 0.731 -r 0.0006"<<endl;
//this is a helper function that will write to **S
exec_cmd(cmd.str().c_str());
}
####distance
float framework_t::distance();
Give the distance between simulated & observed data
Relevant Variables
uint N; // The number of rows of simulated/observed data
uint D; // The number of columns ("D"imensions) of simulated/observed data
float **S; // An N*D matrix of floating points containing simulated data
float **O; // An N*D matrix of floating points containing observed data
Return Variables
float <return value>; // The distance between observed & simulated data
Example
user_t::distance() {
float r=0;
for (uint n=0;n<N;n++)
for (uint d=0;d<D;d++)
r+=pow(S[n][d]-O[n][d],2);
return sqrt(r);
}
###XML: lib
####lib
C++ library to load
Notes
Example
<lib>lib/libmacs.so</lib>
###XML: MPI
####NP Number of processors to use
Notes
This should be the total number of processors al3c will be run with. If there are 8 nodes, each with 8 processors, we specify NP as "64" and let the MPI system do the rest.
Example
<MPI>
<NP>64</NP>
</MPI>
###XML: ABC
####G Minimum number of generations of ABC-SMC to run before al3c quits
Acceptable values
integers in {0,1,...}
Notes
This is the minimum number of generations of ABC-SMC to run.
al3c will not quit until both "G" and "E" are satisfied. If you'd like to quit strictly based on "G", set "E" to be "0".
Example
<ABC>
<G>15</G>
</ABC>
####R Rank of accepted parameter to set next generation's epsilon to
Acceptable values
integers in {1,2,...,A}
integers in {1,2,...,A}G
Notes
After each generation, the distances for each accepted parameter are sorted, least to greatest. If we would like to set the next generation's epsilon ("acceptance threshold") according to the present generation's accepted parameter distances, we do so here. This allows us to adaptively set our epsilon schedule according to quantiles, rather than having to set it explicitly beforehand.
If "A" is set to "1000", and we want to set our epsilon to the simulation with the 250th best rank, we set "250" here. All accepted simulations in the next generation will have a distance less than or equal to this generation's 250th best simulation.
If we would like to change the quantile used for each generation, we can specify R as a tab/space delimited vector.
0 means "ignore". al3c will fall back on an "E" vector to set its epsilon schedule. If "E" is a scalar, acceptances will be made according to "A" and "T" only.
Example
<ABC>
<R>250</R>
</ABC>
<!-- Or, if we'd like to vary the rank over generations, do it like this: -->
<ABC>
<R>750 500 400 300 250 250 250 250 250 250 250 250 250 300 400</R>
</ABC>
####E Maximum distance to observed data all accepted simulations in a generation must have before al3c quits
Acceptable values
reals in [0,∞) strictly decreasing reals in [0,∞)G
Notes
When all accepted parameters in the most recent generation have simulated datasets with a distance from observed data less than or equal to "E", we quit.
If you desire to explicitly set a rejection threshold schedule ("epsilon schedule"), you can do so by letting this value be a vector of G strictly decreasing values, delimited by spaces or tabs. In that case, "R" must be set to "0". Otherwise, use "R" for dynamic rank-based epsilon schedules.
al3c will not quit until both "G" and "E" are satisfied. If you'd like to quit strictly based on "E", set "G" to be "0".
Example
<ABC>
<E>1.4</E>
</ABC>
<!-- Or, if we'd like to set an explicit epsilon schedule, do it like this: -->
<ABC>
<E>100 90 80 70 60 50 40 30 20 10 5 4 3 2 1</E>
</ABC>
####A Number of accepted trials ("acceptances") per generation
Acceptable values
integers in {1,2,...,T}
Notes
This is the number of accepted parameters we desire per generation.
The actual value used may be slightly higher than specified here, so that "NP" divides "A", allowing for even allocation of computation across processors.
Example
<ABC>
<A>1000</A>
</ABC>
####T Minimum number of trials per generation
Acceptable values
integers in {1,2,...}
Notes
This is the minimum number of trials we must have per generation. Each trial is either accepted or rejected. There may be more trials than specified here depending on the values of "E" and "R" are set.
The actual value used may be slightly higher than specified here, so that "NP" divides "T", allowing for even allocation of computation across processors.
Example
<ABC>
<T>10000</T>
</ABC>
###XML: O
####O Observed data to simulate
Notes
This can be a 2-dimensional matrix, rows delimited by newlines, columns delimited by tabs/spaces/columns. The number of rows and columns will be counted, and the values of "N" and "D" will be inferred accordingly.
Example
<O>259 108 103 147
119 53 66 132
100 46 61 128</O>