-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathneutral_simulations.txt
executable file
·223 lines (158 loc) · 10.9 KB
/
neutral_simulations.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
########################################################################################
# #Author: Bárbara Bitarello
#
# #Description:Script for day 2 of popgen course. Some 'ms' commands and R commands
# for reading in results from simulations and making histograms.
#
## Last modified: 06.04.2015
########################################################################################
####################################
####################################
#PART I ############################
#MS: Basic syntax#############
####################################
####################################
#1)In your local machine, open an Rstudio session, so you can do some calculations.
#2) MAC or LINUX: go to the command line and ssh to the server:
ssh <[email protected]
#3) Windows: open PuTTY, put your user name, password and 'lem.ib.usp.br' in the appropriate spaces. Port # is '22'.
#4)In the server, type:
cd /home/BIO5789/<your_user_name>/ #this will take you to your directory
#eg. /home/BIO5789/ouvinte1
#2) now, type
ms 5 1 -t 2
#and
ms 5 1 -t 10
#check the outputs.Each '//' limits the output from a replicate (in this case only 1) and each row of 0s and 1s represents one of the 5 sequences we are
#simulating.
#Question 1: can you guess what 0 and 1 mean in the context? Notice the length of each row of 0/1 is the number of segregating sites.
#Questions 2: In the second case, we see that the number of segregating sites is higher than in the first. Why?
#Answer: the '-t' indicates that what comes after that is the theta parameter (4Nu). The higher theta, the higher the number of mutations
#happening along the genealogy.
#3) now try the same command as in 2, but replacing '1' with 3:
ms 5 3 -t 10 #now we have three replicates.
#PS: 'ms' is a very fast program, so with a model this simple, we can increase the number of replicates A LOT and will still run fast.
################################################################################################
################################################################################################
#Part II #######################################################################################
#RUNNNING SIMULATIONS FOLLOWING A HUMAN DMOGRAPHIC MODEL ######### ######### ####### ####### ###
################################################################################################
################################################################################################
#Note: although the parameters we use from here on are widely used by those who simulate human
#sequences, we are simplifying a lot. In reality, one could simulate subpopulations, population
#structure, migration between subpopulations and recombination. For our purposes, however, we
#will assume there is no substructure between subpopulation (we will simulate only one) and no
#recombination.
#Task: try to simulate a population compatible with what we know from human demographic history,
#and check the estimates for Tajima's D, SS (number os segregating sites) and pi (nucleotide diversity).
#we will do this for
#a) a 'human' population evolving without change in population size
#b) with a ~2-fold increase in size ~148,000 years ago (based on real data)
#b.1-b5) we will see what would happen if this increase in size had occured at other points in time.
#############################################################
#Exercise 1: simulating a human population of constant size.##
#############################################################
########Background info for simulations: #############
#human long-term effective population size=10,000 (this will change in the next exercise)
#human per site neutral mutation rate (mu) = 2.5.10e-08 (this will remain constant in the next exercises, it is the average genomic
#rate for humans.
#we want 10,000 replicates (sounds like a lot, but remember, MS is very fast!)
#theta=4*N*u, where N is the diploid effective pop size (see above) and u is the mutation rate per locus (so, mu*L, where 'L' is
#the length of the sequence we wish to simulate.
#human average gene has something between 10-15 kb, so let's go with 10Kb (10000 bp).
#first, calculate theta.
#theta=4*N*mu*L=10 #chek that you understand this!
#########Begin simulating #############
#Before, we only used 3 replicates of 5 sequences. Now, we will have 1000 replicates of 50 diploid individuals (100 sequences)
#copy and paste into command line:
ms 100 1000 -t 10|less #the pipe takes the output from the previous command and runs a new command using that output as input.
#(use enter to keep moving)
#Now, take the output from those simulations and use the program 'sample_stats' (installed along with 'ms') to calculate some
#summary statistics
#copy and paste:
ms 100 1000 -t 10|sample_stats|less
#notice we have 5 collumns. In reality, we are only interested in the first three for now, so we will save the output
#from sample_stats in a file
#save the output
ms 100 1000 -t 10|sample_stats|cut -f 2,4,6 > pop.const.stats.out
#check out the output again
less pop.const.stats.out
#now let's complicate the model!
#############################################################
#Exercise 2: simulating a human population with increase in N#
#############################################################
#As mentioned earlier, in reality, human demography is well studied, and we can make these simulations a bit more realistic.
#In the African population, there was a ~2-fold increase in the population size around 148,000 years ago. We can model that 'easily'.
########Background info for simulations: #############
#time of the event: 148000 years ago
#assumed generation time for H.sapiens: 20 years #some people use 15, some use 25, depends on your inclination.
#so the time, in generations, is: 148000/20=7400 generations
#MS likes parameters to be modelled in termsof 'units of 4N', and this 'N' is always the current one.
#So, actually, the time of the event should be written as:
7400/(4*N) == 0.1278154
#Current N (the one used for scaling): 14,474
#N before expansion: 7,300
#theta has to be recalculated: 4*14474*(2.5*10^-8)*
#So time= (number of years/generation time)/(4*N) = 0.1278154
#growth_paramter = (Past N/Current N)/(4*N)== 0.5043526/(4*14474)==
#the command line code will look like this
ms nsam nreps -t theta -eN time_of_growth growth_parameter |sample_stats|cut -f 2,4,6 > some.file.name.txt
#which with the actual values is:
ms 100 1000 -t 14.474 -eN 0.1278154 8.711355e-06|less
#Now, as before, save the output from 'sample_stats' in a file:
ms 100 1000 -t 14.474 -eN 0.1278154 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t0.12.stats.out
#########################################################
#Exercise 3: Explore other times for the expansion event
##########################################################
#In the previous exercise we used actual estimates for the timing of the expansion event in humans.
#Now, relaxing a bit about simulating actual human demographic data, let's explore what happens if we change the
#timing of the expansion event.
#Note: remember these simulations go from present to past. If an expansion event occured 0.12 (*4N) generations ago,
#this means that the population has had around 7,000 generations to evolve.
#If we change this timing to, say, 1 (i.e, 4N generations), then the population has had a longer time to evolve since the
#event. Because the sample we collect from the simulations if of the 'present' time, by changing the time of events we can have
#an idea of how long it takes for a population to recover from such an event (in terms of SS, Tajima's D, etc)
#Changing the timing of expansion is a way of looking at what happens immediately after an expansion (t=0.01), or a long time after (t=4).
#######Let's change the timing of the expansion event and save the output ################
#Run simualtions with the expansion event occuring at times={0.01, 0.05, 0.5, 1, 4} (in the previosu exercise the time was 0.12)
ms 100 1000 -t 14.4 -eN 0.01 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t0.01.stats.out
ms 100 1000 -t 14.4 -eN 0.05 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t0.05.stats.out
ms 100 1000 -t 14.4 -eN 0.5 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t0.5.stats.out
ms 100 1000 -t 14.4 -eN 1 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t1.stats.out
ms 100 1000 -t 14.4 -eN 4 8.711355e-06|sample_stats|cut -f 2,4,6 > pop.t4.stats.out
###############################################################################################
###############################################################################################
#Part IV: CoPYING FILES INTO LOCAL MACHINE ########### ########### ########### ########### ####
###############################################################################################
###############################################################################################
#Copying files to local machine:
#Open Filezilla and fill in:
#host is lem.ib.usp.br
#user name
#password
#port 22
#find the files you generated in 'lem' and copy to a local folder in your computer.
#Suggestion: create a folder in your home directory called 'bio5789' and put them there. It will help with the R script.
#open R studio
#Now go to the file called 'simulations.r' on this webpage.
#PS: if you really liked the experience of simulating under a coalescent framework, you can try the block below at home.
##############################################################
#############################################################
## THE FOLLOWING BLOCK IS OPTIONAL ##########################
#############################################################
#############################################################
##############################################################
#Exercise 3: a population suffering a bottleneck
##############################################################
#In human demographic models, it is common to model a bottleneck happening at 51,000 years ago (the 'Out-of-Africa' event).
#in reality, to be accurate we should also simulate a population split, where an 'Eurasian' population is generated from this
#bottleneck, but that adds a lot of complexity to the exercise.
#So, being a bit simplistic, let's assume that this 'human' population suffered a bottleneck 51,000 years ago, with N going from 14474 (present) to 1861 (past).
#Hint, theta will change, and the Ne for scaling now will be 1861. Remember this makes no difference in the results, what is used for scaling, as long
#as it is always the same.
#first, recalculate theta using the same parameters, but 1,861 (N). theta=1.871
#calculate the time of the event: divide years by generation time and then by 4N: (51000/20)/(4*1871)=0.3407269
#calculate the growth parameter and run the command (very similar to the previous example): (1871/14474)/(4*1871)= 1.727235e-05
#run
ms 100 1000 -t 1.871 -eN 0.3407269 1.727235e-05|sample_stats|cut -f 2,4,6 > pop.bottleneck.t0.34.txt
################################################################################################