-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on input parameters MAF, N, and sdY in coloc for GWAS and eQTL data #178
Comments
Additionally, I have gene expression data for 12,345 individuals, and I am calculating
Here are the results for some genes:
Most genes have values very close to 1. In this case:
Thank you! |
for sdY, it looks like your data are standardised, which means sdY=1 for
all genes.
On 19/11/2024 12:09, Alice9503 wrote:
Additionally, I have gene expression data for 12,345 individuals, and
I am calculating |sdY| based on this dataset. I have the following
questions:
1.
Should |sdY| be calculated using all 12,345 individuals (since the
expression data has no missing values), regardless of missing
genotypes for specific SNPs? Or should |sdY| be calculated
separately for each SNP, considering only the intersecting samples
(as some SNPs may have missing genotypes)?
2.
If |sdY| should be calculated across all individuals, I computed
it using the following code:
|sd_per_gene <- apply(pro_nor_dat[, -1], 2, sd) |
Here are the results for some genes:
|> head(sd_per_gene) A1BG AAMDC AARSD1 ABCA2 ABHD14B ABL1 0.9997928
1.0000313 0.9998905 1.0001168 0.9999087 1.0000225 |
Most genes have values very close to 1. In this case:
*
Is it valid to simplify the analysis by setting |sdY = 1| for all
genes?
*
Or do I need to use the precise |sdY| values calculated for each gene?
Thank you!
—
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQWR2B376IYR2GTZUYWRLL2BMS6PAVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGQ3DSMZUGY>.
You are receiving this because you are subscribed to this
thread.Message ID: ***@***.***>
--------------gRtSnEWAVgwgQ1VumDdgXpxr
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit
<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>for sdY, it looks like your data are standardised, which means
sdY=1 for all genes.<br>
</p>
<div class="moz-cite-prefix">On 19/11/2024 12:09, Alice9503 wrote:<br>
</div>
<blockquote type="cite" ***@***.***">
<p dir="auto">Additionally, I have gene expression data for 12,345
individuals, and I am calculating <code class="notranslate">sdY</code>
based on this dataset. I have the following questions:</p>
<ol dir="auto">
<li>
<p dir="auto">Should <code class="notranslate">sdY</code> be
calculated using all 12,345 individuals (since the
expression data has no missing values), regardless of
missing genotypes for specific SNPs? Or should <code class="notranslate">sdY</code> be calculated separately
for each SNP, considering only the intersecting samples (as
some SNPs may have missing genotypes)?</p>
</li>
<li>
<p dir="auto">If <code class="notranslate">sdY</code> should
be calculated across all individuals, I computed it using
the following code:</p>
</li>
</ol>
<pre class="notranslate"><code class="notranslate">sd_per_gene <- apply(pro_nor_dat[, -1], 2, sd)
</code></pre>
<p dir="auto">Here are the results for some genes:</p>
<pre class="notranslate"><code class="notranslate">> head(sd_per_gene)
A1BG AAMDC AARSD1 ABCA2 ABHD14B ABL1
0.9997928 1.0000313 0.9998905 1.0001168 0.9999087 1.0000225
</code></pre>
<p dir="auto">Most genes have values very close to 1. In this
case:</p>
<ul dir="auto">
<li>
<p dir="auto">Is it valid to simplify the analysis by setting
<code class="notranslate">sdY = 1</code> for all genes?</p>
</li>
<li>
<p dir="auto">Or do I need to use the precise <code class="notranslate">sdY</code> values calculated for each
gene?</p>
</li>
</ul>
<p dir="auto">Thank you!</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br>
Reply to this email directly, <a href="#178 (comment)" originalsrc="#178 (comment)" shash="SDg+Bqk0Je++j2caKvf3w+v2HsWXYVq3LFrKfumAYwx2ZYfi34Y/kXBz+UuymOJ+VGgpbWOFW/CjglamKxf7BhgHjXg4C8fseDQ5ae05ufcIJewKG1/HkvxaB/C4959/khxNzhchuy/UOUXgVGUCgiOm5JLGh/7OFxz2Z9eR2qg=" moz-do-not-send="true">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AAQWR2B376IYR2GTZUYWRLL2BMS6PAVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGQ3DSMZUGY" originalsrc="https://github.com/notifications/unsubscribe-auth/AAQWR2B376IYR2GTZUYWRLL2BMS6PAVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBVGQ3DSMZUGY" shash="Yq5T6XmHS5dXQmOftRjfzPdnXyeoCmWZE48XkRz650wADlWHmSqyPCgJk4hS66jiJ4DBf9nxy9kqG98p7DqV7f8Or0qg4Y1S2H7fsC11AYlo9JWdTU56dtaSy64nSE3s1SoH8S+22leKb6uXwWNnrvSNnJH8ZsGh65qwyO0B2nw=" moz-do-not-send="true">unsubscribe</a>.<br>
You are receiving this because you are subscribed to this
thread.<img src="https://github.com/notifications/beacon/AAQWR2DVM372MJC54UBSC5L2BMS6PA5CNFSM6AAAAABSBZAE22WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUUEVAKE.gif" alt="" moz-do-not-send="true" width="1" height="1"><span style="color: transparent; font-size: 0; display: none; visibility: hidden; overflow: hidden; opacity: 0; width: 0; height: 0; max-width: 0; max-height: 0; mso-hide: all">Message
ID: <span><chr1swallace/coloc/issues/178/2485469346</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p>
<script type="application/ld+json">[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#178 (comment)",
"url": "#178 (comment)",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script>
</blockquote>
</body>
</html>
…--------------gRtSnEWAVgwgQ1VumDdgXpxr--
|
Hi Alice,
Thanks for a very clear question with examples of data - makes answering
much easier! As you have beta and se (or slope and se) and sdY, you
won't need MAF. For sample size, just supply one value - the size of the
sample, not per snp.
hth,
Chris
On 19/11/2024 10:43, Alice9503 wrote:
Hi,
I am a new user of the |coloc| package and need clarification on how
to correctly set up input parameters, particularly |MAF|, |N|, and
|sdY|. I am working with GWAS and eQTL data, where the GWAS dataset is
much larger than the eQTL dataset.
Example of my data:
GWAS Data:
|head(gwas) CHR_gwas SNP_gwas POS_gwas A1_gwas A2_gwas N_gwas AF1_gwas
T_gwas SE_T_gwas P_noSPA_gwas BETA_gwas <int> <char> <int> <char>
<char> <int> <num> <num> <num> <num> <num> 1: 18 rs1573362 45455641 A
G 282601 0.501888 -42.2791 43.1588 0.327275 -0.0226980 2: 18
rs11874858 45457818 A G 282263 0.451565 55.4193 42.9665 0.197111
0.0300194 3: 18 rs4940109 45458078 G A 282282 0.480057 53.2239 43.1160
0.217040 0.0286306 4: 18 rs4940110 45458519 T C 282275 0.480053
52.3038 43.1097 0.225026 0.0281438 5: 18 rs57620563 45458763 A C
282070 0.446226 -48.6179 42.8916 0.257002 -0.0264272 6: 18 rs201752156
45458821 CAT C 281829 0.485550 51.3425 43.0838 0.233383 0.0276598
SE_gwas P_gwas CONVERGE_gwas varbeta_gwas rs_id <num> <num> <int>
<num> <char> 1: 0.0231702 0.327275 1 0.0005368582 rs1573362 2:
0.0232740 0.197111 1 0.0005416791 rs11874858 3: 0.0231933 0.217040 1
0.0005379292 rs4940109 4: 0.0231966 0.225026 1 0.0005380823 rs4940110
5: 0.0233146 0.257002 1 0.0005435706 rs57620563 6: 0.0232106 0.233383
1 0.0005387320 rs201752156 |
eQTL Data:
|head(eqtl) phenotype_id variant_id start_distance af ma_samples
ma_count pval_nominal slope slope_se <char> <char> <int> <num> <int>
<int> <num> <num> <num> 1: AGRN rs757557694 -991531 0.01089498 612 615
0.50559341 0.02634526 0.03957399 2: AGRN rs806731 -989198 0.03244476
1263 1295 0.01467575 -0.06758886 0.02769577 3: AGRN rs540662756
-972962 0.01857967 973 979 0.69826483 0.01224067 0.03157521 4: AGRN
rs114420996 -961307 0.03668183 1395 1419 0.37630071 -0.02341123
0.02646101 5: AGRN rs62637817 -959770 0.02768842 1195 1205 0.30569805
-0.02937410 0.02867709 6: AGRN rs62639104 -955190 0.02649279 1150 1158
0.54228284 -0.01783010 0.02925986 chr <int> 1: 1 2: 1 3: 1 4: 1 5: 1 6: 1 |
My Questions and Observations:
1. MAF Calculation
I calculated MAF for the eQTL dataset as |eqtl$MAF <- pmin(eqtl$af, 1
- eqtl$af)| based on |af| (the ALT allele frequency). However:
*
The GWAS dataset is much larger than the eQTL dataset. Would it be
more accurate to calculate |MAF| using |AF1_gwas| from the GWAS
data instead of |af| from the eQTL data?
*
And the way we calculate |MAF| with AF, right?
2. Sample Size (𝑁)
In my eQTL dataset:
*
The intersection of samples between the expression and genotype
data has 12,345 individuals.
*
Expression data has no missing values, but genotype data does,
meaning each SNP could have a different effective sample size.
*
Should I use the overall sample size (12,345) for all SNPs, or
calculate 𝑁 individually for each SNP (similar to |N_gwas| in the
GWAS dataset)?
3. Understanding |sdY|
*
I understand that sdY refers to the standard deviation of the
trait values (here, the gene expression levels in eQTL data). *Right?*
*
Why can it be estimated using 𝑁 and |MAF|? Isn’t |MAF| a concept
specific to genotype data, not expression data? Could you explain
the relationship between these parameters?
I appreciate any clarification and guidance on these issues!
Thank you!
—
Reply to this email directly, view it on GitHub
<#178>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQWR2ECMXK27STRKIDGCD32BMI37AVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY3TCNRXGEZTEMA>.
You are receiving this because you are subscribed to this
thread.Message ID: ***@***.***>
--------------F4RRA4FEOlMvDxb07YJjgLXa
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit
<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hi Alice,</p>
<p>Thanks for a very clear question with examples of data - makes
answering much easier! As you have beta and se (or slope and se)
and sdY, you won't need MAF. For sample size, just supply one
value - the size of the sample, not per snp. <br>
</p>
<p>hth,<br>
</p>
<div class="moz-cite-prefix">Chris<br>
</div>
<div class="moz-cite-prefix">On 19/11/2024 10:43, Alice9503 wrote:<br>
</div>
<blockquote type="cite" ***@***.***">
<p dir="auto">Hi,</p>
<p dir="auto">I am a new user of the <code class="notranslate">coloc</code>
package and need clarification on how to correctly set up input
parameters, particularly <code class="notranslate">MAF</code>,
<code class="notranslate">N</code>, and <code class="notranslate">sdY</code>. I am working with GWAS and
eQTL data, where the GWAS dataset is much larger than the eQTL
dataset.</p>
<h3 dir="auto">Example of my data:</h3>
<h4 dir="auto">GWAS Data:</h4>
<pre class="notranslate"><code class="notranslate">head(gwas)
CHR_gwas SNP_gwas POS_gwas A1_gwas A2_gwas N_gwas AF1_gwas T_gwas SE_T_gwas P_noSPA_gwas BETA_gwas
<int> <char> <int> <char> <char> <int> <num> <num> <num> <num> <num>
1: 18 rs1573362 45455641 A G 282601 0.501888 -42.2791 43.1588 0.327275 -0.0226980
2: 18 rs11874858 45457818 A G 282263 0.451565 55.4193 42.9665 0.197111 0.0300194
3: 18 rs4940109 45458078 G A 282282 0.480057 53.2239 43.1160 0.217040 0.0286306
4: 18 rs4940110 45458519 T C 282275 0.480053 52.3038 43.1097 0.225026 0.0281438
5: 18 rs57620563 45458763 A C 282070 0.446226 -48.6179 42.8916 0.257002 -0.0264272
6: 18 rs201752156 45458821 CAT C 281829 0.485550 51.3425 43.0838 0.233383 0.0276598
SE_gwas P_gwas CONVERGE_gwas varbeta_gwas rs_id
<num> <num> <int> <num> <char>
1: 0.0231702 0.327275 1 0.0005368582 rs1573362
2: 0.0232740 0.197111 1 0.0005416791 rs11874858
3: 0.0231933 0.217040 1 0.0005379292 rs4940109
4: 0.0231966 0.225026 1 0.0005380823 rs4940110
5: 0.0233146 0.257002 1 0.0005435706 rs57620563
6: 0.0232106 0.233383 1 0.0005387320 rs201752156
</code></pre>
<h4 dir="auto">eQTL Data:</h4>
<pre class="notranslate"><code class="notranslate">head(eqtl)
phenotype_id variant_id start_distance af ma_samples ma_count pval_nominal slope slope_se
<char> <char> <int> <num> <int> <int> <num> <num> <num>
1: AGRN rs757557694 -991531 0.01089498 612 615 0.50559341 0.02634526 0.03957399
2: AGRN rs806731 -989198 0.03244476 1263 1295 0.01467575 -0.06758886 0.02769577
3: AGRN rs540662756 -972962 0.01857967 973 979 0.69826483 0.01224067 0.03157521
4: AGRN rs114420996 -961307 0.03668183 1395 1419 0.37630071 -0.02341123 0.02646101
5: AGRN rs62637817 -959770 0.02768842 1195 1205 0.30569805 -0.02937410 0.02867709
6: AGRN rs62639104 -955190 0.02649279 1150 1158 0.54228284 -0.01783010 0.02925986
chr
<int>
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
</code></pre>
<h3 dir="auto">My Questions and Observations:</h3>
<h4 dir="auto">1. MAF Calculation</h4>
<p dir="auto">I calculated MAF for the eQTL dataset as <code class="notranslate">eqtl$MAF <- pmin(eqtl$af, 1 - eqtl$af)</code>
based on <code class="notranslate">af</code> (the ALT allele
frequency). However:</p>
<ul dir="auto">
<li>
<p dir="auto">The GWAS dataset is much larger than the eQTL
dataset. Would it be more accurate to calculate <code class="notranslate">MAF</code> using <code class="notranslate">AF1_gwas</code> from the GWAS data
instead of <code class="notranslate">af</code> from the
eQTL data?</p>
</li>
<li>
<p dir="auto">And the way we calculate <code class="notranslate">MAF</code> with AF, right?</p>
</li>
</ul>
<h4 dir="auto">2. Sample Size (𝑁)</h4>
<p dir="auto">In my eQTL dataset:</p>
<ul dir="auto">
<li>
<p dir="auto">The intersection of samples between the
expression and genotype data has 12,345 individuals.</p>
</li>
<li>
<p dir="auto">Expression data has no missing values, but
genotype data does, meaning each SNP could have a different
effective sample size.</p>
</li>
<li>
<p dir="auto">Should I use the overall sample size (12,345)
for all SNPs, or calculate 𝑁 individually for each SNP
(similar to <code class="notranslate">N_gwas</code> in the
GWAS dataset)?</p>
</li>
</ul>
<h4 dir="auto">3. Understanding <code class="notranslate">sdY</code></h4>
<ul dir="auto">
<li>
<p dir="auto">I understand that sdY refers to the standard
deviation of the trait values (here, the gene expression
levels in eQTL data). <strong>Right?</strong></p>
</li>
<li>
<p dir="auto">Why can it be estimated using 𝑁 and <code class="notranslate">MAF</code>? Isn’t <code class="notranslate">MAF</code> a concept specific to
genotype data, not expression data? Could you explain the
relationship between these parameters?</p>
</li>
</ul>
<p dir="auto">I appreciate any clarification and guidance on these
issues!</p>
<p dir="auto">Thank you!</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br>
Reply to this email directly, <a href="#178" originalsrc="#178" shash="bDelE57W/O3+iSypWGZuKb8R+ttyBC5B/pE1/MpVbbo+2Cu6Kj8IhETm7AdXAimNxmtn6awselr0QtC/gbW6f0EzkpF4tSH2jBYfKlwhdzcKGuo/pRDouFII5H6HmhUdwooFM7Pfxih9/tGhTyWbXhMNUxp3VDmwXXw8Y9awZoo=" moz-do-not-send="true">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AAQWR2ECMXK27STRKIDGCD32BMI37AVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY3TCNRXGEZTEMA" originalsrc="https://github.com/notifications/unsubscribe-auth/AAQWR2ECMXK27STRKIDGCD32BMI37AVCNFSM6AAAAABSBZAE22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGY3TCNRXGEZTEMA" shash="XsR1i4qvM3lW6Acp/MmUgg+JWlbfSQV6MJ3N+mV6Fa9BYvG3H1eIMIl3g6BUIGLr+7tPw9VW+qgGgDBmjB2jQFXkRdM3gOPSkt1fYjBrW780Ep/VJynN5OdfeVCWSsMMgAKiOlGWSz9F2Xu5tKz7c6/1q9zWu+Su3y1OZ8Yph6E=" moz-do-not-send="true">unsubscribe</a>.<br>
You are receiving this because you are subscribed to this
thread.<img src="https://github.com/notifications/beacon/AAQWR2DYUQANE4FNGINYQV32BMI37A5CNFSM6AAAAABSBZAE22WGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHJ6PTYDA.gif" alt="" moz-do-not-send="true" width="1" height="1"><span style="color: transparent; font-size: 0; display: none; visibility: hidden; overflow: hidden; opacity: 0; width: 0; height: 0; max-width: 0; max-height: 0; mso-hide: all">Message
ID: <span><chr1swallace/coloc/issues/178</span><span>@</span><span>github</span><span>.</span><span>com></span></span></p>
<script type="application/ld+json">[
{
***@***.***": "http://schema.org",
***@***.***": "EmailMessage",
"potentialAction": {
***@***.***": "ViewAction",
"target": "#178",
"url": "#178",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
***@***.***": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]</script>
</blockquote>
</body>
</html>
…--------------F4RRA4FEOlMvDxb07YJjgLXa--
|
Hi,
I am a new user of the
coloc
package and need clarification on how to correctly set up input parameters, particularlyMAF
,N
, andsdY
. I am working with GWAS and eQTL data, where the GWAS dataset is much larger than the eQTL dataset.Example of my data:
GWAS Data:
eQTL Data:
My Questions and Observations:
1. MAF Calculation
I calculated MAF for the eQTL dataset as
eqtl$MAF <- pmin(eqtl$af, 1 - eqtl$af)
based onaf
(the ALT allele frequency). However:The GWAS dataset is much larger than the eQTL dataset. Would it be more accurate to calculate
MAF
usingAF1_gwas
from the GWAS data instead ofaf
from the eQTL data?And the way we calculate
MAF
with AF, right?2. Sample Size (𝑁)
In my eQTL dataset:
The intersection of samples between the expression and genotype data has 12,345 individuals.
Expression data has no missing values, but genotype data does, meaning each SNP could have a different effective sample size.
Should I use the overall sample size (12,345) for all SNPs, or calculate 𝑁 individually for each SNP (similar to
N_gwas
in the GWAS dataset)?3. Understanding
sdY
I understand that sdY refers to the standard deviation of the trait values (here, the gene expression levels in eQTL data). Right?
Why can it be estimated using 𝑁 and
MAF
? Isn’tMAF
a concept specific to genotype data, not expression data? Could you explain the relationship between these parameters?I appreciate any clarification and guidance on these issues!
Thank you!
The text was updated successfully, but these errors were encountered: