Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on input parameters MAF, N, and sdY in coloc for GWAS and eQTL data #178

Open
Alice9503 opened this issue Nov 19, 2024 · 3 comments

Comments

@Alice9503
Copy link

Hi,

I am a new user of the coloc package and need clarification on how to correctly set up input parameters, particularly MAF, N, and sdY. I am working with GWAS and eQTL data, where the GWAS dataset is much larger than the eQTL dataset.

Example of my data:

GWAS Data:

head(gwas)
   CHR_gwas    SNP_gwas POS_gwas A1_gwas A2_gwas N_gwas AF1_gwas   T_gwas SE_T_gwas P_noSPA_gwas  BETA_gwas
      <int>      <char>    <int>  <char>  <char>  <int>    <num>    <num>     <num>        <num>      <num>
1:       18   rs1573362 45455641       A       G 282601 0.501888 -42.2791   43.1588     0.327275 -0.0226980
2:       18  rs11874858 45457818       A       G 282263 0.451565  55.4193   42.9665     0.197111  0.0300194
3:       18   rs4940109 45458078       G       A 282282 0.480057  53.2239   43.1160     0.217040  0.0286306
4:       18   rs4940110 45458519       T       C 282275 0.480053  52.3038   43.1097     0.225026  0.0281438
5:       18  rs57620563 45458763       A       C 282070 0.446226 -48.6179   42.8916     0.257002 -0.0264272
6:       18 rs201752156 45458821     CAT       C 281829 0.485550  51.3425   43.0838     0.233383  0.0276598
     SE_gwas   P_gwas CONVERGE_gwas varbeta_gwas       rs_id
       <num>    <num>         <int>        <num>      <char>
1: 0.0231702 0.327275             1 0.0005368582   rs1573362
2: 0.0232740 0.197111             1 0.0005416791  rs11874858
3: 0.0231933 0.217040             1 0.0005379292   rs4940109
4: 0.0231966 0.225026             1 0.0005380823   rs4940110
5: 0.0233146 0.257002             1 0.0005435706  rs57620563
6: 0.0232106 0.233383             1 0.0005387320 rs201752156

eQTL Data:

head(eqtl)
   phenotype_id  variant_id start_distance         af ma_samples ma_count pval_nominal       slope   slope_se
         <char>      <char>          <int>      <num>      <int>    <int>        <num>       <num>      <num>
1:         AGRN rs757557694        -991531 0.01089498        612      615   0.50559341  0.02634526 0.03957399
2:         AGRN    rs806731        -989198 0.03244476       1263     1295   0.01467575 -0.06758886 0.02769577
3:         AGRN rs540662756        -972962 0.01857967        973      979   0.69826483  0.01224067 0.03157521
4:         AGRN rs114420996        -961307 0.03668183       1395     1419   0.37630071 -0.02341123 0.02646101
5:         AGRN  rs62637817        -959770 0.02768842       1195     1205   0.30569805 -0.02937410 0.02867709
6:         AGRN  rs62639104        -955190 0.02649279       1150     1158   0.54228284 -0.01783010 0.02925986
     chr
   <int>
1:     1
2:     1
3:     1
4:     1
5:     1
6:     1

My Questions and Observations:

1. MAF Calculation

I calculated MAF for the eQTL dataset as eqtl$MAF <- pmin(eqtl$af, 1 - eqtl$af) based on af (the ALT allele frequency). However:

  • The GWAS dataset is much larger than the eQTL dataset. Would it be more accurate to calculate MAF using AF1_gwas from the GWAS data instead of af from the eQTL data?

  • And the way we calculate MAF with AF, right?

2. Sample Size (𝑁)

In my eQTL dataset:

  • The intersection of samples between the expression and genotype data has 12,345 individuals.

  • Expression data has no missing values, but genotype data does, meaning each SNP could have a different effective sample size.

  • Should I use the overall sample size (12,345) for all SNPs, or calculate 𝑁 individually for each SNP (similar to N_gwas in the GWAS dataset)?

3. Understanding sdY

  • I understand that sdY refers to the standard deviation of the trait values (here, the gene expression levels in eQTL data). Right?

  • Why can it be estimated using 𝑁 and MAF? Isn’t MAF a concept specific to genotype data, not expression data? Could you explain the relationship between these parameters?

I appreciate any clarification and guidance on these issues!

Thank you!

@Alice9503
Copy link
Author

Additionally, I have gene expression data for 12,345 individuals, and I am calculating sdY based on this dataset. I have the following questions:

  1. Should sdY be calculated using all 12,345 individuals (since the expression data has no missing values), regardless of missing genotypes for specific SNPs? Or should sdY be calculated separately for each SNP, considering only the intersecting samples (as some SNPs may have missing genotypes)?

  2. If sdY should be calculated across all individuals, I computed it using the following code:

sd_per_gene <- apply(pro_nor_dat[, -1], 2, sd)

Here are the results for some genes:

> head(sd_per_gene)
     A1BG     AAMDC    AARSD1     ABCA2   ABHD14B      ABL1 
0.9997928 1.0000313 0.9998905 1.0001168 0.9999087 1.0000225

Most genes have values very close to 1. In this case:

  • Is it valid to simplify the analysis by setting sdY = 1 for all genes?

  • Or do I need to use the precise sdY values calculated for each gene?

Thank you!

@chr1swallace
Copy link
Owner

chr1swallace commented Nov 19, 2024 via email

@chr1swallace
Copy link
Owner

chr1swallace commented Nov 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants