Skip to content

Commit ddb108b

Browse files
committed
improve phenotype read and simulation exercise
1 parent 7b6c373 commit ddb108b

File tree

8 files changed

+131
-147
lines changed

8 files changed

+131
-147
lines changed

README.md

+20-13
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,15 @@
2020
- **family-PGS analyses**: Compute and analyze polygenic scores (PGS) for a set of individuals along with their siblings and parents, using both observed and imputed parental genotypes. *snipar* can estimate the direct effect (within-family) effect of a polygenic score: see [Simulation Exercise: Polygenic score analyses](https://snipar.readthedocs.io/en/latest/simulation.html#polygenic-score-analyses). It can adjust for the impact of assortative mating on estimates of indirect genetic effects (effects of alleles in parents on offspring mediated through the environment) from family-based PGS analysis: see [Simulation Exercise: Polygenic score analyses](https://snipar.readthedocs.io/en/latest/simulation.html#polygenic-score-analyses).
2121
- **Imputation of missing parental genotypes**: For samples with at least one genotyped sibling and/or parent, but without both parents' genotypes available, *snipar* can impute missing parental genotypes according to Mendelian laws (Mendelian Imputation) and use these to increase power for family-GWAS and PGS analyses. See [Tutorial: imputing-missing-parental-genotypes](https://snipar.readthedocs.io/en/latest/tutorial.html#imputing-missing-parental-genotypes)
2222
- **Identity-by-descent (IBD) segments shared by siblings**: *snipar* implements a hidden markov model (HMM) to accurately infer identity-by-descent segments shared between siblings. The output of this is needed for imputation of missing parental genotypes from siblings. See [Tutorial: inferring IBD between siblings](https://snipar.readthedocs.io/en/latest/tutorial.html#inferring-ibd-between-siblings)
23-
- **Multi-generational forward simulation with indirect genetic effects and assortative mating**: *snipar* includes a simulation module that performs forward simulation of multiple generations undergoing random and/or assortative mating of different strengths. The phenotype on which assortment occurs can include indirect genetic effects from parents. Users can input phased haplotypes for the starting generation or artificial haplotypes can be simulated. Output includes a multigenerational pedigree with phenotype values, direct and indirect genetic component values, and plink formatted genotypes for the final two generations along with imputed parental genotypes. See [Simulation Exercise](https://snipar.readthedocs.io/en/latest/simulation.html)
23+
- **Multi-generational forward simulation with indirect genetic effects and assortative mating**: *snipar* includes a simulation module that performs forward simulation of multiple generations undergoing random and/or assortative mating. The phenotype on which assortment occurs can include indirect genetic effects from parents. Users can input phased haplotypes for the starting generation or artificial haplotypes can be simulated. Output includes a multigenerational pedigree with phenotype values, direct and indirect genetic component values, and plink formatted genotypes for the final two generations along with imputed parental genotypes. See [Simulation Exercise](https://snipar.readthedocs.io/en/latest/simulation.html)
2424
- **Estimate correlations between effects**: Family-GWAS summary statistics include genome-wide estimates of direct genetic effects (DGEs) — the within-family estimate of the effect of the allele — population effects — as estimated by standard GWAS — and non-transmitted coefficients (NTCs), the coefficients on parents' genotypes. The *correlate.py* scipt enables efficient estimation of genome-wide correlations between these different classes of effects accounting for sampling errors. See [Tutorial: correlations between effects](https://snipar.readthedocs.io/en/latest/tutorial.html#correlations-between-effects)
2525

26-
This illustrates an end-to-end workflow for performing family-GWAS in *snipar* although not all steps are necessary for all analyses. For example, family-GWAS (and PGS analyses) can be performed without imputed parental genotypes, requiring only input genotypes in .bed or .bgen format along with pedigree information:
27-
2826
<p align="center">
2927
<img src="docs/snipar_flowchart.png" width="100%" alt="snipar flowchart">
3028
</p>
3129

30+
The above illustrates an end-to-end workflow for performing family-GWAS in *snipar* although not all steps are necessary for all analyses. For example, family-GWAS (and PGS analyses) can be performed without imputed parental genotypes, requiring only input genotypes in .bed or .bgen format along with pedigree information. Also: imputation for parent-offspring pairs can proceed without IBD inference.
31+
3232
# Publications
3333

3434
**Please cite at least one of these publications if you use snipar in your work!**
@@ -72,37 +72,44 @@ And to work through the tutorial: https://snipar.readthedocs.io/en/latest/tutori
7272
*snipar* currently supports Python 3.7-3.9 on Linux, Windows, and Mac OSX (although not currently available for Mac through pip). We recommend using a python distribution such as Anaconda 3 (https://store.continuum.io/cshop/anaconda/).
7373

7474
The easiest way to install is using pip:
75-
76-
pip install snipar
75+
```
76+
pip install snipar
77+
```
7778

7879
Sometimes this may not work because the pip in the system is outdated. You can upgrade your pip using:
79-
80+
```
8081
pip install --upgrade pip
82+
```
83+
84+
Note: installing *snipar* requires the package *bed_reader*, which in turn requires Rust. If an error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
8185

8286
# Virtual Environment
8387

8488
You may encounter problems with the installation due to Python version incompatability or package conflicts with your existing Python environment. To overcome this, you can try installing in a virtual environment. In a bash shell, this could be done either via the *venv* Python package or via conda.
8589

8690
To use venv, use the following commands in your directory of choice:
87-
91+
```
8892
python -m venv path-to-where-you-want-the-virtual-environment-to-be
93+
```
8994

9095
You can activate and use the environment using
91-
96+
```
9297
source path-to-where-you-want-the-virtual-environment-to-be/bin/activate
98+
```
9399

94-
Alternatively, we highly recommend using conda:
95-
100+
Alternatively, we recommend using conda:
101+
```
96102
conda create -n myenv python=3.9
97103
conda activate myenv
104+
```
98105
99106
# Installing From Source
100107
To install from source, clone the git repository, and in the directory
101108
containing the *snipar* source code, at the shell type:
102-
109+
```
103110
pip install .
104-
105-
Note: installing *snipar* requires the package *bed_reader*, which in turn requires installing Rust. If error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
111+
```
112+
Note: installing *snipar* requires the package *bed_reader*, which in turn requires Rust. If an error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
106113
107114
# Python version incompatibility
108115

docs/simulation.rst

+45-4
Original file line numberDiff line numberDiff line change
@@ -32,18 +32,59 @@ Please change your working directory to sim/:
3232

3333
``cd sim``
3434

35-
In this directory, the file phenotype.txt is a :ref:`phenotype file <phenotype>` containing the simulated phenotype.
35+
The genotype data (chr_1.bed) has been simulated so that there are 3000 independent families, each with two siblings genotyped.
3636

37-
The genotype data (chr_1.bed) has been simulated so that there are 3000 independent families, each with two siblings genotyped.
37+
The pedigree file (pedigree.txt) contains the pedigree for all simulated generations. The pedigree has columns:
38+
FID, IID, Father_ID, MOTHER_ID, SEX, PHENO, FATHER_PHENO, MOTHER_PHENO, DIRECT, FATHER_DIRECT, MOTHER_DIRECT.
39+
The FID and IID columns are the family and individual IDs, respectively. The Father_ID and MOTHER_ID columns are the IDs of the parents of the individual.
40+
The FATHER_PHENO and MOTHER_PHENO columns are the phenotype values of the parents, and the DIRECT, FATHER_DIRECT, and MOTHER_DIRECT columns are the direct genetic effect components of the individual, father, and mother, respectively.
41+
The FIDs follow the format ``generation_family``, and the IIDs follow the format ``generation_family_individual``.
42+
So if you look at the end of pedigree.txt (e.g. using ``tail pedigree.txt``), you should see
43+
the FID of the last line as ``20_2999``, and the IID of the last line as ``20_2999_1``.
44+
45+
To enable analysis of the final generation phenotypes alone, we have placed the phenotype values for the final generation in a separate file (phenotype.txt).
46+
47+
Family-based GWAS without imputed parental genotypes
48+
----------------------------------------------------
49+
50+
To perform a family-based GWAS, we use the :ref:`gwas.py <gwas.py>` script.
51+
To perform a family-based GWAS without imputed parental genotypes, use the following command:
52+
53+
``gwas.py phenotype.txt --pedigree pedigree.txt --bed chr_@ --out chr_@_sibdiff ``
54+
55+
The first argument is the phenotype file. As we are not inputting an imputed parental genotype file,
56+
we must specify the pedigree information from the pedigree file using the ``--pedigree pedigree.txt`` argument.
57+
The genotype data in .bed format is specified by ``--bed chr_@`` argument.
58+
The ``@`` symbol is a placeholder for the chromosome number, so the script will read the genotype data from ``chr_1.bed``.
59+
The output file is specified by ``--out chr_@_sibdiff``. The script will output summary statistics to a gzipped text file: ``chr_1_sibdiff.sumstats.gz``.
60+
The ``--cpus`` argument can be used to specify the number of processes to use to parallelize the GWAS.
61+
62+
Since the genotype data of the final generation contains 3000 sibling pairs, the script will perform sib-GWAS
63+
using genetic differences between siblings to estimate direct genetic effects (see `Guan et al. <https://www.nature.com/articles/s41588-025-02118-0>`_).
64+
The summary statistics are output to a gzipped text :ref:`sumstats file <sumstats_text>`: chr_1_sibdiff.sumstats.gz.
65+
66+
We can combine the final two generations' genotype data into one .bed file using this command:
67+
68+
``plink --bfile chr_1 --bmerge chr_1_par --out chr_1_combined``
69+
70+
If we run the GWAS script on the combined genotype data, we can estimate the direct genetic effects using the full-sibling offspring and parental genotypes
71+
in a trio design:
72+
73+
``gwas.py phenotype.txt --pedigree pedigree.txt --bed chr_@_combined --out chr_@_trio``
74+
75+
The summary statistics are output to a gzipped text :ref:`sumstats file <sumstats_text>`: chr_1_trio.sumstats.gz.
76+
If you read the summary statistics file (e.g. into R or using ``zless -S``) you can see that the effective sample size for
77+
direct genetic effects is substantially larger from the trio design than the sib-differences design.
78+
Note that both designs use the same number of phenotype observations in a generalized least-squares regression, but the trio design uses more information from the parents.
79+
In this simulation, the effective sample size from the trio design should be about 45% larger than for the sib-differences design.
3880

3981
Inferring IBD between siblings
4082
------------------------------
4183

42-
The first step is to infer the identity-by-descent (IBD) segments shared between siblings.
84+
The first step in the imputation of missing parental genotypes from siblings is to infer the identity-by-descent (IBD) segments shared between siblings.
4385
However, for the purpose of this simulation exercise (where SNPs are independent, so IBD inference doesn't work)
4486
we have provided the true IBD states in the file chr_1.segments.gz.
4587

46-
4788
Imputing missing parental genotypes
4889
-----------------------------------
4990

snipar/gwas.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ def process_chromosome(chrom_out, y, varcomp_lst,
387387
if not parsum and not sib_diff:
388388
par_status, gt_indices, fam_labels = find_par_gts(y.ids, ped, gts_id_dict)
389389
parcount = np.sum(par_status==0,axis=1)
390-
if np.sum(parcount>0)==0:
390+
if np.sum(parcount>0)==0 and not trios_sibs:
391391
# logger.warning('No individuals with genotyped parents found. Using sum of imputed maternal and paternal genotypes to prevent collinearity.')
392392
print('WARNING: no individuals with genotyped parents found. Using sum of imputed maternal and paternal genotypes to prevent collinearity.')
393393
parsum = True

snipar/read/phenotype.py

+59-27
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,68 @@
11
from pysnptools.snpreader import Pheno
22
import numpy as np
3+
import pandas as pd
34
from snipar.gtarray import gtarray
45
from snipar.utilities import make_id_dict
5-
6-
def read_phenotype(phenofile, missing_char = 'NA', phen_index = 1):
7-
"""Read a phenotype file and remove missing values.
8-
9-
Args:
10-
phenofile : :class:`str`
11-
path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, ...
12-
missing_char : :class:`str`
13-
The character that denotes a missing phenotype value; 'NA' by default.
14-
phen_index : :class:`int`
15-
The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile
16-
6+
def read_phenotype(file_path, column=None, column_index=None, na_values='NA'):
7+
"""
8+
Read data from a text file with header structure where either:
9+
- First two columns are 'FID' and 'IID'
10+
- First column is 'IID'
11+
12+
Parameters:
13+
file_path (str): Path to the text file
14+
column (str, optional): Name of column to extract (other than 'FID' or 'IID')
15+
column_index (int, optional): Index of column to extract (counting from 1 after 'IID'/'FID')
16+
Note: This is 1-based indexing
17+
na_values (str or list, optional): String or list of strings to recognize as NA/NaN. Default is 'NA'.
18+
1719
Returns:
18-
y : :class:`~numpy:numpy.array`
19-
vector of non-missing phenotype values from specified column of phenofile
20-
pheno_ids: :class:`~numpy:numpy.array`
21-
corresponding vector of individual IDs (IID)
20+
y : :class:`snipar.gtarray`
21+
vector of non-missing phenotype values from specified column of phenofile along with individual IDs
22+
23+
Note: If neither column nor column_index is provided, defaults to first column after IID/FID
2224
"""
23-
pheno = Pheno(phenofile, missing=missing_char)[:,phen_index-1].read()
24-
y = np.array(pheno.val)
25-
y.reshape((y.shape[0],1))
26-
pheno_ids = np.array(pheno.iid)[:,1]
27-
# Remove y NAs
28-
y_not_nan = np.logical_not(np.isnan(y[:,0]))
29-
if np.sum(y_not_nan) < y.shape[0]:
30-
y = y[y_not_nan,:]
31-
pheno_ids = pheno_ids[y_not_nan]
32-
print('Number of non-missing phenotype observations: ' + str(y.shape[0]))
33-
return gtarray(y,ids=pheno_ids)
25+
# Determine delimiter (tab or whitespace)
26+
with open(file_path, 'r') as file:
27+
first_line = file.readline()
28+
delimiter = '\t' if '\t' in first_line else ' '
29+
header = first_line.split(delimiter)
30+
header[-1] = header[-1].strip() # Remove newline character
31+
# Determine file format based on header
32+
has_fid = (len(header) > 1 and header[0] == 'FID' and header[1] == 'IID')
33+
# Set default column if neither is provided
34+
if column is None and column_index is None:
35+
# Default to first column after IID/FID
36+
column_index = 1
37+
# Determine the usecols parameter for pd.read_csv
38+
if column is not None:
39+
if column in ['FID', 'IID']:
40+
raise ValueError(f"Phenotype cannot be named FID or IID")
41+
# We need to read the IID column and the target column
42+
cols_to_use = ['IID', column]
43+
else: # column_index is provided
44+
# Adjust column_index based on file format
45+
offset = 2 if has_fid else 1
46+
adjusted_index = column_index + offset - 1 # -1 for 0-based indexing
47+
if adjusted_index >= len(header):
48+
raise ValueError(f"Column index {column_index} out of range")
49+
column = header[adjusted_index]
50+
cols_to_use = ['IID', column]
51+
print('Reading phenotype from column:', column)
52+
# Read the data using pandas for efficiency, handling missing values
53+
df = pd.read_csv(file_path,
54+
sep=delimiter,
55+
usecols=cols_to_use,
56+
na_values=na_values)
57+
# Verify target column contains numeric data
58+
try:
59+
df[column] = pd.to_numeric(df[column], errors='coerce')
60+
except ValueError:
61+
raise ValueError(f"Phenotype contains non-numeric values that cannot be converted")
62+
# Remove rows with missing values in either IID or target column
63+
df = df.dropna(subset=['IID', column])
64+
# Return gtarray
65+
return gtarray(np.array(df[column].values).reshape((df.shape[0],1)), ids=np.array(df['IID'].values, dtype=str))
3466

3567
def match_phenotype(G,y,pheno_ids):
3668
"""Match a phenotype to a genotype array by individual IDs.

0 commit comments

Comments
 (0)