AlexTISYoung
diff --git a/‎README.md
+20-13 b/‎README.md
+20-13
diff --git a/‎docs/simulation.rst
+45-4 b/‎docs/simulation.rst
+45-4
diff --git a/‎snipar/gwas.py
+1-1 b/‎snipar/gwas.py
+1-1
diff --git a/‎snipar/read/phenotype.py
+59-27 b/‎snipar/read/phenotype.py
+59-27
@@ -20,15 +20,15 @@
 - **family-PGS analyses**: Compute and analyze polygenic scores (PGS) for a set of individuals along with their siblings and parents, using both observed and imputed parental genotypes. *snipar* can estimate the direct effect (within-family) effect of a polygenic score: see [Simulation Exercise: Polygenic score analyses](https://snipar.readthedocs.io/en/latest/simulation.html#polygenic-score-analyses). It can adjust for the impact of assortative mating on estimates of indirect genetic effects (effects of alleles in parents on offspring mediated through the environment) from family-based PGS analysis: see [Simulation Exercise: Polygenic score analyses](https://snipar.readthedocs.io/en/latest/simulation.html#polygenic-score-analyses).
 - **Imputation of missing parental genotypes**: For samples with at least one genotyped sibling and/or parent, but without both parents' genotypes available, *snipar* can impute missing parental genotypes according to Mendelian laws (Mendelian Imputation) and use these to increase power for family-GWAS and PGS analyses. See [Tutorial: imputing-missing-parental-genotypes](https://snipar.readthedocs.io/en/latest/tutorial.html#imputing-missing-parental-genotypes)
 - **Identity-by-descent (IBD) segments shared by siblings**: *snipar* implements a hidden markov model (HMM) to accurately infer identity-by-descent segments shared between siblings. The output of this is needed for imputation of missing parental genotypes from siblings. See [Tutorial: inferring IBD between siblings](https://snipar.readthedocs.io/en/latest/tutorial.html#inferring-ibd-between-siblings)
-- **Multi-generational forward simulation with indirect genetic effects and assortative mating**: *snipar* includes a simulation module that performs forward simulation of multiple generations undergoing random and/or assortative mating of different strengths. The phenotype on which assortment occurs can include indirect genetic effects from parents. Users can input phased haplotypes for the starting generation or artificial haplotypes can be simulated. Output includes a multigenerational pedigree with phenotype values, direct and indirect genetic component values, and plink formatted genotypes for the final two generations along with imputed parental genotypes. See [Simulation Exercise](https://snipar.readthedocs.io/en/latest/simulation.html)
+- **Multi-generational forward simulation with indirect genetic effects and assortative mating**: *snipar* includes a simulation module that performs forward simulation of multiple generations undergoing random and/or assortative mating. The phenotype on which assortment occurs can include indirect genetic effects from parents. Users can input phased haplotypes for the starting generation or artificial haplotypes can be simulated. Output includes a multigenerational pedigree with phenotype values, direct and indirect genetic component values, and plink formatted genotypes for the final two generations along with imputed parental genotypes. See [Simulation Exercise](https://snipar.readthedocs.io/en/latest/simulation.html)
 - **Estimate correlations between effects**: Family-GWAS summary statistics include genome-wide estimates of direct genetic effects (DGEs) — the within-family estimate of the effect of the allele — population effects — as estimated by standard GWAS — and non-transmitted coefficients (NTCs), the coefficients on parents' genotypes. The *correlate.py* scipt enables efficient estimation of genome-wide correlations between these different classes of effects accounting for sampling errors. See [Tutorial: correlations between effects](https://snipar.readthedocs.io/en/latest/tutorial.html#correlations-between-effects)
 
-This illustrates an end-to-end workflow for performing family-GWAS in *snipar* although not all steps are necessary for all analyses. For example, family-GWAS (and PGS analyses) can be performed without imputed parental genotypes, requiring only input genotypes in .bed or .bgen format along with pedigree information:
-
 <p align="center">
   <img src="docs/snipar_flowchart.png" width="100%" alt="snipar flowchart">
 </p>
 
+The above illustrates an end-to-end workflow for performing family-GWAS in *snipar* although not all steps are necessary for all analyses. For example, family-GWAS (and PGS analyses) can be performed without imputed parental genotypes, requiring only input genotypes in .bed or .bgen format along with pedigree information. Also: imputation for parent-offspring pairs can proceed without IBD inference. 
+
 # Publications
 
 **Please cite at least one of these publications if you use snipar in your work!**
@@ -72,37 +72,44 @@ And to work through the tutorial: https://snipar.readthedocs.io/en/latest/tutori
 *snipar* currently supports Python 3.7-3.9 on Linux, Windows, and Mac OSX (although not currently available for Mac through pip). We recommend using a python distribution such as Anaconda 3 (https://store.continuum.io/cshop/anaconda/). 
 
 The easiest way to install is using pip:
-
-    pip install snipar
+  ```
+  pip install snipar
+  ```
 
 Sometimes this may not work because the pip in the system is outdated. You can upgrade your pip using:
-
+  ```
     pip install --upgrade pip
+  ```
+
+Note: installing *snipar* requires the package *bed_reader*, which in turn requires Rust. If an error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
 
 # Virtual Environment
 
 You may encounter problems with the installation due to Python version incompatability or package conflicts with your existing Python environment. To overcome this, you can try installing in a virtual environment. In a bash shell, this could be done either via the *venv* Python package or via conda.
 
 To use venv, use the following commands in your directory of choice:
-    
+    ```
     python -m venv path-to-where-you-want-the-virtual-environment-to-be
+    ```
 
 You can activate and use the environment using
-
+  ```
     source path-to-where-you-want-the-virtual-environment-to-be/bin/activate
+  ```
 
-Alternatively, we highly recommend using conda:
-	
+Alternatively, we recommend using conda:
+	```
   conda create -n myenv python=3.9
 	conda activate myenv
+  ```
 
 # Installing From Source
 To install from source, clone the git repository, and in the directory
 containing the *snipar* source code, at the shell type:
-
+  ```
   pip install .
-
-Note: installing *snipar* requires the package *bed_reader*, which in turn requires installing Rust. If error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
+  ```
+Note: installing *snipar* requires the package *bed_reader*, which in turn requires Rust. If an error occurs at "Collecting bed-reader ...", please try downloading Rust following the instruction here: https://rust-lang.github.io/rustup/installation/other.html.
 
 # Python version incompatibility 
 
 
@@ -32,18 +32,59 @@ Please change your working directory to sim/:
 
     ``cd sim``
 
-In this directory, the file phenotype.txt is a :ref:`phenotype file <phenotype>` containing the simulated phenotype. 
+The genotype data (chr_1.bed) has been simulated so that there are 3000 independent families, each with two siblings genotyped.
 
-The genotype data (chr_1.bed) has been simulated so that there are 3000 independent families, each with two siblings genotyped. 
+The pedigree file (pedigree.txt) contains the pedigree for all simulated generations. The pedigree has columns:
+FID, IID, Father_ID, MOTHER_ID, SEX, PHENO, FATHER_PHENO, MOTHER_PHENO, DIRECT, FATHER_DIRECT, MOTHER_DIRECT. 
+The FID and IID columns are the family and individual IDs, respectively. The Father_ID and MOTHER_ID columns are the IDs of the parents of the individual.
+The FATHER_PHENO and MOTHER_PHENO columns are the phenotype values of the parents, and the DIRECT, FATHER_DIRECT, and MOTHER_DIRECT columns are the direct genetic effect components of the individual, father, and mother, respectively.
+The FIDs follow the format ``generation_family``, and the IIDs follow the format ``generation_family_individual``.
+So if you look at the end of pedigree.txt (e.g. using ``tail pedigree.txt``), you should see
+the FID of the last line as ``20_2999``, and the IID of the last line as ``20_2999_1``.
+
+To enable analysis of the final generation phenotypes alone, we have placed the phenotype values for the final generation in a separate file (phenotype.txt). 
+
+Family-based GWAS without imputed parental genotypes
+----------------------------------------------------
+
+To perform a family-based GWAS, we use the :ref:`gwas.py <gwas.py>` script. 
+To perform a family-based GWAS without imputed parental genotypes, use the following command:
+
+    ``gwas.py phenotype.txt --pedigree pedigree.txt --bed chr_@ --out chr_@_sibdiff ``
+
+The first argument is the phenotype file. As we are not inputting an imputed parental genotype file,
+we must specify the pedigree information from the pedigree file using the ``--pedigree pedigree.txt`` argument. 
+The genotype data in .bed format is specified by ``--bed chr_@`` argument.
+The ``@`` symbol is a placeholder for the chromosome number, so the script will read the genotype data from ``chr_1.bed``. 
+The output file is specified by ``--out chr_@_sibdiff``. The script will output summary statistics to a gzipped text file: ``chr_1_sibdiff.sumstats.gz``.
+The ``--cpus`` argument can be used to specify the number of processes to use to parallelize the GWAS. 
+
+Since the genotype data of the final generation contains 3000 sibling pairs, the script will perform sib-GWAS 
+using genetic differences between siblings to estimate direct genetic effects (see `Guan et al. <https://www.nature.com/articles/s41588-025-02118-0>`_).
+The summary statistics are output to a gzipped text :ref:`sumstats file <sumstats_text>`: chr_1_sibdiff.sumstats.gz.
+
+We can combine the final two generations' genotype data into one .bed file using this command:
+
+    ``plink --bfile chr_1 --bmerge chr_1_par --out chr_1_combined``
+
+If we run the GWAS script on the combined genotype data, we can estimate the direct genetic effects using the full-sibling offspring and parental genotypes 
+in a trio design:
+
+    ``gwas.py phenotype.txt --pedigree pedigree.txt --bed chr_@_combined --out chr_@_trio``
+
+The summary statistics are output to a gzipped text :ref:`sumstats file <sumstats_text>`: chr_1_trio.sumstats.gz.
+If you read the summary statistics file (e.g. into R or using ``zless -S``) you can see that the effective sample size for 
+direct genetic effects is substantially larger from the trio design than the sib-differences design. 
+Note that both designs use the same number of phenotype observations in a generalized least-squares regression, but the trio design uses more information from the parents.
+In this simulation, the effective sample size from the trio design should be about 45% larger than for the sib-differences design.
 
 Inferring IBD between siblings
 ------------------------------
 
-The first step is to infer the identity-by-descent (IBD) segments shared between siblings. 
+The first step in the imputation of missing parental genotypes from siblings is to infer the identity-by-descent (IBD) segments shared between siblings. 
 However, for the purpose of this simulation exercise (where SNPs are independent, so IBD inference doesn't work)
 we have provided the true IBD states in the file chr_1.segments.gz.
 
-
 Imputing missing parental genotypes
 -----------------------------------
 
 
@@ -387,7 +387,7 @@ def process_chromosome(chrom_out, y, varcomp_lst,
     if not parsum and not sib_diff:
         par_status, gt_indices, fam_labels = find_par_gts(y.ids, ped, gts_id_dict)
         parcount = np.sum(par_status==0,axis=1)
-        if np.sum(parcount>0)==0:
+        if np.sum(parcount>0)==0 and not trios_sibs:
             # logger.warning('No individuals with genotyped parents found. Using sum of imputed maternal and paternal genotypes to prevent collinearity.')
             print('WARNING: no individuals with genotyped parents found. Using sum of imputed maternal and paternal genotypes to prevent collinearity.')
             parsum = True
 
@@ -1,36 +1,68 @@
 from pysnptools.snpreader import Pheno
 import numpy as np
+import pandas as pd
 from snipar.gtarray import gtarray
 from snipar.utilities import make_id_dict
-
-def read_phenotype(phenofile, missing_char = 'NA', phen_index = 1):
-    """Read a phenotype file and remove missing values.
-
-    Args:
-        phenofile : :class:`str`
-            path to plain text phenotype file with columns FID, IID, phenotype1, phenotype2, ...
-        missing_char : :class:`str`
-            The character that denotes a missing phenotype value; 'NA' by default.
-        phen_index : :class:`int`
-           The index of the phenotype (counting from 1) if multiple phenotype columns present in phenofile
-
+def read_phenotype(file_path, column=None, column_index=None, na_values='NA'):
+    """
+    Read data from a text file with header structure where either:
+    - First two columns are 'FID' and 'IID'
+    - First column is 'IID'
+    
+    Parameters:
+    file_path (str): Path to the text file
+    column (str, optional): Name of column to extract (other than 'FID' or 'IID')
+    column_index (int, optional): Index of column to extract (counting from 1 after 'IID'/'FID')
+                                  Note: This is 1-based indexing
+    na_values (str or list, optional): String or list of strings to recognize as NA/NaN. Default is 'NA'.
+    
     Returns:
-        y : :class:`~numpy:numpy.array`
-            vector of non-missing phenotype values from specified column of phenofile
-        pheno_ids: :class:`~numpy:numpy.array`
-            corresponding vector of individual IDs (IID)
+        y : :class:`snipar.gtarray`
+            vector of non-missing phenotype values from specified column of phenofile along with individual IDs
+    
+    Note: If neither column nor column_index is provided, defaults to first column after IID/FID
     """
-    pheno = Pheno(phenofile, missing=missing_char)[:,phen_index-1].read()
-    y = np.array(pheno.val)
-    y.reshape((y.shape[0],1))
-    pheno_ids = np.array(pheno.iid)[:,1]
-    # Remove y NAs
-    y_not_nan = np.logical_not(np.isnan(y[:,0]))
-    if np.sum(y_not_nan) < y.shape[0]:
-        y = y[y_not_nan,:]
-        pheno_ids = pheno_ids[y_not_nan]
-    print('Number of non-missing phenotype observations: ' + str(y.shape[0]))
-    return gtarray(y,ids=pheno_ids)
+    # Determine delimiter (tab or whitespace)
+    with open(file_path, 'r') as file:
+        first_line = file.readline()
+        delimiter = '\t' if '\t' in first_line else ' '  
+        header = first_line.split(delimiter)
+        header[-1] = header[-1].strip()  # Remove newline character
+    # Determine file format based on header
+    has_fid = (len(header) > 1 and header[0] == 'FID' and header[1] == 'IID')
+    # Set default column if neither is provided
+    if column is None and column_index is None:
+        # Default to first column after IID/FID
+        column_index = 1
+    # Determine the usecols parameter for pd.read_csv
+    if column is not None:
+        if column in ['FID', 'IID']:
+            raise ValueError(f"Phenotype cannot be named FID or IID")
+        # We need to read the IID column and the target column
+        cols_to_use = ['IID', column]
+    else:  # column_index is provided
+        # Adjust column_index based on file format
+        offset = 2 if has_fid else 1
+        adjusted_index = column_index + offset - 1  # -1 for 0-based indexing
+        if adjusted_index >= len(header):
+            raise ValueError(f"Column index {column_index} out of range")
+        column = header[adjusted_index]
+        cols_to_use = ['IID', column]
+    print('Reading phenotype from column:', column)
+    # Read the data using pandas for efficiency, handling missing values
+    df = pd.read_csv(file_path, 
+                     sep=delimiter,
+                     usecols=cols_to_use,
+                     na_values=na_values)
+    # Verify target column contains numeric data
+    try:
+        df[column] = pd.to_numeric(df[column], errors='coerce')
+    except ValueError:
+        raise ValueError(f"Phenotype contains non-numeric values that cannot be converted")
+    # Remove rows with missing values in either IID or target column
+    df = df.dropna(subset=['IID', column])
+    # Return gtarray
+    return gtarray(np.array(df[column].values).reshape((df.shape[0],1)), ids=np.array(df['IID'].values, dtype=str))
 
 def match_phenotype(G,y,pheno_ids):
     """Match a phenotype to a genotype array by individual IDs.