GRAF (Genetic Relationship And Fingerprinting) is a package to do some useful analyses and visualization of genotype data from genome-wide association studies. The latest version GRAF 2.3 includes two main features: (1) subject relationship inference (GRAF-rel); (2) subject ancestry (or population structure) inference (GRAF-pop). Both relationship and ancestry inferences are based on the genotypes of 10,000 pre-selected fingerprint SNPs extracted from the input dataset. The GRAF package includes a main C++ program graf
to calculate the relationships and predict the subject ancestry, and two auxiliary Perl programs PlotGraf.pl
and PlotPopulations.pl
to visualize the results. Note that PlotGraf.pl
and PlotPopulations.pl
require that GD Graphics Library (http://search.cpan.org/~lds/GD-1.38/GD.pm) be installed.
GRAF-rel analyzes the genotypes of all the 10,000 fingerprinting SNPs (distributed as the file FP_SNPs.txt) and calculates the all genotype mismatch rate (AGMR) and the homozygous genotype mismatch rate (HGMR) for each pair of sample (Jin et al, 2017). AGMR is the percentage of SNPs on which the two genotypes are not identical, while HGMR is the genotype mismatch rate when only the SNPs with homozygous calls for both samples are considered.
graf
compares the genotypes of all pairs of subjects and finds and reports the closely related pairs, while PlotGraf.pl
takes the file generated by graf
and plots graphs to show the distributions of HGMR and AGMR values.
In most usages, graf
expects as input one or more genotype datasets in PLINK format, i.e., .bed
, .bim
and .fam
files that share a prefix in their names. However, since multiple samples can be collected from one subject and the subject-sample mapping information is not stored in datasets in PLINK, graf
reads subject-sample mapping and pedigree information from the dbGaP SSM file and pedigree file. The IDs (second column, no column header) in the PLINK .fam
file are read as sample IDs by graf
. If subject IDs are the same as sample IDs, then no SSM file is necessary, and graf will read the pedigree information from the PLINK .fam
file.
However, if any of the sample IDs are different from their corresponding subject IDs, then an SSM file should be passed to graf
. The SSM file should be a tab-delimited plain text file with a sample column and a subject column (with column headers, see dbGaP submission guide). When an SSM file is provided, a pedigree file (see dbGaP submission guide) should also be provided to pass the pedigree information to graf
. The pedigree file should be a tab-delimited plain text file with at least the following 5 columns (with a column header row):
- FamilyID
- SubjectID
- FatherID
- MotherID
- Sex (1 = male; 2 = female; 0 or NULL = unknown)
SubjectID
, FatherID
and MotherID
are IDs of subjects, not samples.
The SSM format is a two-column tab delimited text file that establishes a mapping from Sample IDs to Subject IDs. The columns should have the headers Subject_ID
and Sample_ID
, respectively. An example SSM format file is included in the GRAF distribution with the name affy_hapmap_ssm.txt.
If there are identical twins in the datasets, the twin information should be entered to the optional 6th column TwinID
, where the same twin ID (can be an integer or a string) is used to indicate that subjects are identical twins. For example, if three subjects A, B, C are identical triplets, a unique subject ID, e.g., the integer 18, can be created for them and entered into the TwinID column for subjects A, B, C.
The sample genotypes can also be stored in datasets with GRAF format. graf
uses a single .fpg
file to store the sample genotypes. A .fpg
file is a plain text file with three columns: the first column is the dataset ID (integer) column; the second one is the sample ID column; and the third column stores sample genotypes in strings of hexadecimal numbers. Each hexadecimal number represents genotypes of two fingerprinting SNPs. The first hexadecimal number stores genotypes of the first two fingerprinting SNPs; the second number keeps genotypes of fingerprinting SNPs #3 and #4, and so on. If the hexadecimal number is converted to a binary number, then the first two bits keep the genotype of the first SNP and the last two bits are for the second SNP, with the following code meanings:
00: 0 reference alleles
01: 1 reference allele
10: 2 reference alleles
11: missing genotype
The .fpg
file can be generated using the -geno
option of the graf
program and reused as input to the program in a subsequent run.
Included in the distribution are two sample datasets for which the file names have prefixes affy_hapmap and perlegen_hapmap. Both sets of sample files come in byte-encode PLINK format meaning that there are three files with suffixes {fam,bim,bed}.
graf
is a command line executable that can be run under GNU/LINUX 64 bit systems. Brief instructions are given when the program is executed without parameters:
$ graf
Usage: graf [options]
-plink PLINK set root: File root of PLINK .bed, .bim and .fam files
-geno fpg file: Specify GRAF .fpg file
-exfp PLINK set list: Extract fingerprinting genotypes from a list of PLINK sets (file roots) separated by commas
-pop output file: Check subject populations and save results to the output file
-out output file: Output file to save the results
-appd DS No.: Append extracted fingerprinting genotypes to the output file. The integer is dataset No.
of the first PLINK set
-ssrs SS-RS mapping: Specify SS# to RS# mapping file (Two columns: SS# and RS# without column headers)
-ped pedigree file: Specify pedigree file of subject IDs (with column headers)
-ssm SSM file: Specify dbGaP subject-sample mapping file
-maxhm max HGMR value: Specify maximum HGMR values for a pair of subjects to be reported by GRAF
-xpmr type: Specify how expected HGMR and AGMR values are calculated for each type of relationship (default 1)
1: Use input dataset to calculate the expected HGMR and AGMR values
2: Use average HGMR and AGMR values in dbGaP database for the expected values
-type relation_type: Specify relation type. Acceptable values are 1, 2, 3, or 4 (default 3)
1: Find all duplicates and PO pairs
2: Find all duplicates, PO and FS pairs
3: Find all duplicates, PO, FS and second degree relatives
4: Compare all the 10,000 SNPs to find all the related subjects
NOTE:
1. Exactly one of the following two options should be selected: -plink or –geno.
2. When option -exfp is selected, -out must also be selected and output file should have .fpg extension.
3. When multiple PLINK sets are used, each dataset will be assigned an integer dataset ID starting with 1.
4. The above PLINK set starting index can be specified using option -appd.
When -appd is selected, the out file should be an existing GRAF .fpg file.
5. Multiple datasets can be combined into a single geno file using the –exfp and –appd options.
6. When multiple datasets are used, the program does pairwise comparisons to find related samples both within and across datasets
Below are more detailed descriptions (with examples) of these options.
-plink
Allows the user to specify the name of the genotype dataset in PLINK .bed
, .bim
, .fam
format. The parameter should be the file root of the plink set. In this example, graf
will try to find the following three files: affy_hapmap.bed, aff_hapmap.bim and affy_hapmap.fam. Example:
$ graf -plink affy_hapmap
-exfp
Extracts fingerprinting genotypes from multiple PLINK sets and saves the results to the file name specified by -out option. The datasets will be given integer dataset IDs starting from 1. The output file name should be new. Example:
$ graf -exfp affy_hapmap,perlegen_hapmap -out comb_hapmap.fpg
-exfp -appd
Extracts fingerprinting genotypes from a PLINK set and appends the results to an existing output file, with dataset ID specified by –appd option. Example (two steps):
$ graf -exfp affy_hapmap -out comb_hapmap2.fpg
$ graf -exfp perlegen_hapmap -out comb_hapmap2.fpg –appd 2
-geno
Allows the user to specify the name of the genotype dataset in GRAF format. Example:
$ graf -geno comb_hapmap.fpg
-ssm
Allows the user to specify the name of the subject-sample mapping file in dbGaP format. When sample IDs are different from subject IDs, a subject-sample mapping file is required. The subject-sample mapping file should list all the sample IDs in the PLINK .fam
file and their corresponding subject IDs. Example:
$ graf -plink affy_hapmap -ssm affy_hapmap_ssm.txt
-ped
Allows the user to specify the pedigree file in dbGaP format. When pedigree file is specified with -ped
option, graf will ignore the pedigree information in the PLINK .fam
file and read the information from the pedigree file. The IDs in the pedigree file should be subject IDs. This option can take only one dataset at a time. Example:
$ graf -plink affy_hapmap -ssm affy_hapmap_ssm.txt -ped affy_hapmap_fake_pedigree.txt
-out
Allows the user to specify the name of the output file for saving the related pairs of samples detected by graf
. If the output file is not specified, the output will be saved to a default file graf_rel_yyyymmdd_hhmm.txt, where yyyymmdd_hhmm is the current local time in this format. Example:
$ graf -plink affy_hapmap –out aff_hapmap_rels.txt
-maxhm
Sets the maximum HGMR value for related pairs outputted by graf
. Subject pairs with HGMR greater than this value will be treated by graf
as unrelated and will not be saved to the output file. The default maximum HGMR is 20. Example:
$ graf -plink affy_hapmap –out aff_hapmap_rels_m_15.txt –maxhm 15
-xpmr
Allows the user to specify how the expected HGMR and AGMR values are calculated. For each pair of subjects, GRAF estimates the allele frequency distribution of the fingerprinting SNPs of the population where the subjects are sampled, and then uses these allele frequencies to calculate the expected HGMR and AGMR values. Assuming all of the subjects in the input file(s) are sampled from the same population, GRAF uses the allele frequencies of all subjects in the input datasets to estimate the allele frequencies in the population. In cases when the sample size is small (fewer than 100 subjects) in the input datasets, GRAF uses the allele frequencies of all the subjects in dbGaP Fingerprint Collection to estimate the population allele frequencies. The user can use -xpmr
option (1 or 2) to let GRAF choose one of the above two options to estimate the population allele frequencies. When the selection -xpmr 1
is combined with choices of -geno
or -exfp
that combine multiple datasets, then the allele frequencies are combined as a weighted average of all the participating datasets and the same weighted average is used for all pairwise comparisons. Example:
$ graf -plink affy_hapmap –xpmr 2
-type
Usage of graf involves a tradeoff between running time and prediction accuracy. To obtain high sensitivity, the program needs to check more SNPs, at the expense of a longer running time. The -type
option allows the user to specify the relative type for which graf should try to find all the pairs. The type should be an integer from 1 to 4, with the code meanings shown in the above short description. The greater the type value is, the more SNPs graf will check, and hence the more related samples it will find and the more time it will spend. The default type value is 3. Example:
$ graf -plink affy_hapmap –type 2
-ssrs
When the marker IDs in the PLINK .bim
file are SS IDs, the user can use -ssrs
option to specify an SS to RS mapping file so that graf can convert the SS IDs to RS IDs. Example (assuming PLINK set DsWithSs.* exists):
$ graf -plink DsWithSs –SsToRs.txt
graf
requires that an input genotype file, either in PLINK format (with -plink
option) or in GRAF format (with -geno
option) should be specified. When -exfp
option is selected, the -out
option should also be selected to specify the name of the output file. The output file is the genotype dataset in GRAF format (.fpg
file), as described above.
The output file should have the extracted genotypes of the fingerprinting SNPs and can be passed back to graf as an input file in a later run.exfp
When -exfp
option is not selected, graf
will use the genotype information in the input genotype dataset, find the related subjects or determine population structures, and will save the results to the output file.
If any related subjects are found by GRAF-rel, the results will be saved to the output file, which is a plain text file with the following columns:
Sample1: ID of the first sample in each pair
Sample2: ID of the second sample in each pair
Subject1: subject ID of the first sample in each pair
Subject2: subject ID of the second sample in each pair
Sex11: gender of the first subject in each pair, 1=male; 2=female
Sex12: gender of the second subject in each pair, 1=male; 2=female
HG match: number of SNPs with matched genotypes when only homozygous SNPs are counted
HG miss: number of SNPs with mismatched genotypes when only homozygous SNPs are counted
HGMR: Homozygous Genotype Mismatch Rate (%)
AG match: number of SNPs with matched genotypes when all SNPs are counted
AG miss: number of SNPs with mismatched genotypes when all SNPs are counted
AGMR: All Genotype Mismatch Rate (%)
Geno relation: relationship determined by sample genotypes. See above for code meanings
Ped relation: relationship derived from subject-sample mapping file and pedigree file (See Table 1 for code meanings).
p_value: probability that the genetic relationship is NOT the predicted type
Table 1. Pedigree relationships and the expected genetic relationships
When multiple PLINK sets are checked pairwise, the output file will have two extra columns, DS1 and DS2, showing the dataset IDs for the pair of PLINK sets.
PlotGraf.pl
is a perl script that plots graphs to show the distributions of HGMR and AGMR values of the related pairs of subjects. It shows brief instructions when it is executed without parameters:
$ PlotGraf.pl
Usage: PlotGraf.pl <input related subject file> <output png file> <graph type> [Options]
Note:
Valid graph types are:
1 = HGMR histogram
2 = AGMR histogram
3 = HGMR + AGMR scatter plot
Options:
-gw graph width: Set graph width in pixels
-gh graph height: Set graph height in pixels
-xmax max x value: Set maximum HGMR or AGMR on x-axis of the histogram
-ymax max y value: Set maximum number of pairs on y-axis of the histogram
-dot size: Set dot size in pixels on the scatter plot
-hfd size: Set dot size in pixels for HF (half sibling + full cousin) pairs
It takes three required parameters. The first parameter should be the name of the file that is generated by graf
and contains related subject pairs. The second one is the output .png
file which shows the graph. The third one is an integer representing the graph type. The options should be entered after the required parameters. Below are some examples showing how to run the script.
$ graf -plink affy_hapmap -maxhm 15 -ssm affy_hapmap_ssm.txt -ped affy_hapmap_fake_pedigree.txt -out affy_hapmap_rels_15.txt
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_hgmr.png 1
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_agmr.png 2
$ PlotGraf.pl affy_hapmap_rels_15.txt affy_hapmap_scatter.png 3
In the first step the C++ program finds related pairs and saves the results to affy_hapmap_rels_15.txt. Then PlotGraf.pl
takes the results and plots histograms to show distributions of HGMR values of the related subjects, AGMR values of the duplicates, and a scatter plot to show distribution of both values.
In both histograms, the colored bars represent different type of relationships derived from the SSM and pedigree file (See Table 1 for the meanings of the two-letter abbreviations). The cyan lines show the cutoff values suggested by GRAF to separate different types of relationships determined by comparing the genotypes. In the scatter plot, each contour line shows the area that is predicted to contain 95% of the pairs for each relatedness type, assuming all of the 10,000 fingerprinting SNPs are genotyped for all of the subjects in a large, homogeneous, random mating population. Note that the HapMap samples were collected from human individuals from very different populations, and GRAF is more accurate when predicting relatedness for subjects from a homogeneous population.
$ graf -plink affy_hapmap -maxhm 15 -ssm affy_hapmap_fake_ssm.txt -ped affy_hapmap_fake_pedigree.txt -out affy_hapmap_fake_rels.txt
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_hgmr_f1.png 1 -gw 1000 -gh 500
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_agmr_f1.png 2 -xmax 60 -ymax 20
$ PlotGraf.pl affy_hapmap_fake_rels.txt affy_hapmap_scatter_f1.png 3 -dot 5
The above examples show that graph size, axis limits and the scatter plot dot size can be adjusted by users. In the first step a fake pedigree and a fake SSM file are used to show how GRAF finds and reports errors in the pedigree and SSM files. The HGMR histogram generated in the second step shows that some of the related pairs reported by the pedigree and SSM file don't match the genetic relatedness determined by GRAF. It also shows that the graph size can be adjusted by using options -gw
and -gh
. The AGMR histogram also shows the mismatches between the relationships types reported in the input files and those determined by GRAF. The axis limits can be adjusted by using -xmax
and -ymax
options. The scatter plot shows the dot size can be adjusted using -dot
option.
Multiple genotype datasets can be combined into one .fpg
file and passed to graf
for determining genetic relationships, e.g.,
$ graf -exfp affy_hapmap,perlegen_hapmap -out comb_hapmap.fpg
$ graf -geno comb_hapmap.fpg -out comb_hapmap_rels.txt -maxhm 15 -ped affy_hapmap_fake_pedigree.txt -ssm comb_hapmap_ssm.txt
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_hgmr.png 1
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_agmr.png 2
$ PlotGraf.pl comb_hapmap_rels.txt comb_hapmap_scatter.png 3
When multiple datasets are used, if there are no SSM and pedigree files, it is not required that the sample and subject IDs be unique across datasets. GRAF uses both DS# and subject/sample IDs to identify subjects or samples. In the output table, GRAF shows both the DS# and ID for each subject or sample. However, when there are SSM and pedigree files, it is required that IDs be unique across datasets. GRAF doesn't take multiple SSM or pedigree files. The user needs to combine multiple SSM or pedigree files into one, and each ID in the combined SSM or pedigree file should represent only one sample or subject. Neither the SSM file nor the pedigree file has DS# columns.
The -hfd
option of PlotGraf.pl
lets user set the dot size for the half sibling + full cousin pairs (HF, see Table 1) in the scatter plot. The HF relationship is genetically remoter than full sibling but closer than second degree relatives. In the scatter plot, these pairs are predicted to be between FS and D2 pairs. In the rare cases when there are HF pairs, the user can use -hfd
option to highlight the HF pairs by setting different dot sizes for them.
GRAF-pop calculates genetic distances from each subject to several reference populations and estimates subject ancestry and ancestral proportions based on these distances. Four genetic distances scores, GD1, GD2, GD3, GD4, are used in ancestry inference in the current version of GRAF. Subjects in the input datasets are clustered using these scores and plotted on scatter plots.
GRAF-pop assumes that each subject is an admixture of three ancestries: European (E), African (F), and Asian (A), and estimates ancestral proportions Pe, Pf, Pa based on GD1 and GD2 scores using barycentric coordinates. It also assigns a population ID (PopID) to each subject using the cutoff values shown in Tables 2 and 3.
Table 2. Grouping subjects based on the ancestry proportions
PopID | Population | Cutoff standard |
---|---|---|
1 | European | Pe ≥ 87% |
2 | African | Pf ≥ 95% |
3 | East Asian | Pa ≥ 95% |
4 | African American | 40% ≤ Pf < 95% and Pa < 13% |
5 | Hispanic1 | Pf < 40% and Pe < 87% and Pa < 13% and Pf ≥ Pa |
6,7,8 | (Three populations) | Pa < 95% and Pe < 87% and Pf < 13% and Pf < Pa |
9 | Other | Pa ≥ 13% and Pf ≥ 13% |
Table 3. Separating Asians and Hispanics using GD1 and GD4 scores
PopID | Population | Cutoff standard |
---|---|---|
7 | Other Asian | GD1 > 30 × (GD4)2 + 1.73 |
8 | South Asian | GD4 > 5 × (GD1 -1.69)2 + 0.042 |
6 | Hispanic2 | GD4 < 0 and PopID is not 7 |
Same as GRAF-rel, GRAF-pop takes genotype datasets in either PLINK format (.fam
, .bim
, .bed
) or GRAF format (.fpg
). In addition, GRAF-pop can read self-reported ancestries from the input file and compare the ancestries inferred from genotypes with the self-reported ones. The input file should be a plain text file with two columns (without column header), containing subject ID and the self-reported ancestry, respectively.
Option -pop
is used by graf
to infer subject ancestry:
$ graf
...
Usage: graf [options]
-pop output file: Check subject populations and save results to the output file
...
The following command determines population structures and saves results to the output file:
graf -plink G1000FpGeno -pop G1000_sbj_scores.txt
The results generated by graf
can be passed to PlotPopulations.pl
for further processing. The following instructions are displayed on the screen when the script is run without parameters:
$ PlotPopulations.pl
Usage: PlotPopulations.pl <input file> <output file> [Options]
Note:
Output file should be either a .png file or a .txt file.
If the output file is a .png file, the script will plot the results to a graph and save the graph to the file.
If the output file is a .txt file, the script will save the calculated subject ancestry components to the file.
Options:
Set window size in pixels
-gw graph width
Set graph axis limits
-xmin min x value
-xmax max x value
-ymin min y value
-ymax max y value
Set a rectangle area to retrieve subjects for graph of GD1 vs. GD2
-xcmin min x value
-xcmax max x value
-ycmin min y value
-ycmax max y value
-isByd 0 or 1
0: retrieve subjects whose values are within the above rectangle (default value)
1: retrieve subjects whose values are beyond the above rectangle
Set population cutoff lines
-ecut proportion: cutoff European proportion dividing Europeans from other population. Default 87%.
-fcut proportion: cutoff African proportion dividing Africans from other population. Default 95%.
Set it to -1 to combine African and African American populations
-acut proportion: cutoff East Asian proportion dividing East Asians from other populations. Default 95%.
Set it to -1 to combine East Asian and Other Asian populations
-ohcut proportion: cutoff African proportion dividing Hispanics from Other population. Default 13%.
-fhcut proportion: cutoff African proportion dividing Hispanics from African Americans. Default 40%.
Select some self-reported populations (by IDs) to be highlighted on the graph
-pops comma separated population IDs, e.g., -pops 1,3,4 -> highlight populations #1, #3 and #4
Select self-reported populations (by IDs) to show areas including 95% dbGaP subjects with genotypes of at least 4000 fingerprint SNPs
-areas comma separated dbGaP self-population IDs, e.g., -areas 1,3
-> show areas that include 95% dbGaP subjects with self-reported populations #1 and #3
1: European/White/Caucasian
2: African (Ghana/Yoruba)
3: East Asian (Chinese/Japanese)
4: African American/Black
5: Puerto Rican/Dominican
6: Mexican/Latino
7: Asian/Pacific Islander
8: Asian Indian/Pakistani
Select which score to show on the y-axis
-gd4 1 or 0. 1: show GD4 on y-axis; 0: show GD2
Set population cutoff lines
-cutoff 1 or 0. 1: show cutoff lines; 0: hide cutoff lines
Rotate the plot with respect to the x-axis by a certain angle
-rotx angle in degrees
Set the size (diameter) of each dot that represents each subject
-dot pixels
The input file with self-reported subject race information
-spf a file with two columns: subject and self-reported population
The script takes two required parameters, which must be the first two arguments and are not preceded by flags, unlike all the optional arguments, which are preceded by a flag. The first parameter should be the name of the file that is generated by graf -pop
option and contain subject genetic distance scores. The second parameter is the output file, expected to be either a .png
or .txt
file. If the output file is a .png
file, the script processes the scores and saves the results to the output file. The default graph is GD1 vs. GD2, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops.png
When option -gd4
is set to 1, the script generates a graph of GD1 vs. GD4:
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gd4.png -gd4 1
If the output file is a .txt
file, the script processes the data and saves the results to the output file in a format of a rectangular table.
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list.txt
In the output file, columns P_e, P_f, P_a show each subject's African, European, and East Asian proportions Pe, Pf, Pa, in percentages. The populations determined by GRAF-pop are included in the last two columns as an identifier and as the full name of the population.
When self-reported ancestries are available, the information can be passed to the script with -spf
option so that the script can color-code the subjects using the self-reported ancestries, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_sp.png -spf G1000SbjSuperPop.txt
The format of the input ancestry file is described above. In the graph generated by the script, the ancestries are numbered and color coded.
The cutoff lines used to partition the subjects are drawn on the graphs when option -cutoff
is set, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_cut.png -spf G1000SbjSuperPop.txt -cutoff 1
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_cut_gd4.png -spf G1000SbjSuperPop.txt -gd4 1 -cutoff 1
If multiple subjects appear at the same locations in the x-y plane, the user can use option -pops
to bring some ancestries to the front, while setting some ancestries to the back and fade out them in the graph. For example, the following command generates a graph with the ancestry No. 5 (AMR, standing for Ad Mixed American) in the back and colored yellow: The assignments of colors to populations are currently hard-coded.
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_1234.png -spf G1000SbjSuperPop.txt -pops 1,3,2,4
The ancestry numbers following -pops
should be separated by commas without spaces.
One can also use the -rotx
option to rotate the graph of GD2 vs. GD1 around x-axis by a certain angle specified in degrees (can be any real number). For example, the following command generates a graph showing the subjects rotated by 90o:
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_90.png -spf G1000SbjSuperPop.txt -rotx 90
Options -gw, -xmin, -xmax, -ymin, -ymax, -dot
, similar to those in PlotGraf.pl
, can be used to adjust the graph size, specify axis limits, and set the dot size, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gw.png -spf G1000SbjSuperPop.txt -gw 800 -ymin 1.1 -dot 5
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_gw_gd4.png -spf G1000SbjSuperPop.txt -gw 800 -gd4 1 -ymin -0.2
One can use the option -areas
to select populations to show the expected oval areas that include 95% of dbGaP subjects with at least 4000 fingerprint SNPs with genotypes, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_a.png -spf G1000SbjSuperPop.txt -areas 1,4,7
The integers in the comma-delimited string represent the eight self-reported ancestry groups in dbGaP, with most common ancestry terms in each group shown below:
1: European/White/Caucasian
2: African (Ghana/Yoruba)
3: East Asian (Chinese/Japanese)
4: African American/Black
5: Puerto Rican/Dominican
6: Mexican/Latino
7: Asian/Pacific Islander
8: Asian Indian/Pakistani
GRAF-pop uses the ancestry proportions shown in Tables 2 and 3 as default cutoff values. The user can use options -ecut
, -fcut
, -acut
, -ohcut
, -ahcut
, -fhcut
to set the cutoff values to different numbers, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_ucut.png -spf G1000SbjSuperPop.txt -cutoff 1 -fcut 85 -ahcut 80 -ohcut 15.5
When -fcut
or -acut
are set to negative values, the African or East Asian cutoff line is not plotted on the graph, and the script does not distinguish Africans from African Americans, or East Asians from Other Asians, e.g.,
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_pops_nf.png -spf G1000SbjSuperPop.txt -cutoff 1 -fcut -1
As mentioned above, when the second parameter (the output file) is a .txt
file, the script saves subjects and the ancestry proportions into a rectangular table. Options -xcmin
, -xcmax
, -ycmin
, -ycmax
, -isByd
can be used to specify a rectangular area and let the script to retrieve subjects whose x(GD1), y (GD2) scores are either within or beyond this area. For example, the following command saves all subjects with 1.8 < GD1 < ∞ and -∞ < GD2 < 1.2, which are all the EAS (East Asian) subjects:
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list_cut.txt -spf G1000SbjSuperPop.txt -xcmin 1.8 -ycmax 1.2
When option -isByd
is set to 1, the script retrieves subjects whose value are beyond rectangular area specified by options -xcmin
, -xcmax
, -ycmin
, -ycmax
. For example, the following command excludes most of the 1000 Genome Projects subjects with super populations AMR (Ad Mixed American) and SAS (South Asian):
$ PlotPopulations.pl G1000_sbj_scores.txt G1000_sbj_list_cutb.txt -spf G1000SbjSuperPop.txt -xcmin 1.64 -xcmax 1.8 -ycmin 1.24 -ycmax 1.36 -isByd 1
Jin Y, Schäffer AA, Sherry ST, and Feolo M (2017). Quickly identifying identical and closely related subjects in large databases using genotype data. PLoS One. 12(6):e0179106.
Jin Y, Schäffer AA, Feolo M, Holmes JB and Kattman BL (2019). GRAF-pop: A Fast Distance-based Method to Infer Subject Ancestry from Multiple Genotype Datasets without Principal Components Analysis G3: Genes | Genomes | Genetics. Aug 8; 9(8):2447-2461.