Skip to content

ahmedmoustafa/genetic-ancestry

Repository files navigation

Genetic Ancestry

Ethnic Groups Image source: Council of State Archivists

A detailed workflow to compute and visualize the principal component analysis (PCA) of genotypic Single Nucleotide Polymorphisms (SNPs). The workflow leverages a curated set of 10,000 SNPs predefined by GRAF to pinpoint ancestry markers. For the computation of PCA, we employ PLINK for generating the eigenvectors and eigenvalues.

Workflow Overview

  1. Fingerprinting SNPs Extraction: Extract GRAF's 10,000 curated SNPs from the dbSNP database.
  2. Data Cleaning: Ensure the extracted SNPs are exclusively biallelic. (included in the previous notebook)
  3. SNPs Retrieval from 1,000 Genomes Project: Extract the genotypes of 10,000 fingerprinting positions from the 1,000 Genomes Project's VCF dataset.
  4. PCA Computation: Generate PCA's eigenvectors and eigenvalues using PLINK.
  5. PCA Visualization: Visualize the PCA data, highlighting the relationships between samples using R.

Populations PCA