This repo describes pangenomes produced by the Human Pangenome Reference Consortium from year 1 data. For information about data reuse and publicating with HPRC data please see the HPRC's Data Use Protocol.
Note: The pangenomes and resultant files referred to in this repo have not been fully QC'd, are not published, and may have known issues.
A Draft Human Pangenome Reference
Graphs are available from three different strategies summarized in the table (and relevant sections) below:
Minigraph | Minigraph-Cactus | PGGB | |
---|---|---|---|
sequence comparison | reference-based, progressive | reference-based, progressive | symmetric, all-vs-all |
resolution | SV only | base-level (via abPOA) | base-level (via abPOA) |
scope | full assemblies | Non-centromeric | full assemblies |
cyclic paths | no | non-reference | all |
short read mapping | untested | yes (fast) | untested |
long read mapping | yes (fastest) | yes | yes (slowest) |
Assembly mapping | yes (direct) | untested | yes (via injection) |
Index files listing file locations for download with the AWS CLI can be found in the indexes folder of this repository. Alternatively, tables are listed below in each graph creation strategy's section. Note that the index files list the file locations with s3:// uris -- as opposed to http:// urls as found in the tables.
Information about the source assemblies can be found in the HPRC Assembly GitHub repository. Of the 47 samples assembled (94 assemblies) in year 1, all but three samples were included in graph constructions (HG002, HG005 and NA19240 were excluded for evaluation purposes). GRCh38 and CHM13 were added to make the total number of haplotypes included 90.
Minigraph (cite) is a generalization of minimap2 (very fast) which builds the graph with iterative construction. Minigraph aligns with approximate locations and can be used to call structural variants (>50nt). Graphs were built with both GRCh38 and CHM13+Y (found here) used as reference sequences.
Description | GRCh38 Graph | CHM13 Graph |
---|---|---|
graph | graph | graph |
bed | bed index | bed index |
Minigraph-Cactus (cite) adds base-level alignment to minigraph
graphs.
Note: The links below have been updated to point to version 1.1 of the graphs which contain numerous bug fixes and updated file formats (this includes switching from .
to #
as path name separator in all vg files). The original version 1.0 graph that was described in the HPRC paper, has been moved here. The input assemblies are the same for both versions, so unless you are trying to exactly reproduce results from the paper, please consider using the updated version.
Graphs and associated files are summarized below.
Description | GRCh38 Graph | CHM13 Graph |
---|---|---|
Graph | gfa gbz | gfa gbz |
Full (Unclipped) Graph | gfa gbz odgi | gfa gbz odgi |
Chromosome Graphs | chroms | chroms |
Decomposed VCF | VCF VCF index | VCF VCF index GRCh38-VCF GRCh38-VCF index |
Raw VCF | VCF VCF index | VCF VCF index GRCh38-VCF GRCh38-VCF index |
Multiple Alignment | HAL MAF MAF Index TAF TAF Index | HAL MAF MAF Index TAF TAF Index |
Multiple Alignment (Duplications removed) | MAF MAF Index TAF TAF Index | MAF MAF Index TAF TAF Index |
VG Indexes | gbz hapl dist min snarls | gbz hapl dist min snarls |
AF-Filtered VG Indexes | gbz dist min snarls | gbz dist min snarls |
Excluded Regions | full graph bed clipped graph bed | full graph bed clipped graph bed |
All Files | files | files |
The graphs are available in gfa format alongside other graph and index files. Information about the associated file formats can be found:
- mc output overview
- vg wiki
- gbz format (cite)
- hapl format new in vg 1.51.0: allows
vg giraffe
to infer a personalized pangenome as alternative to AF-filtering. - maf and taf
The Raw VCF files contain a site for each bubble in the graph. Nested bubbles will result in overlapping sites. The nesting relationships are denoted with the PS
(parent snarl), LV
(level) and AT
(allele traversal) tags and need to be taken into account when interpreting the VCF. Alternatively, you can use the "Decomposed VCFs" which have been normalized by using vcfbub to "pop" bubbles with alleles larger than 100k and vcfwave to realign each alt allele to the reference (script). Note that in order to reproduce the PanGenie analyses from the papers, you should instead use the PanGenie HPRC Workflow. This workflow has a CHM13 branch to use when working with that reference.
The exact tools and commands used to produce the VCFs are given here.
The "AF-Filtered VG indexes" above were created by dropping nodes and edges supported by fewer than 10% of haplotypes, and give the best performance for Giraffe and are what have been used in the various papers to date. Note that giraffe
requires only the .gbz
, .dist
and .min
indexes.
Some input contigs could not be assigned to a reference chromosome and were dropped. See the "full graph bed" files above for a listing of these. Contig fragments >10kb that did not map anywhere were likewise excluded (these regions are predominantly centromeric). See the "clipped graph bed" files above for these regions (this file includes the unassigned contigs). dna-brnn
was not used to make these graphs.
The Pangenome Graph Builder pipeline (PGGB) (cite) creates and all-vs-all graph with base-level alignments and no clipping of mitochondrial or centromeric regions.
Graphs and associated files are summarized below.
Description | Location |
---|---|
graph | gfa |
untangle | delta paf |
Decomposed VCFs | GRCh38 VCF GRCh38 VCF Index |
Raw VCFs | chm13.1-22+X chm13.M grch38.1-22+X grch38.M grch38.Y |
Graph chromosome files and images can be found here and here.
See above for more information of VCF decomposition (script).
* Dec 03, 2021: updated minigraph-cactus VCFs to fix headers (thanks to Wen-Wei)