Benchmark data

For cell line data, we already included in sv-callers the GiaB sample NA12878/HG001, defined on the IGSR portal.

We want to include two additional samples:

NA24385/HG002 from GiaB
The synthetic diploid sample CHM1_CHM13 derived from two complete hydatidiform mole (CHM) cell lines: CHM1 and CHM13

NA24385/HG002

BAM files

[chosen: longer reads] HG002 2x250 bp paired end reads mapped on the hs37d5 reference sequence. This is the BAM file that is used in the HiFi (PacBio CCS reads) publication.
- BAM
- BAI
- README
HG002 2x148 bp paired end reads (README) mapped on the hs37d5 reference sequence. This is the BAM file that is used in the Cameron2019 benchmark (Methods, section "Cell line evaluation").
- BAM
- BAI
- README

AK: check with the GIAB/NCBI SRA team how to refer to the BAM file (if at all possible) or (indirectly) via FASTQ files as required in our PeerJ paper for the NA12878 sample (ENA:SRX1049768-SRX1049855; BioProject:PRJNA200694).

Truth sets:

HG002_SVs_Tier1_v0.6. This is the truth set used in the Cameron2019 benchmark.
[chosen] nstd167, on dbVar. This is the most recent truth set derived from the PacBio CCS reads. Note: Both datasets contain INS and DEL. Here the two truth sets are compared: 4051 DELs in common, 98 unique to nstd167, 119 unique to HG002_SVs_Tier1_v0.6

CHM1_CHM13

BAM files

Two BAM files from the study are available at the ENA Project PRJEB13208. For these BAM files, the GRCh37 reference genome was used. See section "Calling SNPs and short indels from Illumina data" of the publication. There are two sequencing libraries: CHM1_CHM13_2 (ERR1341796) and CHM1_CHM13_3 (ERR1341793)

[chosen: higher coverage and base quality] CHM1_CHM13_2: BAM and BAI
CHM1_CHM13_3: BAM and BAI

AK: use ENA:experiment_accession instead of run_accession (e.g. https://identifiers.org/ena.embl:ERX1413368).

Homo_sapiens_assembly19 available at the Broad Institute is the reference genome used in both BAM files.

Truth sets

[chosen] nstd137 relative to GRCh37 and to GRCh38 published here. This is the CHM1_CHM13 truth set relative to the GRCh38 reference genome that is used in the Cameron2019 benchmark.

test

Home

Install and test SV callers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark data

NA24385/HG002

BAM files

Truth sets:

CHM1_CHM13

BAM files

Truth sets

Clone this wiki locally