Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiC mode #76

Open
dabitz opened this issue Feb 17, 2021 · 11 comments
Open

HiC mode #76

dabitz opened this issue Feb 17, 2021 · 11 comments

Comments

@dabitz
Copy link

dabitz commented Feb 17, 2021

Dear Hifiasm team,

thank you very much for introducing this feature of HiC mode on hifiasm. Before I was running SALSA after assembly.
I want to know what would be the easiest way to make the hic plot after phased diploid assembly with this HiC mode?

Best,
André

@shilpagarg
Copy link

You may also wanna check out pstools for haplotype-aware scaffolding: shilpagarg/DipAsm#16. Unfortunately, at this stage we don't provide hi-c plot feature, but will make available in the next release. Hope this helps!

@dabitz
Copy link
Author

dabitz commented Feb 17, 2021

Thanks a lot. I will check it out. What is the difference between using pstools from running in hifiasm?

@shilpagarg
Copy link

Pstools can produce chromosome-scale phased assembly.

@dabitz
Copy link
Author

dabitz commented Feb 17, 2021

Ok. thanks. And the version of hifiasm with HiC integration what it does?

@lh3
Copy link
Collaborator

lh3 commented Feb 17, 2021

Hifiasm partitions contigs into two groups with Hi-C. You can run SALSA on partitioned contigs. PS: the major benefit of Hi-C is that you can much longer phased blocks. Also, pstools is a separate project that uses hifiasm output and a different algoroithm. You can try both and see which work better for you.

@chhylp123
Copy link
Owner

chhylp123 commented Feb 17, 2021

As Heng said, hifiasm generates phased contig for now. I believe current hifiasm probably can generate good phased contig. That is, in each contig, the switch error rate and hamming error rate is lowest in my experiment. However, I'm not sure if it can correctly assign contigs to each haplotype. We're still improving this part. But I guess if we already have phased contig, probably it is not too hard to use other information to assign contigs, or even manually adjust by eye? Since there are not too many contigs and contigs are long enough.

As for scaffolding, also because of contigs are long enough, getting a not bad scaffolded assembly is not hard. However, it may still have problems at difficult regions. I believe with HiFi in hand, we will also have a nearly prefect scaffolded assembly with totally new algorithms. But it takes time to do that.

PS: pstools is not a part of hifiasm. It should use different information and strategies in comparison with hifiasm. As Heng said, you can try both to see which work better for your project.

@dabitz
Copy link
Author

dabitz commented Feb 18, 2021

thank you guys for the nice tips. I will definitely try them out. I just finished the first run testing the HiC mode and I ended up with hap1 much smaller than hap2 (303mbp x 475mbp), which in fact genome size is around 390mbp. Any idea why?

@chhylp123
Copy link
Owner

Well, that is the assigning problem (i.e., how to assign contigs to each haplotype). We are working on that and hopefully can fix it in a few days. This problem is similar to purge_dups. By the way, do you have any solution to roughly evaluate contig hamming && switch error rate? It would be helpful to get those two numbers.

@dabitz
Copy link
Author

dabitz commented Feb 19, 2021

Thanks for the update. I wonder if I do purge_dups on the bigger hap assembly and then merge the rest with the other smaller hap...

I am not an expert on that but found this way as reported in https://www.nature.com/articles/s41587-020-0719-5

Phasing accuracy estimates
To evaluate phasing accuracy, we determined SNVs in our phased assemblies based on their alignments to GRCh38. This procedure is described in the ‘SV, indel and SNV detection’ section in the Methods. We evaluate phasing accuracy of our assemblies in comparison to trio-based phasing for HG00733 (ref. 19) and NA12878 (ref. 46). In all calculations, we compare only SNV positions that are shared between our SNV calls and those from trio-based phasing. To count the number of switch errors between our phased assemblies and trio-based phasing, we compare all neighboring pairs of SNVs along each haplotype and recode them into a string of 0s and 1s depending on whether the neighboring alleles are the same (0) or not (1). The absolute number of differences in such binary strings is counted between our haplotypes and the trio-based haplotypes (per chromosome). The switch error rate is reported as a fraction of counted differences of the total number of compared SNVs (per haplotype). Similarly, we calculate the Hamming distance as the absolute number of differences between our SNVs and trio-based phasing (per chromosome) and report it as a fraction of the total number of differences to the total number of compared SNVs (per haplotype).

hope it helps

@dabitz
Copy link
Author

dabitz commented Feb 19, 2021

another way from a colleague:

We have {parents:pA, pB, child:C} - a trio - for evaluate phasing using k-mers: for child we have the phased assembly, while for parents, we need Illumina short reads.

For example, in apricot work: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02235-5:

After polishing the assemblies respectively with the “Currot”-genotype and “Orange Red”-genotype PacBio reads using apollo [52], we built up two sets of haplotype-specific k-mers from the assemblies, rC and rO. Correspondingly, a set of “Currot”-specific k-mers (with coverage from 10 to 60x), pC, was selected from the parental Illumina WGS that did not exist in “Orange Red” short reads (coverage over 1x) but in “Rojo Pasión” pollen short reads (coverage from 10 to 300x); similarly, a set of “Orange Red”-specific k-mers, pO, was also collected. Then, we intersected rC and rO with pC and pO respectively, leading to four subsets rC ∩ pC, rC ∩ pO, rO ∩ pC, and rO ∩ pO, which were used to calculate average haplotyping accuracy. All k-mer processing (counting, intersecting and difference finding) were performed with KMC [53].

@chhylp123
Copy link
Owner

Thank you so much. It is easy to evaluate hamming/switch error rate with trio data. For purge_dups, I do think current tools are problematic, especially for segdups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants