Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent worse contact accuracy resulted from hhblits3 #205

Closed
kad-ecoli opened this issue Jun 29, 2020 · 15 comments
Closed

Consistent worse contact accuracy resulted from hhblits3 #205

kad-ecoli opened this issue Jun 29, 2020 · 15 comments

Comments

@kad-ecoli
Copy link

kad-ecoli commented Jun 29, 2020

Expected Behavior

hhblits3 (commit d755970) should give better MSA than hhblits2 (2.0.15, patched with https://gist.github.com/milot-mirdita/fd4b193a2423cc8e71868a3a68ef940f to address cs219 overflow issue)

Current Behavior

hhblits2 MSA consistently leads to higher residue-residue contact prediction precision on average than hhblits3 MSA. The same observation can be made using almost any uniclust30 version supporting both hhblits2 and hhblits3 (i.e. up to version uniclust30_2018_08) and most contact predictors based on coevolution or uses coevolution as the main input feature for deep learning. For simplicity, in the following example, we will use uniclust30_2018_08 and CCMpred. The benchmark dataset is the 211 "hard" protein domain structures from the DeepMSA benchmark dataset. Native structure attached.
hard.zip

Steps to Reproduce (for bugs)

CCMpred predictions are generated by

#### hhblits 2.0.15 ####
~/hhsuite2/bin/hhblits -i seq.fasta -d ~/uniclust/uniclust30_2018_08/uniclust30_2018_08 -cpu 1 -oa3m hhblits2.a3m -id 99 -cov 50 -n 3 -diff inf -o hhblits2.hhr
grep -v '^>' hhblits2.a3m|sed 's/[a-z]//g' > hhblits2.aln
ccmpred hhblits2.aln hhblits2.ccmpred

#### hhblits 3 last commit on Jun 28, 2020 ####
~/hhsuite/current/build/bin/hhblits -i seq.fasta -d ~/uniclust/uniclust30_2018_08/uniclust30_2018_08 -cpu 1 -oa3m hhblits3.a3m -id 99 -cov 50 -n 3 -diff inf -o hhblits3.hhr
grep -v '^>' hhblits3.a3m|sed 's/[a-z]//g' > hhblits3.aln
ccmpred hhblits3.aln hhblits3.ccmpred

To assess contact prediction precision, you can use contact_pdb.py

contact_pdb.py native.pdb hhblits2.ccmpred -infmt=gremlin
contact_pdb.py native.pdb hhblits3.ccmpred -infmt=gremlin

HH-suite Output (for bugs)

The top short, medium, long, all (=short+medm+long) range contact precision, averaged over all 211 target proteins are:

MSA short L short L/2 short L/5 medm L medm L/2 medm L/5 long L long L/2 long L/5 all L all L/2 all L/5
hhblits2 0.1180 0.1726 0.2736 0.1417 0.2105 0.3271 0.2325 0.3247 0.4248 0.3091 0.3991 0.4869
hhblits3 0.1171 0.1681 0.2650 0.1401 0.2062 0.3177 0.2262 0.3157 0.4189 0.3005 0.3927 0.4764

Here, L is the length of protein sequence; long L/5 means that we rank all predicted contacts in descending probability for residues pairs with | i - j | >=24, and choose the top L/5 residue pairs to calculate precision. In all the categories, hhblits2, on average, consistently outperforms hhblits3. The full results are attached below as tab-delimited text files.
hhblits3.ccmpred.txt
hhblits2.ccmpred.txt

Context

This issue is related to
soedinglab/uniclust-pipeline#3
and
#188
where we report an extreme case of hhblits3's poor performance (pdb 3e7u, where hhblits3 + ccmpred is worse than hhblits2 + ccmpred by 50% for long range contact prediction). Accordingly to our benchmark study, this is not an isolated case, but is actually a common issue in hhblits3. Unfortunately, uniclust30 decides to drop support for hhblits2, which forces us to implement dirty workarounds to convert recent uniclut30 to be usable by legacy hhblits2.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

  • Version/Git commit used: (mentioned above)
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
cat /proc/cpuinfo |grep flags|head -1
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
 free -g
              total        used        free      shared  buff/cache   available
Mem:            125          14           0           0         110         109
Swap:            31           8          23
  • Operating system and version:
lsb_release -a
LSB Version:	:core-4.1-amd64:core-4.1-noarch
Distributor ID:	CentOS
Description:	CentOS Linux release 7.7.1908 (Core)
Release:	7.7.1908
Codename:	Core
@kad-ecoli
Copy link
Author

kad-ecoli commented Jun 29, 2020

By the way, in the hhsuite3 paper, the accuracy of MSA by hhblits2 versus that of hhblits3 was mainly benchmark in terms of identifying SCOP templates with the same fold (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7/figures/7).

Unfortunately, such benchmark only calibrates alignment statistics (e-value, posterior probability etc), but did not directly check the correctness of alignment (i.e. residue-residue correspondence) for a given query-template pair. A more sensitive and realistic benchmark to show the performance discrepancy should be a contact prediction task as shown here, or a threading task to evaluate first template TM-score. This is because in most structural bioinformatics tasks, such as contact prediction and threading, we do not only care about finding the correct hit, but also care about correctly aligning the given template sequence hit to the query sequence.

In fact, a sequence searching program that can always find good template but fail to correctly align the query to the template is arguably less useful than another program that may miss some good hit but manage to make the best alignment for each query-template pair.

@milot-mirdita
Copy link
Member

Could you upload the a3m/hhr files of hhblits2 and 3 for some of the worst offenders? That would be super useful.

@kad-ecoli
Copy link
Author

I did not keep the hhr file. You can download the a3m files at
http://zhanglab.ccmb.med.umich.edu/DeepMSA/hard.zip

@kad-ecoli
Copy link
Author

I have also finished benchmarking the results on the remaining 403 "easy" targets from the DeepMSA benchmark set. The result is consistent with that on hard targets, and shows that hhblits2 results in better CCMpred contact accuracy than hhblits3.

MSA short L short L/2 short L/5 medm L medm L/2 medm L/5 long L long L/2 long L/5 all L all L/2 all L/5
hhblits2 0.1636 0.2545 0.4230 0.2094 0.3326 0.5288 0.4083 0.5463 0.6626 0.5084 0.6233 0.7080
hhblits3 0.1608 0.2493 0.4151 0.2059 0.3260 0.5176 0.3986 0.5365 0.6588 0.4974 0.6156 0.6999

Due to file size, I am only able to upload the list and native structures at
http://zhanglab.ccmb.med.umich.edu/DeepMSA/easy.zip

The full assessment results are attached below as tab-delimited text files.
hhblits3.ccmpred.txt
hhblits2.ccmpred.txt

@martin-steinegger martin-steinegger changed the title Consistent worse contact accuracy resulted from hhblit3 Consistent worse contact accuracy resulted from hhblits3 Jun 30, 2020
@martin-steinegger
Copy link
Member

martin-steinegger commented Jun 30, 2020

Thank you for benchmarking and writing a summary. This really helps us to understand whats going on.

HHsuite3 has two critical changes that could cause this change: (1) the context specific pseudo counts
changed to [1] (2) the realignment is not performed iteratively anymore. Both changes could cause this issue but it is possible to turn both features off using -nocontxt and -norealign. @milot-mirdita did look at the differences for one of the worst performing sequences d1mabg_. The table below summarizing this finding. It seems that if both flags are used -nocontxt -norealign both methods perform pretty similar. Using only -nocontxt causes the largest difference indicating the realignment changes causes this major effect. Another indicator for a regression in the realignment is that -norealign actually improves the performance in HHsuite3 but in 2.0.16 it decreases the performance.

👉 Update: We isolated the issue, the premerge of the realign step causes the differences. We removed it from HHsuite3 because of performance issues.

Command: 
hhblits -i d1mabg_.fas -d uniclust30_2018_08 -oa3m stdout -id 99 -cov 50 -diff 99999 -cpu 16 -n 1 -v 0 $PARM | wc -l
ver no param -premerge 0 -nocontxt -nocontxt -premerge 0 -norealign -norealign -nocontxt
2.0.16 7296 2112 7248 3440 5468 5084
3 2568 3088 4914 5274

[1] Angermüller et al, Discriminative modelling of context-specific amino acid substitution probabilities
(2012) Bioinformatics

@martin-steinegger
Copy link
Member

@kad-ecoli I have added the premerge back to to hh-suite3. Using this premerge with d1mabg_.fas we could align slightly more 7352 instead of 7296 from hh-suite2.0.16. Would it be possible for your to rerun your benchmark? Thank you so much for pointing out this issue.

@kad-ecoli
Copy link
Author

What parameter should I use? Should I just use the same parameter as before or do I need to specifically call the -premerge flag?

@milot-mirdita
Copy link
Member

It's enabled by default now, you don't need to specify any additional parameters.

@kad-ecoli
Copy link
Author

I have rebuilt hhblits3 from the following commit

commit bec434ee309e4880dd73df9234c0a84465fdc93c
Author: Martin Steinegger <[email protected]>
Date:   Sun Aug 9 21:48:24 2020 +0900

Unfortunately the issue was not solved for d1mabg_, whose msa still contains 1455 sequences only. My command is

hhblits -i seq.fasta -d uniclust30_2018_08 -cpu 1 -oa3m hhblits3.a3m -id 99 -cov 50 -n 3 -diff inf -o hhblits3.hhr

The a3m and hhr files are attached below.

hhblits3.zip

@martin-steinegger
Copy link
Member

@kad-ecoli thank you for checking. Is it overall performance still worse? A single query can still perform worse.

@kad-ecoli
Copy link
Author

kad-ecoli commented Aug 16, 2020

The overall performance of hhblits3 with and without the patch is essentially identical on all 211 hard target proteins. Maybe there are some changes you did not commit.

@milot-mirdita
Copy link
Member

I just tried it out, you are right bec434e broke premerge again. We'll fix it, sorry for the confusion.

@milot-mirdita
Copy link
Member

Could you try out 47a835a? This commit produces the >3k expected sequences again. We didn't test bec434e correctly. I tested the previous commit and had found that it produces the correct number of sequences but it had another remaining issue. Martin fixed that and we just assumed it was fine.

@kad-ecoli
Copy link
Author

Yes, this indeed address the issue for d1mabg_. I am now performing the large scale benchmark to see the overall performance.

@kad-ecoli
Copy link
Author

MSA short L short L/2 short L/5 medm L medm L/2 medm L/5 long L long L/2 long L/5 all L all L/2 all L/5
hhblits2 0.1180 0.1726 0.2736 0.1417 0.2105 0.3271 0.2325 0.3247 0.4248 0.3091 0.3991 0.4869
hhblits3 no premerge 0.1171 0.1681 0.2650 0.1401 0.2062 0.3177 0.2262 0.3157 0.4189 0.3005 0.3927 0.4764
hhblits3 with premerge 0.1184 0.1716 0.2698 0.1426 0.2120 0.3193 0.2319 0.3220 0.4227 0.3065 0.3973 0.4812

After commit 47a835a, the ccmpred contact prediction accuracy with hhblits msa (hhblits3 with premerge in the above table) is indeed better than before (hhblits3 no premerge), and is similar to that of hhblits2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants