-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with Crosslinking data input #18
Comments
The xl array (as well as a contact map or the pair representation) is symmetric that's why you have (i,j) and (j,i). The grouping array is an artefact, it's no longer required in the distogram network. Here, we just assign every crosslink to its own group indicated by an integer. To reproduce the results, you need to disable all sources of non-determinism, for example, the MSA masking. |
Thank you for your response. Based on your explanation, am I correct in understanding that grouping_array doesn't serve any purpose in the model? |
I noticed you mentioned the example T1064 in another (issues/13 ). I ran the data as per your instructions, but the resulting PLDDT score doesn't match the displayed 82.371 in the link Additionally, it differs significantly from the TM score in the attached model.cif file. Could you please help me identify the issue? |
It doesn't serve any purpose, but unfortunately, it can still affect the results due to injecting randomness.
What do you mean, it didn't utilize the MSA? For this example, there will not be any random subsampling of the MSAs, since the MSA size is below the threshold, but by default, there is always MSA masking. This would also apply to T1064, you'd need to remove any source of randomness including MSA masking. We removed any non-determinism to make the results comparable to AlphaFold. |
I noticed in predict_with_crosslinks.py that if a PKL file is provided, MSA won't be performed since the PKL file already contains MSA information, is that correct?
So, when you refer to the random subsampling of the MSAs, what does that mean? Do I need to input the neff parameter? How do I remove MSA masking? Can you give out an example?
By the way, you trained on model_5_ptm. When comparing with AlphaFold, did you use the results from model_5? Which checkpoint did you use, the one from AlphaFold or OpenFold? Thank you very much for your patient responses. Looking forward to your reply. |
Yes, no MSA search will be performed if you supply a pickle file. The pickle already contains all the features, including the MSA. This way the MSA stays fixed (at a given Neff) which will ensure comparability with AlphaFold, since we used exactly the same input features. The only difference is the crosslinks (+ additional training).
To limit memory consumption, AlphaFold limits the size of the input MSAs. How many MSAs are being used, is defined in the model configuration, see https://github.com/lhatsk/AlphaLink/blob/main/openfold/config_crosslinks.py#L197. If the MSA is bigger than max_msa_clusters, the MSA is subsampled to max_msa_clusters many sequences and the rest is aggregated in the extraMSA stack.
Set this to 0.0.
We used the 2.0 AlphaFold weights for model_5_ptm both as a starting point for fine-tuning and in the prediction for AlphaFold. The predictions were made in OpenFold with the AlphaFold weights which produces the same results (or reasonably close) to AlphaFold. |
Sorry, OpenFold cannot accept a feature file as input, right? So how do you ensure that you are using exactly the same input? When creating feature files, you mentioned using different 'neff' values. How is this variable controlled when comparing with AlphaFold2?
Thank you very much for your prompt. After setting this config to 0.0, the result of TM score from the AlphaLink inference increased from 0.365 to 0.8675. Could you please explain why this has such a significant impact? |
No, not by default, but it's easy to change. I just removed crosslinks from AlphaLink and used the original AlphaFold weights with the same inputs.
By using the same features which includes the MSA with a fixed Neff.
The MSA masking affects the Neff. It randomly removes 15% of the information in the MSA. The effect is obviously much stronger for MSAs that contain little information to begin with (low Neff). Depending on what is subsampled and how well the network is able to reconstruct it, you may end up with a lower/ higher Neff than before. It could now for example be the case that you mask out parts that could help with noise rejection or that remove information that are super complementary to crosslinks, resulting in worse results and more variance. Here, the masking was just unlucky, it could also help.
No, I would keep it on for normal usage.
Yes, you should set it to 0.0 to keep the comparison fair for both methods. |
Thank you. I would like to know how the number of effective sequences (Neff) is defined. Did you set the parameter neff=10 when running AlphaLink and AlphaFold2 on the dataset? Is this done to reflect the impact of crosslink data? I want to ask this question because when I ran MSA with neff=10 on the example 6LKI_B (ma-rap-alink-0001), the results differ from using feature inputs (skipping MSA). The TM scores with the ground truth are 0.8087 and 0.9012, respectively. |
The Neff is defined in the "MSA subsampling" section. We subsampled the MSAs to a given Neff to simulate challenging targets to show the impact of crosslinking MS data.
6LKI is part of the low Neff CAMEO targets, they are already challenging with low Neffs (at most 25, for 6LKI it's 15) therefore we didn't do any MSA subsampling. Your subsampling will further reduce the Neff and make the target harder which likely results in a lower TM-score. |
Hello, I noticed in the data_module_xl.py file, specifically at line 24, that you have imported the MSA subsampling module with from openfold.data.msa_subsampling import get_eff, subsample_msa, subsample_msa_sequentially, subsample_msa_random. However, looking further into the file, I didn't find any usage of this module. Could you please explain why it was imported but not used? |
data_module_xl.py is not used. It's some legacy stuff that I didn't clean up. |
When I was reproducing the results of CDK in the test_set, you provided input data in the form of crosslink data in both CSV and PT file formats. I noticed that in the PT file, the xl_array contains duplicated entries for residueTo and residueFrom. Can you explain why these entries are duplicated in reverse order?
Additionally, could you clarify the information represented by the grouping_array?
Furthermore, the results I inferred from these inputs do not match the PDB file located at test_set/CDK/predictions/CDK_neff10_1h01_xl_model_5_ptm.pdb, specifically in terms of RMSD and TM-score.
this is my call script:
python predict_with_crosslinks.py test_set/CDK/fasta/CDK.fasta test_set/CDK/crosslinks/1h01_xl.pt --features test_set/CDK/features/CDK_neff10.pkl --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt --uniref90_database_path /xxx/uniref90.fasta --mgnify_database_path /xxx/mgnify/mgy_clusters_2022_05.fa --pdb70_database_path /xxx/pdb70 --uniclust30_database_path /xxx/uniref30/
The text was updated successfully, but these errors were encountered: