-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hands-on protocol for contacts_to_distograms #9
Comments
Hi Yan, To generate distograms with the contacts_to_distograms.py script you need two inputs.
The script currently only allows one cutoff. If you want to mix multiple cutoffs (which is possible), a quick hack would be to run this script with different cutoffs and concatenate the output files. The output is another CSV with the distograms per contact pair that you can use for prediction. Run like this: Let me know how it's working out with your data, I'm curious! Kolja |
Hi Kolja, I just tried the contacts_to_distograms.py script. it showed the following ERROR: 22 def get_uniform(cutoff, fdr): do you know how to fix it? Thank you. |
Sorry! It's fixed in the new version. |
Hi Kolja, |
Hi Kolja, Thank you so much! |
specifically, it gave the errror: cannot parse restraint type in line Thank you. |
hello,
it should give you the format of restraints.csv (I'll fix the way it's displayed and include an example and clarification in the readme) The file should be made up of lines like (for example)
These lines specify a restraint from resi 125 to resi 51 with a mean distance of 10 Angstrom, standard deviation of 5 and a normal distribution |
yes, i did run the preprocessing_distributions.py --help it showed and i changed them to ('From', 'To', 'mu', 'sigma', 'type') in your original script, but it still do not work and showed cannot parse restraint type in line |
could you attach your restraint.csv file here? |
(there should be no header in the file, just the comma-separated list of restraints with the lines like my comment above) |
it works now without headers. thank you so much! |
Hi, i dissected into the following way: python predict_with_crosslinks.py i left questions on each line of the commands. Thank you so much! |
Hi Yan, --distograms is only a flag, it doesn't take an argument. Kolja |
Thank you so much Kolja. |
Hi, again the error as following: "RuntimeError: Error(s) in loading state_dict for AlphaFold: How to easily figure it out? |
Hi, but it showed and the prediction gave only one structure model. does it make sense? how to figure it out? Cheers, |
In the end, you get two models, one relaxed and one unrelaxed. It fails in your case in the relax stage. I haven't run into this issue before. It's probably due to a lot of clashes. You should have a look at the unrelaxed model to see if it makes sense. |
Thank you so much Kolja. Is there a simple way to run multiple different sequences and restraints data at the same time? Thanks, |
No, not in parallel. Sequentially, the easiest would be to wrap everything in a for-loop in bash. |
Okay. thanks. |
Hello, This thread helped me to understand the scripts. Is there a sample input for testing:
Currently I am working out some issues to build the dockerfile (which is failing on my linux instance, but I should be able to work through). Afterwards, I would like to test with existing input. It appears there is testing data (test_set), but this may be already processed to some degree? Thanks! |
Hello, Just to clarify: AlphaLink is run with a protein sequence (.fasta format) and a set of distance restraints (usually crosslinking MS data). The restraints can be represented in 2 ways: -the default representation, which is a space-separated file with ResidueFrom ResidueTo FDR as shown in the ReadMe. In this case, predict_with_crosslinks.py is run with no additional flags. The same information can be given as a pyTorch dictionary with Numpy arrays. In addition to these 2 files, you may or may not want to use openfold to generate the msa features. For example, you can run the msa stage of alphafold2 and then take the msa directories coming out of that by pointing to that directory with the --use_precomputed_alignments flag. In the test directory, you will find the CDK case: there you have pyTorch dictionaries, sequences and precomputed msa files to run predict_with_crosslinks.py. If you want, we can also add the space separated restraint file instead (it looks very much like the one in the ReadMe). Hope this helps! |
Thanks it helps. It would be helpful to have the space separated restraint file for the CDK example (this would be the data informed by crosslinking MS data, correct?). And if I understand, then I do not need the preprocessing_distributions and may follow the 'default representation'? Having those data in text form rather than pyTorch dictionary would be helpful as I am not (yet?) very familiar with the representation and our own data would be closer to the text form. |
I uploaded the corresponding CSV files. Note that CDK was a theoretical experiment (proof-of-concept), the links correspond to simulated data. The real data for the membrane set is also in the git (still in PyTorch format though). Yes, for the CDK example you can use the CSV directly as an input to the photo-AA network. For your data, you could do the same, if your data is close to 10A. If you need a different cutoff, you would need to use the distogram network. Here you could use the contacts_to_distograms.py to preprocess your CSV data (same format as the input to the network ("default representation")). |
Hi, Could you share some detailed commands to play with -neff flag? should we run it with predict_with_crosslinks.py script? predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff Thanks. |
Neff is a number - the number of effective sequences in the MSA, as described in the AlphaLink paper, the original AlphaFold2 paper (Fig.5) and previous publicaitons. It acts at the MSA level by subsampling MSA to a given number of effective sequences. The fewer effective sequences, the weaker the MSA evidence will be. Thus, for Neff=10, in your command:
|
Hi, File "/opt/AlphaLink/predict_with_crosslinks.py", line 569, in Any solution/ Thanks, |
I pushed a fix. Thanks for reporting the issue! |
Hi, Thanks, |
AlphaLink was trained on model_5 and doesn't support templates. I updated the README with the db flags since they were missing. |
Hi, I am quite intrigued by the approach you used for processing input data when training on distograms. Specifically, for static structures, the distance for a selected pair is a fixed value. Could you kindly explain how you transform this value into a distribution for training purposes? Or perhaps, I might have misunderstood the model's methodology. I would greatly appreciate your insights on this matter. Thank you for your time! |
Hi Lin, |
Hi! |
"finetuning_model_5_ptm_CACA_10A.pt" doesn't use a distogram as the internal representation of the crosslinking data. Here, the input data was simply a contact map. |
Hi,
Could you please share a hands-on protocol on how we can generate distogram with contact information? as a beginner, it seems hard for me to use the scripts (contacts_to_distograms.py) to build the distogram.
Thank you so much!
Yan
The text was updated successfully, but these errors were encountered: