Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hands-on protocol for contacts_to_distograms #9

Open
Yan-Yan-2020 opened this issue May 5, 2023 · 34 comments
Open

hands-on protocol for contacts_to_distograms #9

Yan-Yan-2020 opened this issue May 5, 2023 · 34 comments

Comments

@Yan-Yan-2020
Copy link

Hi,

Could you please share a hands-on protocol on how we can generate distogram with contact information? as a beginner, it seems hard for me to use the scripts (contacts_to_distograms.py) to build the distogram.

Thank you so much!
Yan

@lhatsk
Copy link
Owner

lhatsk commented May 6, 2023

Hi Yan,

To generate distograms with the contacts_to_distograms.py script you need two inputs.

  1. a CSV with your contacts where each row has the following format: ResiduePosition1 ResiduePosition2 FDR
    ResiduePositions are 1-indexed, i.e., starting from 1.
    FDR corresponds to the uncertainty for this particular contact, essentially the level of noise you expect. It's a number between 0 and 1. E.g., 0.05 would mean there is a 5% chance this contact is noisy/ wrong. In principle, you can use an FDR of 0 but this would be more akin to forcing constraints.
    Example:
    10 15 0.05
    30 50 0.2
  2. the cutoff for your contact information in Angstrom (CA-CA).

The script currently only allows one cutoff. If you want to mix multiple cutoffs (which is possible), a quick hack would be to run this script with different cutoffs and concatenate the output files.

The output is another CSV with the distograms per contact pair that you can use for prediction.

Run like this:
python contacts_to_distograms.py --csv my_contact_file --cutoff 10 --output my_distogram_output

Let me know how it's working out with your data, I'm curious!

Kolja

@Yan-Yan-2020
Copy link
Author

Hi Kolja,

I just tried the contacts_to_distograms.py script. it showed the following ERROR:

22 def get_uniform(cutoff, fdr):
23 d = np.ones(128)
---> 24 maximum_bin = np.argmax(distogram_bins > cutoff)
25 d[:maximum_bin] /= np.sum(d[:maximum_bin])
26 d[:maximum_bin] *= 1 - fdr
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'

do you know how to fix it?

Thank you.
Yan

@lhatsk
Copy link
Owner

lhatsk commented May 7, 2023

Sorry! It's fixed in the new version.

@Yan-Yan-2020
Copy link
Author

Hi Kolja,
the new script works now.
Thank you so much!
Yan

@Yan-Yan-2020
Copy link
Author

Hi Kolja,
i have a new problem on the 'preprocessing_distributions.py' script. what is the format of the infile "restraints.csv"? Could you please give an example for the csv file?

Thank you so much!
Yan

@Yan-Yan-2020
Copy link
Author

specifically, it gave the errror:

cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')

Thank you.
Yan

@grandrea
Copy link
Collaborator

grandrea commented May 9, 2023

hello,
if you run

preprocessing_distributions.py --help

it should give you the format of restraints.csv (I'll fix the way it's displayed and include an example and clarification in the readme)

The file should be made up of lines like (for example)

125,51,10.0,5.0,normal
16,10,10.0,5.0,log-normal

These lines specify

a restraint from resi 125 to resi 51 with a mean distance of 10 Angstrom, standard deviation of 5 and a normal distribution
a restraint from resi 16 to resi 10 with a mean distance of 10, standard deviation of 5 and a log-normal distribution

@Yan-Yan-2020
Copy link
Author

yes, i did run the preprocessing_distributions.py --help it showed
"residueFrom, residueTo, mean_dist, stdev, distribution_type"

and i changed them to ('From', 'To', 'mu', 'sigma', 'type') in your original script, but it still do not work and showed cannot parse restraint type in line
('From', 'To', 'mu', 'sigma', 'type')

@grandrea
Copy link
Collaborator

could you attach your restraint.csv file here?

@grandrea
Copy link
Collaborator

(there should be no header in the file, just the comma-separated list of restraints with the lines like my comment above)

@Yan-Yan-2020
Copy link
Author

it works now without headers. thank you so much!

@Yan-Yan-2020
Copy link
Author

Hi,
Again, now I have questions on the predict_with_crosslinks.py script. In the READ.ME protocol, command shows
'python predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08'

i dissected into the following way:

python predict_with_crosslinks.py
--distograms restrains_distribution.csv (#csv file generated by preprocessing_distributions.py script)
--checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt
? 7K3N_A.fasta (#i guess it's the input file? then what script we should run?)
? restraints.csv (#why there is restraints.csv file here? what py script we should run?)
--uniref90_database_path uniref90.fasta
--mgnify_database_path mgy_clusters.fa
--pdb70_database_path pdb70/pdb70
? pdb_mmcif/mmcif_files (#what script we should run? like openfold command format? )
--uniclust30_database_path uniclust30_2018_08/uniclust30_2018_08

i left questions on each line of the commands.

Thank you so much!
Yan

@lhatsk
Copy link
Owner

lhatsk commented May 11, 2023

Hi Yan,

--distograms is only a flag, it doesn't take an argument.
restraints.csv then corresponds to your restraints_distribution.csv.
7K3N_A.fasta -- you need to have some sort of FASTA input file with your sequence.
pdb_mmcif/mmcif_files -- yes, follow the OpenFold instructions, also for setting up the databases like uniref90, mgnify, and pdb70.

Kolja

@Yan-Yan-2020
Copy link
Author

Thank you so much Kolja.

@Yan-Yan-2020
Copy link
Author

Hi, again the error as following:

"RuntimeError: Error(s) in loading state_dict for AlphaFold:
size mismatch for xl_embedder.linear.weight: copying a param with shape torch.Size([128, 1]) from checkpoint, the shape in current model is torch.Size([128, 128])"

How to easily figure it out?
Thanks
Yan

@Yan-Yan-2020
Copy link
Author

Hi,
never mind about my last question. i made it work with distogram.pt.

but it showed
' File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 521, in run_pipeline
max_attempts=max_attempts,
File "/opt/AlphaLink/openfold/np/relax/amber_minimize.py", line 459, in _run_one_iteration
raise ValueError(f"Minimization failed after {max_attempts} attempts.")
ValueError: Minimization failed after 100 attempts."

and the prediction gave only one structure model. does it make sense? how to figure it out?

Cheers,
Yan

@lhatsk
Copy link
Owner

lhatsk commented May 12, 2023

In the end, you get two models, one relaxed and one unrelaxed. It fails in your case in the relax stage. I haven't run into this issue before. It's probably due to a lot of clashes. You should have a look at the unrelaxed model to see if it makes sense.

@Yan-Yan-2020
Copy link
Author

Thank you so much Kolja.

Is there a simple way to run multiple different sequences and restraints data at the same time?

Thanks,
Yan

@lhatsk
Copy link
Owner

lhatsk commented May 13, 2023

No, not in parallel. Sequentially, the easiest would be to wrap everything in a for-loop in bash.

@Yan-Yan-2020
Copy link
Author

Okay. thanks.

@roivant-matts
Copy link

Hello, This thread helped me to understand the scripts. Is there a sample input for testing:

Currently I am working out some issues to build the dockerfile (which is failing on my linux instance, but I should be able to work through). Afterwards, I would like to test with existing input. It appears there is testing data (test_set), but this may be already processed to some degree?

Thanks!

@grandrea
Copy link
Collaborator

Hello,

Just to clarify: AlphaLink is run with a protein sequence (.fasta format) and a set of distance restraints (usually crosslinking MS data). The restraints can be represented in 2 ways:

-the default representation, which is a space-separated file with ResidueFrom ResidueTo FDR as shown in the ReadMe. In this case, predict_with_crosslinks.py is run with no additional flags. The same information can be given as a pyTorch dictionary with Numpy arrays.
-as distance distributions. In this case, you may either provide the programs with your own distance distributions, or generate some for your restraints using preprocessing_distributions.py . In this case, you will then use the output of this script as the input for predict_with_crosslinks.py, which you run with the --distogram flag.

In addition to these 2 files, you may or may not want to use openfold to generate the msa features. For example, you can run the msa stage of alphafold2 and then take the msa directories coming out of that by pointing to that directory with the --use_precomputed_alignments flag.

In the test directory, you will find the CDK case: there you have pyTorch dictionaries, sequences and precomputed msa files to run predict_with_crosslinks.py. If you want, we can also add the space separated restraint file instead (it looks very much like the one in the ReadMe).

Hope this helps!

@roivant-matts
Copy link

Thanks it helps. It would be helpful to have the space separated restraint file for the CDK example (this would be the data informed by crosslinking MS data, correct?). And if I understand, then I do not need the preprocessing_distributions and may follow the 'default representation'? Having those data in text form rather than pyTorch dictionary would be helpful as I am not (yet?) very familiar with the representation and our own data would be closer to the text form.

@lhatsk
Copy link
Owner

lhatsk commented May 23, 2023

I uploaded the corresponding CSV files. Note that CDK was a theoretical experiment (proof-of-concept), the links correspond to simulated data. The real data for the membrane set is also in the git (still in PyTorch format though).

Yes, for the CDK example you can use the CSV directly as an input to the photo-AA network.

For your data, you could do the same, if your data is close to 10A. If you need a different cutoff, you would need to use the distogram network. Here you could use the contacts_to_distograms.py to preprocess your CSV data (same format as the input to the network ("default representation")).

@Yan-Yan-2020
Copy link
Author

Hi,

Could you share some detailed commands to play with -neff flag? should we run it with predict_with_crosslinks.py script?

predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff

Thanks.
Yan

@grandrea
Copy link
Collaborator

Neff is a number - the number of effective sequences in the MSA, as described in the AlphaLink paper, the original AlphaFold2 paper (Fig.5) and previous publicaitons. It acts at the MSA level by subsampling MSA to a given number of effective sequences.

The fewer effective sequences, the weaker the MSA evidence will be. Thus, for Neff=10, in your command:

predict_with_crosslinks.py --distograms --checkpoint_path resources/AlphaLink_params/finetuning_model_5_ptm_CACA_10A.pt 7K3N_A.fasta restraints.csv uniref90.fasta mgy_clusters.fa pdb70/pdb70 pdb_mmcif/mmcif_files uniclust30_2018_08/uniclust30_2018_08 --neff 10

@Yan-Yan-2020
Copy link
Author

Hi,
i tried the command and gave the error;

File "/opt/AlphaLink/predict_with_crosslinks.py", line 569, in
main(args)
File "/opt/AlphaLink/predict_with_crosslinks.py", line 421, in main
feature_dict['deletion_matrix'] = feature_dict['deletion_matrix'][indices]
KeyError: 'deletion_matrix'

Any solution/

Thanks,

@lhatsk
Copy link
Owner

lhatsk commented May 26, 2023

I pushed a fix. Thanks for reporting the issue!

@Yan-Yan-2020
Copy link
Author

Hi,
is there positional argument 'pdb_mmcif/mmcif_files' in predict_with_crosslinks.py script? it seems i can not add template_mmcif_dir.

Thanks,
Yan

@lhatsk
Copy link
Owner

lhatsk commented Jun 6, 2023

AlphaLink was trained on model_5 and doesn't support templates. I updated the README with the db flags since they were missing.

@linjyshanghaitech
Copy link

Hi,

I am quite intrigued by the approach you used for processing input data when training on distograms. Specifically, for static structures, the distance for a selected pair is a fixed value. Could you kindly explain how you transform this value into a distribution for training purposes? Or perhaps, I might have misunderstood the model's methodology. I would greatly appreciate your insights on this matter.

Thank you for your time!
Lin

@lhatsk
Copy link
Owner

lhatsk commented May 15, 2024

Hi Lin,
In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.

@linjyshanghaitech
Copy link

Hi Lin, In different ways. The simplest is just a "contact" distogram, i.e., an uniform distribution. Let's say we want to simulate a true restraint with an upper bound of 10 A. We choose any restraint that is < 10 A in the crystal structure. The distogram then has uniform probability over the bins up to 10 A. We sometimes replaced the uniform distribution with a subsampled "real" distance distribution over X restraints with < 10 A or a distance distribution based on simulated crosslinking data.

Hi!
Thank you for your reply. I understand the operation for the "contact" distribution map, but I noticed that you have trained two sets of parameters separately. The first set, "finetuning_model_5_ptm_CACA_10A.pt", should be the one you mentioned that was run with a uniform distribution of 10A. I am curious about how the "finetuning_model_5_ptm_distogram.pt" set was trained. It seems to involve more complex procedures.
Thank you for your time!
Lin

@lhatsk
Copy link
Owner

lhatsk commented May 17, 2024

"finetuning_model_5_ptm_CACA_10A.pt" doesn't use a distogram as the internal representation of the crosslinking data. Here, the input data was simply a contact map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants