-
Notifications
You must be signed in to change notification settings - Fork 14
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create basic eval script for ProteinGym benchmark #4
Comments
WHat do we want as a 'basic'eval script? Should I try to make a script downloading and preprocessing the csvs and then use esm-2 as a placeholder, get its final hidden states feed these to one row of logits? If the script runs, we can later use our models instead and have a comparison with esm already built in :) |
Thanks @Muedi ! What you suggest is great for the semi-supervised property prediction setting. For the pure zero-shot setting with ESM models we could be using the ESM-1v masked-marginal approach as describe here: https://github.com/facebookresearch/esm/blob/main/examples/variant-prediction/predict.py (that's what we've been using when reporting the corresponding model performance in ProteinGym). Works nearly out of the box for the single-sequence only ESM models (eg., ESM1b, ESM1v, ESM2), except for sequences that are longer than the context length (ie longer than 1023 AAs, if we count the BOS token). |
Is the code available where it's adapted to proteinGym too? Then I'll gladly take that and put the script together, perhaps with a true/false flag if we want to get zero shot results or fine-tuned? |
Sounds good regarding the binary flag. Our code is not yet open source but the ESM script should be good enough for testing things out (95% of assay sequences are below the 1023 threshold). Note that this script will only be relevant for maskeg language modeling models, not the AR models we intend to train. So we will need another script parameter to choose model type (ESM vs AR) and then the code will use the relevant zero-shot scoring function/utils. |
@pascalnotin Hi to implement the script you provided We need the base sequence. |
Hi @Muedi -- each assay mutates a single reference sequence. We have a reference file with all these ref sequences in the repo (these cant be inferred from mutants as the mutated range is sometimes just a subset of the full protein). |
Ok, thanks, will revert the respective function then and instead download this file too :) |
/take |
No description provided.
The text was updated successfully, but these errors were encountered: