Create basic eval script for ProteinGym benchmark #4

pascalnotin · 2023-08-02T21:45:05Z

No description provided.

Muedi · 2023-08-03T19:45:48Z

WHat do we want as a 'basic'eval script? Should I try to make a script downloading and preprocessing the csvs and then use esm-2 as a placeholder, get its final hidden states feed these to one row of logits?
Would this work as a zero-shot test? Any suggestions in this direction would be welcome, I only used my own models or pretrained ones as is :)

If the script runs, we can later use our models instead and have a comparison with esm already built in :)

pascalnotin · 2023-08-06T02:47:00Z

Thanks @Muedi ! What you suggest is great for the semi-supervised property prediction setting.

For the pure zero-shot setting with ESM models we could be using the ESM-1v masked-marginal approach as describe here: https://github.com/facebookresearch/esm/blob/main/examples/variant-prediction/predict.py (that's what we've been using when reporting the corresponding model performance in ProteinGym). Works nearly out of the box for the single-sequence only ESM models (eg., ESM1b, ESM1v, ESM2), except for sequences that are longer than the context length (ie longer than 1023 AAs, if we count the BOS token).

Muedi · 2023-08-06T10:21:55Z

Is the code available where it's adapted to proteinGym too?

Then I'll gladly take that and put the script together, perhaps with a true/false flag if we want to get zero shot results or fine-tuned?
Otherwise I'll adapt what you linked already :)

pascalnotin · 2023-08-06T13:04:48Z

Sounds good regarding the binary flag. Our code is not yet open source but the ESM script should be good enough for testing things out (95% of assay sequences are below the 1023 threshold). Note that this script will only be relevant for maskeg language modeling models, not the AR models we intend to train. So we will need another script parameter to choose model type (ESM vs AR) and then the code will use the relevant zero-shot scoring function/utils.

Muedi · 2023-08-13T05:04:39Z

@pascalnotin Hi to implement the script you provided We need the base sequence.
Do some experiments have multiple base seqs?
If not I already have written a function that returns the base seqs for all experiments during preprocessing :)

pascalnotin · 2023-08-17T15:21:26Z

Hi @Muedi -- each assay mutates a single reference sequence. We have a reference file with all these ref sequences in the repo (these cant be inferred from mutants as the mutated range is sometimes just a subset of the full protein).
Link to reference file: https://github.com/OATML-Markslab/ProteinGym/blob/main/ProteinGym_reference_file_substitutions.csv

Muedi · 2023-08-17T18:20:33Z

Ok, thanks, will revert the respective function then and instead download this file too :)

Muedi · 2023-08-21T11:52:12Z

/take

…penBioML#53)

pascalnotin · 2023-10-05T15:35:59Z

Closing this issue as it was addressed by PR#50 - thank you @Muedi !

pascalnotin mentioned this issue Aug 6, 2023

Instantiate base model class in HF #2

Closed

github-actions bot assigned Muedi Aug 21, 2023

Muedi added a commit to Muedi/protein-lm-scaling that referenced this issue Sep 26, 2023

removed old eval script, added changes for Issues (OpenBioML#4) and (O…

7f878c8

…penBioML#53)

pascalnotin closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create basic eval script for ProteinGym benchmark #4

Create basic eval script for ProteinGym benchmark #4

pascalnotin commented Aug 2, 2023

Muedi commented Aug 3, 2023 •

edited

Loading

pascalnotin commented Aug 6, 2023

Muedi commented Aug 6, 2023

pascalnotin commented Aug 6, 2023

Muedi commented Aug 13, 2023

pascalnotin commented Aug 17, 2023

Muedi commented Aug 17, 2023

Muedi commented Aug 21, 2023

pascalnotin commented Oct 5, 2023

Create basic eval script for ProteinGym benchmark #4

Create basic eval script for ProteinGym benchmark #4

Comments

pascalnotin commented Aug 2, 2023

Muedi commented Aug 3, 2023 • edited Loading

pascalnotin commented Aug 6, 2023

Muedi commented Aug 6, 2023

pascalnotin commented Aug 6, 2023

Muedi commented Aug 13, 2023

pascalnotin commented Aug 17, 2023

Muedi commented Aug 17, 2023

Muedi commented Aug 21, 2023

pascalnotin commented Oct 5, 2023

Muedi commented Aug 3, 2023 •

edited

Loading