Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't get protein level confidence. "IndexError: index -3 is out of bounds for axis 0 with size 1" #113

Open
roushrsh opened this issue Nov 1, 2023 · 3 comments

Comments

@roushrsh
Copy link

roushrsh commented Nov 1, 2023

I'm trying to run mokapot on this file (TestAX.pin):
https://drive.google.com/file/d/1T4447c9Y24hx3qHla_cW55HEg4A_XEuW/view?usp=sharing

Which works great to give peptide confidence:

psms = mokapot.read_pin("TestAX.pin")
results, models = mokapot.brew(psms)
print(results)
A mokapot.confidence.LinearConfidence object:
	- PSMs at q<=0.01: 83617
	- Peptides at q<=0.01: 42508

I've then shuffled the peptides myself to make decoys which I would like to provide.

I've tried keeping them either their own row:
https://drive.google.com/file/d/1mP-nsqDAER2JcxoUxHc1H6Ed5qx6SCH7/view?usp=sharing
or grouped into one string so mokapot does the digesting (can I have this not done?)
https://drive.google.com/file/d/1_A-q9rqytLr9-YYrW6UD69jrtf0APaXA/view?usp=sharing

I do :

psms.add_proteins('Protein.csv',
    enzyme="[KR]",
    decoy_prefix="decoy_",
    missed_cleavages=2,
                )

then run mokapot.brew(psms) again.

Regardless, in case 1)

"WARNING:mokapot.picked_protein: 149241 out of 317925 peptides could not be mapped. Please check your digest settings." 

and I get

ValueError: Fewer than 90% of all peptides could be matched to proteins. Please verify that your digest settings are correct."

and in case 2)

WARNING:mokapot.picked_protein:11128 out of 317925 peptides could not be mapped. Please check your digest settings.

and I get the error:

    258 scores[rights == 0] = g[0] - (medians[0] - scores[rights == 0]) * derr # reuse "scores" array to save memory
    260 derl = (g[-1] - g[-2]) / (medians[-1] - medians[-2]) + (medians[-1] - medians[-2]) / 6 * gamma[-3]
IndexError: index 0 is out of bounds for axis 0 with size 0"

Any suggestions?

Thanks

@wfondrie
Copy link
Owner

wfondrie commented Nov 1, 2023

Hi @roushrsh 👋

Can you please provide the version of mokapot that you're using and please provide the full log? You should be able to find the version with:

import mokapot
print(mokapot.__version__)

To enable detailed logging within your python session, you can add:

import logging
logging.basicConfig(level=logging.INFO)

I'll do my best to answer your questions without this information for now though.

For protein-level confidence estimates, mokapot internally digests proteins from a FASTA file; the CSV files that you provided above are not supported. The protein grouping is performed based on all theoretical digested peptides for a protein sequence, so I would generally advise letting mokapot handle it. However, if you really want to provide your own sequences, you can do so by initializing the mokapot.proteins.Proteins object directly:

class Proteins:
"""Store protein sequences.
This class stores the mapping of peptides to proteins and the mapping of
target proteins to their corresponding decoy proteins.
We recommend creating Proteins objects using the
:py:func:`mokapot.read_fasta()` function.
Parameters
----------
decoy_prefix : str
The prefix used to indicate a decoy protein in the description
lines of the FASTA file.
peptide_map : Dict[str, str]
A dictionary mapping peptide sequences to the proteins that
may have generated them.
protein_map : Dict[str, str]
A dictionary mapping decoy proteins to the target proteins from
which they were generated.
shared_peptides : Dict[str]
A dictionary mapping shared peptides to the proteins that may have
generated them.
has_decoys : bool
Did the FASTA file have decoy proteins in it?
Attributes
----------
decoy_prefix : str
The prefix used to indicate a decoy protein in the description
lines of the FASTA file.
peptide_map : Dict[str, str]
A dictionary mapping peptide sequences to the proteins that
may have generated them.
protein_map : Dict[str, str]
A dictionary mapping decoy proteins to the target proteins from
which they were generated.
shared_peptides : Dict[str]
A dictionary mapping shared peptides to the proteins that may have
generated them.
has_decoys : bool
Did the FASTA file have decoy proteins in it?
"""

However, you'll need to provide a mapping of target proteins to decoy proteins, a mapping of peptides to their generating proteins, and set of all shared peptides, in addition to other information. I strongly recommend using a FASTA file.

To my knowledge, the error you're seeing does not originate from the mokapot codebase, so I can't provide much insight without seeing the full log.

@roushrsh
Copy link
Author

roushrsh commented Nov 2, 2023

Edit/Update:
Version is 0.10.0, but I had the same problem on the older version.

I made the protein structure and all the parts:

proteins = Proteins(peptide_map=dictionaryPepts, protein_map=dictionaryProts, shared_peptides= dict(zip([],[])), has_decoys=True, decoy_prefix='decoy_')

but when I use add_proteins the Protein confidence estimates enabled remains False and nothing happens. How can I add them properly?
Even loading from fasta and replacing the parts with mine seems to do nothing.

I got the 'fasta' version to work though, thanks!

@wfondrie
Copy link
Owner

wfondrie commented Nov 9, 2023

The approach you took should work with the Proteins object should work. Can you create a tiny reproducible example to share?

Glad the fasta is working for you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants