-
Notifications
You must be signed in to change notification settings - Fork 41
Add fastalignment metric #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
503330a
67105f4
03871cf
07920f3
c7427e5
0f53688
ed48585
53749db
315a19e
147f1e8
c1f85cc
60fb080
8d78fa4
09c3d2a
af623c5
042dbbd
fb9e6a5
196d2c1
21f06c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -478,3 +478,200 @@ def _self_alignment_scores(self, seqs: Sequence) -> dict: | |||||||||
| dtype=int, | ||||||||||
| count=len(seqs), | ||||||||||
| ) | ||||||||||
|
|
||||||||||
|
|
||||||||||
| @_doc_params(params=_doc_params_parallel_distance_calculator) | ||||||||||
| class FastAlignmentDistanceCalculator(ParallelDistanceCalculator): | ||||||||||
| """\ | ||||||||||
| Calculates distance between sequences based on pairwise sequence alignment. | ||||||||||
|
|
||||||||||
| This is a variation of the AlignmentDistanceCalculator which pre-filters sequence pairs based on | ||||||||||
| a) differences in sequence length | ||||||||||
| b) the number of different characters, based on estimate of the mismatch penalty | ||||||||||
|
|
||||||||||
| Depending on the setup, alignment may be performed significantly faster, but be advised that some sequence pairs may be filtered out incorrectly. | ||||||||||
| Default values for BLOSUM and PAM matrices are provided, but finding an adequate estimated_penalty for your given setup is encouraged. | ||||||||||
|
|
||||||||||
| The distance between two sequences is defined as :math:`S_{{1,2}}^{{max}} - S_{{1,2}}`, | ||||||||||
| where :math:`S_{{1,2}}` is the alignment score of sequences 1 and 2 and | ||||||||||
| :math:`S_{{1,2}}^{{max}}` is the max. achievable alignment score of sequences 1 and 2. | ||||||||||
| :math:`S_{{1,2}}^{{max}}` is defined as :math:`\\min(S_{{1,1}}, S_{{2,2}})`. | ||||||||||
|
|
||||||||||
| The use of alignment-based distances is heavily inspired by :cite:`TCRdist`. | ||||||||||
|
|
||||||||||
| High-performance sequence alignments are calculated leveraging | ||||||||||
| the `parasail library <https://github.com/jeffdaily/parasail-python>`_ (:cite:`Daily2016`). | ||||||||||
|
|
||||||||||
| Choosing a cutoff: | ||||||||||
| Alignment distances need to be viewed in the light of the substitution matrix. | ||||||||||
| The alignment distance is the difference between the actual alignment | ||||||||||
| score and the max. achievable alignment score. For instance, a mutation | ||||||||||
| from *Leucine* (`L`) to *Isoleucine* (`I`) results in a BLOSUM62 score of `2`. | ||||||||||
| An `L` aligned with `L` achieves a score of `4`. The distance is, therefore, `2`. | ||||||||||
|
|
||||||||||
| On the other hand, a single *Tryptophane* (`W`) mutating into, e.g. | ||||||||||
| *Proline* (`P`) already results in a distance of `15`. | ||||||||||
|
|
||||||||||
| We are still lacking empirical data up to which distance a CDR3 sequence still | ||||||||||
| is likely to recognize the same antigen, but reasonable cutoffs are `<15`. | ||||||||||
|
|
||||||||||
| Choosing an expected penalty: | ||||||||||
| The choice of an expected penalty is likely influenced by similar considerations as the | ||||||||||
| other parameters. Essentially, this can be thought of as a superficial (dis)similarity | ||||||||||
| measure. A higher value more strongly penalizes mismatching characters and is more in line | ||||||||||
| with looking for closely related sequence paris, while a lower value is more forgiving | ||||||||||
| and better suited when looking for more distantly related sequence pairs. | ||||||||||
|
|
||||||||||
| Parameters | ||||||||||
| ---------- | ||||||||||
| cutoff | ||||||||||
| Will eleminate distances > cutoff to make efficient | ||||||||||
| use of sparse matrices. The default cutoff is `10`. | ||||||||||
| {params} | ||||||||||
| subst_mat | ||||||||||
| Name of parasail substitution matrix | ||||||||||
| gap_open | ||||||||||
| Gap open penalty | ||||||||||
| gap_extend | ||||||||||
| Gap extend penatly | ||||||||||
| estimated_penalty | ||||||||||
| Estimate of the average mismatch penalty | ||||||||||
| """ | ||||||||||
|
|
||||||||||
| def __init__( | ||||||||||
| self, | ||||||||||
| cutoff: Union[None, int] = None, | ||||||||||
| *, | ||||||||||
| n_jobs: Union[int, None] = None, | ||||||||||
| block_size: int = 50, | ||||||||||
| subst_mat: str = "blosum62", | ||||||||||
| gap_open: int = 11, | ||||||||||
| gap_extend: int = 11, | ||||||||||
| estimated_penalty: float = None, | ||||||||||
| ): | ||||||||||
| if cutoff is None: | ||||||||||
| cutoff = 10 | ||||||||||
| super().__init__(cutoff, n_jobs=n_jobs, block_size=block_size) | ||||||||||
| self.subst_mat = subst_mat | ||||||||||
| self.gap_open = gap_open | ||||||||||
| self.gap_extend = gap_extend | ||||||||||
|
|
||||||||||
| penalty_dict = { | ||||||||||
grst marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| "blosum30": 4.0, | ||||||||||
| "blosum35": 4.0, | ||||||||||
| "blosum40": 4.0, | ||||||||||
| "blosum45": 4.0, | ||||||||||
| "blosum50": 4.0, | ||||||||||
| "blosum55": 4.0, | ||||||||||
| "blosum60": 4.0, | ||||||||||
| "blosum62": 4.0, | ||||||||||
| "blosum65": 4.0, | ||||||||||
| "blosum70": 4.0, | ||||||||||
| "blosum75": 4.0, | ||||||||||
| "blosum80": 4.0, | ||||||||||
| "blosum85": 4.0, | ||||||||||
| "blosum90": 4.0, | ||||||||||
| "pam10": 8.0, | ||||||||||
| "pam20": 8.0, | ||||||||||
| "pam30": 8.0, | ||||||||||
| "pam40": 8.0, | ||||||||||
| "pam50": 8.0, | ||||||||||
| "pam60": 4.0, | ||||||||||
| "pam70": 4.0, | ||||||||||
| "pam80": 4.0, | ||||||||||
| "pam90": 4.0, | ||||||||||
| "pam100": 4.0, | ||||||||||
| "pam110": 2.0, | ||||||||||
| "pam120": 2.0, | ||||||||||
| "pam130": 2.0, | ||||||||||
| "pam140": 2.0, | ||||||||||
| "pam150": 2.0, | ||||||||||
| "pam160": 2.0, | ||||||||||
| "pam170": 2.0, | ||||||||||
| "pam180": 2.0, | ||||||||||
| "pam190": 2.0, | ||||||||||
| "pam200": 2.0, | ||||||||||
| } | ||||||||||
|
|
||||||||||
| self.estimated_penalty = ( | ||||||||||
| estimated_penalty | ||||||||||
| if estimated_penalty is not None | ||||||||||
| else penalty_dict[subst_mat] | ||||||||||
| if subst_mat in penalty_dict.keys() | ||||||||||
| else 0.0 | ||||||||||
|
||||||||||
| else penalty_dict[subst_mat] | |
| if subst_mat in penalty_dict.keys() | |
| else 0.0 | |
| else penalty_dict.get(subst_mat, 0.0) |
I would even consider raising an error if the substitution matrix is unnown and no penalty is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think raising an error would be better
Uh oh!
There was an error while loading. Please reload this page.