|
1 |
| -# midsv-development |
| 1 | +[](https://choosealicense.com/licenses/mit/) |
| 2 | +[](https://github.com/akikuno/midsv/actions) |
| 3 | +[](https://pypi.org/project/midsv/) |
| 4 | +[](https://pypi.org/project/midsv/) |
| 5 | +[](https://anaconda.org/bioconda/midsv) |
| 6 | + |
| 7 | + |
| 8 | +# midsv |
| 9 | + |
| 10 | +`midsv` is a Python module to convert SAM to MIDSV format. |
| 11 | + |
| 12 | +MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference. |
| 13 | + |
| 14 | +> ⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory. |
| 15 | +
|
| 16 | +MIDSV provides `MIDSV`, `CSSPLIT`, and `QSCORE`. |
| 17 | + |
| 18 | +- `MIDSV` is a simple representation focusing on mutations |
| 19 | +- `CSSPLIT` keeps original nucleotides |
| 20 | +- `QSCORE` provides Phred quality score on each nucleotide |
| 21 | + |
| 22 | +MIDSV (formerly named MIDS) details are described in [our paper](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001507#sec009). |
| 23 | + |
| 24 | +# Installation |
| 25 | + |
| 26 | +From [PyPI](https://pypi.org/project/midsv/): |
2 | 27 |
|
3 | 28 | ```bash
|
4 |
| -rm -rf midsv |
5 |
| -[ -d midsv ] || git clone https://github.com/akikuno/midsv.git |
| 29 | +pip install midsv |
| 30 | +``` |
| 31 | + |
| 32 | +From [Bioconda](https://anaconda.org/bioconda/midsv): |
| 33 | + |
| 34 | +```bash |
| 35 | +conda install -c bioconda midsv |
| 36 | +``` |
| 37 | + |
| 38 | +# Usage |
| 39 | + |
| 40 | +```python |
| 41 | +midsv.transform( |
| 42 | + sam: list[list], |
| 43 | + midsv: bool = True, |
| 44 | + cssplit: bool = True, |
| 45 | + qscore: bool = True) -> list[dict] |
| 46 | +``` |
| 47 | + |
| 48 | +- `midsv.transform()` returns a list of dictionaries incuding `QNAME`, `RNAME`, `MIDSV`, `CSSPLIT`, and `QSCORE`. |
| 49 | +- `MIDSV`, `CSSPLIT`, and `QSCORE` are comma-separated and have the same reference sequence length. |
| 50 | + |
| 51 | +```python |
| 52 | +import midsv |
| 53 | + |
| 54 | +# Perfect match |
| 55 | + |
| 56 | +sam = [ |
| 57 | + ['@SQ', 'SN:example', 'LN:10'], |
| 58 | + ['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC'] |
| 59 | + ] |
| 60 | + |
| 61 | +midsv.transform(sam) |
| 62 | +# [{ |
| 63 | +# 'QNAME': 'control', |
| 64 | +# 'RNAME': 'example', |
| 65 | +# 'MIDSV': 'M,M,M,M,M,M,M,M,M,M', |
| 66 | +# 'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C', |
| 67 | +# 'QSCORE': '15,16,17,18,19,20,21,22,23,24' |
| 68 | +# }] |
| 69 | + |
| 70 | +# Insertion, deletion and substitution |
| 71 | + |
| 72 | +sam = [ |
| 73 | + ['@SQ', 'SN:example', 'LN:10'], |
| 74 | + ['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT'] |
| 75 | + ] |
| 76 | + |
| 77 | +midsv.transform(sam) |
| 78 | +# [{ |
| 79 | +# 'QNAME': 'indel_sub', |
| 80 | +# 'RNAME': 'example', |
| 81 | +# 'MIDSV': 'M,M,M,M,S,3M,D,D,M,M', |
| 82 | +# 'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T', |
| 83 | +# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22' |
| 84 | +# }] |
| 85 | + |
| 86 | +# Large deletion |
| 87 | + |
| 88 | +sam = [ |
| 89 | + ['@SQ', 'SN:example', 'LN:10'], |
| 90 | + ['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'], |
| 91 | + ['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC'] |
| 92 | + ] |
| 93 | + |
| 94 | +midsv.transform(sam) |
| 95 | +# [ |
| 96 | +# {'QNAME': 'large-deletion', |
| 97 | +# 'RNAME': 'example', |
| 98 | +# 'MIDSV': 'M,M,D,D,D,D,D,D,M,M', |
| 99 | +# 'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C', |
| 100 | +# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'} |
| 101 | +# ] |
| 102 | + |
| 103 | +# Inversion |
| 104 | + |
| 105 | +sam = [ |
| 106 | + ['@SQ', 'SN:example', 'LN:10'], |
| 107 | + ['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'], |
| 108 | + ['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'], |
| 109 | + ['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC'] |
| 110 | + ] |
| 111 | + |
| 112 | +midsv.transform(sam) |
| 113 | +# [ |
| 114 | +# {'QNAME': 'inversion', |
| 115 | +# 'RNAME': 'example', |
| 116 | +# 'MIDSV': 'M,M,M,M,M,m,m,m,M,M', |
| 117 | +# 'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C', |
| 118 | +# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'} |
| 119 | +# ] |
6 | 120 |
|
7 |
| -conda create -n env-midsv python=3.12 ipykernel pip pytest coverage ruff mypy black cstag -y |
8 | 121 | ```
|
9 | 122 |
|
10 |
| -# 優先順位 |
| 123 | +# Operators |
| 124 | + |
| 125 | +## MIDSV |
| 126 | + |
| 127 | +| Op | Description | |
| 128 | +| ----------- | --------------------------- | |
| 129 | +| M | Identical sequence | |
| 130 | +| [1-9][0-9]+ | Insertion to the reference | |
| 131 | +| D | Deletion from the reference | |
| 132 | +| S | Substitution | |
| 133 | +| N | Unknown | |
| 134 | +| [mdsn] | Inversion | |
| 135 | + |
| 136 | +`MIDSV` represents insertion as an integer and appends the following operators. |
| 137 | + |
| 138 | +If five insertions follow three matches, MIDSV returns `5M,M,M` (not `5,M,M,M`) since `5M,M,M` keeps reference sequence length in a comma-separated field. |
| 139 | + |
| 140 | +## CSSPLIT |
| 141 | + |
| 142 | +| Op | Regex | Description | |
| 143 | +| --- | -------------- | ---------------------------- | |
| 144 | +| = | [ACGTN] | Identical sequence | |
| 145 | +| + | [ACGTN] | Insertion to the reference | |
| 146 | +| - | [ACGTN] | Deletion from the reference | |
| 147 | +| * | [ACGTN][ACGTN] | Substitution | |
| 148 | +| | [acgtn] | Inversion | |
| 149 | +| \| | | Separater of insertion sites | |
| 150 | + |
| 151 | +`CSSPLIT` uses `|` to separate nucleotides in insertion sites. |
| 152 | + |
| 153 | +Therefore, `+A|+C|+G|+T|=A` can be easily splited to `[+A, +C, +G, +T, =A]` by `"+A|+C|+G|+T|=A".split("|")` in Python. |
| 154 | + |
| 155 | +## QSCORE |
| 156 | + |
| 157 | + |
| 158 | +| Op | Description | |
| 159 | +| --- | ---------------------------- | |
| 160 | +| -1 | Unknown | |
| 161 | +| \| | Separator at insertion sites | |
| 162 | + |
| 163 | +`QSCORE` uses `-1` at deletion or unknown nucleotides. |
| 164 | + |
| 165 | +As with `CSSPLIT`, `QSCORE` uses `|` to separate quality scores in insertion sites. |
| 166 | + |
| 167 | +# Helper functions |
| 168 | + |
| 169 | +## Read SAM file |
| 170 | + |
| 171 | +```python |
| 172 | +midsv.read_sam(path_of_sam: str | Path) -> list[list] |
| 173 | +``` |
| 174 | + |
| 175 | +`midsv.read_sam` read SAM file into a list of lists. |
| 176 | + |
| 177 | + |
| 178 | +## Read/Write JSON Line (JSONL) |
| 179 | + |
| 180 | +```python |
| 181 | +midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path) |
| 182 | +``` |
| 183 | + |
| 184 | +```python |
| 185 | +midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict] |
| 186 | +``` |
| 187 | + |
| 188 | +Since `midsv` returns a list of dictionaries, `midsv.write_jsonl` outputs it to a file in JSONL format. |
| 189 | + |
| 190 | +Conversely, `midsv.read_jsonl` reads JSONL as a list of dictionaries. |
11 | 191 |
|
12 |
| -1. `csvtag.call()`を確立する |
13 |
| -1. `csvtag.to_sequence()` |
14 |
| -1. `csvtag.to_consensus()` |
15 |
| -2. DAJIN2のほうで、`convert_csvtag_to_csv` |
|
0 commit comments