Skip to content

Commit 219665f

Browse files
committed
Add README
1 parent 9bc2aaa commit 219665f

File tree

5 files changed

+185
-494
lines changed

5 files changed

+185
-494
lines changed

README.md

+185-9
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,191 @@
1-
# midsv-development
1+
[![Licence](https://img.shields.io/badge/License-MIT-9cf.svg?style=flat-square)](https://choosealicense.com/licenses/mit/)
2+
[![Test](https://img.shields.io/github/actions/workflow/status/akikuno/midsv/ci.yml?branch=main&label=Test&color=brightgreen&style=flat-square)](https://github.com/akikuno/midsv/actions)
3+
[![Python](https://img.shields.io/pypi/pyversions/midsv.svg?label=Python&color=blue&style=flat-square)](https://pypi.org/project/midsv/)
4+
[![PyPI](https://img.shields.io/pypi/v/midsv.svg?label=PyPI&color=orange&style=flat-square)](https://pypi.org/project/midsv/)
5+
[![Bioconda](https://img.shields.io/conda/v/bioconda/midsv?label=Bioconda&color=orange&style=flat-square)](https://anaconda.org/bioconda/midsv)
6+
7+
8+
# midsv
9+
10+
`midsv` is a Python module to convert SAM to MIDSV format.
11+
12+
MIDSV (Match, Insertion, Deletion, Substitution, and inVersion) is a comma-separated format representing the difference between a reference and a query with the same length as the reference.
13+
14+
> ⚠️ MIDSV is for the target amplicon sequence (10-100 kbp). It may crash when whole chromosomes are used as reference due to running out of memory.
15+
16+
MIDSV provides `MIDSV`, `CSSPLIT`, and `QSCORE`.
17+
18+
- `MIDSV` is a simple representation focusing on mutations
19+
- `CSSPLIT` keeps original nucleotides
20+
- `QSCORE` provides Phred quality score on each nucleotide
21+
22+
MIDSV (formerly named MIDS) details are described in [our paper](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001507#sec009).
23+
24+
# Installation
25+
26+
From [PyPI](https://pypi.org/project/midsv/):
227

328
```bash
4-
rm -rf midsv
5-
[ -d midsv ] || git clone https://github.com/akikuno/midsv.git
29+
pip install midsv
30+
```
31+
32+
From [Bioconda](https://anaconda.org/bioconda/midsv):
33+
34+
```bash
35+
conda install -c bioconda midsv
36+
```
37+
38+
# Usage
39+
40+
```python
41+
midsv.transform(
42+
sam: list[list],
43+
midsv: bool = True,
44+
cssplit: bool = True,
45+
qscore: bool = True) -> list[dict]
46+
```
47+
48+
- `midsv.transform()` returns a list of dictionaries incuding `QNAME`, `RNAME`, `MIDSV`, `CSSPLIT`, and `QSCORE`.
49+
- `MIDSV`, `CSSPLIT`, and `QSCORE` are comma-separated and have the same reference sequence length.
50+
51+
```python
52+
import midsv
53+
54+
# Perfect match
55+
56+
sam = [
57+
['@SQ', 'SN:example', 'LN:10'],
58+
['match', '0', 'example', '1', '60', '10M', '*', '0', '0', 'ACGTACGTAC', '0123456789', 'cs:Z:=ACGTACGTAC']
59+
]
60+
61+
midsv.transform(sam)
62+
# [{
63+
# 'QNAME': 'control',
64+
# 'RNAME': 'example',
65+
# 'MIDSV': 'M,M,M,M,M,M,M,M,M,M',
66+
# 'CSSPLIT': '=A,=C,=G,=T,=A,=C,=G,=T,=A,=C',
67+
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'
68+
# }]
69+
70+
# Insertion, deletion and substitution
71+
72+
sam = [
73+
['@SQ', 'SN:example', 'LN:10'],
74+
['indel_sub', '0', 'example', '1', '60', '5M3I1M2D2M', '*', '0', '0', 'ACGTGTTTCGT', '01234!!!56789', 'cs:Z:=ACGT*ag+ttt=C-aa=GT']
75+
]
76+
77+
midsv.transform(sam)
78+
# [{
79+
# 'QNAME': 'indel_sub',
80+
# 'RNAME': 'example',
81+
# 'MIDSV': 'M,M,M,M,S,3M,D,D,M,M',
82+
# 'CSSPLIT': '=A,=C,=G,=T,*AG,+T|+T|+T|=C,-A,-A,=G,=T',
83+
# 'QSCORE': '15,16,17,18,19,0|0|0|20,-1,-1,21,22'
84+
# }]
85+
86+
# Large deletion
87+
88+
sam = [
89+
['@SQ', 'SN:example', 'LN:10'],
90+
['large-deletion', '0', 'example', '1', '60', '2M', '*', '0', '0', 'AC', '01', 'cs:Z:=AC'],
91+
['large-deletion', '0', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
92+
]
93+
94+
midsv.transform(sam)
95+
# [
96+
# {'QNAME': 'large-deletion',
97+
# 'RNAME': 'example',
98+
# 'MIDSV': 'M,M,D,D,D,D,D,D,M,M',
99+
# 'CSSPLIT': '=A,=C,N,N,N,N,N,N,=A,=C',
100+
# 'QSCORE': '15,16,-1,-1,-1,-1,-1,-1,23,24'}
101+
# ]
102+
103+
# Inversion
104+
105+
sam = [
106+
['@SQ', 'SN:example', 'LN:10'],
107+
['inversion', '0', 'example', '1', '60', '5M', '*', '0', '0', 'ACGTA', '01234', 'cs:Z:=ACGTA'],
108+
['inversion', '16', 'example', '6', '60', '3M', '*', '0', '0', 'CGT', '567', 'cs:Z:=CGT'],
109+
['inversion', '2048', 'example', '9', '60', '2M', '*', '0', '0', 'AC', '89', 'cs:Z:=AC']
110+
]
111+
112+
midsv.transform(sam)
113+
# [
114+
# {'QNAME': 'inversion',
115+
# 'RNAME': 'example',
116+
# 'MIDSV': 'M,M,M,M,M,m,m,m,M,M',
117+
# 'CSSPLIT': '=A,=C,=G,=T,=A,=c,=g,=t,=A,=C',
118+
# 'QSCORE': '15,16,17,18,19,20,21,22,23,24'}
119+
# ]
6120

7-
conda create -n env-midsv python=3.12 ipykernel pip pytest coverage ruff mypy black cstag -y
8121
```
9122

10-
# 優先順位
123+
# Operators
124+
125+
## MIDSV
126+
127+
| Op | Description |
128+
| ----------- | --------------------------- |
129+
| M | Identical sequence |
130+
| [1-9][0-9]+ | Insertion to the reference |
131+
| D | Deletion from the reference |
132+
| S | Substitution |
133+
| N | Unknown |
134+
| [mdsn] | Inversion |
135+
136+
`MIDSV` represents insertion as an integer and appends the following operators.
137+
138+
If five insertions follow three matches, MIDSV returns `5M,M,M` (not `5,M,M,M`) since `5M,M,M` keeps reference sequence length in a comma-separated field.
139+
140+
## CSSPLIT
141+
142+
| Op | Regex | Description |
143+
| --- | -------------- | ---------------------------- |
144+
| = | [ACGTN] | Identical sequence |
145+
| + | [ACGTN] | Insertion to the reference |
146+
| - | [ACGTN] | Deletion from the reference |
147+
| * | [ACGTN][ACGTN] | Substitution |
148+
| | [acgtn] | Inversion |
149+
| \| | | Separater of insertion sites |
150+
151+
`CSSPLIT` uses `|` to separate nucleotides in insertion sites.
152+
153+
Therefore, `+A|+C|+G|+T|=A` can be easily splited to `[+A, +C, +G, +T, =A]` by `"+A|+C|+G|+T|=A".split("|")` in Python.
154+
155+
## QSCORE
156+
157+
158+
| Op | Description |
159+
| --- | ---------------------------- |
160+
| -1 | Unknown |
161+
| \| | Separator at insertion sites |
162+
163+
`QSCORE` uses `-1` at deletion or unknown nucleotides.
164+
165+
As with `CSSPLIT`, `QSCORE` uses `|` to separate quality scores in insertion sites.
166+
167+
# Helper functions
168+
169+
## Read SAM file
170+
171+
```python
172+
midsv.read_sam(path_of_sam: str | Path) -> list[list]
173+
```
174+
175+
`midsv.read_sam` read SAM file into a list of lists.
176+
177+
178+
## Read/Write JSON Line (JSONL)
179+
180+
```python
181+
midsv.write_jsonl(dict: list[dict], path_of_jsonl: str | Path)
182+
```
183+
184+
```python
185+
midsv.read_jsonl(path_of_jsonl: str | Path) -> list[dict]
186+
```
187+
188+
Since `midsv` returns a list of dictionaries, `midsv.write_jsonl` outputs it to a file in JSONL format.
189+
190+
Conversely, `midsv.read_jsonl` reads JSONL as a list of dictionaries.
11191

12-
1. `csvtag.call()`を確立する
13-
1. `csvtag.to_sequence()`
14-
1. `csvtag.to_consensus()`
15-
2. DAJIN2のほうで、`convert_csvtag_to_csv`

midsv

-1
This file was deleted.

0 commit comments

Comments
 (0)