Skip to content

Commit d25d840

Browse files
authored
Merge pull request #240 from HadrienG/dev
New release
2 parents 1687d52 + 0b80d8e commit d25d840

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2073
-1626
lines changed

.flake8

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
[flake8]
2+
max-line-length = 120
3+
ignore = E203, W503, C901
4+
exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
5+
max-complexity = 10

.github/workflows/pythonpackage.yml

+6
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,17 @@ jobs:
2121
python -m pip install --upgrade pip
2222
pip install pipenv
2323
pipenv install --dev
24+
- name: Style check
25+
run: |
26+
pipenv run black
27+
pipenv run isort
28+
pipenv run flake8
2429
- name: Test with pytest
2530
run: |
2631
chmod -w data/read_only.fasta
2732
pipenv run tests
2833
- name: Upload to Codecov
34+
if: github.event.repository.fork == false
2935
uses: codecov/[email protected]
3036
with:
3137
token: ${{ secrets.CODECOV_TOKEN }}

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
*.bam
44
*.bam.bai
55
*.bt2
6+
*.npz
67
test_runs/
78

89
# Byte-compiled / optimized / DLL files
@@ -11,6 +12,7 @@ __pycache__/
1112

1213
# Distribution / packaging
1314
*.egg-info/
15+
.eggs/
1416
dist/
1517
build/
1618

.readthedocs.yml

+7
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,11 @@
1+
version: 2
2+
3+
build:
4+
os: ubuntu-22.04
5+
tools:
6+
python: "3"
17
python:
28
install:
39
- requirements: doc/requirements.txt
410
- method: pip
11+
path: .

Pipfile

+8-2
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,9 @@ verify_ssl = true
44
name = "pypi"
55

66
[packages]
7-
future = "*"
87
numpy = "*"
98
scipy = "*"
109
biopython = "*"
11-
joblib = "*"
1210
pysam = "*"
1311
requests = "*"
1412
urllib3 = ">=1.26.5"
@@ -20,7 +18,15 @@ pycodestyle = "*"
2018
pytest = "*"
2119
pytest-cov = "*"
2220
exceptiongroup = "*"
21+
black = "*"
22+
isort = "*"
23+
flake8 = "*"
24+
typing-extensions = "*"
25+
tomli = "*"
2326

2427
[scripts]
2528
iss = "python -m iss"
2629
tests = "pytest --cov=iss ."
30+
black = "black --check ."
31+
isort = "isort --check ."
32+
flake8 = "flake8 ."

Pipfile.lock

+485-390
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

data/readcounts.txt

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
amplicon_A 2
2+
amplicon_T 2
3+
amplicon_GC 4
4+
amplicon_ATCG 1
5+
amplicon_TA 1

doc/index.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ InSilicoSeq supports substitution, insertion, deletion errors, and models gc bia
1616
Details
1717
-------
1818

19-
* **Authors:** Hadrien Gourlé, Juliette Hayer, Oskar E. Karlsson and Erik Bongcam-Rudloff
19+
* **Authors:** Hadrien Gourlé, Juliette Hayer, Oskar E. Karlsson, Erik Bongcam-Rudloff, Stefan Lelieveld, Thijs Maas
2020
2121
* **GitHub:** `HadrienG/InSilicoSeq <https://github.com/HadrienG/InSilicoSeq>`_
2222
* **License:** MIT

doc/iss/generate.rst

+42-5
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,19 @@
33
Generating reads
44
================
55

6+
InSilicoSeq can simulate amplicon reads, or reads from whole metagenome sequencing (the default).
7+
You can specify the type of reads you want to simulate with the ``--sequence_type`` option.
8+
69
InSilicoSeq comes with a set of pre-computed error models to allow the user to easily generate reads from the most popular Illumina instruments:
710

811
- HiSeq
9-
- MiSeq
12+
- MiSeq (optionally, with various quality thresholds):
13+
- MiSeq-20
14+
- MiSeq-24
15+
- MiSeq-28
16+
- MiSeq-32
17+
- MiSeq-36
18+
- NextSeq
1019
- NovaSeq
1120

1221
Per example generate 1 million MiSeq reads from a set of input genomes:
@@ -49,6 +58,19 @@ You can also provide multiple input files:
4958
curl -O -J -L https://osf.io/37kg8/download # download another example file
5059
iss generate --genomes SRS121011.fasta minigut.fasta --n_genomes 5 --model novaseq --output novaseq_reads
5160
61+
Amplicons
62+
---------
63+
64+
To generate amplicon reads, use the ``--sequence_type amplicon`` option:
65+
66+
.. code-block:: bash
67+
68+
# no example data is provided here
69+
iss generate --genomes my_amplicons.fasta ---readcount_file counts.txt -sequence_type amplicon --model nextseq --output reads
70+
71+
where ``counts.txt`` is a tab-delimited file containing the number of reads to generate for each amplicon sequence present in ``my_amplicons.fasta``.
72+
Alternatively, you can use the ``--n_reads`` option to generate a fixed number of reads, together with an abundance distribution.
73+
5274
Draft genomes
5375
-------------
5476

@@ -230,6 +252,11 @@ coverage distribution. Can be uniform, halfnormal, exponential, lognormal or zer
230252

231253
file containing coverage information (default: None).
232254

255+
--readcount_file
256+
^^^^^^^^^^^^^^^^
257+
258+
file containing read_count information (default: None).
259+
233260
--n_reads
234261
^^^^^^^^^
235262

@@ -246,16 +273,21 @@ Can be 'kde' or 'basic'
246273
^^^^^^^^
247274

248275
Error model file. (default: None).
249-
Use HiSeq, NovaSeq or MiSeq for a pre-computed error model provided with the software, or a file generated with iss model.
250-
If you do not wish to use a model, use --mode basic.
251-
The name of the built-in models is case insensitive.
276+
Use HiSeq, NextSeq, NovaSeq, MiSeq or Miseq-[20,24,28,32] for a pre-computed error model provided with the software, or a file generated with iss model.
277+
If you do not wish to use a model, use --mode basic or --mode perfect.
278+
The name of the built-in models are case insensitive.
252279

253280
--gc_bias
254281
^^^^^^^^^
255282

256283
If set, may fail to sequence reads with abnormal GC content.
257284
Does not guarantee --n_reads (default: False)
258285

286+
--sequence_type
287+
^^^^^^^^^^^^^^^
288+
289+
Type of sequencing. Can be metagenomics or amplicon (default: metagenomics).
290+
259291
--cpus
260292
^^^^^^
261293

@@ -284,4 +316,9 @@ Output file path and prefix (Required)
284316
--compress
285317
^^^^^^^^^^
286318

287-
Compress the output in gzip format (default: False).
319+
Compress the output in gzip format (default: False).
320+
321+
--store_mutations
322+
^^^^^^^^^^^^^^^^^
323+
324+
Generates an additional VCF file with the mutations introduced in the reads (bool).

doc/iss/install.rst

-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ To install InSilicoSeq, type the following in your terminal:
1818
It will install InSilicoSeq as well as the following dependencies:
1919

2020
* biopython
21-
* joblib
2221
* numpy
2322
* pysam
2423
* scipy

doc/iss/model.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,12 @@ Available models
1212
+------------+-------------+
1313
| Model name | Read length |
1414
+============+=============+
15-
| MiSeq | 300 bp |
15+
| MiSeq* | 300 bp |
1616
+------------+-------------+
1717
| HiSeq | 125 bp |
1818
+------------+-------------+
19+
| NextSeq | 300 bp |
20+
+------------+-------------+
1921
| NovaSeq | 150 bp |
2022
+------------+-------------+
2123

doc/iss/qualities.png

-347 KB
Loading

iss/__init__.py

-4
Original file line numberDiff line numberDiff line change
@@ -1,4 +0,0 @@
1-
#!/usr/bin/env python
2-
# -*- coding: utf-8 -*-
3-
4-
from iss import *

iss/__main__.py

+1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
# -*- coding: utf-8 -*-
33

44
from iss.app import main
5+
56
main()

iss/abundance.py

+51-27
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,40 @@
11
#!/usr/bin/env python
22
# -*- coding: utf-8 -*-
33

4-
from Bio import SeqIO
5-
from scipy import stats
6-
4+
import logging
75
import os
86
import sys
9-
import logging
7+
108
import numpy as np
9+
from Bio import SeqIO
10+
from scipy import stats
11+
12+
13+
def parse_readcount_file(readcount_file):
14+
logger = logging.getLogger(__name__)
15+
16+
readcount_dic = {}
17+
try:
18+
assert os.stat(readcount_file).st_size != 0
19+
f = open(readcount_file, "r")
20+
except (IOError, OSError) as e:
21+
logger.error("Failed to open file:%s" % e)
22+
sys.exit(1)
23+
except AssertionError:
24+
logger.error("File seems empty: %s" % readcount_file)
25+
sys.exit(1)
26+
else:
27+
with f:
28+
for line in f:
29+
try:
30+
genome_id = line.split()[0]
31+
read_count = int(line.split()[1])
32+
except IndexError as e:
33+
logger.error("Failed to read file: %s" % e)
34+
sys.exit(1)
35+
else:
36+
readcount_dic[genome_id] = read_count
37+
return readcount_dic
1138

1239

1340
def parse_abundance_file(abundance_file):
@@ -28,12 +55,12 @@ def parse_abundance_file(abundance_file):
2855
abundance_dic = {}
2956
try:
3057
assert os.stat(abundance_file).st_size != 0
31-
f = open(abundance_file, 'r')
58+
f = open(abundance_file, "r")
3259
except (IOError, OSError) as e:
33-
logger.error('Failed to open file:%s' % e)
60+
logger.error("Failed to open file:%s" % e)
3461
sys.exit(1)
35-
except AssertionError as e:
36-
logger.error('File seems empty: %s' % abundance_file)
62+
except AssertionError:
63+
logger.error("File seems empty: %s" % abundance_file)
3764
sys.exit(1)
3865
else:
3966
with f:
@@ -42,11 +69,11 @@ def parse_abundance_file(abundance_file):
4269
genome_id = line.split()[0]
4370
abundance = float(line.split()[1])
4471
except IndexError as e:
45-
logger.error('Failed to read file: %s' % e)
72+
logger.error("Failed to read file: %s" % e)
4673
sys.exit(1)
4774
else:
4875
abundance_dic[genome_id] = abundance
49-
logger.debug('Loaded abundance/coverage file: %s' % abundance_file)
76+
logger.debug("Loaded abundance/coverage file: %s" % abundance_file)
5077
return abundance_dic
5178

5279

@@ -179,16 +206,16 @@ def coverage_scaling(total_n_reads, abundance_dic, genome_file, read_length):
179206
Returns:
180207
dict: scaled coverage dictionary
181208
"""
209+
logger = logging.getLogger(__name__)
182210
total_reads = 0
183-
f = open(genome_file, 'r') # re-opens the file
211+
f = open(genome_file, "r") # re-opens the file
184212
with f:
185-
fasta_file = SeqIO.parse(f, 'fasta')
213+
fasta_file = SeqIO.parse(f, "fasta")
186214
for record in fasta_file:
187215
try:
188216
species_coverage = abundance_dic[record.id]
189217
except KeyError as e:
190-
logger.error(
191-
'Fasta record not found in abundance file: %s' % e)
218+
logger.error("Fasta record not found in abundance file: %s" % e)
192219
sys.exit(1)
193220

194221
genome_size = len(record.seq)
@@ -210,18 +237,18 @@ def to_file(abundance_dic, output, mode="abundance"):
210237
"""
211238
logger = logging.getLogger(__name__)
212239
if mode == "abundance":
213-
output_abundance = output + '_abundance.txt'
240+
output_abundance = output + "_abundance.txt"
214241
else:
215-
output_abundance = output + '_coverage.txt'
242+
output_abundance = output + "_coverage.txt"
216243
try:
217-
f = open(output_abundance, 'w')
244+
f = open(output_abundance, "w")
218245
except PermissionError as e:
219-
logger.error('Failed to open output file: %s' % e)
246+
logger.error("Failed to open output file: %s" % e)
220247
sys.exit(1)
221248
else:
222249
with f:
223250
for record, abundance in abundance_dic.items():
224-
f.write('%s\t%s\n' % (record, abundance))
251+
f.write("%s\t%s\n" % (record, abundance))
225252

226253

227254
def draft(genomes, draft, distribution, output, mode="abundance"):
@@ -240,13 +267,10 @@ def draft(genomes, draft, distribution, output, mode="abundance"):
240267
# first we get a list of contig names in draft genomes
241268
draft_records = []
242269
for d in draft:
243-
draft_records.extend(
244-
[record.name for record in SeqIO.parse(d, 'fasta')])
270+
draft_records.extend([record.name for record in SeqIO.parse(d, "fasta")])
245271
genomes = list(set(genomes) - set(draft_records))
246272
abundance_dic = distribution(genomes + draft)
247-
complete_genomes_abundance = {k: v for
248-
k, v in abundance_dic.items()
249-
if k not in draft}
273+
complete_genomes_abundance = {k: v for k, v in abundance_dic.items() if k not in draft}
250274
to_file(abundance_dic, output)
251275
draft_dic = expand_draft_abundance(abundance_dic, draft, mode)
252276
full_abundance_dic = {**complete_genomes_abundance, **draft_dic}
@@ -273,12 +297,12 @@ def expand_draft_abundance(abundance_dic, draft, mode="abundance"):
273297
for key, abundance in abundance_dic.items():
274298
if key in draft:
275299
try:
276-
f = open(key, 'r')
300+
f = open(key, "r")
277301
with f:
278-
fasta_file = SeqIO.parse(f, 'fasta')
302+
fasta_file = SeqIO.parse(f, "fasta")
279303
draft_records = [record for record in fasta_file]
280304
total_length = sum(len(record) for record in draft_records)
281-
except Exception as e:
305+
except Exception:
282306
raise
283307
else:
284308
total_length = sum(len(record) for record in draft_records)

0 commit comments

Comments
 (0)