Skip to content

Commit 22236e1

Browse files
committed
Merge branch 'develop' of github.com:ncsa/NEAT into develop
2 parents f31fe69 + 6442510 commit 22236e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+17641
-45356
lines changed

.github/workflows/python-app.yml

+64-34
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,82 @@
1-
# This workflow will install Python dependencies, run tests and lint with a single version of Python
2-
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
1+
# This workflow configures the environment and executes NEAT read-simulator tests using relative paths for a series of configuration files individually
2+
# For more information on using Python with GitHub Actions, refer to:
3+
# https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
34

4-
name: NEAT unit tests
5+
name: NEAT Unit Tests
56

67
on:
78
push:
8-
branches: [ master ]
9+
branches: [develop, main]
910
pull_request:
10-
branches: [ master ]
11+
branches: [main]
1112

1213
jobs:
13-
build:
14+
detailed_test_execution:
1415
runs-on: ubuntu-latest
15-
1616
steps:
1717
- uses: actions/checkout@v3
1818
- uses: s-weigand/[email protected]
1919
with:
2020
conda-channels: bioconda, conda-forge
2121
activate-conda: true
2222
repository: NCSA/NEAT
23-
- name: basic test
23+
- name: Environment Setup
2424
run: |
2525
conda env create -f environment.yml -n test_neat
26-
conda activate test_neat
26+
source activate test_neat
2727
poetry install
28-
neat
29-
30-
# - name: lint with flake8
31-
# run: |
32-
# conda activate neat
33-
# # stop the build if there are Python syntax errors or undefined names
34-
# flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
35-
# # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
36-
# flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
37-
# - name: Execute test_gen_reads
38-
# run: |
39-
# conda activate neat
40-
# cd ${{ github.workspace }}
41-
# poetry install
42-
# neat --log-level ERROR --no-log read-simulator -c data/test_config.yml -o test
43-
# - run: echo "This job's status is ${{ job.status }}."
44-
# - name: Execute seq_err_model_test
45-
# run: |
46-
# cd ${{ github.workspace }}
47-
# neat --log-level ERROR --no-log model-seq-err -i data/baby.fastq
48-
# - run: echo "This job's status is ${{ job.status }}."
49-
50-
51-
52-
28+
29+
- name: Run NEAT Simulation for config_test1
30+
run: |
31+
source activate test_neat
32+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test1.yml -o ../outputs/test1_read-simulator
33+
34+
- name: Run NEAT Simulation for config_test2
35+
run: |
36+
source activate test_neat
37+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test2.yml -o ../outputs/test2_read-simulator
38+
39+
- name: Run NEAT Simulation for config_test3
40+
run: |
41+
source activate test_neat
42+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test3.yml -o ../outputs/test3_read-simulator
43+
44+
- name: Run NEAT Simulation for config_test4
45+
run: |
46+
source activate test_neat
47+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test4.yml -o ../outputs/test4_read-simulator
48+
49+
- name: Run NEAT Simulation for config_test5
50+
run: |
51+
source activate test_neat
52+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test5.yml -o ../outputs/test5_read-simulator
53+
54+
- name: Run NEAT Simulation for config_test6
55+
run: |
56+
source activate test_neat
57+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test6.yml -o ../outputs/test6_read-simulator
58+
59+
- name: Run NEAT Simulation for config_test7
60+
run: |
61+
source activate test_neat
62+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test7.yml -o ../outputs/test7_read-simulator
63+
64+
- name: Run NEAT Simulation for config_test8
65+
run: |
66+
source activate test_neat
67+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test8.yml -o ../outputs/test8_read-simulator
68+
69+
- name: Run NEAT Simulation for config_test9
70+
run: |
71+
source activate test_neat
72+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test9.yml -o ../outputs/test9_read-simulator
73+
74+
- name: Run NEAT Simulation for config_test10
75+
run: |
76+
source activate test_neat
77+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test10.yml -o ../outputs/test10_read-simulator
78+
79+
- name: Run NEAT Simulation for config_test11
80+
run: |
81+
source activate test_neat
82+
python -m neat --log-level DEBUG read-simulator -c data/test_configs/config_test11.yml -o ../outputs/test11_read-simulator

ChangeLog.md

+10
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# NEAT has a new home
22
NEAT is now a part of the NCSA github and active development will continue here. Please direct issues, comments, and requests to the NCSA issue tracker. Submit pull requests here insead of the old repo.
33

4+
# NEAT v4.2
5+
- After several bug fixes that constituted release 4.1 and some minor releases, we are ready to release an overhauled vesion of NEAT 4.0.
6+
- Removed GC bias - it had little to no effect and made implementation nearly impossible
7+
- Removed fasta creation - we had tweaked this a bit but never got any feedback. It may come back if requested.
8+
- Improvements/fixes/full implementations of:
9+
- heterozygosity
10+
- read creation (now with more reads!)
11+
- bam alignment/creation
12+
- bed tool incorporation
13+
414
-Updated "master" branch to "main." - please update your repo accordingly
515
# NEAT v4.0
616
- Rewritten the models. Models generated on old versions of NEAT will have to be redone, due to the restructuring of the codebase. These new models should be smaller and more efficient. We have replicated the previous default models in the new style. There is no straightforward way to convert between these, unfortuantely.

README.md

+28-55
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
1-
# The NEAT Project v4.0
2-
Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.0. This is our first (beta) release of the newest version of NEAT. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes.
1+
# The NEAT Project v4.2
2+
Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.2. This release of NEAT includes several fixes and a little bit of restructuring. There is still lots of work to be done. See the [ChangeLog](ChangeLog.md) for notes. We have discarded the fasta file writing for now and removed that code. We may add that in as a feature in the future, if users call for it. We also removed GC bias for now. It severely complicated implementation, and had very few noticeable effects. After discussing with some people at the Illinois Institute for Genomic Biology, it sounded like GC bias may be a bit of a non-factor with improved chemistries. These will be reintroduced if needed/called for.
33

44
We are also working on redeveloping NEAT in Rust, a memory and thread safe language that will lend itself well to the way NEAT works, check that out here: https://github.com/ncsa/rusty-neat
55

66
Stay tuned over the coming weeks for exciting updates to NEAT, and learn how to [contribute](CONTRIBUTING.md) yourself. If you'd like to use some of our code, no problem! Just review the [license](LICENSE.md), first.
77

88
NEAT's read-simulator is a fine-grained read simulator. It simulates real-looking data using models learned from specific datasets. There are several supporting utilities for generating models used for simulation and for comparing the outputs of alignment and variant calling to the golden BAM and golden VCF produced by NEAT.
99

10-
This is release v4.0 of the software. While it has been tested, it does represent a shift in the software with the introduction of a configuration file. For a stable release using the old command line interface, please see: [NEAT 3.0](https://github.com/ncsa/NEAT/releases/tag/3.3) (or check out older tagged releases)
10+
This is release v4.2 of the software. While it has been tested, it does represent a shift in the software with the introduction of a configuration file. For a stable release using the old command line interface, please see: [NEAT 3.0](https://github.com/ncsa/NEAT/releases/tag/3.3) (or check out older tagged releases)
1111

1212
To cite this work, please use:
1313

@@ -31,7 +31,6 @@ Table of Contents
3131
* [Large single end reads](#large-single-end-reads)
3232
* [Parallelizing simulation](#parallelizing-simulation)
3333
* [Utilities](#utilities)
34-
* [compute_gc_bias](#computegcbias)
3534
* [model_fragment_lengths](#modelfraglen)
3635
* [gen_mut_model](#genmutmodel)
3736
* [model_sequencing_error](#modelseqerror)
@@ -40,8 +39,9 @@ Table of Contents
4039

4140
## Requirements (the most up-to-date requirements are found in the environment.yml file)
4241

42+
* Some version of Anaconda to set up the environment
4343
* Python == 3.10.*
44-
* poetry
44+
* poetry == 1.3.*
4545
* biopython == 1.79
4646
* pkginfo
4747
* matplotlib
@@ -71,13 +71,20 @@ the NEAT repo, after creating the conda environment:
7171
> poetry install
7272
```
7373

74+
Notes: If any packages are struggling to resolve, check the channels and try to manually pip install the package to see if that helps (but note that NEAT is not tested on the pip versions.)
75+
7476
Test your install by running:
7577
```
7678
> neat --help
7779
```
7880

81+
You can also try running it using the python command directly:
82+
```
83+
> python -m neat --help
84+
```
85+
7986
## Usage
80-
NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single ended fastq file with reads of length 101, ploidy 2, coverage 10X, using the default sequencing substitution, GC% bias, and mutation rate models.
87+
NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single ended fastq file with reads of length 151, ploidy 2, coverage 10X, using the default sequencing substitution, and mutation rate models.
8188

8289
Contents of neat_config.yml
8390
```
@@ -98,7 +105,7 @@ template config file to copy and use for your runs.
98105
reference: full path to a fasta file to generate reads from
99106
read_len: The length of the reads for the fastq (if using). Integer value, default 101.
100107
coverage: desired coverage value. Float or int, default = 10
101-
ploidy: Desired value for ploidy (# of copies of each chromosome). Default is 2
108+
ploidy: Desired value for ploidy (# of copies of each chromosome in the organism). Default is 2
102109
paired_ended: If paired-ended reads are desired, set this to True. Setting this to true requires
103110
either entering values for fragment_mean and fragment_st_dev or entering the path to a
104111
valid fragment_model.
@@ -110,7 +117,6 @@ The default is given:
110117

111118
produce_bam: False
112119
produce_vcf: False
113-
produce_fasta: False
114120
produce_fastq: True
115121

116122
error_model: full path to an error model generated by NEAT. Leave empty to use default model
@@ -119,34 +125,25 @@ mutation_model: full path to a mutation model generated by NEAT. Leave empty to
119125
model (default model based on human data sequenced by Illumina)
120126
fragment_model: full path to fragment length model generate by NEAT. Leave empty to use default model
121127
(default model based on human data sequenced by Illumina)
122-
gc_model: Full path to model for correlating GC concentration and coverage, produced by NEAT.
123-
(default model is based on human data, sequenced by Illumina)
124128

125-
126-
partition_mode: by chromosome ("chrom"), or subdivide the chromosomes ("subdivision").
127-
Note: this feature is not yet fully implemented
128129
threads: The number of threads for NEAT to use.
129130
Note: this feature is not yet fully implemented
130131
avg_seq_error: average sequencing error rate for the sequencing machine. Use to increase or
131132
decrease the rate of errors in the reads. Float betwoon 0 and 0.3. Default is set by the error model.
132-
rescale_qualities: rescale the quality scores to reflect the avg_seq_error rate above. Set True to activate.
133-
include_vcf: full path to list of variants in vcf format to include in the simulation.
134-
target_bed: full path to list of regions in bed format to target. All areas outside these regions will have
135-
very low coverage.
136-
off_target_scalar: manually set the off-target-scalar when using a target bed. Default is 0.02
137-
(i.e., off target areas will have only 2% of the average coverage)
138-
discard_offtarget: throws out reads from off-target regions. Regions of overlap may still have reads.
139-
Set True to activate
133+
rescale_qualities: rescale the quality scores to reflect the avg_seq_error rate above. Set True to activate if you
134+
notice issues with the sequencing error rates in your datatset.
135+
include_vcf: full path to list of variants in vcf format to include in the simulation. These will be inserted as they
136+
appear in the input VCF into the final VCF, and the corresponding fastq and bam files, if requested.
137+
target_bed: full path to list of regions in bed format to target.
138+
All areas outside these regions will have coverage of 0.
140139
discard_bed: full path to a list of regions to discard, in BED format.
141-
mutation_rate: Desired rate of mutation for the dataset. Float between 0 and 0.3
140+
mutation_rate: Desired rate of mutation for the dataset. Float between 0.0 and 0.3
142141
(default is determined by the mutation model)
143142
mutation_bed: full path to a list of regions with a column describing the mutation rate of that region,
144-
as a float with values between 0 and 0.3. The mutation rate must be in the third column.
145-
no_coverage_bias: Set to true to produce a dataset free of coverage bias
146-
rng_seed: Manually enter a seed for the random number generator. Used for repeating runs.
147-
min_mutations: Set the minimum number of mutations that NEAT should add, per contig. Default is 1.
148-
fasta_per_ploid: Produce one fasta per ploid. Default behavior is to produce
149-
a single fasta showing all variants. |
143+
as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., mut_rate=0.00.
144+
rng_seed: Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer.
145+
min_mutations: Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting
146+
this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |
150147

151148
The command line options for NEAT are as follows:
152149

@@ -155,10 +152,9 @@ Universal options can be applied to any subfunction. The commands should come be
155152
|---------------------|--------------------------------------|
156153
| -h, --help | Displays usage information |
157154
| --no-log | Turn off log file creation |
158-
| --log-dir LOG_DIR | Sets the log directory to custom path (default is current working directory |
159-
| --log-name LOG_NAME | Custom name for log file (default is timestamped) |
155+
| --log-name LOG_NAME | Custom name for log file, can be a full path (default is current working directory with a name starting with a timestamp)|
160156
| --log-level VALUE | VALUE must be one of [DEBUG, INFO, WARN, WARNING, ERROR] - sets level of log to display |
161-
| --log-detal VALUE | VALUE must be one of [LOW, MEDIUM, HIGH] - how much info to write for each log record |
157+
| --log-detail VALUE | VALUE must be one of [LOW, MEDIUM, HIGH] - how much info to write for each log record |
162158
| --silent-mode | Writes logs, but suppresses stdout messages |
163159

164160
read-simulator command line options
@@ -183,9 +179,8 @@ Features:
183179
- Can simulate targeted sequencing via BED input specifying regions to sample from
184180
- Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
185181
- Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution using utilities/computeFraglen.py
186-
- Simulates quality scores using either the default model or empirically learned quality scores using utilities/fastq_to_qscoreModel.py
182+
- Simulates quality scores using either the default model or empirically learned quality scores using `neat gen_mut_model`
187183
- Introduces sequencing substitution errors using either the default model or empirically learned from utilities/
188-
- Accounts for GC% coverage bias using model learned from utilities/computeGC.py
189184
- Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
190185
- Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
191186
- Create paired tumour/normal datasets using characteristics learned from real tumour data
@@ -217,7 +212,6 @@ neat read-simulator \
217212
Simulate a targeted region of a genome, e.g. exome, with a targeted bed:
218213

219214
```
220-
<<<<<<< HEAD
221215
[contents of neat_config.yml]
222216
reference: hg19.fa
223217
read_len: 126
@@ -288,27 +282,6 @@ neat read-simulator \
288282
# Utilities
289283
Several scripts are distributed with gen_reads that are used to generate the models used for simulation.
290284

291-
## neat compute_gc_bias
292-
293-
Computes GC% coverage bias distribution from sample (bedrolls genomecov) data.
294-
Takes .genomecov files produced by BEDtools genomeCov (with -d option).
295-
(Not yet implemented in NEAT 4.0)
296-
297-
```
298-
bedtools genomecov
299-
-d \
300-
-ibam normal.bam \
301-
-g reference.fa
302-
```
303-
304-
```
305-
neat compute_gc_bias \
306-
-r reference.fa \
307-
-i genomecovfile \
308-
-w [sliding window length] \
309-
-o /path/to/prefix
310-
```
311-
312285
## neat model-fraglen
313286

314287
Computes empirical fragment length distribution from sample data.

config_template/simple_template.yml

-6
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,20 @@ fragment_st_dev: .
88

99
produce_bam: .
1010
produce_vcf: .
11-
produce_fasta: .
1211
produce_fastq: .
1312

1413
error_model: .
1514
mutation_model: .
1615
fragment_model: .
17-
gc_model: .
1816

19-
partition_mode: .
20-
threads: .
2117
avg_seq_error: .
2218
rescale_qualities: .
2319
quality_offset: .
2420
include_vcf: .
2521
target_bed: .
26-
off_target_scalar: .
2722
discard_bed: .
2823
mutation_rate: .
2924
mutation_bed: .
30-
no_coverage_bias: .
3125
rng_seed: .
3226
min_mutations: .
3327
overwrite_output: .

0 commit comments

Comments
 (0)