A comprehensive benchmarking framework for raw nanopore signal analysis
RawBench provides a standardized evaluation framework for comparing nanopore signal analysis tools. This repository contains job scripts, configuration templates, and setup tools for benchmarking:
- Sigmoni - Signal-based read classification using compressed indexing
- SPUMONI - BWT-based r-index implementation for repetitive text matching
- Basecalling tools - Dorado and other ONT basecallers
- Read mapping tools - minimap2 and other alignment tools
- Evaluation framework - Standardized metrics and comparison scripts
Clone the repository and set up the base environment:
git clone https://github.com/Furkan9015/rawbench.git
cd rawbench
chmod +x scripts/setup_env.sh setup.sh
The benchmarking framework requires several external tools. Install them according to your system:
Dorado (ONT Basecaller):
# Download from: https://github.com/nanoporetech/dorado
# Install to /usr/local/bin/dorado or update DORADO_PATH in environment
RawHash2 (Optional):
# Download from: https://github.com/CMU-SAFARI/RawHash2
# Install to /usr/local/bin/rawhash2 or update RAWHASH2_PATH
# Create conda environment for Sigmoni
conda create --name sigmoni python=3.8
conda activate sigmoni
conda install h5py numpy scipy
pip install uncalled4
# Create environment for BAM processing
conda create --name bamtofastq
conda activate bamtofastq
conda install bamtofastq
# Create environment for minimap2
conda create --name mm2
conda activate mm2
conda install minimap2
# Follow instructions in sigmoni_submodule/README.md
# Option 1: Add as git submodule (if public repository available)
# Option 2: Manual installation - download Sigmoni source code
# Required files in sigmoni_submodule/:
# - index.py (index building script)
# - main.py (classification script)
# - example/ directory with sample data
# Follow instructions in spumoni_submodule/README.md
# Option 1: Add as git submodule (if public repository available)
# Option 2: Manual installation and compilation
# Required: compiled spumoni executable in spumoni_submodule/build/spumoni
cd spumoni_submodule
mkdir build && cd build
cmake ..
make -j$(nproc)
Configure paths for your installation:
# Load the environment setup (do this before running any job scripts)
source scripts/setup_env.sh
# Check environment is configured correctly
print_environment
validate_environment
Edit your shell profile (.bashrc
, .zshrc
) or create a local config:
# External tool paths - customize for your installation
export DORADO_PATH="/path/to/dorado"
export RAWHASH2_PATH="/path/to/rawhash2"
# Data directories - customize for your setup
export BASECALLING_MODELS_DIR="/path/to/basecalling_models"
export POD5_DATA_DIR="/path/to/pod5_data"
# Then reload environment
source scripts/setup_env.sh
# Create data directories
mkdir -p ../basecalling_models ../pod5
# Download Dorado models (example)
# wget -P ../basecalling_models/ https://cdn.oxfordnanoportal.com/software/analysis/dorado/[email protected]
# tar -xzf ../basecalling_models/[email protected] -C ../basecalling_models/
# Place your POD5 signal data in ../pod5/
# Expected files: hsapiens.pod5, ecoli.pod5, dmelanogaster.pod5
Place reference genomes in the refs/
directory:
hsapiens.fa
- Human reference (dataset d8)ecoli.fa
- E. coli reference (dataset d9)dmelanogaster.fa
- D. melanogaster reference (dataset d10)
IMPORTANT: Always load the environment before running any scripts:
cd rawbench/
source scripts/setup_env.sh
# Load environment first
source scripts/setup_env.sh
cd job_scripts/basecalling
# Human dataset (d8)
sbatch d8_basecall_sup.sh
# E. coli dataset (d9)
sbatch d9_basecall_sup.sh
# D. melanogaster dataset (d10)
sbatch d10_basecall_sup.sh
# Load environment first
source scripts/setup_env.sh
cd job_scripts/read_classification
# Sigmoni-based classification
sbatch zymo_ont_rindex_ttest.sh
sbatch zymo_uncalled4_rindex_ttest.sh
# Load environment first
source scripts/setup_env.sh
cd job_scripts/read_mapping
# minimap2 mapping
sbatch d8_2chunksmax_mm2.sh
sbatch d9_1chunkmax_mm2.sh
# Activate environment
source scripts/setup_env.sh
conda activate sigmoni
cd sigmoni_submodule/example/zymo
# Build index
python ../../index.py -p ref_fastas/pos_class/*.fasta -n ref_fastas/neg_class/*.fasta \
-b 6 --shred 100000 -o ./ --ref-prefix zymo_ont
# Run classification
python ../../main.py -i fast5/ -r refs/zymo_ont -b 6 -t 64 -o ./ \
--complexity --thresh 1.6666666666333334
# Activate environment
source scripts/setup_env.sh
cd spumoni_submodule
# Build index
./build/spumoni build -r ../refs/reference.fa -M -P -m -o ./index_dir
# Run classification
./build/spumoni run -r ./index_dir -p ../basecalled_reads/reads.fa -P -c
rawbench/
├── README.md # This file
├── DEPENDENCIES.md # Detailed dependency information
├── setup.sh # Legacy setup script
├── scripts/
│ └── setup_env.sh # Environment configuration script
├── sigmoni_submodule/ # Sigmoni tool installation directory
│ └── README.md # Installation instructions
├── spumoni_submodule/ # SPUMONI tool installation directory
│ └── README.md # Installation instructions
├── job_scripts/ # SLURM job scripts for benchmarking
│ ├── basecalling/ # Basecalling benchmark scripts
│ ├── read_mapping/ # Read mapping evaluation scripts
│ └── read_classification/ # Classification benchmark scripts
├── refs/ # Reference genomes (FASTA format)
├── outputs/ # Generated results and logs
├── basecalled_reads/ # Generated basecalled FASTQ files
├── kmer_models/ # K-mer models for nanopore chemistries
├── fast5/ # Fast5 signal data download scripts
├── pod5/ # POD5 signal data download scripts
└── .gitignore # Excludes build artifacts and data files
parent_directory/
├── basecalling_models/ # Dorado neural network models
├── pod5/ # POD5 signal files by organism
├── bin/ # External tool binaries (optional)
│ ├── dorado
│ └── rawhash2
└── rawbench/ # This repository
├── sigmoni_submodule/ # Sigmoni source code
├── spumoni_submodule/ # SPUMONI source code & build/
└── [rest of repository]
Check if all required tools are installed:
source scripts/setup_env.sh
validate_environment
- "Dorado not found" - Update
DORADO_PATH
environment variable - "SPUMONI executable not found" - Compile SPUMONI in
spumoni_submodule/build/
- "Sigmoni not found" - Install Sigmoni source files according to
sigmoni_submodule/README.md
- Missing conda environments - Create required environments (sigmoni, bamtofastq, mm2)
- Path issues - Always run
source scripts/setup_env.sh
before job scripts
Job scripts include validation and will report missing dependencies. Check SLURM output files in outputs/
for detailed error messages.
*.out
/*.err
files: SLURM job outputs and errors*.report
files: Classification results for each read*.paf
files: Alignment results in PAF format*_timing.log
files: Performance timing measurements*.throughput
files: Performance and accuracy metrics
When modifying job scripts:
- Maintain environment variable usage instead of hardcoded paths
- Include validation checks for required files/tools
- Update documentation if changing expected directory structure
- Test with
validate_environment
before committing
If you use RawBench in your research, please cite:
@software{rawbench2025,
title = {RawBench: A comprehensive benchmarking framework for raw nanopore signal analysis},
author = {Eris, Furkan and McConnell, Ulysse and Firtina, Can and Mutlu, Onur},
year = {2025},
url = {https://github.com/Furkan9015/rawbench}
}
This project is licensed under the MIT License - see the repository for details.