Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added wrapper for Paraphase #3071

Open
wants to merge 42 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
1abdb90
feat: added wrapper for Paraphrase by PacBio, which takes HiFi aligne…
MagdalenaZZ Jul 15, 2024
4cda233
bug: Update bio/paraphase/meta.yaml
MagdalenaZZ Jul 22, 2024
32e74dd
bug: Update bio/paraphase/meta.yaml
MagdalenaZZ Jul 22, 2024
7b8a51c
bug: Update bio/paraphase/meta.yaml
MagdalenaZZ Jul 22, 2024
70ebcaa
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Jul 22, 2024
8e09198
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Jul 29, 2024
a2f85b3
Update bio/paraphase/test/Snakefile
MagdalenaZZ Aug 5, 2024
0e0295b
Update bio/paraphase/test/Snakefile
MagdalenaZZ Aug 5, 2024
9112bd1
Update bio/paraphase/wrapper.py
MagdalenaZZ Aug 5, 2024
68e6be6
feat: improved environment
MagdalenaZZ Aug 5, 2024
73454a1
Update meta.yaml
MagdalenaZZ Aug 5, 2024
59010d1
Update bio/paraphase/test/Snakefile
MagdalenaZZ Aug 6, 2024
ac24eaa
Update wrapper.py
MagdalenaZZ Aug 6, 2024
ac87abe
bug: small fix
MagdalenaZZ Aug 7, 2024
876aad7
bug: fixes
MagdalenaZZ Aug 7, 2024
bd65d78
bug: fixes
MagdalenaZZ Aug 7, 2024
3d62ef3
bug: fixes
MagdalenaZZ Aug 7, 2024
f9bb343
bug: fixing flies
MagdalenaZZ Aug 7, 2024
d8f96dc
Update Snakefile
MagdalenaZZ Aug 7, 2024
b64175c
Update Snakefile
MagdalenaZZ Aug 7, 2024
18da384
Update wrapper.py
MagdalenaZZ Aug 7, 2024
a4856c3
bug: fixing files
MagdalenaZZ Aug 7, 2024
4ce43b2
bug: fix
MagdalenaZZ Aug 7, 2024
7ad0aee
bug: fixed files
MagdalenaZZ Aug 7, 2024
178e901
Update bio/paraphase/meta.yaml
MagdalenaZZ Aug 13, 2024
108c9ca
Update bio/paraphase/test/Snakefile
MagdalenaZZ Aug 13, 2024
b63cf7f
Delete bio/paraphase/used_wrappers.yaml
MagdalenaZZ Aug 13, 2024
c5ed609
feat: linting
MagdalenaZZ Aug 19, 2024
2f918ef
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Aug 20, 2024
1baf9b8
Remove file handle close
fgvieira Aug 20, 2024
e12dcc5
Remove trailing spaces
fgvieira Aug 20, 2024
e41cb61
Remove unused imports
fgvieira Aug 20, 2024
c9dc011
Update wrapper.py
MagdalenaZZ Aug 21, 2024
e568525
Update meta.yaml
MagdalenaZZ Aug 21, 2024
4cc9676
Update bio/paraphase/wrapper.py
MagdalenaZZ Aug 21, 2024
5b81e0d
bug: linting wrapper.py
MagdalenaZZ Aug 26, 2024
6745c6b
bug: linting
MagdalenaZZ Aug 26, 2024
ef1ba6e
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Aug 26, 2024
34bf6b8
Remove import
fgvieira Aug 26, 2024
02c8953
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Aug 27, 2024
8631711
bug: black linting
MagdalenaZZ Aug 27, 2024
781e912
Merge branch 'master' into my-new-snakemake-wrapper-paraphrase
MagdalenaZZ Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions bio/paraphase/environment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- minimap2=2.28
- paraphase=3.1.1
- htslib =1.20
- samtools =1.20
- bcftools =1.20
15 changes: 15 additions & 0 deletions bio/paraphase/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: poraphrase
url: https://github.com/PacificBiosciences/paraphase
description: paraphase resolves SNVs in gene families for PacBio longreads from whole genomes
authors: Magdalena
input:
bam: Path to bam file with aligned genomic reads
fasta: genome fasta file for target genome
faidx: fasta .fai index for the genome fasta file
output:
vcf: vcf file with resulting phased variants
bam: bam file with realigned reads
bai: bai index file for the realigned reads
merged_vcf: VCF file containing all newly discovered variants, phased by gene copy and haplotype
params:
extra: additional parameters that will be forwarded to paraphase
24 changes: 24 additions & 0 deletions bio/paraphase/test/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Set the custom wrapper prefix for the entire workflow


rule paraphase:
input:
bam="test_data/PDPK1.bam",
fasta="test_data/chr16s.fa",
faidx="test_data/homo_sapiens.fa.fai",
output:
bam="paraphase/PDPK1_realigned.paraphase.bam",
bai="paraphase/PDPK1_realigned.paraphase.bam.bai",
merged_vcf="paraphase/PDK1_paraphase.vcf.gz",
vcf_header="paraphase/vcf_chromosome_header.vcf",
params:
extra="-g PDPK1 --genome 38",
log:
"paraphase/PDPK1_realigned.vcf.log",
benchmark:
"paraphase/PDPK1_paraphase.vcf.benchmark.tsv"
threads: 1
resources:
mem_mb=1024,
wrapper:
"main/bio/paraphase"
MagdalenaZZ marked this conversation as resolved.
Show resolved Hide resolved
Binary file added bio/paraphase/test/test_data/PDPK1.bam
Binary file not shown.
Binary file added bio/paraphase/test/test_data/PDPK1.bam.bai
Binary file not shown.
134 changes: 134 additions & 0 deletions bio/paraphase/test/test_data/chr16s.fa

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions bio/paraphase/test/test_data/homo_sapiens.fa.fai
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
chr1 2012 6 2012 2013
chr2 2012 2025 2012 2013
chr3 2012 4044 2012 2013
chr4 2012 6063 2012 2013
chr5 71099999 8082 71099999 71100000
chr6 2012 71108088 2012 2013
chr7 2012 71110107 2012 2013
chr8 2012 71112126 2012 2013
chr9 2012 71114145 2012 2013
chr10 2012 71116165 2012 2013
chr11 2012 71118185 2012 2013
chr12 2012 71120205 2012 2013
chr13 2012 71122225 2012 2013
chr14 2012 71124245 2012 2013
chr15 2012 71126265 2012 2013
chr16 2012 71128285 2012 2013
chr17 2012 71130305 2012 2013
chr18 2012 71132325 2012 2013
chr19 2012 71134345 2012 2013
chr20 2012 71136365 2012 2013
chr21 2012 71138385 2012 2013
chr22 2012 71140405 2012 2013
chrX 2012 71142424 2012 2013
chrY 2012 71144443 2012 2013
chrM 2012 71146462 2012 2013
144 changes: 144 additions & 0 deletions bio/paraphase/wrapper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
import glob
import sys
fgvieira marked this conversation as resolved.
Show resolved Hide resolved
from snakemake.shell import shell
from tempfile import TemporaryDirectory

log = snakemake.log_fmt_shell(stdout=True, stderr=True)
extra = snakemake.params.get("extra", "")


with TemporaryDirectory() as tmpdirname:

### Run paraphase
shell(
f"""
(paraphase --bam {snakemake.input.bam} \
--reference {snakemake.input.fasta} \
--out {tmpdirname} \
{snakemake.params.genome} \
{extra}) {log}
""",
tmpdirname=tmpdirname, # Pass tmpdirname to the shell environment
)
shell(
"""
touch {snakemake.output}
"""
)

### Create a new VCF header

# Get the paths from Snakemake input and output objects
input_faidx = snakemake.input.faidx
output_vcf_header = snakemake.output.vcf_header

# Open the .fai index file and read lines
with open(input_faidx, "r") as fai_file:
lines = fai_file.readlines()

# Open the output file and write formatted header lines
with open(output_vcf_header, "w") as output:
for line in lines:
contig_id, length = line.split()[:2] # Assuming the first two elements are ID and length
output.write(f"##contig=<ID={contig_id},length={length}>\n")

### Concatenating, reheadering, and sorting the zipped and indexed VCF files, and copy the remapped reads
vcf_res = glob.glob(f"{tmpdirname}/*_vcfs/*vcf")
if vcf_res:
for vcf in vcf_res:
bgzip_cmd = f"bgzip -c {vcf} > {vcf}.gz"
shell(bgzip_cmd)
index_cmd = f"bcftools index {vcf}.gz"
shell(index_cmd)
print(f"Compressed and indexed: {vcf}.gz")

params_variant_files = " ".join([f"{vcf}.gz" for vcf in vcf_res])
shell(
f"bcftools concat -a -Oz {params_variant_files} | "
f"bcftools annotate --header-lines {snakemake.output.vcf_header} | "
f"bcftools sort -Oz -o {snakemake.output.merged_vcf}"
)
print(f"Merged, reheadered, and sorted VCF file created: {snakemake.output.merged_vcf}")

# Copy out bam and bai files
bam_res = glob.glob(f"{tmpdirname}/*.bam")
bai_res = glob.glob(f"{tmpdirname}/*.bai")
# print("BAM RES: ", bam_res, bai_res)
shell(
f"""
cp -pr {' '.join(bam_res)} {snakemake.output.bam};
cp -pr {' '.join(bai_res)} {snakemake.output.bai}
"""
)
else:
print("No output VCF or BAM files were produced by paraphase, I hope this is what you were expecting, human?")
shell(f"touch {snakemake.output.merged_vcf}")


with TemporaryDirectory() as tmpdirname:

### Run paraphase
shell(
f"""
(paraphase --bam {snakemake.input.bam} \
--reference {snakemake.input.fasta} \
--out {tmpdirname} \
{snakemake.params.genome} \
{extra}) {log}
""",
tmpdirname=tmpdirname, # Pass tmpdirname to the shell environment
)
fgvieira marked this conversation as resolved.
Show resolved Hide resolved
shell(
"""
touch {snakemake.output}
"""
)

### Create a new VCF header

# Get the paths from Snakemake input and output objects
input_faidx = snakemake.input.faidx
output_vcf_header = snakemake.output.vcf_header

# Open the .fai index file and read lines
with open(input_faidx, "r") as fai_file:
lines = fai_file.readlines()

# Open the output file and write formatted header lines
with open(output_vcf_header, "w") as output:
for line in lines:
contig_id, length = line.split()[:2] # Assuming the first two elements are ID and length
output.write(f"##contig=<ID={contig_id},length={length}>\n")
output.close()
fgvieira marked this conversation as resolved.
Show resolved Hide resolved

### Concatenating, reheadering, and sorting the zipped and indexed VCF files, and copy the remapped reads
vcf_res = glob.glob(f"{tmpdirname}/*_vcfs/*vcf")
if vcf_res:
for vcf in vcf_res:
bgzip_cmd = f"bgzip -c {vcf} > {vcf}.gz"
shell(bgzip_cmd)
index_cmd = f"bcftools index {vcf}.gz"
shell(index_cmd)
print(f"Compressed and indexed: {vcf}.gz")

params_variant_files = " ".join([f"{vcf}.gz" for vcf in vcf_res])
shell(
f"bcftools concat -a -Oz {params_variant_files} | "
f"bcftools annotate --header-lines {snakemake.output.vcf_header} | "
f"bcftools sort -Oz -o {snakemake.output.merged_vcf}"
)
print(f"Merged, reheadered, and sorted VCF file created: {snakemake.output.merged_vcf}")

# Copy out bam and bai files
bam_res = glob.glob(f"{tmpdirname}/*.bam")
bai_res = glob.glob(f"{tmpdirname}/*.bai")
# print("BAM RES: ", bam_res, bai_res)
shell(
f"""
cp -pr {' '.join(bam_res)} {snakemake.output.bam};
cp -pr {' '.join(bai_res)} {snakemake.output.bai}
"""
)
else:
print("No output VCF or BAM files were produced by paraphase, I hope this is what you were expecting, human?")
shell(f"touch {snakemake.output.merged_vcf}")
fgvieira marked this conversation as resolved.
Show resolved Hide resolved
Loading