Add option to pass in existant genomeDB to sarek #1539

cmatKhan · 2024-05-22T16:20:33Z

Description of feature

Currently, a new genomeDB is created in sarek in each run. But, there are times when I want to either update an existing genomeDB with new samples as new batches of samples are sequenced, or more interestingly to me currently, include a set of parent samples with all new batches of sequencing.

It would require that a parameter is added to the pipeline nextlfow.config to allow passing in an existing genomeDB. In this line:

            // meta is now a list of [meta1, meta2] but they are all the same. So take the first element.
            [ meta_list[0], gvcf, tbi, intervals, [], [] ]

https://github.com/nf-core/sarek/blob/b5b766d3b4ac89864f2fa07441cdc8844e70a79e/subworkflows/local/bam_joint_calling_germline_gatk/main.nf#L44C13-L44C59

And, the final input would be that param.genomedb path. If that parameter exists, then the third input of the GVCFS_GENOMICSDBIMPORT() should be true rather than false:

sarek/subworkflows/local/bam_joint_calling_germline_gatk/main.nf

Line 48 in b5b766d

GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)

The text was updated successfully, but these errors were encountered:

cmatKhan · 2024-05-22T17:54:32Z

This will currently conflict with the intervals. Per the GATK documentation:

"The user cannot specify intervals when incrementally adding new samples - in this case, the tool will use the intervals specified when the datastore was initially created"

One solution would be to allow passing in an existent genomedb only when there are no intervals, and letting the user suffer the consequences if the file becomes very large.

cmatKhan · 2024-05-22T18:26:00Z

Another issue is whether or not to copy or symlink the genomedb into the work directory. GATK advises:

"We recommend that users backup existing datastores before try incremental addition. This is because if the tool happens to fail when incrementally adding new samples, it may leave the datastore in a corrupt/invalid state."

Possibly best to copy rather than symlink to stage the genomedb file, but it could be large. Maybe making it clear in the documentation that the filepath passed into the pipeline should not be the sole copy of the genomedb

FriederikeHanssen · 2024-05-23T06:16:21Z

I believe this is related/duplicate of: #755. CAn you confirm and merge these issues?

cmatKhan · 2024-05-23T11:18:02Z

yes -- and you've already had a conversation about this. Closing this one.

cmatKhan added this to Hackathon: May 2024 May 22, 2024

cmatKhan self-assigned this May 22, 2024

cmatKhan converted this from a draft issue May 22, 2024

cmatKhan added enhancement New feature or request good first issue Good for newcomers labels May 22, 2024

cmatKhan removed this from Hackathon: May 2024 May 22, 2024

cmatKhan closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to pass in existant genomeDB to sarek #1539

Add option to pass in existant genomeDB to sarek #1539

cmatKhan commented May 22, 2024

cmatKhan commented May 22, 2024

cmatKhan commented May 22, 2024

FriederikeHanssen commented May 23, 2024

cmatKhan commented May 23, 2024

Add option to pass in existant genomeDB to sarek #1539

Add option to pass in existant genomeDB to sarek #1539

Comments

cmatKhan commented May 22, 2024

Description of feature

cmatKhan commented May 22, 2024

cmatKhan commented May 22, 2024

FriederikeHanssen commented May 23, 2024

cmatKhan commented May 23, 2024