Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to pass in existant genomeDB to sarek #1539

Closed
cmatKhan opened this issue May 22, 2024 · 4 comments
Closed

Add option to pass in existant genomeDB to sarek #1539

cmatKhan opened this issue May 22, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@cmatKhan
Copy link
Contributor

Description of feature

Currently, a new genomeDB is created in sarek in each run. But, there are times when I want to either update an existing genomeDB with new samples as new batches of samples are sequenced, or more interestingly to me currently, include a set of parent samples with all new batches of sequencing.

It would require that a parameter is added to the pipeline nextlfow.config to allow passing in an existing genomeDB. In this line:

            // meta is now a list of [meta1, meta2] but they are all the same. So take the first element.
            [ meta_list[0], gvcf, tbi, intervals, [], [] ]

https://github.com/nf-core/sarek/blob/b5b766d3b4ac89864f2fa07441cdc8844e70a79e/subworkflows/local/bam_joint_calling_germline_gatk/main.nf#L44C13-L44C59

And, the final input would be that param.genomedb path. If that parameter exists, then the third input of the GVCFS_GENOMICSDBIMPORT() should be true rather than false:

GATK4_GENOMICSDBIMPORT(gendb_input, false, false, false)

@cmatKhan cmatKhan self-assigned this May 22, 2024
@cmatKhan cmatKhan converted this from a draft issue May 22, 2024
@cmatKhan cmatKhan added enhancement New feature or request good first issue Good for newcomers labels May 22, 2024
@cmatKhan
Copy link
Contributor Author

This will currently conflict with the intervals. Per the GATK documentation:

"The user cannot specify intervals when incrementally adding new samples - in this case, the tool will use the intervals specified when the datastore was initially created"

One solution would be to allow passing in an existent genomedb only when there are no intervals, and letting the user suffer the consequences if the file becomes very large.

@cmatKhan
Copy link
Contributor Author

Another issue is whether or not to copy or symlink the genomedb into the work directory. GATK advises:

"We recommend that users backup existing datastores before try incremental addition. This is because if the tool happens to fail when incrementally adding new samples, it may leave the datastore in a corrupt/invalid state."

Possibly best to copy rather than symlink to stage the genomedb file, but it could be large. Maybe making it clear in the documentation that the filepath passed into the pipeline should not be the sole copy of the genomedb

@FriederikeHanssen
Copy link
Contributor

I believe this is related/duplicate of: #755. CAn you confirm and merge these issues?

@cmatKhan
Copy link
Contributor Author

yes -- and you've already had a conversation about this. Closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants