You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, a new genomeDB is created in sarek in each run. But, there are times when I want to either update an existing genomeDB with new samples as new batches of samples are sequenced, or more interestingly to me currently, include a set of parent samples with all new batches of sequencing.
It would require that a parameter is added to the pipeline nextlfow.config to allow passing in an existing genomeDB. In this line:
// meta is now a list of [meta1, meta2] but they are all the same. So take the first element.
[ meta_list[0], gvcf, tbi, intervals, [], [] ]
And, the final input would be that param.genomedb path. If that parameter exists, then the third input of the GVCFS_GENOMICSDBIMPORT() should be true rather than false:
This will currently conflict with the intervals. Per the GATK documentation:
"The user cannot specify intervals when incrementally adding new samples - in this case, the tool will use the intervals specified when the datastore was initially created"
One solution would be to allow passing in an existent genomedb only when there are no intervals, and letting the user suffer the consequences if the file becomes very large.
Another issue is whether or not to copy or symlink the genomedb into the work directory. GATK advises:
"We recommend that users backup existing datastores before try incremental addition. This is because if the tool happens to fail when incrementally adding new samples, it may leave the datastore in a corrupt/invalid state."
Possibly best to copy rather than symlink to stage the genomedb file, but it could be large. Maybe making it clear in the documentation that the filepath passed into the pipeline should not be the sole copy of the genomedb
Description of feature
Currently, a new genomeDB is created in sarek in each run. But, there are times when I want to either update an existing genomeDB with new samples as new batches of samples are sequenced, or more interestingly to me currently, include a set of parent samples with all new batches of sequencing.
It would require that a parameter is added to the pipeline
nextlfow.config
to allow passing in an existing genomeDB. In this line:https://github.com/nf-core/sarek/blob/b5b766d3b4ac89864f2fa07441cdc8844e70a79e/subworkflows/local/bam_joint_calling_germline_gatk/main.nf#L44C13-L44C59
And, the final input would be that
param.genomedb
path. If that parameter exists, then the third input of the GVCFS_GENOMICSDBIMPORT() should betrue
rather thanfalse
:sarek/subworkflows/local/bam_joint_calling_germline_gatk/main.nf
Line 48 in b5b766d
The text was updated successfully, but these errors were encountered: