-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of named outputs in processes #39
Comments
Hi @abhi18av, Recently we have discussed about that here in the lab. And we agreed that we should not rely on sections like: antismash_output = antismash.out[0] Because, although it works, it is not clear enough which input/output is being used in the modules And I believe it would be nice to have these changes to it easier to debug and work on the code. However, in the lab we discussed to save and use the modules outputs in the Groovy maps instead of directly through variables or named outputs. We were thinking in the beginning of the workflow to initialize a Groovy map to capture the results and save them with names that better describe the results so any one can look at the code and easily understand what is being used. For example, we'd initialize a map: def OUTPUTS = [:] And them save the results (after produced) into these maps with descriptive names to be reused as input to other modules in the pipeline, for example: OUTPUTS['prokkaGff'] = prokka.out[1] Perharps, we could have both? To have things as clear as possible in the code, and to make it easier to reuse the results? For example: OUTPUTS['prokkaGff'] = prokka.out[1]
OUTPUTS['prokkaRenamedGenome'] = prokka.out[3]
...
annotations_files_ch = (OUTPUTS['prokkaRenamedGenome']).join(OUTPUTS['prokkaGff'] )
.join(OUTPUTS['mlstAnalysis'])
.join(OUTPUTS['barnappGff']) What do you think? Would it be too much? Only the named channels already solve the problem? |
Hmm, this is an interesting idea! Putting channel data in a hash-map - I'll do some experiments tomorrow regarding this. My gut feeling is that this might not work, since we're essentially mixing the data structure which is not thread safe Beyond this, I think that even with the usage of |
Re: Hash-map Let's see what happens, if it does not work, it is ok. Was just an ideia we had in the lab :) Re: number of identifiers The number of identifiers and sections where I create them exists just because of one single problem I had: The pipeline has optional modules. For instance a user may or may not execute the However, whenever I skipped a module, when I arrived at the reporting and jbrowse modules where I grab all the annotation results for each sample with So to solve that, I relied upon the identifiers to load in it the results when the module was executed and an empty channel when it was skipped. I don't like it either to have it a bunch of identifiers. For me the best would be to use only the named outputs. However, I am not an expert in Groovy, I learnt everything from nextflow's manual 😅 and I didn't know how to solve this problem that I faced when skipping modules more elegantly. But if you know how to diminish the number of identifiers relying only in the named outputs and avoiding this error when skipping module. I would be super happy to use it how you first described when opened the issue 😁 |
In the small sample below nextflow.enable.dsl = 2
process SAY_HELLO {
tag "${x}"
input:
val(x)
output:
path("*txt")
script:
"""
sleep \$[ ( \$RANDOM % 10 ) + 1 ]s
echo $x > ${x}.txt
"""
}
process SAY_BYE {
tag "${x}"
input:
path(x)
output:
path("*txt")
script:
"""
sleep \$[ ( \$RANDOM % 10 ) + 1 ]s
cat $x > temp.txt
"""
}
workflow {
cheers_ch = Channel.of('Hello', 'Hola', 'Ola', 'Nihao', 'Namaste')
SAY_HELLO(cheers_ch)
// SAY_BYE(SAY_HELLO.out)
def output_map = [:]
output_map[0] = SAY_HELLO.out
}
Both Though, I do think that before the wider scale consumption, it might be worth discussing on the |
Regarding this, okay let's dig deeper. I see that you've followed the general practice of conditional value allocation in if (params.skip_antismash == false) {
antismash(prokka.out[2])
antismash_output = antismash.out[0]
} else {
antismash_output = Channel.empty()
}
And then this is used later // Grab inputs needed for JBrowse step
jbrowse_input = merge_annotations.out[0].join(annotations_files, remainder: true)
...
.join(antismash_output, remainder: true)
But when we see the For example, We might simplify the input directive as shown below
Using the
No worries about this @fmalmeida, I'm pleasantly surprised how much you've absorbed from the manual itself. I myself ended up printing the entire thing to learn - and after all, we're always learning 😊 |
Hi @abhi18av, Re: about the hash maps
I completely agree with you. It is better to first check if this strategy has drawbacks in parallelization to avoid slowing down the pipeline. Re: about the optional modules
That is really I concern I had. In the best world, it would be best to have created the input channels for each "visualization/reporting" module on demand, without using the
Here, I totally agree with you! In the best scenario I would like to have named only the files that I would really use in the script directive, because it is really painful to name them all assuring the proper order. But, I fell in the same pit, which is I did not know how to properly do it.
This is amazing. I did not know about it. How the anonymous staging actually work? Are the anonymous files indexed as tuple val(prefix), file(gff), path("*"), file(mlst), ... Also, I believe that your repo fork is pointing to previous 2.x versions as
Yes, we are 😄 thanks, by the way |
Hi! I have made a small comment in issue #38 that may have some relevance to this one. |
No, this way, Nextflow would not be able to identify, how exactly to limit the use of process MY_PROCESS {
input:
tuple val(sampleName), path(bai), path(bam)
path(ref_fasta)
path("*")
...
} can be used as shown below MY_PROCESS(OTHER_PROCESS.out,
params.ref_fasta,
[params.ref_fasta_fai, params.ref_fasta_dict])
|
Thanks for clarifying it, I believe ti could work and make things easier to read inside the module. However, since I have made the modules "accept" and understand the same input channel in "standard" manner for all, this would require great efforts inside the
I understood how it would change the Perhaps the best would be to create a branch specific for this issue based on the current master |
Agreed, this isn't something that's going to change or impact the usage of this pipeline and we can work on this in parallel. But it's best, if the |
Hey Felipe,
Building upon the stylistic change suggested in #38
Perhaps instead of using the
prokka.out[1]
etc, we can rely upon named outputsThis way you don't really need to rely upon sections like
... since you can directly call it
antismash.out.result
. This can lead to a significant reduction in the number of identifiers used in themain.nf
script.The use of
camelCase
for emitted channel names is a personal preference of mine, to distinguish between the identifiers coming from the usersnake_case
and the identifiers I've used within the pipeline.The text was updated successfully, but these errors were encountered: