Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'versions' directive in process #4386

Closed
adamrtalbot opened this issue Oct 9, 2023 · 15 comments
Closed

'versions' directive in process #4386

adamrtalbot opened this issue Oct 9, 2023 · 15 comments

Comments

@adamrtalbot
Copy link
Collaborator

New feature

Many pipelines attempt to catch the software versions running within the container. The most common method is to catch a file with the software version written to it, typically using some fun string parsing:

https://github.com/nf-core/modules/blob/72f633b541815ae9565599e5b3cf66d863736ae1/modules/nf-core/fastqc/main.nf#L38-L41

https://github.com/epi2me-labs/wf-template/blob/3edea44cdf9449093a2adc2f08bb16ec143373bf/main.nf#L20-L30

This populates the output directive with an additional channel and prevents people using the pipe operator amongst other things.

Why not get this into a separate, native directive? It could automatically populate some property of the process which is available to use as a string (or map of strings).

Usage scenario

Anyone who wants to record their software at runtime.

Suggest implementation

process FASTQC {
    // process goes here
    versions:
    fastqc = 'fastqc --version | sed -e "s/FastQC v//g"'
    python = 'python --version'
}

Somewhere within the .command.run the .task.versions file is created:

fastqc: 0.12.1
python: 3.11.2

Back in the pipeline code:

FASTQC.versions == [ name: ${task.process}, versions: [ fastqc: "0.12.1", python: "3.11.2" ]  ]

// It's syntactic sugar around the FASTQC.out.versions channel, so can be used like so:
ch_versions = ch_versions.mix(FASTQC.versions)
@ewels
Copy link
Member

ewels commented Oct 9, 2023

I think that this is basically a duplicate of an issue that I wrote with @pditommaso back in 2018 😬 #879

But same motivation, similar syntax even (this is slightly better though, agree that the ability to have more than 1 version per process is necessary 👍🏻 )

@adamrtalbot
Copy link
Collaborator Author

Ah bother. I did search for versions but I see that feature is slightly more generic. You could probably use my this feature to implement generic info as well as versions.

The biggest flaw I can see right now is stealing a secret. From the example above:

versions:
sneaky-secret-steal = "echo $SECRET"

@pditommaso
Copy link
Member

I like the general design, still find it horrible having to run and parse all these CLI option. would not be better to rely on the version declared by conda and spack packages?

@adamrtalbot
Copy link
Collaborator Author

In my personal experience, conda only contains about 25% of tools used in bioinformatics (although they are the big and important ones).

@ewels
Copy link
Member

ewels commented Oct 9, 2023

Yup, we've gone back and forth on this a lot over the years. At the end of the day, the only thing you can really trust is to ask the tool itself. Also many pipelines (eg. nf-core) specify both conda and docker. Both can be overwritten by the end user. So if running with some random Docker image there's really no way to know what the version is.

@pditommaso
Copy link
Member

Fair enough. Other points address 1) cardinality: should this be collected once per process or once per task execution; 2) how do you expect this information to be consumed?

@adamrtalbot
Copy link
Collaborator Author

adamrtalbot commented Oct 10, 2023

Yes, once per task. It's easy enough to reduce the data to per-process later. Likely is to consume it for an output file (collectFile), or wrap into an output tool like nf-prov.

@ewels
Copy link
Member

ewels commented Oct 11, 2023

My hope is to use the new MultiQC software versions YAML files for reporting. One per task is fine, MultiQC will handle collapsing the duplicates automatically.

And yeah, would also be nice to go into nf-prov / other tooling 👍🏻

@jordeu
Copy link
Collaborator

jordeu commented Oct 18, 2023

I've opened a PR #4425 with an alternative and more generic way of approaching this problem. Give it a try or add your comments about that alternative.

@jordeu
Copy link
Collaborator

jordeu commented Oct 26, 2023

Explaining a bit what I don't like about adding a "versions" directive and why I proposed this "custom traces" #4425 approach.

To add a directive for collecting the versions of the tools at runtime seems a bit like an odd feature. It suggests that something might be failing in the software management system; otherwise, all the versions should be fixed and known beforehand without any doubt.

However, I understand that reality can be different. Therefore, what we need is a directive to collect runtime metadata of the executed processes. This is why I decided to give the "customTraces" directive approach a try. To me, it also makes no sense to inject this metadata into a channel because it would mean mixing logics. This metadata is not necessary for the execution of the actual data flows. In other words, it should be possible to completely remove this metadata collection, and the pipeline should still function as intended (similar to how trace metrics are optionally collected).

@muffato
Copy link

muffato commented Oct 26, 2023

To me, it also makes no sense to inject this metadata into a channel because it would mean mixing logics.

Another -huge- advantage of tracking versions without channels is that it won't clutter the DAG view, finally making it readable on nf-core-style pipelines

@bentsherman
Copy link
Member

@pditommaso on Jordi's PR:

I'm not convinced we should proceed down this path. It would lead to even more do-it-yourself metadata, which the nextflow runtime would be totally unaware of. Think we should have a more controlled approach

Are you okay with a versions directive then?

I don't really care whether the tool versions are captured through a specialized directive or a more generic directive. The ability to specify custom trace fields will be useful in the future for other reasons like fine-grained resource prediction, but the advantage of a specialized versions directive is that it provides a standard way for Nextflow to understand "tool versions", which is useful not only for multiqc but also provenance reports like RO-crate.

@pditommaso
Copy link
Member

pditommaso commented Nov 1, 2023

Taking into consideration this and the discussion for "custom traces" (#4425), I think the problem could be decouple into two main goals:

  1. Provide a mechanism to carry out semantic-oriented command in which output needs to be captured
  2. Provide the ability to collect all outputs with the same semantics across multiple processes.

Command output

Regarding the first point, nextflow allows the execution of an arbitrary command or user script. Therefore, the main need is to remove the boilerplate that is currently needed to isolate and capture the output of commands that have special semantics e.g. version.

My proposal is to add a new output type that, instead of capturing a file (path) or an environment variable (env), allows the execution of a user-provided command, the output of which is emitted as a channel value.

A hypothetical syntax could be:

process something {
  output: 
  command('bedtools --version')

  '''
  bedtools --this -that
  '''
} 

Alternatively, it could be implemented as an optional parameter of the existing stdout output type, e.g.

process something {
  output:
  stdout()                     // capture the standard output of the main script   
  stdout('bedtools --version') // capture the standard output of the command specified as argument 

  '''
  bedtools --this -that
  '''
} 

Collecting output across the process

Regarding the second point, instead of collecting the outputs for all processes, as it's done via the mix pattern, a much better approach would be to reverse the logic, with the processes emitting the output over the same "sematic" channel.

The topic channels provide an excellent fit for this use case, bringing two benefits:

a. remove the mix pattern boilerplate code
b. improve modularization, because any

The resulting syntax would be:

process foo {
  output: 
  command('bedtools --version'), topic: versions
  script: 
  ''' 
  bedtools --this --that
  '''
} 

process bar {
  output: 
  command('bowtie --version'), topic: versions
  script: 
  ''' 
  bowtie --this --that
  '''
}

workflow {
   foo()
   bar() 
   multiqc ( channel.topic('version'), etc )
}

@phue
Copy link
Contributor

phue commented Nov 8, 2023

I like this alot, it's simple but effective.

Just some thoughts on it:

command() in the examples above could also be bash(), because that's what it is effectively

What I'm trying to say is that overloading stdout() seems to be a more powerful choice as it could generalise to other interpreters, e.g using python:

import tensorflow as tf; print(tf.__version__)

@bentsherman
Copy link
Member

Solved by #4459 and #4493

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants