Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load any nextflow.config that is at the same level as a .nf #4205

Closed
maxulysse opened this issue Aug 21, 2023 · 24 comments
Closed

load any nextflow.config that is at the same level as a .nf #4205

maxulysse opened this issue Aug 21, 2023 · 24 comments

Comments

@maxulysse
Copy link
Contributor

maxulysse commented Aug 21, 2023

New feature

load any nextflow.config that is at the same level as a .nf

Usage scenario

It would greatly help the re-usability and share-ability of nf-core modules/subworkflows.

Suggest implementation

Basically similarly to what we're already doing for script templates and bin scripts

EDIT : added a suggest implementation

@bentsherman
Copy link
Member

Nextflow currently does the following:

  1. load configuration
  2. create session w/ config
  3. execute pipeline script

So the configuration is already resolved when the script is executed, but modules are loaded during script execution...

I guess the idea here is that module config should only apply to the module script. When loading a module, maybe we could create a temporary "augmented config" which includes the config from (1) plus the module config.

However, most config scopes apply to the entire workflow, so I think a module config would only make sense to use the params and process scopes. All other scopes would be ignored because they can't be applied to specific processes / modules.

@maxulysse
Copy link
Contributor Author

the idea would be to be able to ship altogether config, script and test for a given module, subworkflow or even pipeline in the long term along the lines...

@bentsherman
Copy link
Member

Can you give an example of what such a config would look like, so that we can see which scopes you are using?

@maxulysse
Copy link
Contributor Author

I'm thinking just to deal with the ext directives and publishdir, so something like: https://github.com/nf-core/sarek/blob/master/conf/modules/aligner.config

@adamrtalbot
Copy link
Collaborator

adamrtalbot commented Aug 21, 2023

In @maxulysse's example, I would separate each of those withLabel selectors into an individual nextflow.config file:

.
└── module/
    └── bwa/
        └── mem/
            ├── main.nf
            ├── nextflow.config
            └── tests/
                ├── main.nf.test
                └── main.nf.test.snap          

If you wanted to re-use a process, you could use the withName/withLabel selector. So the contents might be (pretending bwa mem and bwa mem2 use the same Nextflow process):

        withName: ".*:BWAMEM_MEM" {
            ext.when         = { params.aligner == "bwa-mem" }
        }

        withName: ".*:BWAMEM2_MEM" {
            ext.when         = { params.aligner == "bwa-mem2" }
        }

p.s I know the example is rubbish.

Scoping might be tricky.

@drpatelh
Copy link
Contributor

drpatelh commented Aug 22, 2023

We are trying to standardise the way we pass configuration around nf-core pipelines for modules / subworkflows / workflows. At the moment, we have a modules.config like here which contains all of this configuration which is messy and makes it difficult to share components across pipelines. Ideally, it would be nice to ship the configuration along with modules / subworkflows / workflows to make them standalone, easier to tweak and easier to share.

An example of a proposed structure for modules / subworkflows / workflows if we are able to ship a nextflow.config with each component:

.
├── main.nf
├── nextflow.config
├── modules
│   ├── local
│   │   ├── my_module
│   │   │   ├── main.nf
│   │   │   ├── nextflow.config
├── subworkflows
│   └── local
│       └── my_subworkflow
│           ├── main.nf
│           ├── nextflow.config
└── workflows
    ├── my_workflow
    │   ├── main.nf
    │   ├── nextflow.config

The modules / subworkflows / workflows nextflow.config would have lower precedence than the pipeline nextflow.config and so on as it currently is. This way we could ship a default config with these components that still can be overridden if required.

@adamrtalbot
Copy link
Collaborator

Here me out...how about a config context?

process MYPROCESS {
    ...
}

workflow MYWORKFLOW {
    ...
}

config {
    ...
}

// later, in a file far, far away...

include { MYPROCESS } from workflows/myworkflow.nf
// auto imports the config in the same file.

@bentsherman
Copy link
Member

The example config given by Maxime only has process config, so why can't these settings just be process directives in the module?

@drpatelh
Copy link
Contributor

You mean include process selectors in the module along with the actual process definition? Sorry, didn't quite understand.

@ewels
Copy link
Member

ewels commented Sep 1, 2023

I think he means inline publishDir and everything into the process itself in the module main.nf:

process BWAMEM2_MEM {
    publishDir = [
        mode: params.publish_dir_mode,
        path: { "${params.outdir}/preprocessing/" },
        pattern: "*bam",
        // Only save if (save_mapped OR (no_markduplicates AND save_output_as_bam)) AND only a single BAM file per sample
        saveAs: { (params.save_output_as_bam && (params.save_mapped || params.skip_tools && params.skip_tools.split(',').contains('markduplicates'))) && (meta.size * meta.num_lanes == 1) ? "mapped/${meta.id}/${it}" : null }
    ]

    // container etc..

    input:
    // ...
}

But my problem with this, and also with the idea of shipping a nextflow.config file alongside the module in nf-core, is this: isn't this configuration at pipeline level? So how useful is it to ship it with the module? Surely each pipeline that imports / uses the module will need different configuration settings in this file?

I feel like I'm missing a key point here..

@drpatelh
Copy link
Contributor

drpatelh commented Sep 1, 2023

Don't think we should keep the publishing logic in the process definition as this will need to be updated in instances where the same module is being used in multiple subworkflows.

This would just be a "vanilla" configuration that is shipped with the module / subworkflow / workflow that would have sensible publishing and additional logic to determine how to run that particular component. We could then even have an additional command that patches the config much like nf-core modules patch to see exactly where the configuration has been changed - this would be incredibly valuable for developers to see exactly what has changed.

Additionally, this means we won't need a massive modules.config that contains the configuration for ALL of the components in the pipeline.

Config is instead shipped with each component, changed if required and most importantly packaged with the NF scripts to make it easier to share.

But to add to that, the config shipped with NF script should have lower precedence than the pipeline-level config just in case developers want to override it.

@ewels
Copy link
Member

ewels commented Sep 1, 2023

Ok cool, yeah that makes sense to me - have a default config to ship with the module and then the pipeline dev edits it as required via patching functionality. I imagine it'll largely be addition of lines rather than editing existing ones, so shouldn't be too difficult.

So @bentsherman I think it's a question of config parsing priority. Anything defined directly within process can't be overridden by a config file, right?

@drpatelh
Copy link
Contributor

drpatelh commented Sep 1, 2023

Yeah, the precedence will need some thought in terms of hierarchy. The vanilla config from a module should be able to be overwritten by the vanilla config from a subworkflow should be able to be overwritten by the vanilla config from a workflow should be able to be overwritten by the pipeline config.

@drpatelh drpatelh closed this as completed Sep 1, 2023
@drpatelh drpatelh reopened this Sep 1, 2023
@ewels
Copy link
Member

ewels commented Sep 1, 2023

..which should be able to be overwritten by an institutional config, which should be able to be overwritten by the user config, which should be overwritten by the user command line.

math-hangover-gif

@bentsherman
Copy link
Member

The vanilla config from a module should be able to be overwritten by the vanilla config from a subworkflow should be able to be overwritten by the vanilla config from a workflow should be able to be overwritten by the pipeline config.

This seems backwards from the usual hierarchy, usually the more specific configs override the less specific configs.

What if we extend the nf-core modules command to maintain a generated config file that is always included? So when you install a module, the module config would be appended to some global config in the same way that the module entry is added to modules.json.

@ewels
Copy link
Member

ewels commented Sep 2, 2023

Yeah I was wondering this as well - whether we could have a whole load of includeConfig statements (one for each module / subworkflow) which we automatically insert into the nextflow.config.

@drpatelh
Copy link
Contributor

drpatelh commented Sep 2, 2023

Not sure it would be that straightforward. This might warrant a proper brainstorming session! I'll set something up.

Also, that would only work with those using nf-core tooling and not the rest of the NF community.

@bentsherman
Copy link
Member

See #4112 , we are thinking about adding a nextflow modules command to do basically the same thing as nf-core modules, so we could make it more generally available

@ewels
Copy link
Member

ewels commented Sep 4, 2023

How about we just try to kill off module-level config as much as possible, instead of coming up with a new way to ship a default config? So for example, taking a typical block and dissecting it line by line:

ext.args   = '--keep-exon-attrs -F -T'
publishDir = [
    path: { "${params.outdir}/genome" },
    mode: params.publish_dir_mode,
    saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
    enabled: params.save_reference
]
  • ext.args = '--keep-exon-attrs -F -T'
    • Pipeline level. Can't be shipped with the module.
  • path: { "${params.outdir}/genome" }
    • Partially pipeline level. Many could be removed if we had a way to set sensible defaults.
    • To set sensible defaults, we'd ideally have an option to set a base directory for all publishDir statements, and/or have a new process-level directive to set a base path for publishDir.
  • mode: params.publish_dir_mode,
    • Could be removed if we had an option to set this for all processes. Nearly universally the same for every process.
  • saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
    • In most cases, could be removed if the process output had a way to describe "ignore this in publishDir". Majority of cases in nf-core it's used to tell Nextflow to not publish a file.
    • Second usage is to organise files into subdirectories. Again could be done with an output descriptor?
  • enabled: params.save_reference
    • Could be removed if we had a way to set this at process level (passed to the module as a param, standardised naming in nf-core)

So a pseudo-code final state could look something like this:

nextflow.config (pipeline)

process {
    publishDirMode = "copy"
    publishBaseDir = params.outdir
}
withName: 'UNTAR_.*' {
    ext.args  = '--keep-exon-attrs -F -T'
    ext.publish = params.save_reference
}

main.nf (module)

process GFFREAD {
    publish task.ext.publish
    publishPath "genome"
    // ...

    output:
    path "*.gtf"        , emit: gtf, publish: false
    path "versions.yml" , emit: versions
    // ...
}

@maxulysse
Copy link
Contributor Author

not sure I agree with you on the first point, but I like where all of this is going

@ewels
Copy link
Member

ewels commented Sep 4, 2023

First point being "kill off module-level config"? Yes that was maybe a little strong. "Remove duplicated / boilerplate code" perhaps..

@pditommaso
Copy link
Member

I think both are needed. I mean it makes sense a rethinking the modules.config in order to simplify it and remove redundant settings.

At the same time it's desirable to have a nextflow.config at module level, to define the config for processes defined in the module. To avoid introducing yet another scope, I think the setting in the module config should be interpreted as if they were applied in the process code ie. main.nf

@maxulysse
Copy link
Contributor Author

I'm trying to split the modules.config files in separate config files so we independently test and validate separate subworkflows, we already have something working in sarek, but I'd like to have a more simple way to do so for all pipelines.

Which is actually why I was requesting this feature, and why we want it in a reverse order.
The pipeline configuration for a process should overwrite configuration for any subsequent inclusion.

@bentsherman
Copy link
Member

Closing in favor of #4422

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants