diff --git a/.claude/actions_plans.md b/.claude/actions_plans.md new file mode 100644 index 0000000..201600a --- /dev/null +++ b/.claude/actions_plans.md @@ -0,0 +1,172 @@ +# quantmsdiann — Action Plans & Roadmap + +> Standalone DIA-NN (DIA) pipeline refactored from quantms. +> Last updated: 2026-03-18 + +--- + +## Phase 1 — Cleanup & Correctness (COMPLETED) + +**Goal**: Remove legacy quantms artifacts that don't belong in a DIA-only pipeline. + +### 1.1 Remove non-DIA test profiles from CI + +- **Files**: `.github/workflows/ci.yml`, `.github/workflows/extended_ci.yml` +- **Action**: Removed `test_lfq`, `test_tmt`, `test_localize`, `test_dda_id_*`, `test_tmt_corr` from CI matrices. Updated repository reference from `bigbio/quantms` to `bigbio/quantmsdiann`. +- **Status**: [x] DONE + +### 1.2 Remove MSstats analysis step + +- **Files**: `workflows/dia.nf`, `workflows/quantmsdiann.nf`, `modules/local/msstats/`, `bin/msstats_plfq.R`, `nextflow.config`, `conf/modules/shared.config` +- **Action**: Removed MSSTATS_LFQ module, R script, and all MSstats analysis parameters. Kept DIANN_MSSTATS conversion step (generates MSstats-compatible CSV without running MSstats analysis). Removed MSstats general and LFQ options from nextflow.config. +- **Status**: [x] DONE + +### 1.3 Clean parameter schema and config + +- **Files**: `nextflow.config`, `nextflow_schema.json` +- **Action**: Removed `id_only` parameter and `statistical_post_processing` schema section. Updated manifest name to `bigbio/quantmsdiann`. Removed dead publishDir rules from shared.config. Created missing nf-core lint files (logos, docs/README.md). +- **Status**: [x] DONE + +### 1.4 Update documentation + +- **Files**: `README.md`, `docs/usage.md`, `docs/output.md`, `AGENTS.md` +- **Action**: Rewrote README with DIA-NN workflow SVG diagram. Rewrote AGENTS.md for DIA-only. Cleaned docs. Removed 10 legacy quantms images from docs/images/, created `quantmsdiann_workflow.svg`. +- **Status**: [x] DONE + +### 1.5 Remove unused modules and files + +- **Action**: Removed `modules/local/msstats/`, `bin/msstats_plfq.R`, `.devcontainer/`, `diann_private.yml`. Cleaned all legacy quantms images. +- **Status**: [x] DONE + +--- + +## Phase 1b — Version-Aware Testing Strategy (COMPLETED) + +**Goal**: Map each DIA-NN feature to the minimum version that supports it, with conditional CI. + +### DIA-NN Version → Feature Matrix + +| Version | Key Features | Container | +| ------- | ------------ | --------- | +| **1.8.1** (default) | Core DIA-NN workflow, library-free, .quant caching | `biocontainers/diann:v1.8.1_cv1` (public) | +| **1.9.2** | QuantUMS quantification, Parquet libraries, redesigned NN | `ghcr.io/bigbio/diann:1.9.2` (private, needs build) | +| **2.0** | Parquet output, proteoform confidence, decoy reporting | `ghcr.io/bigbio/diann:2.1.0` (private) | +| **2.1.0** | Native .raw on Linux, latest improvements | `ghcr.io/bigbio/diann:2.1.0` (private) | +| **2.2.0** | Latest release | `ghcr.io/bigbio/diann:latest` (private, needs build) | + +### Test Profiles Created + +| Profile | Feature Tested | Min DIA-NN | Container | +| ------- | -------------- | ---------- | --------- | +| `test_dia` | Core workflow | 1.8.1 | biocontainers (public) | +| `test_dia_dotd` | Bruker .d format | 1.8.1 | biocontainers (public) | +| `test_dia_quantums` | QuantUMS quantification | 1.9.2 | ghcr.io/bigbio/diann:2.1.0 | +| `test_dia_parquet` | Parquet output + decoys | 2.0 | ghcr.io/bigbio/diann:2.1.0 | +| `test_latest_dia` | Latest version validation | latest | ghcr.io/bigbio/diann:2.1.0 | +| `test_full_dia` | Full-size dataset | 1.8.1 | biocontainers (public) | + +### CI/CD Structure + +**ci.yml** (every PR, fast): +- `test_dia`, `test_dia_dotd` — public containers, no auth needed + +**extended_ci.yml** (5 jobs): +1. **test-default** — always runs `test_dia`, `test_dia_dotd` (Docker, 2 NXF versions) +2. **detect-changes** — uses `dorny/paths-filter` to detect which feature files changed +3. **test-features** — always runs on push to dev/master, releases, manual dispatch: `test_latest_dia`, `test_dia_quantums`, `test_dia_parquet` +4. **test-features-pr** — runs on PRs only when relevant files change (conditional per-feature) +5. **test-singularity** — default tests only, after Docker passes + +### Container Build Needed + +- **DIA-NN 1.9.2**: Dockerfile created at `quantms-containers/diann-1.9.2/Dockerfile`. Needs to be built and pushed to `ghcr.io/bigbio/diann:1.9.2`. +- **DIA-NN 2.2.0**: Need Dockerfile when ready to test latest. + +### Status: [x] DONE + +--- + +## Phase 2 — DIA-NN 2.x Full Support + +**Goal**: Make DIA-NN 2.1.0 a first-class citizen, leverage Parquet-native output. + +### 2.1 Promote DIA-NN 2.1.0 to default + +- **Files**: All `modules/local/diann/*/main.nf`, `nextflow.config` +- **Action**: Update default container from `diann:v1.8.1_cv1` to `diann:2.1.0`. Keep 1.8.1 available as a fallback profile. +- **Dependencies**: Verify all 7 DIA-NN modules work with 2.1.0 CLI changes. +- **Effort**: Medium +- **Status**: [ ] TODO + +### 2.2 Parquet-native pipeline path + +- **Files**: `modules/local/diann/final_quantification/main.nf`, `modules/local/diann/diann_msstats/main.nf` +- **Action**: Ensure DIANN_MSSTATS handles Parquet input end-to-end. Validated by `test_dia_parquet` CI profile. +- **Effort**: Medium +- **Status**: [ ] TODO + +### 2.3 DIA-NN version parameter + +- **Files**: `nextflow.config`, `nextflow_schema.json` +- **Action**: Add a `diann_version` parameter that switches container images without needing separate profiles. +- **Effort**: Medium +- **Status**: [ ] TODO + +--- + +## Phase 3 — Performance & Scalability + +**Goal**: Optimize resource usage and execution for large-scale DIA studies. + +### 3.1 Smarter pre-analysis file selection + +- **Files**: `workflows/dia.nf` +- **Action**: Implement stratified selection (by condition/batch from SDRF) instead of purely random. +- **Effort**: Medium +- **Status**: [ ] TODO + +### 3.2 Resource profiling and tuning + +- **Files**: `conf/base.config` +- **Action**: Profile resource usage across dataset sizes. Adjust labels. Consider dynamic allocation. +- **Effort**: Medium +- **Status**: [ ] TODO + +### 3.3 GPU support profile + +- **Files**: `conf/gpu.config` (new), `modules/local/diann/*/main.nf` +- **Action**: Create `gpu` profile with NVIDIA runtime, `accelerator` directives, GPU container. +- **Effort**: Medium-Large +- **Status**: [ ] TODO + +### 3.4 Improved caching strategy + +- **Files**: DIA-NN modules +- **Action**: Evaluate `storeDir` for expensive steps (library generation, preliminary analysis). +- **Effort**: Small +- **Status**: [ ] TODO + +--- + +## Priority Summary + +| Priority | Phase | Items | Timeline | +| --------------- | -------- | --------------------------- | -------------------- | +| **Done** | Phase 1 | 1.1-1.5 (cleanup) | Completed 2026-03-17 | +| **Done** | Phase 1b | Version-aware testing | Completed 2026-03-18 | +| **Short-term** | Phase 2 | 2.1-2.3 (DIA-NN 2.x) | 1-2 weeks | +| **Medium-term** | Phase 3 | 3.1-3.4 (performance) | 2-4 weeks | + +--- + +## Decision Log + +| Date | Decision | Rationale | +| ---------- | ------------------------------------------------ | ---------------------------------------------------------------------------- | +| 2026-03-17 | Created roadmap from quantms dev comparison | Align refactoring priorities | +| 2026-03-17 | Completed Phase 1 cleanup | Remove all non-DIA artifacts | +| 2026-03-17 | Keep DIANN_MSSTATS, remove MSSTATS_LFQ | Generate MSstats-compatible CSV but don't run MSstats analysis in-pipeline | +| 2026-03-17 | Removed Phases 4-6 (quantification, QC, interop) | pmultiqc already covers QC; downstream analysis/interop out of scope for now | +| 2026-03-18 | Version-aware testing with conditional CI | Each feature maps to min DIA-NN version; PRs only run affected feature tests | +| 2026-03-18 | DIA-NN containers are private (license) | Academic-only license; GHCR_USERNAME + GHCR_TOKEN secrets required | +| 2026-03-18 | Created DIA-NN 1.9.2 Dockerfile | Needed for QuantUMS feature testing at minimum supported version | diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..7a2dabc --- /dev/null +++ b/.gitattributes @@ -0,0 +1,4 @@ +*.config linguist-language=nextflow +*.nf.test linguist-language=nextflow +modules/nf-core/** linguist-generated +subworkflows/nf-core/** linguist-generated diff --git a/.github/.dockstore.yml b/.github/.dockstore.yml new file mode 100644 index 0000000..ca4e607 --- /dev/null +++ b/.github/.dockstore.yml @@ -0,0 +1,6 @@ +# Dockstore config version, not pipeline version +version: 1.2 +workflows: + - subclass: NFL + primaryDescriptorPath: /nextflow.config + publish: True diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 0000000..2246991 --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,125 @@ +# `bigbio/quantmsdiann`: Contributing Guidelines + +Hi there! +Many thanks for taking an interest in improving bigbio/quantmsdiann. + +We try to manage the required tasks for bigbio/quantmsdiann using GitHub issues, you probably came to this page when creating one. +Please use the pre-filled template to save time. + +However, don't be put off by this template - other more general issues and suggestions are welcome! +Contributions to the code are even more welcome ;) + +> [!NOTE] +> If you need help using or modifying bigbio/quantmsdiann or bigbio/quantms then the best place to ask is on the nf-core Slack [#quantms](https://nfcore.slack.com/channels/quantms) channel ([join our Slack here](https://nf-co.re/join/slack)). + +## Contribution workflow + +If you'd like to write some code for bigbio/quantmsdiann, the standard workflow is as follows: + +1. Check that there isn't already an issue about your idea in the [bigbio/quantmsdiann issues](https://github.com/bigbio/quantmsdiann/issues) to avoid duplicating work. If there isn't one already, please create one so that others know you're working on this +2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [bigbio/quantmsdiann repository](https://github.com/bigbio/quantmsdiann) to your GitHub account +3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) +4. Use `nf-core pipelines schema build` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). +5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged + +If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). + +## Tests + +You have the option to test your changes locally by running the pipeline. For receiving warnings about process selectors and other `debug` information, it is recommended to use the debug profile. Execute all the tests with the following command: + +```bash +nf-test test --profile debug,test,docker --verbose +``` + +When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests. +Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then. + +There are typically two types of tests that run: + +### Lint tests + +`nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. +To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core pipelines lint ` command. + +If any failures or warnings are encountered, please follow the listed URL for more documentation. + +### Pipeline tests + +Each `nf-core` pipeline should be set up with a minimal set of test-data. +`GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. +If there are any failures then the automated tests fail. +These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code. + +## Patch + +:warning: Only in the unlikely and regretful event of a release happening with a bug. + +- On your own fork, make a new branch `patch` based on `upstream/main`. +- Fix the bug, and bump version (X.Y.Z+1). +- Open a pull-request from `patch` to `main` with the changes. + +## Getting help + +For further information/help, please consult the [bigbio/quantmsdiann documentation](https://docs.quantms.org/en/latest/usage.html) and don't hesitate to get in touch on the nf-core Slack [#quantms](https://nfcore.slack.com/channels/quantms) channel ([join our Slack here](https://nf-co.re/join/slack)). + +## Pipeline contribution conventions + +To make the `bigbio/quantmsdiann` code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. + +### Adding a new step + +If you wish to contribute a new step, please use the following coding standards: + +1. Define the corresponding input channel into your new process from the expected previous process channel. +2. Write the process block (see below). +3. Define the output channel if needed (see below). +4. Add any new parameters to `nextflow.config` with a default (see below). +5. Add any new parameters to `nextflow_schema.json` with help text (via the `nf-core pipelines schema build` tool). +6. Add sanity checks and validation for all relevant parameters. +7. Perform local tests to validate that the new code works as expected. +8. If applicable, add a new test in the `tests` directory. +9. Update MultiQC config `assets/multiqc_config.yml` so relevant suffixes, file name clean up and module plots are in the appropriate order. If applicable, add a [MultiQC](https://multiqc.info/) module. +10. Add a description of the output files and if relevant any appropriate images from the MultiQC report to `docs/output.md`. + +### Default values + +Parameters should be initialised / defined with default values within the `params` scope in `nextflow.config`. + +Once there, use `nf-core pipelines schema build` to add to `nextflow_schema.json`. + +### Default processes resource requirements + +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/main/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. + +The process resources can be passed on to the tool dynamically within the process with the `${task.cpus}` and `${task.memory}` variables in the `script:` block. + +### Naming schemes + +Please use the following naming schemes, to make it easy to understand what is going where. + +- initial process channel: `ch_output_from_` +- intermediate and terminal channels: `ch__for_` + +### Nextflow version bumping + +If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core pipelines bump-version --nextflow . [min-nf-version]` + +### Images and figures + +For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). + +## GitHub Codespaces + +This repo includes a devcontainer configuration which will create a GitHub Codespaces for Nextflow development! This is an online developer environment that runs in your browser, complete with VSCode and a terminal. + +To get started: + +- Open the repo in [Codespaces](https://github.com/bigbio/quantmsdiann/codespaces) +- Tools installed + - nf-core + - Nextflow + +Devcontainer specs: + +- [DevContainer config](.devcontainer/devcontainer.json) diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 0000000..00db16e --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,42 @@ +name: Bug report +description: Report something that is broken or incorrect +labels: bug +body: + - type: textarea + id: description + attributes: + label: Description of the bug + description: A clear and concise description of what the bug is. + validations: + required: true + + - type: textarea + id: command_used + attributes: + label: Command used and terminal output + description: Steps to reproduce the behaviour. Please paste the command you used to launch the pipeline and the output from your terminal. + render: console + placeholder: | + $ nextflow run ... + + Some output where something broke + + - type: textarea + id: files + attributes: + label: Relevant files + description: | + Please drag and drop the relevant files here. Create a `.zip` archive if the extension is not allowed. + Your verbose log file `.nextflow.log` is often useful _(this is a hidden file in the directory where you launched the pipeline)_ as well as custom Nextflow configuration files. + + - type: textarea + id: system + attributes: + label: System information + description: | + * Nextflow version _(eg. 23.04.0)_ + * Hardware _(eg. HPC, Desktop, Cloud)_ + * Executor _(eg. slurm, local, awsbatch)_ + * Container engine: _(e.g. Docker, Singularity, Conda, Podman, Shifter, Charliecloud, or Apptainer)_ + * OS _(eg. CentOS Linux, macOS, Linux Mint)_ + * Version of bigbio/quantmsdiann _(eg. 1.1, 1.5, 1.8.2)_ diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000..3849862 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,7 @@ +contact_links: + - name: Join nf-core + url: https://nf-co.re/join + about: Please join the nf-core community here + - name: "Slack #quantms channel" + url: https://nfcore.slack.com/channels/quantms + about: Discussion about the bigbio/quantmsdiann pipeline diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 0000000..5ac76f0 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,11 @@ +name: Feature request +description: Suggest an idea for the bigbio/quantmsdiann pipeline +labels: enhancement +body: + - type: textarea + id: description + attributes: + label: Description of feature + description: Please describe your suggestion for a new feature. It might help to describe a problem or use case, plus any alternatives that you have considered. + validations: + required: true diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000..7299652 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,26 @@ + + +## PR checklist + +- [ ] This comment contains a description of changes (with reason). +- [ ] If you've fixed a bug or added code that should be tested, add tests! +- [ ] If you've added a new tool - have you followed the pipeline conventions in the [contribution docs](https://github.com/bigbio/quantmsdiann/tree/master/.github/CONTRIBUTING.md) +- [ ] If necessary, also make a PR on the bigbio/quantmsdiann _branch_ on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. +- [ ] Make sure your code lints (`nf-core pipelines lint`). +- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir `). +- [ ] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir `). +- [ ] Usage Documentation in `docs/usage.md` is updated. +- [ ] Output Documentation in `docs/output.md` is updated. +- [ ] `CHANGELOG.md` is updated. +- [ ] `README.md` is updated (including new tool citations and authors/contributors). diff --git a/.github/actions/get-shards/action.yml b/.github/actions/get-shards/action.yml new file mode 100644 index 0000000..ad52fda --- /dev/null +++ b/.github/actions/get-shards/action.yml @@ -0,0 +1,67 @@ +name: "Get number of shards" +description: "Get the number of nf-test shards for the current CI job" +inputs: + max_shards: + description: "Maximum number of shards allowed" + required: true + tags: + description: "Tags to pass as argument for nf-test --tag parameter" + required: false +outputs: + shard: + description: "Array of shard numbers" + value: ${{ steps.shards.outputs.shard }} + total_shards: + description: "Total number of shards" + value: ${{ steps.shards.outputs.total_shards }} +runs: + using: "composite" + steps: + - name: Install nf-test + uses: nf-core/setup-nf-test@v1 + with: + version: ${{ env.NFT_VER }} + - name: Get number of shards + id: shards + shell: bash + run: | + # Run nf-test with dynamic parameter + nftest_output=$(nf-test test \ + --profile +docker \ + $(if [ -n "${{ inputs.tags }}" ]; then echo "--tag ${{ inputs.tags }}"; fi) \ + --dry-run \ + --ci \ + --changed-since HEAD^) || nftest_exit_code=$? + if [ "${nftest_exit_code:-0}" -ne 0 ]; then + echo "nf-test command failed with exit code $nftest_exit_code" + echo "Full output: $nftest_output" + exit 1 + fi + echo "nf-test dry-run output: $nftest_output" + + # Default values for shard and total_shards + shard="[]" + total_shards=0 + + # Check if there are related tests + if echo "$nftest_output" | grep -q 'No tests to execute'; then + echo "No related tests found." + else + # Extract the number of related tests + number_of_shards=$(echo "$nftest_output" | sed -n 's|.*Executed \([0-9]*\) tests.*|\1|p') + if [[ -n "$number_of_shards" && "$number_of_shards" -gt 0 ]]; then + shards_to_run=$(( $number_of_shards < ${{ inputs.max_shards }} ? $number_of_shards : ${{ inputs.max_shards }} )) + shard=$(seq 1 "$shards_to_run" | jq -R . | jq -c -s .) + total_shards="$shards_to_run" + else + echo "Unexpected output format. Falling back to default values." + fi + fi + + # Write to GitHub Actions outputs + echo "shard=$shard" >> $GITHUB_OUTPUT + echo "total_shards=$total_shards" >> $GITHUB_OUTPUT + + # Debugging output + echo "Final shard array: $shard" + echo "Total number of shards: $total_shards" diff --git a/.github/actions/nf-test/action.yml b/.github/actions/nf-test/action.yml new file mode 100644 index 0000000..9cb9cb0 --- /dev/null +++ b/.github/actions/nf-test/action.yml @@ -0,0 +1,109 @@ +name: "nf-test Action" +description: "Runs nf-test with common setup steps" +inputs: + profile: + description: "Profile to use" + required: true + shard: + description: "Shard number for this CI job" + required: true + total_shards: + description: "Total number of test shards(NOT the total number of matrix jobs)" + required: true + tags: + description: "Tags to pass as argument for nf-test --tag parameter" + required: false +runs: + using: "composite" + steps: + - name: Setup Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "${{ env.NXF_VERSION }}" + + - name: Set up Python + uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6 + with: + python-version: "3.12" + + - name: Install nf-test + uses: nf-core/setup-nf-test@v1 + with: + version: "${{ env.NFT_VER }}" + install-pdiff: true + + - name: Setup apptainer + if: contains(inputs.profile, 'singularity') + uses: eWaterCycle/setup-apptainer@main + + - name: Set up Singularity + if: contains(inputs.profile, 'singularity') + shell: bash + run: | + mkdir -p $NXF_SINGULARITY_CACHEDIR + mkdir -p $NXF_SINGULARITY_LIBRARYDIR + + - name: Conda setup + if: contains(inputs.profile, 'conda') + uses: conda-incubator/setup-miniconda@505e6394dae86d6a5c7fbb6e3fb8938e3e863830 # v3 + with: + auto-update-conda: true + conda-solver: libmamba + channels: conda-forge + channel-priority: strict + conda-remove-defaults: true + + - name: Run nf-test + id: nf-test + shell: bash + env: + NFT_WORKDIR: ${{ env.NFT_WORKDIR }} + run: | + nf-test test \ + --profile=+${{ inputs.profile }} \ + $(if [ -n "${{ inputs.tags }}" ]; then echo "--tag ${{ inputs.tags }}"; fi) \ + --ci \ + --changed-since HEAD^ \ + --verbose \ + --tap=test.tap \ + --shard ${{ inputs.shard }}/${{ inputs.total_shards }} + + # Save the absolute path of the test.tap file to the output + echo "tap_file_path=$(realpath test.tap)" >> $GITHUB_OUTPUT + + - name: Generate test summary + if: always() + shell: bash + run: | + # Add header if it doesn't exist (using a token file to track this) + if [ ! -f ".summary_header" ]; then + echo "# 🚀 nf-test results" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "| Status | Test Name | Profile | Shard |" >> $GITHUB_STEP_SUMMARY + echo "|:------:|-----------|---------|-------|" >> $GITHUB_STEP_SUMMARY + touch .summary_header + fi + + if [ -f test.tap ]; then + while IFS= read -r line; do + if [[ $line =~ ^ok ]]; then + test_name="${line#ok }" + # Remove the test number from the beginning + test_name="${test_name#* }" + echo "| ✅ | ${test_name} | ${{ inputs.profile }} | ${{ inputs.shard }}/${{ inputs.total_shards }} |" >> $GITHUB_STEP_SUMMARY + elif [[ $line =~ ^not\ ok ]]; then + test_name="${line#not ok }" + # Remove the test number from the beginning + test_name="${test_name#* }" + echo "| ❌ | ${test_name} | ${{ inputs.profile }} | ${{ inputs.shard }}/${{ inputs.total_shards }} |" >> $GITHUB_STEP_SUMMARY + fi + done < test.tap + else + echo "| ⚠️ | No test results found | ${{ inputs.profile }} | ${{ inputs.shard }}/${{ inputs.total_shards }} |" >> $GITHUB_STEP_SUMMARY + fi + + - name: Clean up + if: always() + shell: bash + run: | + sudo rm -rf /home/ubuntu/tests/ diff --git a/.github/skills/sdrf/SKILL.md b/.github/skills/sdrf/SKILL.md new file mode 100644 index 0000000..5329895 --- /dev/null +++ b/.github/skills/sdrf/SKILL.md @@ -0,0 +1,586 @@ +--- +name: Create SDRF +description: Create a sample-to-data-relationship format (SDRF) file (usually from another type of samplesheet) +--- + +Create a sample-to-data-relationship format file based on the specification document. By default +generate a tsv. Quote where necessary. If required columns are missing, let the user know and provide help +in further filling by asking questions. In the end, self-validate the file by doing: + +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +uvx --from sdrf-pipelines parse_sdrf validate-sdrf -s $YOUR_GENERATED_SDRF +``` + +Additional online resources: + +- https://github.com/bigbio/proteomics-sample-metadata (specification repo) +- https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects (example annotated projects) +- https://github.com/bigbio/sdrf-pipelines (python validator) + +Specification document: + +[[status]] +== Status of this document + +This document provides information to the proteomics community about a proposed standard for sample metadata annotations in public repositories called Sample and Data Relationship File (SDRF)-Proteomics format. Distribution is unlimited. + +**Version 1.0.1** - 2023-05-24 + +[[abstract]] +== Abstract + +The Human Proteome Organisation (HUPO) Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange, and verification. This document presents a specification for a sample metadata annotation of proteomics experiments. + +Further detailed information, including any updates to this document, implementations, and examples is available at https://github.com/bigbio/proteomics-metadata-standard. The official PSI web page for the document is the following: http://psidev.info/sdrf. + +[[introduction]] +== Introduction + +Many resources have emerged that provide raw or integrated proteomics data in the public domain. If these are valuable individually, their integration through re-analysis represents a huge asset for the community [1]. Unfortunately, proteomics experimental design and sample related information are often missing in public repositories or stored in very diverse ways and formats. For example, the CPTAC consortium (https://cptac-data-portal.georgetown.edu/) provides for every dataset a set of Excel files with the information on each sample (e.g. https://cptac-data-portal.georgetown.edu/study-summary/S048) including tumor size, origin, but also how every sample is related to a specific raw file (e.g. instrument configuration parameters). As a resource routinely re-analysing public datasets, ProteomicsDB, captures for each sample in the database a minimum number of properties to describe the sample and the related experimental protocol such as tissue, digestion method and instrument (e.g. https://www.proteomicsdb.org/#projects/4267/6228). Such heterogeneity often prevents data interpretation, reproducibility, and integration of data from different resources. This is why we propose a homogenous standard for proteomics metadata annotation. For every proteomics dataset we propose to capture at least three levels of metadata: (i) dataset description, (ii) the sample and data files related information; and (iii) the technical/proteomics specific information in standard data file formats (e.g. the PSI formats mzIdentML, mzML, or mzTab, among others). + +The general description includes minimum information to describe the study overall: title, description, date of publication, type of experiment (e.g. http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD016060.0-1&outputMode=XML). The standard data files contain mostly the technical metadata associated with the dataset including search engine settings, scores, workflows, configuration files, but do not include information about the sample metadata and/or the experimental design. Currently, all ProteomeXchange partners mandate this information for each dataset. However, the information regarding the sample and its relation to the data files (**Figure 1**) is mostly missing [1]. + +These three levels of metadata are combined in the well-established data formats ISA-TAB [2] (https://www.isacommons.org/) or MAGE-TAB [3], which are used in other omics fields such as metabolomics and transcriptomics. In both data formats, a tab-delimited file is used to annotate the sample metadata and link it to the corresponding data file(s) (sample and data relationship file format—SDRF). Both data formats encode the properties and sample attributes as columns, and each row represents a sample in the study. However, more important that the file-format itself, general guidelines about what information should be encoded to enable reproducibility of the proteomics results are needed. The lack of guidelines to annotate information such as disease stage, cell line code, or organism part, or the analytical information about labelling channels (e.g. TMT, SILAC) makes the data representation incomplete. The consequence is that it is not possible to understand the original experiment, and/or perform a re-analysis of the dataset having all the necessary information for reproducibility purposes. If the information about the fractions, labelling channels, or enrichment methods is not annotated, the reuse and reproduction of the original results will be challenging, if possible, at all. + +image::https://github.com/bigbio/proteomics-metadata-standard/raw/master/sdrf-proteomics/images/sample-metadata.png[] + +**Figure 1**: SDRF-Proteomics file format stores the information of the sample and its relation to the data files in the dataset. The file format includes not only information about the sample but also about how the data was acquired and processed. + +[[requirements]] +=== Requirements + +The SDRF-Proteomics format describes the sample characteristics and the relationships between samples and data files included in a dataset. The information in SDRF files is organised so that it follows the natural flow of a proteomics experiment. The main requirements to be fulfilled for SDRF-Proteomics format are: + +- The SDRF file is a tab-delimited format where each ROW corresponds to a relationship between a Sample and a Data file (and MS signal corresponding to labelling in the context of multiplexed experiments). +- Each column MUST correspond to an attribute/property of the Sample or the Data file. +- Each value in each cell MUST be the property for a given Sample or Data file. +- The file MUST begin with columns describing the samples of origin and continue with the data files generated from their MS analyses. +- Support for handling unknown values/characteristics. + +[[issues-addressed]] +=== Issues to be addressed + +The main issues to be addressed by the SDRF are: + +- It MUST be able to represent the sample metadata and the data files generated by the instruments or the analyses. +- It MUST be able to represent the experimental design including the way samples and data have been collected. + +[[notation-conventions]] +== Notational Conventions + +The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMEND/RECOMMENDED”, “MAY”, “COULD BE”, and “OPTIONAL” are to be interpreted as described in RFC 2119 (2). + +[[document-structure]] +== Documentation + +The official website for SDRF-Proteomics project is https://github.com/bigbio/proteomics-metadata-standard. New use cases, changes to the specification and examples can be added by using Pull requests or issues in GitHub (see introduction to GitHub - https://lab.github.com/githubtraining/introduction-to-github). + +A set of examples and annotated projects from ProteomeXchange can be found here: https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects + +Multiple tools have been implemented to validate SDRF-Proteomics files for users familiar with Python and Java: + +- sdrf-pipelines (Python - https://github.com/bigbio/sdrf-pipelines): This tool allows to validate an SDRF-Proteomics file. In addition, it allows converting SDRF to other popular pipelines and software configure files such as MaxQuant or OpenMS. + +- jsdrf (Java - https://github.com/bigbio/jsdrf ): These Java library and tool allow validating SDRF-Proteomics files. It also includes a generic data model that can be used by Java applications. + +[[relationship-specifications]] +== Relationship to other specifications + +SDRF-Proteomics is fully compatible with the SDRF file format part of https://www.ebi.ac.uk/arrayexpress/help/magetab_spec.html[MAGE-TAB]. MAGE-TAB is the file format used to store metadata and sample information for transcriptomics experiments. When the proteomeXchange project file is converted to idf file (project description in MAGE-TAB) and is combined with the SDRF-Proteomics a valid MAGE-TAB is obtained. + +SDRF-Proteomics sample information can be embedded into mzTab metadata files. The sample metadata in mzTab contains properties as the columns in the SDRF-Proteomics and values as Sample cell values. + +The SDRF-Proteomics aims to capture the sample metadata and its relationship with the data files (e.g. raw files from mass spectrometers). The SDRF-Proteomics do not aim to capture the downstream analysis part of the experimental design such as what samples should be compared, how they can be combined or parameters for the downstream analysis (FDR or p-values thresholds). The HUPO-PSI community will work in the future to include this information in other file formats such as mzTab or a new type of file format. + +[[ontologies-supported]] +== Ontologies/Controlled Vocabularies Supported + +The list of ontologies/controlled vocabularies (CV) supported are: + +- PSI Mass Spectrometry CV (PSI-MS) +- Experimental Factor Ontology (EFO). +- Unimod protein modification database for mass spectrometry +- PSI-MOD CV (PSI-MOD) +- Cell line ontology +- Drosophila anatomy ontology +- Cell ontology +- Plant ontology +- Uber-anatomy ontology +- Zebrafish anatomy and development ontology +- Zebrafish developmental stages ontology +- Plant Environment Ontology +- FlyBase Developmental Ontology +- Rat Strain Ontology +- Chemical Entities of Biological Interest Ontology +- NCBI organismal classification +- PATO - the Phenotype and Trait Ontology +- PRIDE Controlled Vocabulary (CV) +- Mondo Disease Ontology (MONDO): A unified disease ontology integrating multiple disease resources. + +[[sdrf-file-format]] +== SDRF-Proteomics file format + +The SDRF-Proteomics file format describes the sample characteristics and the relationships between samples and data files. The file format is a tab-delimited one where each ROW corresponds to a relationship between a Sample and a Data file (and MS signal corresponding to labelling in the context of multiplexed experiments), each column corresponds to an attribute/property of the Sample, and the value in each cell is the specific value of the property for a given Sample (**Figure 2**). + +[#img-sunset] +image::https://github.com/bigbio/proteomics-metadata-standard/raw/master/sdrf-proteomics/images/sdrf-nutshell.png[] + +**Figure 2**: SDRF-Proteomics in a nutshell. The file format is a tab-delimited one where columns are properties of the sample, the data file or the variables under study. The rows are the samples of origin and the cells are the values for one property in a specific sample. + +[[sdrf-file-rules]] +=== SDRF-Proteomics format rules + +There are general scenarios/use cases that are addressed by the following rules: + +- **Unknown values**: In some cases, the column is mandatory in the format, but for some samples the corresponding value is unknown. In those cases, users SHOULD use ‘not available’. +- **Not Applicable values**: In some cases, the column is mandatory, but for some samples the corresponding value is not applicable. In those cases, users SHOULD use ‘not applicable’. +- **Case sensitivity**: By specification the SDRF is case-insensitive, but we RECOMMEND using lowercase characters throughout all the text (Column names and values). +- **Spaces**: By specification the SDRF is case-sensitive to spaces (sourcename != source name). +- **Column order**: The SDRF MUST start with the source name column (accession/name of the sample of origin), then all the sample characteristics; followed by the assay name corresponding to the MS run. Finally, after the assay name all the comments (properties of the data file generated). +- **Extension**: The extension of the SDRF should be .tsv or .txt. + +[[sdrf-file-standarization]] +=== SDRF-Proteomics values + +The value for each property, (e.g. characteristics, comment) corresponding to each sample can be represented in multiple ways. + +- Free Text (Human readable): In the free text representation, the value is provided as text without Ontology support (e.g. colon or providing accession numbers). This is only RECOMMENDED when the text inserted in the table is the exact name of an ontology/CV term in EFO. If the term is not in EFO, other ontologies can be used. + +|=== +| source name | characteristics[organism] + +| sample 1 |homo sapiens +| sample 2 |homo sapiens +|=== + +- Ontology url (Computer readable): Users can provide the corresponding URI (Uniform Resource Identifier) of the ontology/CV term as a value. This is recommended for enriched files where the user does not want to use intermediate tools to map from free text to ontology/CV terms. + +|=== +| source name | characteristics[organism] + +| Sample 1 |http://purl.obolibrary.org/obo/NCBITaxon_9606 +| Sample 2 |http://purl.obolibrary.org/obo/NCBITaxon_9606 +|=== + +- Key=value representation (Human and Computer readable): The current representation aims to provide a mechanism to represent the complete information of the ontology/CV term including Accession, Name and other additional properties. In the key=value pair representation, the Value of the property is represented as an Object with multiple properties, where the key is one of the properties of the object and the value is the corresponding value for the particular key. An example of key value pairs is post-translational modification <> + + NT=Glu->pyro-Glu;MT=fixed;PP=Anywhere;AC=Unimod:27;TA=E + +[[from-sample-metadata]] +== SDRF-Proteomics: Samples metadata + +The Sample metadata has different Categories/Headings to organize all the attributes/ column headers of a given sample. Each Sample contains a _source name_ (accession) and a set of _characteristics_. Any proteomics sample MUST contain the following characteristics: + +- _source name_: Unique sample name (it can be present multiple times if the same sample is used several times in the same dataset) +- _characteristics[organism]_: The organism of the Sample of origin. +- _characteristics[disease]_: The disease under study in the Sample. +- _characteristics[organism part]_: The part of organism's anatomy or substance arising from an organism from which the biomaterial was derived, (e.g., liver) +- _characteristics[cell type]_: A cell type is a distinct morphological or functional form of cell. Examples are epithelial, glial etc. + +Example: + +|=== +| source name | characteristics[organism] | characteristics[organism part] | characteristics[disease] | characteristics[cell type] + +|sample_treat | homo sapiens | liver | liver cancer | not available +|sample_control | homo sapiens | liver | liver cancer | not available +|=== + +NOTE: Additional characteristics can be added depending on the type of the experiment and sample. The https://github.com/bigbio/proteomics-metadata-standard/tree/master/templates[SDRF-Proteomics templates] defines a set of templates and checklists of properties that should be provided depending on the proteomics experiment. + +Some important notes: + +- Each characteristic name in the column header SHOULD be a CV term from the EFO ontology. For example, the header _characteristics[organism]_ corresponds to the ontology term Organism. However the values could be from EFO or other ontologies. For example, we RECOMMEND to use MONDO for diseases because it has better coverage than EFO. + +- Multiple values (columns) for the same characteristics term are allowed in SDRF-Proteomics. However, it is RECOMMENDED not to use the same column in the same file. If you have multiple phenotypes, you can specify what it refers to or use another more specific term, e.g., "immunophenotype". + +[[from-sample-data]] +== SDRF-Proteomics: Data files metadata + +The connection between the Samples to the Data files is done by using a series of properties and attributes (comments - for backward compatibility with SDRF in transcriptomics comment MUST be used). All the properties referring to the MS run (file) itself are annotated with the category **comment**. The use of comment is mainly aimed at differentiating sample properties from the data properties. It matches a given sample to the corresponding file(s). The word comment is used for backwards-compatibility with gene expression experiments (RNA-Seq and Microarrays experiments). + +The order of the columns is important, _assay name_ SHOULD always be located before the comments. It is RECOMMENDED to put the last column as _comment[data file]_. The following properties MUST be provided for each data file (ms run) file: + +- **assay name**: For SDRF back-compatibility, MSRun cannot be used. Instead, _assay name_ is used. Examples of assay names are: “run 1”, “run_fraction_1_2”. +- **comment[fraction identifier]**: The fraction identifier allows recording the number of a given fraction. The fraction identifier corresponds to this ontology term. It MUST start from 1, and if the experiment is not fractionated, 1 MUST be used for each MSRun (assay). +- **comment[label]**: label describes the label applied to each Sample (if any). In the case of multiplex experiments such as TMT, SILAC, and/or ITRAQ the corresponding label SHOULD be added. For Label-free experiments the label-free sample term MUST be used <>. +- **comment[data file]**: The data file provides the name of the raw file generated by the instrument. The data files can be instrument raw files but also converted peak lists such as mzML, MGF or result files like mzIdentML. +- **comment[instrument]**: Instrument model used to capture the sample <>. + +Example: + +|=== +| | assay name | comment[label] | comment[fraction identifier] | comment[instrument]| comment[data file] +|sample 1| run 1 | label free sample | 1 | NT=LTQ Orbitrap XL | 000261_C05_P0001563_A00_B00K_R1.RAW +|sample 1| run 2 | label free sample | 2 | NT=LTQ Orbitrap XL | 000261_C05_P0001563_A00_B00K_R2.RAW +|=== + +TIP: All the possible _label_ values can be seen in the in the PRIDE CV under the https://www.ebi.ac.uk/ols/ontologies/pride/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPRIDE_0000514&viewMode=All&siblings=false[Label] node. + +[[label-data]] +=== Label annotations + +In order to annotate quantitative datasets, the SDRF file format uses tags for each channel associated with the sample in _comment[label]_. The label values are organized under the following ontology term Label. Some of the most popular labels are: + +- For label-free experiments the value SHOULD be: label free sample +- For TMT experiments, the SDRF uses the PRIDE ontology terms under sample label. Here are some examples of TMT channels: + + TMT126, TMT127, TMT127C, TMT127N, TMT128 , TMT128C, TMT128N, TMT129, TMT129C, TMT129N, TMT130, TMT130C, TMT130N, TMT131 + +In order to achieve a clear relationship between the label and the sample characteristics, each channel of each sample (in multiplex experiments) SHOULD be defined in a separate row: one row per channel used (annotated with the corresponding _comment[label]_ per file. + +Examples: + +• https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD000612/PXD000612.sdrf.tsv[Label free] +• https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD011799/PXD011799.sdrf.tsv[TMT] +• https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD017710/PXD017710-silac.sdrf.tsv[SILAC] + +[[instrument]] +=== Type and Model of Mass Spectrometer + +The model of the mass spectrometer SHOULD be specified as _comment[instrument]_. Possible values are listed under https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1000031&viewMode=All&siblings=false[instrument model term]. + +Additionally, it is strongly RECOMMENDED to include comment[MS2 analyzer type]. This is important, e.g., for Orbitrap models where MS2 scans can be acquired either in the Orbitrap or in the ion trap. Setting this value allows differentiating high-resolution MS/MS data. Possible values of _comment[MS2 analyzer type]_ are mass analyzer types. + +[[technology-type]] +=== Technology type + +Technology type is used in SDRF and MAGE-TAB formats to specify the technology applied in the study to capture the data. For transcriptomics, common values include technologies such as microarray, RNA-seq, and ChIP-seq (as seen in https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-13567[ArrayExpress Example]). In SDRF-Proteomics, the technology type field is REQUIRED to describe the experimental approach used to generate the data. We RECOMMEND including the technology type column immediately after the `assay name`` column in the SDRF file, clearly indicating which technology was used to produce the data files. + +|=== +| | assay name | technology type +|sample 1| run 1 | proteomic profiling by mass spectrometry +|=== + +NOTE: While we RECOMMEND positioning the technology type column after the assay name, in some original templates, this column was placed before the assay name. We will allow the technology type column to appear either directly before or after the assay name column but RECOMMEND placing it after the assay name for consistency. + +For proteomics experiments the possible values for technology types can be obtained from PRIDE Ontology term https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000663[technology type]. + +Here, the list of valid values: + +- proteomic profiling by mass spectrometry + +[[additional-data-files]] +=== Additional Data files technical properties + +It is RECOMMENDED to encode some of the technical parameters of the MS experiment as comments, including the following parameters: + +- Protein Modifications +- Precursor and Fragment ion mass tolerances +- Digestion Enzymes + +[[ptms]] +==== Protein Modifications + +Sample modifications, (including both chemical modifications and post-translational modifications, PTMs) are originated from multiple sources: artifact modifications, isotope labeling, adducts that are encoded as PTMs (e.g. sodium) or the most biologically relevant PTMs. + +It is RECOMMENDED to provide the modifications expected in the sample including the amino acid affected, whether it is Variable or Fixed (also Custom and Annotated modifications are supported) and included other properties such as mass shift/delta mass and the position (e.g. anywhere in the sequence). + +The RECOMMENDED name of the column for sample modification parameters is: comment[modification parameters]. + +The modification parameters are the name of the ontology term MS:1001055. + +For each modification, different properties are captured using a key=value pair structure including name, position, etc. All the possible (optional) features available for modification parameters are: + +|=== +|Property |Key |Example | Mandatory(:white_check_mark:)/Optional(:zero:) |comment + +|Name of the Modification| NT | NT=Acetylation | :white_check_mark: | \* Name of the Term in this particular case Modification, for custom modifications can be a name defined by the user. +|Modification Accession | AC |AC=UNIMOD:1 | :zero: | Accession in an external database UNIMOD or PSI-MOD supported. +|Chemical Formula | CF | CF=H(2)C(2)O | :zero: | This is the chemical formula of the added or removed atoms. For the formula composition please follow the guidelines from http://www.unimod.org/names.html[UNIMOD] +|Modification Type | MT | MT=Fixed | :zero: | This specifies which modification group the modification should be included with. Choose from the following options: [Fixed, Variable, Annotated]. \_Annotated* is used to search for all the occurrences of the modification into an annotated protein database file like UNIPROT XML or PEFF. +|Position of the modification in the Polypeptide | PP | PP=Any N-term | :zero: | Choose from the following options: [Anywhere, Protein N-term, Protein C-term, Any N-term, Any C-term]. Default is _Anywhere_. +|Target Amino acid | TA | TA=S,T,Y | :white_check_mark: | The target amino acid letter. If the modification targets multiple sites, it can be separated by `,`. +|Monoisotopic Mass | MM | MM=42.010565 | :zero: | The exact atomic mass shift produced by the modification. Please use at least 5 decimal places of accuracy. This should only be used if the chemical formula of the modification is not known. If the chemical formula is specified, the monoisotopic mass will be overwritten by the calculated monoisotopic mass. +|Target Site | TS | TS=N[^P][ST] | :zero: | For some software, it is important to capture complex rules for modification sites as regular expressions. These use cases should be specified as regular expressions. +|=== + +We RECOMMEND for indicating the modification name, to use the UNIMOD interim name or the PSI-MOD name. For custom modifications, we RECOMMEND using an intuitive name. If the PTM is unknown (custom), the Chemical Formula or Monoisotopic Mass MUST be annotated. + +An example of an SDRF-Proteomics file with sample modifications annotated, where each modification needs an extra column: + +|=== +| |comment[modification parameters] | comment[modification parameters] + +|sample 1| NT=Glu->pyro-Glu; MT=fixed; PP=Anywhere;AC=Unimod:27; TA=E | NT=Oxidation; MT=Variable; TA=M +|=== + +[[cleavage-agents]] +==== Cleavage agents + +The REQUIRED _comment [cleavage agent details]_ property is used to capture the enzyme information. Similar to protein modification, a key=value pair representation is used to encode the following properties for each enzyme: + +|=== +|Property |Key |Example | Mandatory(:white_check_mark:)/Optional(:zero:) | comment +|Name of the Enzyme | NT | NT=Trypsin | :white_check_mark: | \* Name of the Term in this particular case Name of the Enzyme. +|Enzyme Accession | AC |AC=MS:1001251 | :zero: | Accession in an external PSI-MS Ontology definition under the following category https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1001045[Cleavage agent name]. +|Cleavage site regular expression | CS | CS=(?<=[KR])(?!P) | :zero: | The cleavage site defined as a regular expression. +|=== + +An example of an SDRF-Proteomics with annotated endopeptidase: + +|=== +| source name |...|comment[cleavage agent details] + +|sample 1| ....|NT=Trypsin;AC=MS:1001251 +|=== + +NOTE: If no endopeptidase is used, for example, in the case of Top-down/intact protein experiments, the value SHOULD be ‘not applicable’. + +[[mass-tolerances]] +==== Precursor and Fragment mass tolerances + +For proteomics experiments, it is important to encode different mass tolerances (for precursor and fragment ions). + +|=== +| |comment[fragment mass tolerance] | comment[precursor mass tolerance] + +|sample 1| 0.6 Da | 20 ppm +|=== + +Units for the mass tolerances (either Da or ppm) MUST be provided. + +[[study-variables]] +== SDRF-Proteomics study variables + +The variable/property under study SHOULD be highlighted using the factor value category. For example, the _factor value[tissue]_ is used when the user wants to compare expression across different tissues. You can add Multiple variables under study by providing multiple factor values. + +|=== +|factor value | :zero: | 0..\* | “factor value” columns SHOULD indicate which experimental factor/variable is used as the hypothesis to perform the data analysis. The “factor value” columns SHOULD occur after all characteristics and the attributes of the samples. | factor value[phenotype] +|=== + +[[conventions]] +== SDRF-Proteomics conventions + +Conventions define how to encode some particular information in the file format in specific use cases. Conventions define a set of new columns that are needed to represent a particular use case or experiment type (e.g. phosphorylation dataset). In addition, conventions define how some specific free-text columns (value that is not defined as ontology terms) should be written. Conventions are compiled from the proteomics community using https://github.com/bigbio/proteomics-metadata-standard/issues or pull-request and will be added to updated versions of this specification document in the future. + +In the convention section <>, the columns are described and defined, while in the section use cases and templates <> the columns needed to describe a use case are specified. + +[[age-encoding]] +=== How to encode age + +One of the characteristics of a patient sample can be the age of an individual. It is RECOMMENDED to provide the age in the following format: {X}Y{X}M{X}D. Some valid examples are: + +- 40Y (forty years) +- 40Y5M (forty years and 5 months) +- 40Y5M2D (forty years, 5 months, and 2 days) + +When needed, weeks can also be used: 8W (eight weeks) + +Age interval: + +Sometimes the sample does not have an exact age but a range of age. To annotate an age range the following standard is RECOMMENDED: + + 40Y-85Y + +This means that the subject (sample) is between 40 and 85 years old. Other temporal information can be encoded similarly. + +[[phos-pho]] +=== Phosphoproteomics and other post-translational modifications enriched studies + +In PTM-enriched experiments, the _characteristics[enrichment process]_ SHOULD be provided. The different values already included in EFO are: + +- enrichment of phosphorylated Protein +- enrichment of glycosylated Protein + +This characteristic can be used as a _factor value[enrichment process]_ to differentiate the expression between proteins in the phospho-enriched sample compared with the control. + +[[pooled-samples]] +=== Pooled samples + +When multiple samples are pooled into one, the general approach is to annotate them separately, abiding by the general rule: one row stands for one sample-to-file relationship. In this case, multiple rows are created for the corresponding data file, much like in <>. + +One possible exception is made for the case when one channel e.g., in a TMT/iTRAQ multiplexed experiment is used for a sample pooled from all other channels, typically for normalization purposes. In this case, it is not necessary to repeat all sample annotations. Instead, a special characteristic can be used: + +|=== +|source name |characteristics[pooled sample] | assay name | comment[label] | comment[data file] + +| sample 1 | not pooled | run 1 | TMT131 | file01.raw +| sample 2 | not pooled | run 1 | TMT131C | file01.raw +| sample 10 | SN=sample 1,sample 2, ... sample 9| run 1 | TMT128 | file01.raw +|=== + +`SN` stands for source names and lists `source name` fields of samples that are annotated in the same file and _used in the same experiment and same MS run_. + +Another possible value for _characteristics[pooled sample]_ is a string `pooled` for cases when it is known that a sample is pooled but the individual samples cannot be annotated. + +[[derived-samples]] +=== Derived samples (such as patient-derived xenografts) + +In cancer research, patient-derived xenografts (PDX) are commonly used. In those, the patient’s tumor is transplanted into another organism, usually a mouse. In these cases, the metadata, such as age and sex, MUST refer to the original patient and not the mouse. + +PDX samples SHOULD be annotated by using the column name _characteristics[xenograft]_. The value should then describe the growth condition, such as ‘pancreatic cancer cells grown in nude mice’. + +For experiments where both the PDX and the original tumor are measured, the PDX entry SHOULD reference the respective tumor sample’s source name in the _characteristics[source name]_ column. Non-PDX samples SHOULD contain the “not applicable” value in the _characteristics[xenograft]_ and the characteristics[source name] column. Both tumor and PDX samples SHOULD reference the patient using the characteristics[individual] column. This column SHOULD contain some sort of patient identifier. + +[[spiked-in]] +=== Spiked-in samples + +There are multiple scenarios when a sample is spiked with additional analytes. Peptides, proteins, or mixtures can be added to the sample as controlled amounts to provide a standard or ground truth for quantification, or for retention time alignment, etc. + +To include information about the spiked compounds, use _characteristics[spiked compound]_. The information is provided in key-value pairs. Here are the keys and values that SHOULD be provided: + +|=== +|Key | Meaning | Examples | Peptide | Protein | Mixture | Other + +|SP | Species | Escherichia coli K-12 | :zero: | :zero: | :zero: | :zero: +|CT | Compound type | protein, peptide, mixture, other | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: +|QY | Quantity (molar or mass) | 10 mg, 20 nmol | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: +|PS | Peptide sequence | PEPTIDESEQ |:white_check_mark: | | | +|AC | Uniprot Accession | A9WZ33 | | :white_check_mark: | | +|CN | Compound name | `iRT mixture`, `substance name` | | :zero: | :zero: | :zero: +|CV | Compound vendor | `in-house` or vendor name | :zero: | :zero: | :white_check_mark: | :zero: +|CS | Compound specification URI | `http://vendor.web.site/specs/coomercial-kit.xlsx` | :zero: | :zero: | :zero: | :zero: +|CF | Compound formula | `C2H2O` | | | | :zero: +|=== + +In addition to specifying the component and its quantity, the injected mass of the main sample SHOULD be specified as _characteristics[mass]_. + +An example of SDRF-Proteomics for a sample spiked with a peptide would be: + +|=== +|characteristics[mass] | characteristics[spiked compound] +|1 ug | CT=peptide;PS=PEPTIDESEQ;QY=10 fmol +|=== + +For multiple spiked components, the column _characteristics[spiked compound]_ may be repeated. + +If the spiked component is another biological sample (e.g. **E. coli** lysate spiked into human sample), then the spiked component MUST be annotated in its own row. Both components of the sample SHOULD have `characteristics[mass]` specified. Inclusion of _characteristics[spiked compound]_ is optional in this case; if provided, it SHOULD be the string `spiked` for the spiked sample. + +[[synthetic-peptide]] +=== Synthetic peptide libraries + +It is common to use synthetic peptide libraries for proteomics, and MS use cases include: + +• Benchmark of analytical and bioinformatics methods and algorithms. +• Improvement of peptide identification/quantification using spectral libraries. + +When describing synthetic peptide libraries, most of the sample metadata can be declared as “not applicable”. However, some authors can annotate the organism for example because they know the library has been designed from specific peptide species, see example Synthetic Peptide experiment (https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD000759/sdrf.tsv). + +It is important to annotate that the sample is a synthetic peptide library, this can be done by adding the characteristics[synthetic peptide]. The possible values are “synthetic” or “not synthetic”. + +[[normal-healthy]] +=== Normal and healthy samples + +Samples from healthy patients or individuals normally appear in manuscripts and annotations as healthy or normal. We RECOMMEND using the word “normal” mapped to term PATO_0000461 that is in EFO: normal PATO term. Example: + +|=== +| source name | characteristics[organism] | characteristics[organism part] | characteristics[phenotype] | characteristics[compound] | factor value[phenotype] + +|sample_treat | homo sapiens | Whole Organism | necrotic tissue | drug A | necrotic tissue +|sample_control | homo sapiens | Whole Organism | normal | none | normal +|=== + +[[sample-technical-biological-replicates]] +=== Encoding sample technical and biological replicates + +Different measurements of the same biological sample are often categorized as (i) Technical or (ii) Biological replicates, based on whether they are (i) matched on all variables, e.g. same sample and same protocol; or (ii) different samples matched on explanatory variable(s), e.g. different patients receiving a placebo, in a placebo vs. drug trial. Technical and biological replicates have different levels of independence, which must be taken into account during data interpretation. + +For a given experiment, there are different levels to which samples can be matched - e.g., same sample, sample protocol, covariates - the definition of technical replicate can therefore vary based on the number of variables included. In addition, an experiment might be used in multiple models with different explanatory variable(s), and biological replicates in one model would not be replicates in another. Therefore, Technical vs. Biological considerations, while sometimes relevant to analytical and statistical interpretation, fall beyond the scope of the SDRF-Proteomics format. However, data providers are encouraged to provide any identifier - e.g. Biological_replicate_1, Technical_replicate_2 - that would help link the samples to their analytical and statistical analysis as comments. A good starting point for the SDRF-Proteomics specification is the following: + +**technical replicate**: It is defined as repeated measurements of the same sample that represent independent measures of the random noise associated with protocols or equipment [4]. + +In MS-based proteomics, a technical replicate can be, for example, doing the full sample preparation from extraction to MS multiple times to control variability in the instrument and sample preparation. Another valid example would be to replicate only one part of the analytical method, for example, run the sample twice on the LC-MS/MS. technical replicates indicate if measurements are scientifically robust or noisy, and how large the measured effect must be to stand out above that noise. + +In the following example, only if the technical replicate column is provided, one can distinguish quantitative values of the same fraction but different technical replicates. + +|=== +| source name | assay name | comment[label] | comment[fraction identifier] | comment[technical replicate] | comment[data file] +| Sample 1 | run 1 | label free sample | 1 | 1 | F1_TR1.RAW +| Sample 1 | run 2 | label free sample | 2 | 1 | F2_TR1.RAW +| Sample 1 | run 3 | label free sample | 1 | 2 | F1_TR2.RAW +| Sample 1 | run 4 | label free sample | 2 | 2 | F2_TR2.RAW +|=== + +The _comment[technical replicate]_ column is MANDATORY. Please fill it with 1 if technical replicates are not performed in a study. + +**Biological replicate**: parallel measurements of biologically distinct samples that capture biological variation, which may itself be a subject of study or a source of noise. Biological replicates address if and how widely the results of an experiment can be generalized. For example, repeating a particular assay with independently generated samples, individuals or samples derived from various cell types, tissue types, or organisms, to see if similar results can be observed. Context is critical, and appropriate biological replicates will indicate whether an experimental effect is sustainable under a different set of biological variables or an anomaly itself. + +In SDRF-Proteomics, biological replicates can be annotated using _characteristics[biological replicate]_ and it is MANDATORY. Please fill it with 1 if biological replicates are not performed in a study. + +Some examples with explicit annotation of the biological replicates can be found here: + +- https://github.com/bigbio/proteomics-metadata-standard/blob/c3a56b076ef381280dfcb0140d2520126ace53ff/annotated-projects/PXD006401/sdrf.tsv + +[[sample-prep]] +=== Sample preparation properties + +In order to encode sample preparation details, we strongly RECOMMEND specifying the following parameters. + +- **comment [depletion]**: The removal of specific components of a complex mixture of proteins or peptides based on some specific property of those components. The values of the columns will be `no depletion` or `depletion`. In the case of depletion `depleted fraction` of `bound fraction` can be specified. + +- **comment [reduction reagent]**: The chemical reagent that is used to break disulfide bonds in proteins. The values of the column are under the term https://www.ebi.ac.uk/ols/ontologies/pride/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPRIDE_0000607&viewMode=All&siblings=false[reduction reagent]. For example, DTT. + +- **comment [alkylation reagent]**: The alkylation reagent that is used to covalently modify cysteine SH-groups after reduction, preventing them from forming unwanted novel disulfide bonds. The values of the column are under the term https://www.ebi.ac.uk/ols/ontologies/pride/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPRIDE_0000598&viewMode=All&siblings=false[alkylation reagent]. For example, IAA. + +- **comment [fractionation method]**: The fraction method used to separate the sample. The values of this term can be read under PRIDE ontology term https://www.ebi.ac.uk/ols/ontologies/pride/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPRIDE_0000550[Fractionation method]. For example, Off-gel electrophoresis. + +[[fragment-proper]] +=== MS/MS properties + +- **comment[collision energy]**: Collision energy can be added as non-normalized (10000 eV) or normalized (1000 NCE) value. + +- **comment[dissociation method]**: This property will provide information about the fragmentation method, like HCD, CID. The values of the column are under the term https://www.ebi.ac.uk/ols/ontologies/ms/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMS_1000044&viewMode=All&siblings=false[dissociation method]. + +[[raw-file-uri]] +=== RAW file URI + +We RECOMMEND including the public URI of the file if available. For example, for ProteomeXchange datasets, the URI from the FTP can be provided: + +|=== +| |... |comment[file uri] + +|sample 1| ... |https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/09/PXD005946/000261_C05_P0001563_A00_B00K_R1.RAW +|=== + +[[multiple-projects]] +=== Multiple projects into one annotation file + +Curators can decide to annotate multiple ProteomeXchange datasets into one large SDRF-Proteomics file for reanalysis purposes. If that is the case, it is RECOMMENDED to use the comment[proteomexchange accession number] to differentiate between different datasets. + +[[data-acquisition-method]] +=== Data acquisition method: DDA and DIA and others + +Proteomics data acquisition method can happen in two ways: Data Dependent Acquisition (DDA) or Data Independent Acquisition (DIA). The SDRF-Proteomics file format allows to capture the method used for the data acquisition in the _comment[proteomics data acquisition method]_ column. The following values are RECOMMENDED for DDA and DIA: + +- https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000627[data-dependent acquisition] +- https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000450[data-independent acquisition] + - https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000650?lang=en[diaPASEF] + - https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000447[SWATH MS] +- https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000629[parallel reaction monitoring] +- https://www.ebi.ac.uk/ols4/ontologies/pride/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPRIDE_0000630[selected reaction monitoring] + +TIP: If the SDRF do not specified the proteomics data acquisition method as _comment[proteomics data acquisition method]_, it is assumed that the method used is DDA which is the most common method used in proteomics. + +You can find an example of a DIA experiment in the following link: https://github.com/bigbio/proteomics-sample-metadata/blob/master/annotated-projects/PXD018830/PXD018830-DIA.sdrf.tsv[DIA example] + +[[dia-ms1-scan]] +==== Data Independent Acquisition - Scan window limits + +Additionally to the general _comment[proteomics data acquisition method]_ column, the SDRF-Proteomics file format allows to capture other properties for the DIA method. The following properties are RECOMMENDED for DIA: + +- _comment[MS1 scan range]_: The MS1 scan range is the m/z range used for the DIA acquisition. The values are expressed in m/z units. + +Example: + +|=== +|assay name | comment[MS1 scan range] | comment[data file] +|run 1 | 400m/z - 1200m/z | FILE_R1.RAW +|run 2 | 400m/z - 1200m/z | FILE_R2.RAW +|=== + +TIP: While the specification recommend to write the MS1 scan range as an interval (e.g. 400m/z - 1200m/z), it is also possible to write the MS1 scan range as a single value (e.g. 400m/z) using two columns for the lower and upper limits. In those cases you can write the lower limit in the _comment[scan window lower limit]_ and the uper limit in _comment[scan window upper limit]_ + +[[use-cases]] +== SDRF-Proteomics use-cases representation (templates) + +Please visit the following document to read about SDRF-Proteomics use cases, templates, and https://github.com/bigbio/proteomics-metadata-standard/blob/master/templates/README.adoc[checklists]. + +[[example-annotated-datasets]] +== Examples of annotated datasets + +|=== +|Dataset Type | ProteomeXchange / Pubmed Accession | SDRF URL +|Label-free | PXD008934 | https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects/PXD008934 +|TMT | PXD017710 | https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects/PXD017710 diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml new file mode 100644 index 0000000..9d8deda --- /dev/null +++ b/.github/workflows/branch.yml @@ -0,0 +1,46 @@ +name: nf-core branch protection +# This workflow is triggered on PRs to `main`/`master` branch on the repository +# It fails when someone tries to make a PR against the nf-core `main`/`master` branch instead of `dev` +on: + pull_request_target: + branches: + - main + - master + +jobs: + test: + runs-on: ubuntu-latest + steps: + # PRs to the nf-core repo main/master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches + - name: Check PRs + if: github.repository == 'bigbio/quantmsdiann' + run: | + { [[ ${{github.event.pull_request.head.repo.full_name }} == bigbio/quantmsdiann ]] && [[ $GITHUB_HEAD_REF == "dev" ]]; } || [[ $GITHUB_HEAD_REF == "patch" ]] + + # If the above check failed, post a comment on the PR explaining the failure + # NOTE - this doesn't currently work if the PR is coming from a fork, due to limitations in GitHub actions secrets + - name: Post PR comment + if: failure() + uses: mshick/add-pr-comment@b8f338c590a895d50bcbfa6c5859251edc8952fc # v2 + with: + message: | + ## This PR is against the `${{github.event.pull_request.base.ref}}` branch :x: + + * Do not close this PR + * Click _Edit_ and change the `base` to `dev` + * This CI test will remain failed until you push a new commit + + --- + + Hi @${{ github.event.pull_request.user.login }}, + + It looks like this pull-request has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) ${{github.event.pull_request.base.ref}} branch. + The ${{github.event.pull_request.base.ref}} branch on nf-core repositories should always contain code from the latest release. + Because of this, PRs to ${{github.event.pull_request.base.ref}} are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch. + + You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page. + Note that even after this, the test will continue to show as failing until you push a new commit. + + Thanks again for your contribution! + repo-token: ${{ secrets.GITHUB_TOKEN }} + allow-repeats: false diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 0000000..39597f0 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,110 @@ +name: nf-core CI +# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors +on: + push: + branches: + - dev + - master + pull_request: + release: + types: [published] + workflow_dispatch: + +env: + NXF_ANSI_LOG: false + NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity + NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity + +concurrency: + group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}" + cancel-in-progress: true + +jobs: + test: + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + EXEC_PROFILE: ${{ matrix.exec_profile }} + + name: "CI [${{ matrix.test_profile }}] DIA-NN=1.8.1 NXF=${{ matrix.NXF_VER }}" + if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'bigbio/quantmsdiann') }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + NXF_VER: + - "25.04.0" + test_profile: ["test_dia", "test_dia_dotd"] + exec_profile: ["docker"] # extended ci tests singularity. + + steps: + - name: Check out pipeline code + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + with: + fetch-depth: 0 + + - uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "${{ matrix.NXF_VER }}" + + - name: Set up Singularity + if: matrix.exec_profile == 'singularity' + run: | + mkdir -p $NXF_SINGULARITY_CACHEDIR + mkdir -p $NXF_SINGULARITY_LIBRARYDIR + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data in docker/singularity profile + if: github.event.pull_request.base.ref != 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE,dev --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir failed_logs + failed=$(grep "FAILED" ${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + + - name: Set timestamp + run: | + echo "ARTIFACT_TIMESTAMP=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ github.run_id }}_${{ github.run_attempt }}_${{ env.ARTIFACT_TIMESTAMP }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: success() + name: Upload results + with: + name: ${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ github.run_id }}_${{ github.run_attempt }}_${{ env.ARTIFACT_TIMESTAMP }}_results + include-hidden-files: true + path: ${{ matrix.test_profile }}_${{ matrix.exec_profile }}_results + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ github.run_id }}_${{ github.run_attempt }}_${{ env.ARTIFACT_TIMESTAMP }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn diff --git a/.github/workflows/clean-up.yml b/.github/workflows/clean-up.yml new file mode 100644 index 0000000..6adb0ff --- /dev/null +++ b/.github/workflows/clean-up.yml @@ -0,0 +1,24 @@ +name: "Close user-tagged issues and PRs" +on: + schedule: + - cron: "0 0 * * 0" # Once a week + +jobs: + clean-up: + runs-on: ubuntu-latest + permissions: + issues: write + pull-requests: write + steps: + - uses: actions/stale@5f858e3efba33a5ca4407a664cc011ad407f2008 # v10 + with: + stale-issue-message: "This issue has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor. Remove stale label or add a comment otherwise this issue will be closed in 20 days." + stale-pr-message: "This PR has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor. Remove stale label or add a comment if it is still useful." + close-issue-message: "This issue was closed because it has been tagged as awaiting-changes or awaiting-feedback by an nf-core contributor and then staled for 20 days with no activity." + days-before-stale: 30 + days-before-close: 20 + days-before-pr-close: -1 + any-of-labels: "awaiting-changes,awaiting-feedback" + exempt-issue-labels: "WIP" + exempt-pr-labels: "WIP" + repo-token: "${{ secrets.GITHUB_TOKEN }}" diff --git a/.github/workflows/extended_ci.yml b/.github/workflows/extended_ci.yml new file mode 100644 index 0000000..42a9f48 --- /dev/null +++ b/.github/workflows/extended_ci.yml @@ -0,0 +1,279 @@ +name: nf-core extended CI +# Job dependency chain: +# test-default (1.8.1, fast, public) → test-latest (2.2.0, all features) +# → test-singularity (1.8.1, Singularity) +# If test-default fails, downstream jobs are skipped to save resources. +on: + push: + branches: + - dev + - master + pull_request: + release: + types: [published] + workflow_dispatch: + +env: + NXF_ANSI_LOG: false + NXF_SINGULARITY_CACHEDIR: ${{ github.workspace }}/.singularity + NXF_SINGULARITY_LIBRARYDIR: ${{ github.workspace }}/.singularity + +concurrency: + group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}" + cancel-in-progress: true + +jobs: + # ────────────────────────────────────────────────────────────────────────── + # Stage 1: Default tests — DIA-NN 1.8.1 (public, fast, no auth) + # Must pass before any other job runs. + # ────────────────────────────────────────────────────────────────────────── + test-default: + name: "Default [${{ matrix.test_profile }}] DIA-NN=1.8.1 NXF=${{ matrix.NXF_VER }}" + if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'bigbio/quantmsdiann') }} + runs-on: ubuntu-latest + strategy: + fail-fast: true + matrix: + NXF_VER: ["25.04.0", "latest-everything"] + test_profile: ["test_dia", "test_dia_dotd"] + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + EXEC_PROFILE: docker + + steps: + - name: Checkout repository + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Java 17 + uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "${{ matrix.NXF_VER }}" + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + if: github.ref != 'refs/heads/master' && github.event.pull_request.base.ref != 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE,dev --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Run pipeline with test data (master branch) + if: github.ref == 'refs/heads/master' || github.event.pull_request.base.ref == 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn + + # ────────────────────────────────────────────────────────────────────────── + # Stage 2a: Latest DIA-NN (2.2.0) — all features + # Only runs after test-default passes. + # ────────────────────────────────────────────────────────────────────────── + test-latest: + name: "Latest [${{ matrix.test_profile }}] DIA-NN=2.2.0" + needs: test-default + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + test_profile: ["test_latest_dia", "test_dia_quantums", "test_dia_parquet"] + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + EXEC_PROFILE: docker + + steps: + - name: Checkout repository + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Java 17 + uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "25.04.0" + + - name: Log in to GitHub Container Registry + env: + GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }} + if: env.GHCR_TOKEN != '' + run: | + echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ secrets.GHCR_USERNAME }} --password-stdin + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + if: github.ref != 'refs/heads/master' && github.event.pull_request.base.ref != 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE,dev --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Run pipeline with test data (master branch) + if: github.ref == 'refs/heads/master' || github.event.pull_request.base.ref == 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_latest_${{ matrix.test_profile }}_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_latest_${{ matrix.test_profile }}_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn + + # ────────────────────────────────────────────────────────────────────────── + # Stage 2b: Singularity — default tests only (public containers) + # Only runs after test-default passes. + # ────────────────────────────────────────────────────────────────────────── + test-singularity: + name: "Singularity [${{ matrix.test_profile }}] DIA-NN=1.8.1 NXF=${{ matrix.NXF_VER }}" + needs: test-default + if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'bigbio/quantmsdiann') }} + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + NXF_VER: ["25.04.0", "latest-everything"] + test_profile: ["test_dia", "test_dia_dotd"] + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + EXEC_PROFILE: singularity + steps: + - name: Check out pipeline code + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + with: + fetch-depth: 0 + + - uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "${{ matrix.NXF_VER }}" + + - name: Set up Singularity + run: | + mkdir -p $NXF_SINGULARITY_CACHEDIR + mkdir -p $NXF_SINGULARITY_LIBRARYDIR + + - name: Install Singularity with defaults + uses: singularityhub/install-singularity@main + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + if: github.ref != 'refs/heads/master' && github.event.pull_request.base.ref != 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE,dev --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Run pipeline with test data (master branch) + if: github.ref == 'refs/heads/master' || github.event.pull_request.base.ref == 'master' + run: | + nextflow run ${GITHUB_WORKSPACE} -profile $TEST_PROFILE,$EXEC_PROFILE --outdir ${TEST_PROFILE}_${EXEC_PROFILE}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_${EXEC_PROFILE}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_sing_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_sing_${{ matrix.test_profile }}_${{ matrix.NXF_VER }}_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn diff --git a/.github/workflows/fix_linting.yml b/.github/workflows/fix_linting.yml new file mode 100644 index 0000000..7990cbe --- /dev/null +++ b/.github/workflows/fix_linting.yml @@ -0,0 +1,94 @@ +name: Fix linting from a comment +on: + issue_comment: + types: [created] + +jobs: + fix-linting: + # Only run if comment is on a PR with the main repo, and if it contains the magic keywords + if: > + contains(github.event.comment.html_url, '/pull/') && + contains(github.event.comment.body, '@nf-core-bot fix linting') && + github.repository == 'bigbio/quantmsdiann' + runs-on: ubuntu-latest + permissions: + contents: write + pull-requests: write + issues: write + steps: + # Use the default GITHUB_TOKEN to check out so we can push later + - uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 + with: + token: ${{ secrets.GITHUB_TOKEN }} + + # indication that the linting is being fixed + - name: React on comment + uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5 + with: + comment-id: ${{ github.event.comment.id }} + reactions: eyes + + # Action runs on the issue comment, so we don't get the PR by default + # Use the gh cli to check out the PR + - name: Checkout Pull Request + run: gh pr checkout ${{ github.event.issue.number }} + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + + # Install and run pre-commit + - uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6 + with: + python-version: "3.12" + + - name: Install pre-commit + run: pip install pre-commit + + - name: Run pre-commit + id: pre-commit + run: pre-commit run --all-files + continue-on-error: true + + # indication that the linting has finished + - name: react if linting finished succesfully + if: steps.pre-commit.outcome == 'success' + uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5 + with: + comment-id: ${{ github.event.comment.id }} + reactions: "+1" + + - name: Commit & push changes + id: commit-and-push + if: steps.pre-commit.outcome == 'failure' + continue-on-error: true + run: | + git config user.email "core@nf-co.re" + git config user.name "nf-core-bot" + git config push.default upstream + git add . + git status + git diff --staged --quiet || git commit -m "[automated] Fix code linting" + git push + + - name: react if linting errors were fixed + id: react-if-fixed + if: steps.commit-and-push.outcome == 'success' + uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5 + with: + comment-id: ${{ github.event.comment.id }} + reactions: hooray + + - name: react if linting errors were not fixed + if: steps.commit-and-push.outcome == 'failure' + uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5 + with: + comment-id: ${{ github.event.comment.id }} + reactions: confused + + - name: react if linting errors were not fixed + if: steps.commit-and-push.outcome == 'failure' + uses: peter-evans/create-or-update-comment@e8674b075228eee787fea43ef493e45ece1004c9 # v5 + with: + issue-number: ${{ github.event.issue.number }} + body: | + @${{ github.actor }} I tried to fix the linting errors, but it didn't work. Please fix them manually. + See [CI log](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}) for more details. diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml new file mode 100644 index 0000000..7a527a3 --- /dev/null +++ b/.github/workflows/linting.yml @@ -0,0 +1,80 @@ +name: nf-core linting +# This workflow is triggered on pushes and PRs to the repository. +# It runs the `nf-core pipelines lint` and markdown lint tests to ensure +# that the code meets the nf-core guidelines. +on: + pull_request: + release: + types: [published] + +jobs: + pre-commit: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 + + - name: Set up Python 3.14 + uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6 + with: + python-version: "3.14" + + - name: Install pre-commit + run: pip install pre-commit + + - name: Run pre-commit + run: pre-commit run --all-files + + nf-core: + runs-on: ubuntu-latest + steps: + - name: Check out pipeline code + uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 + + - name: Install Nextflow + uses: nf-core/setup-nextflow@v2 + + - uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6 + with: + python-version: "3.14" + architecture: "x64" + + - name: read .nf-core.yml + uses: pietrobolcato/action-read-yaml@9f13718d61111b69f30ab4ac683e67a56d254e1d # 1.1.0 + id: read_yml + with: + config: ${{ github.workspace }}/.nf-core.yml + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install nf-core==${{ steps.read_yml.outputs['nf_core_version'] }} + + - name: Run nf-core pipelines lint + if: ${{ github.base_ref != 'master' }} + env: + GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }} + run: nf-core -l lint_log.txt pipelines lint --dir ${GITHUB_WORKSPACE} --markdown lint_results.md + + - name: Run nf-core pipelines lint --release + if: ${{ github.base_ref == 'master' }} + env: + GITHUB_COMMENTS_URL: ${{ github.event.pull_request.comments_url }} + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + GITHUB_PR_COMMIT: ${{ github.event.pull_request.head.sha }} + run: nf-core -l lint_log.txt pipelines lint --release --dir ${GITHUB_WORKSPACE} --markdown lint_results.md + + - name: Save PR number + if: ${{ always() }} + run: echo ${{ github.event.pull_request.number }} > PR_number.txt + + - name: Upload linting log file artifact + if: ${{ always() }} + uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5 + with: + name: linting-logs + path: | + lint_log.txt + lint_results.md + PR_number.txt diff --git a/.github/workflows/linting_comment.yml b/.github/workflows/linting_comment.yml new file mode 100644 index 0000000..dd8c035 --- /dev/null +++ b/.github/workflows/linting_comment.yml @@ -0,0 +1,39 @@ +name: nf-core linting comment +# This workflow is triggered after the linting action is complete +# It posts an automated comment to the PR, even if the PR is coming from a fork + +on: + workflow_run: + workflows: ["nf-core linting"] + types: [completed] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - name: Download lint results + uses: dawidd6/action-download-artifact@ac66b43f0e6a346234dd65d4d0c8fbb31cb316e5 # v11 + with: + workflow: linting.yml + workflow_conclusion: completed + + - name: Get PR number + id: pr_number + run: | + if [ ! -f linting-logs/PR_number.txt ]; then + echo "PR number file not found" + exit 1 + fi + PR_NUM=$(cat linting-logs/PR_number.txt) + if ! [[ "$PR_NUM" =~ ^[0-9]+$ ]]; then + echo "Invalid PR number: $PR_NUM" + exit 1 + fi + echo "pr_number=$PR_NUM" >> $GITHUB_OUTPUT + + - name: Post PR comment + uses: marocchino/sticky-pull-request-comment@773744901bac0e8cbb5a0dc842800d45e9b2b405 # v2 + with: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + number: ${{ steps.pr_number.outputs.pr_number }} + path: linting-logs/lint_results.md diff --git a/.github/workflows/merge_ci.yml b/.github/workflows/merge_ci.yml new file mode 100644 index 0000000..448888e --- /dev/null +++ b/.github/workflows/merge_ci.yml @@ -0,0 +1,276 @@ +name: nf-core merge CI +# Full version × feature matrix — runs when merging dev → master. +# Job dependency chain: +# test-core-public (1.8.1, fast) → test-core-private (2.1.0, 2.2.0) +# → test-features-matrix (QuantUMS, Parquet × versions) +# If the fast 1.8.1 tests fail, all downstream jobs are skipped. +on: + pull_request: + branches: + - master + - main + release: + types: [published] + workflow_dispatch: + +env: + NXF_ANSI_LOG: false + +concurrency: + group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}" + cancel-in-progress: true + +jobs: + # ────────────────────────────────────────────────────────────────────────── + # Stage 1: Core tests with 1.8.1 (public, fast, no auth) + # Must pass before private container tests run. + # ────────────────────────────────────────────────────────────────────────── + test-core-public: + name: "Core [${{ matrix.test_profile }}] DIA-NN=1.8.1" + runs-on: ubuntu-latest + strategy: + fail-fast: true + matrix: + test_profile: ["test_dia", "test_dia_dotd"] + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + EXEC_PROFILE: docker + + steps: + - name: Checkout repository + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Java 17 + uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "25.04.0" + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + run: | + nextflow run ${GITHUB_WORKSPACE} \ + -profile $TEST_PROFILE,$EXEC_PROFILE \ + --outdir ${TEST_PROFILE}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_merge_${{ matrix.test_profile }}_v1_8_1_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_merge_${{ matrix.test_profile }}_v1_8_1_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn + + # ────────────────────────────────────────────────────────────────────────── + # Stage 2a: Core tests with private containers (2.1.0, 2.2.0) + # Only runs after Stage 1 passes. + # ────────────────────────────────────────────────────────────────────────── + test-core-private: + name: "Core [${{ matrix.test_profile }}] DIA-NN=${{ matrix.diann_version }}" + needs: test-core-public + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + test_profile: ["test_dia", "test_dia_dotd"] + diann_version: ["diann_v2_1_0", "diann_v2_2_0"] + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + DIANN_VERSION: ${{ matrix.diann_version }} + EXEC_PROFILE: docker + + steps: + - name: Checkout repository + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Java 17 + uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "25.04.0" + + - name: Log in to GitHub Container Registry + env: + GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }} + run: | + if [ -n "$GHCR_TOKEN" ]; then + echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ secrets.GHCR_USERNAME }} --password-stdin + fi + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + run: | + nextflow run ${GITHUB_WORKSPACE} \ + -profile $TEST_PROFILE,$DIANN_VERSION,$EXEC_PROFILE \ + --outdir ${TEST_PROFILE}_${DIANN_VERSION}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_${DIANN_VERSION}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_${DIANN_VERSION}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_merge_${{ matrix.test_profile }}_${{ matrix.diann_version }}_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_merge_${{ matrix.test_profile }}_${{ matrix.diann_version }}_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn + + # ────────────────────────────────────────────────────────────────────────── + # Stage 2b: Feature tests across supported DIA-NN versions + # QuantUMS and Parquet against 2.1.0 and 2.2.0 + # Only runs after Stage 1 passes. + # ────────────────────────────────────────────────────────────────────────── + test-features-matrix: + name: "Feature [${{ matrix.test_profile }}] DIA-NN=${{ matrix.diann_version }}" + needs: test-core-public + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + include: + # QuantUMS × {2.1.0, 2.2.0} + - test_profile: test_dia_quantums + diann_version: diann_v2_1_0 + - test_profile: test_dia_quantums + diann_version: diann_v2_2_0 + # Parquet × {2.1.0, 2.2.0} + - test_profile: test_dia_parquet + diann_version: diann_v2_1_0 + - test_profile: test_dia_parquet + diann_version: diann_v2_2_0 + env: + NXF_ANSI_LOG: false + CAPSULE_LOG: none + TEST_PROFILE: ${{ matrix.test_profile }} + DIANN_VERSION: ${{ matrix.diann_version }} + EXEC_PROFILE: docker + + steps: + - name: Checkout repository + uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4 + + - name: Set up Java 17 + uses: actions/setup-java@v4 + with: + distribution: "temurin" + java-version: "17" + + - name: Set up Nextflow + uses: nf-core/setup-nextflow@v2 + with: + version: "25.04.0" + + - name: Log in to GitHub Container Registry + env: + GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }} + run: | + if [ -n "$GHCR_TOKEN" ]; then + echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ secrets.GHCR_USERNAME }} --password-stdin + fi + + - name: Disk space cleanup + uses: jlumbroso/free-disk-space@v1.3.1 + + - name: Run pipeline with test data + run: | + nextflow run ${GITHUB_WORKSPACE} \ + -profile $TEST_PROFILE,$DIANN_VERSION,$EXEC_PROFILE \ + --outdir ${TEST_PROFILE}_${DIANN_VERSION}_results + + - name: Gather failed logs + if: failure() || cancelled() + run: | + mkdir -p failed_logs + if [ -f "${TEST_PROFILE}_${DIANN_VERSION}_results/pipeline_info/execution_trace.txt" ]; then + failed=$(grep "FAILED" ${TEST_PROFILE}_${DIANN_VERSION}_results/pipeline_info/execution_trace.txt | cut -f 2) + while read -r line ; do cp $(ls work/${line}*/*.log) failed_logs/ || true ; done <<< "$failed" + fi + + - name: Set timestamp + if: always() + run: echo "TS=$(date +%s)" >> $GITHUB_ENV + + - uses: actions/upload-artifact@v4 + if: failure() || cancelled() + name: Upload failed logs + with: + name: failed_logs_merge_${{ matrix.test_profile }}_${{ matrix.diann_version }}_${{ env.TS }} + include-hidden-files: true + path: failed_logs + overwrite: false + if-no-files-found: warn + + - uses: actions/upload-artifact@v4 + if: always() + name: Upload log + with: + name: nextflow_merge_${{ matrix.test_profile }}_${{ matrix.diann_version }}_${{ env.TS }}.log + include-hidden-files: true + path: .nextflow.log + overwrite: false + if-no-files-found: warn diff --git a/.github/workflows/template-version-comment.yml b/.github/workflows/template-version-comment.yml new file mode 100644 index 0000000..e8560fc --- /dev/null +++ b/.github/workflows/template-version-comment.yml @@ -0,0 +1,46 @@ +name: nf-core template version comment +# This workflow is triggered on PRs to check if the pipeline template version matches the latest nf-core version. +# It posts a comment to the PR, even if it comes from a fork. + +on: pull_request_target + +jobs: + template_version: + runs-on: ubuntu-latest + steps: + - name: Check out pipeline code + uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5 + with: + ref: ${{ github.event.pull_request.head.sha }} + + - name: Read template version from .nf-core.yml + uses: nichmor/minimal-read-yaml@1f7205277e25e156e1f63815781db80a6d490b8f # v0.0.2 + id: read_yml + with: + config: ${{ github.workspace }}/.nf-core.yml + + - name: Install nf-core + run: | + python -m pip install --upgrade pip + pip install nf-core==${{ steps.read_yml.outputs['nf_core_version'] }} + + - name: Check nf-core outdated + id: nf_core_outdated + run: echo "OUTPUT=$(pip list --outdated | grep nf-core)" >> ${GITHUB_ENV} + + - name: Post nf-core template version comment + uses: mshick/add-pr-comment@b8f338c590a895d50bcbfa6c5859251edc8952fc # v2 + if: | + contains(env.OUTPUT, 'nf-core') + with: + repo-token: ${{ secrets.NF_CORE_BOT_AUTH_TOKEN }} + allow-repeats: false + message: | + > [!WARNING] + > Newer version of the nf-core template is available. + > + > Your pipeline is using an old version of the nf-core template: ${{ steps.read_yml.outputs['nf_core_version'] }}. + > Please update your pipeline to the latest version. + > + > For more documentation on how to update your pipeline, please see the [nf-core documentation](https://github.com/nf-core/tools?tab=readme-ov-file#sync-a-pipeline-with-the-template) and [Synchronisation documentation](https://nf-co.re/docs/contributing/sync). + # diff --git a/.gitignore b/.gitignore index b7faf40..52812e6 100644 --- a/.gitignore +++ b/.gitignore @@ -1,207 +1,23 @@ -# Byte-compiled / optimized / DLL files -__pycache__/ -*.py[codz] -*$py.class - -# C extensions -*.so - -# Distribution / packaging -.Python -build/ -develop-eggs/ -dist/ -downloads/ -eggs/ -.eggs/ -lib/ -lib64/ -parts/ -sdist/ -var/ -wheels/ -share/python-wheels/ -*.egg-info/ -.installed.cfg -*.egg -MANIFEST - -# PyInstaller -# Usually these files are written by a python script from a template -# before PyInstaller builds the exe, so as to inject date/other infos into it. -*.manifest -*.spec - -# Installer logs -pip-log.txt -pip-delete-this-directory.txt - -# Unit test / coverage reports -htmlcov/ -.tox/ -.nox/ -.coverage -.coverage.* -.cache -nosetests.xml -coverage.xml -*.cover -*.py.cover -.hypothesis/ -.pytest_cache/ -cover/ - -# Translations -*.mo -*.pot - -# Django stuff: -*.log -local_settings.py -db.sqlite3 -db.sqlite3-journal - -# Flask stuff: -instance/ -.webassets-cache - -# Scrapy stuff: -.scrapy - -# Sphinx documentation -docs/_build/ - -# PyBuilder -.pybuilder/ -target/ - -# Jupyter Notebook -.ipynb_checkpoints - -# IPython -profile_default/ -ipython_config.py - -# pyenv -# For a library or package, you might want to ignore these files since the code is -# intended to run in multiple environments; otherwise, check them in: -# .python-version - -# pipenv -# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. -# However, in case of collaboration, if having platform-specific dependencies or dependencies -# having no cross-platform support, pipenv may install dependencies that don't work, or not -# install all needed dependencies. -#Pipfile.lock - -# UV -# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. -# This is especially recommended for binary packages to ensure reproducibility, and is more -# commonly ignored for libraries. -#uv.lock - -# poetry -# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. -# This is especially recommended for binary packages to ensure reproducibility, and is more -# commonly ignored for libraries. -# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control -#poetry.lock -#poetry.toml - -# pdm -# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. -# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. -# https://pdm-project.org/en/latest/usage/project/#working-with-version-control -#pdm.lock -#pdm.toml -.pdm-python -.pdm-build/ - -# pixi -# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. -#pixi.lock -# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one -# in the .venv directory. It is recommended not to include this directory in version control. -.pixi - -# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm -__pypackages__/ - -# Celery stuff -celerybeat-schedule -celerybeat.pid - -# SageMath parsed files -*.sage.py - -# Environments +.nextflow* +work/ +data/ +results/ +.DS_Store .env -.envrc -.venv -env/ -venv/ -ENV/ -env.bak/ -venv.bak/ - -# Spyder project settings -.spyderproject -.spyproject - -# Rope project settings -.ropeproject - -# mkdocs documentation -/site - -# mypy -.mypy_cache/ -.dmypy.json -dmypy.json - -# Pyre type checker -.pyre/ - -# pytype static type analyzer -.pytype/ - -# Cython debug symbols -cython_debug/ - -# PyCharm -# JetBrains specific template is maintained in a separate JetBrains.gitignore that can -# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore -# and can be added to the global gitignore or merged into this file. For a more nuclear -# option (not recommended) you can uncomment the following to ignore the entire idea folder. -#.idea/ - -# Abstra -# Abstra is an AI-powered process automation framework. -# Ignore directories containing user credentials, local state, and settings. -# Learn more at https://abstra.io/docs -.abstra/ - -# Visual Studio Code -# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore -# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore -# and can be added to the global gitignore or merged into this file. However, if you prefer, -# you could uncomment the following to ignore the entire vscode folder -# .vscode/ - -# Ruff stuff: -.ruff_cache/ - -# PyPI configuration file -.pypirc - -# Cursor -# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to -# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data -# refer to https://docs.cursor.com/context/ignore-files -.cursorignore -.cursorindexingignore - -# Marimo -marimo/_static/ -marimo/_lsp/ -__marimo__/ +.env.* +!.env.example +testing* +*.pyc +__pycache__/ +null/ +.idea/ +.vscode/ +/lint_log.txt +/lint_results.md + +# AI assistant configuration +.claude/ +.cursor/rules/codacy.mdc +.codacy/ +.github/instructions/codacy.instructions.md +docs/superpowers/ diff --git a/.nf-core.yml b/.nf-core.yml new file mode 100644 index 0000000..e62b86d --- /dev/null +++ b/.nf-core.yml @@ -0,0 +1,30 @@ +lint: + files_exist: + - .github/workflows/ci.yml + - .github/workflows/nf-test.yml + - .gitignore + - conf/modules.config + - conf/test.config + - conf/test_full.config + files_unchanged: + - .github/PULL_REQUEST_TEMPLATE.md + - .github/CONTRIBUTING.md + - .github/workflows/branch.yml + - .github/workflows/linting_comment.yml + - .github/.dockstore.yml + - docs/README.md + - .gitignore + modules_config: false + multiqc_config: false + nextflow_config: false +nf_core_version: 3.5.2 +repository_type: pipeline +template: + author: Yasset Perez-Riverol + description: Quantitative Mass Spectrometry nf-core workflow + force: false + is_nfcore: false + name: quantmsdiann + org: bigbio + outdir: . + version: 1.0.0 diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000..d06777a --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,27 @@ +repos: + - repo: https://github.com/pre-commit/mirrors-prettier + rev: "v3.1.0" + hooks: + - id: prettier + additional_dependencies: + - prettier@3.6.2 + - repo: https://github.com/pre-commit/pre-commit-hooks + rev: v6.0.0 + hooks: + - id: trailing-whitespace + args: [--markdown-linebreak-ext=md] + exclude: | + (?x)^( + .*ro-crate-metadata.json$| + modules/nf-core/.*| + subworkflows/nf-core/.*| + .*\.snap$ + )$ + - id: end-of-file-fixer + exclude: | + (?x)^( + .*ro-crate-metadata.json$| + modules/nf-core/.*| + subworkflows/nf-core/.*| + .*\.snap$ + )$ diff --git a/.prettierignore b/.prettierignore new file mode 100644 index 0000000..7b5485f --- /dev/null +++ b/.prettierignore @@ -0,0 +1,21 @@ +modules.json +email_template.html +adaptivecard.json +slackreport.json +.nextflow* +work/ +data/ +results/ +.DS_Store +testing/ +testing* +*.pyc +bin/ +.nf-test/ +ro-crate-metadata.json +modules/nf-core/ +subworkflows/nf-core/ + +# Ignore potentially non-human-readable markdown files +.github/**/SKILL.md +AGENT.md diff --git a/.prettierrc.yml b/.prettierrc.yml new file mode 100644 index 0000000..07dbd8b --- /dev/null +++ b/.prettierrc.yml @@ -0,0 +1,6 @@ +printWidth: 120 +tabWidth: 4 +overrides: + - files: "*.{md,yml,yaml,html,css,scss,js,cff}" + options: + tabWidth: 2 diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..2c32d31 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,435 @@ +# AI Agent Guidelines for quantmsdiann Development + +This document provides comprehensive guidance for AI agents working with the **quantmsdiann** bioinformatics pipeline. These guidelines ensure code quality, maintainability, and compliance with project standards. + +## Critical: Mandatory Validation Before ANY Commit + +**ALWAYS run pre-commit hooks before committing ANY changes:** + +```bash +pre-commit run --all-files +``` + +This is **non-negotiable**. All code must pass formatting and style checks before being committed. + +--- + +## Project Overview + +**quantmsdiann** is a [bigbio](https://github.com/bigbio) bioinformatics pipeline, built following [nf-core](https://nf-co.re/) guidelines, for **DIA-NN-based quantitative mass spectrometry**. It is a standalone pipeline focused exclusively on **Data-Independent Acquisition (DIA)** workflows using the DIA-NN search engine. + +**This pipeline does NOT support DDA, TMT, iTRAQ, LFQ-DDA, or any non-DIA workflows.** Those are handled by the parent `quantms` pipeline. + +**Key Features:** + +- Built with Nextflow DSL2 +- DIA-NN for peptide/protein identification and quantification +- Supports DIA-NN v1.8.1, v2.1.0, and v2.2.0 (latest) +- QuantUMS quantification method (DIA-NN >= 1.9.2) +- Parquet-native output with decoy reporting (DIA-NN >= 2.0) +- MSstats-compatible output generation (via quantms-utils conversion, no MSstats analysis) +- Quality control with pmultiqc +- Complies with nf-core standards + +**Repository:** https://github.com/bigbio/quantmsdiann + +--- + +## Technology Stack + +### Core Technologies + +- **Nextflow**: >=25.04.0 (DSL2 syntax) +- **nf-schema plugin**: 2.5.1 (parameter validation) +- **nf-test**: Testing framework (config: `nf-test.config`) +- **nf-core tools**: Pipeline standards and linting +- **Containers**: Docker/Singularity/Apptainer/Podman (Conda deprecated) +- **DIA-NN**: Primary search engine (versions 1.8.1 through 2.2.0) + +### Key Configuration Files + +- `nextflow.config` - Main pipeline configuration +- `nextflow_schema.json` - Parameter schema (auto-generated) +- `nf-test.config` - Testing configuration +- `.nf-core.yml` - nf-core compliance settings +- `modules.json` - Module dependencies +- `.pre-commit-config.yaml` - Pre-commit hooks + +### Project Structure + +``` +quantmsdiann/ +├── main.nf # Pipeline entry point +├── workflows/ +│ ├── quantmsdiann.nf # Main workflow orchestrator +│ └── dia.nf # DIA-NN analysis workflow +├── subworkflows/local/ # Reusable subworkflows +│ ├── input_check/ # SDRF validation +│ ├── file_preparation/ # Format conversion +│ └── create_input_channel/ # SDRF metadata parsing +├── modules/local/ +│ ├── diann/ # DIA-NN modules (7 steps) +│ │ ├── generate_cfg/ +│ │ ├── insilico_library_generation/ +│ │ ├── preliminary_analysis/ +│ │ ├── assemble_empirical_library/ +│ │ ├── individual_analysis/ +│ │ ├── final_quantification/ +│ │ └── diann_msstats/ +│ ├── openms/ # mzML indexing, peak picking +│ ├── pmultiqc/ # QC reporting +│ ├── sdrf_parsing/ # SDRF parsing +│ ├── samplesheet_check/ # Input validation +│ └── utils/ # tdf2mzml, decompress, mzml stats +├── conf/ +│ ├── base.config # Resource definitions +│ ├── modules/ # Module-specific configs +│ ├── tests/ # Test profile configs (DIA only) +│ └── diann_versions/ # DIA-NN version-override configs for merge matrix +├── tests/ # nf-test test cases +└── assets/ # Pipeline assets and schemas +``` + +--- + +## DIA-NN Workflow + +The pipeline executes the following steps: + +1. **SDRF Validation & Parsing** - Validates input SDRF and extracts metadata +2. **File Preparation** - Converts RAW/mzML/.d/.dia files (ThermoRawFileParser, tdf2mzml) +3. **Generate Config** - Creates DIA-NN config from enzyme/modifications (`quantmsutilsc dianncfg`) +4. **In-Silico Library Generation** - Predicts spectral library from FASTA (or uses provided library) +5. **Preliminary Analysis** - Per-file calibration and mass accuracy determination +6. **Assemble Empirical Library** - Builds consensus library from preliminary results using .quant files +7. **Individual Analysis** - Per-file search with empirical library (optional, for large datasets) +8. **Final Quantification** - Summary quantification with protein/peptide/gene matrices +9. **MSstats Format Conversion** - Converts DIA-NN report to MSstats-compatible CSV (`quantmsutilsc diann2msstats`) +10. **pmultiqc** - Quality control reporting + +### DIA-NN Version-Specific Features + +| Feature | Min Version | Parameter | +| ------------------------------------------- | ----------- | ---------------------------- | +| Core workflow, library-free, .quant caching | 1.8.1 | (default) | +| QuantUMS quantification | 1.9.2 | `--quantums true` | +| Parquet output format | 2.0 | (automatic in 2.0+) | +| Decoy reporting | 2.0 | `--diann_report_decoys true` | +| Native .raw on Linux | 2.1.0 | (automatic) | + +--- + +## Validation Workflow + +### 1. Pre-commit Hooks (MANDATORY) + +**Installation:** + +```bash +pip install pre-commit +pre-commit install # Install git hooks (one-time setup) +``` + +**Run before EVERY commit:** + +```bash +pre-commit run --all-files +``` + +**Configured Hooks** (`.pre-commit-config.yaml`): + +1. **Prettier** - Formats code consistently across multiple file types +2. **trailing-whitespace** - Removes trailing whitespace (preserves markdown linebreaks) +3. **end-of-file-fixer** - Ensures files end with a single newline + +**Auto-fix in CI:** +If you forget to run pre-commit locally, comment on your PR: + +``` +@nf-core-bot fix linting +``` + +### 2. Pipeline Linting (RECOMMENDED) + +```bash +nf-core pipelines lint +# For master branch PRs: +nf-core pipelines lint --release +``` + +### 3. Schema Validation (REQUIRED for parameter changes) + +```bash +nf-core pipelines schema build +``` + +--- + +## Testing Strategy + +### 3-Tier CI/CD Strategy + +1. **Every PR / push to dev**: Test all features against **latest** DIA-NN (2.2.0) + test **1.8.1** for features it supports. +2. **Merge dev → master**: Run the **full version × feature matrix** — every DIA-NN version against every feature it introduced. +3. **ci.yml** (fast gate): Only 1.8.1 public container tests, no auth needed. + +### Test Profiles (DIA only) + +| Profile | Feature Tested | Default Container | Min DIA-NN | +| ------------------- | ----------------------- | ---------------------------- | ---------- | +| `test_dia` | Core workflow | biocontainers 1.8.1 (public) | 1.8.1 | +| `test_dia_dotd` | Bruker .d format | biocontainers 1.8.1 (public) | 1.8.1 | +| `test_dia_quantums` | QuantUMS quantification | ghcr.io/bigbio/diann:2.2.0 | 1.9.2 | +| `test_dia_parquet` | Parquet output + decoys | ghcr.io/bigbio/diann:2.2.0 | 2.0 | +| `test_latest_dia` | Core on latest DIA-NN | ghcr.io/bigbio/diann:2.2.0 | latest | +| `test_dia_2_2_0` | DIA-NN 2.2.0 compat | ghcr.io/bigbio/diann:2.2.0 | 2.2.0 | +| `test_full_dia` | Full-size dataset | biocontainers 1.8.1 (public) | 1.8.1 | + +### Version Override Profiles (for merge matrix) + +These apply on top of test profiles to override the DIA-NN container version: + +| Profile | Container | Auth | +| -------------- | -------------------------------- | ---- | +| `diann_v1_8_1` | `biocontainers/diann:v1.8.1_cv1` | none | +| `diann_v2_1_0` | `ghcr.io/bigbio/diann:2.1.0` | GHCR | +| `diann_v2_2_0` | `ghcr.io/bigbio/diann:2.2.0` | GHCR | + +### CI Workflows + +| Workflow | Trigger | What it runs | +| ------------------- | ----------------------- | ---------------------------------------------------- | +| **ci.yml** | Every PR (fast gate) | `test_dia`, `test_dia_dotd` (1.8.1, Docker) | +| **extended_ci.yml** | Every PR / push to dev | 1.8.1 defaults + all features on 2.2.0 + Singularity | +| **merge_ci.yml** | PR to master / releases | Full version × feature matrix (10 combinations) | +| **linting.yml** | All PRs, releases | Pre-commit hooks + `nf-core pipelines lint` | +| **branch.yml** | PRs to master | Only allows PRs from `dev` branch | + +### When to Run Tests Locally + +**No testing required:** + +- README, CHANGELOG, docs/ updates +- Minor config tweaks (labels, descriptions) +- Comment additions + +**Targeted testing required:** + +| Change Area | Test Profile | Command | +| ------------------------------- | -------------------- | ----------------------------------------------------------------------- | +| Core DIA-NN modules | `test_dia` | `nextflow run . -profile test_dia,docker --outdir results` | +| Bruker .d support | `test_dia_dotd` | `nextflow run . -profile test_dia_dotd,docker --outdir results` | +| QuantUMS / final_quantification | `test_dia_quantums` | `nextflow run . -profile test_dia_quantums,docker --outdir results` | +| Parquet output / diann_msstats | `test_dia_parquet` | `nextflow run . -profile test_dia_parquet,docker --outdir results` | +| Cross-version compat | Use version override | `nextflow run . -profile test_dia,diann_v2_2_0,docker --outdir results` | + +**Comprehensive testing (before PR):** + +```bash +nf-test test --profile debug,test,docker --verbose +``` + +### Container Authentication + +Tests using `ghcr.io/bigbio/diann:*` containers require GHCR authentication (DIA-NN has an academic-only license): + +```bash +echo "$GHCR_TOKEN" | docker login ghcr.io -u $GHCR_USERNAME --password-stdin +``` + +In CI, the `GHCR_USERNAME` and `GHCR_TOKEN` secrets are configured in the repository. + +--- + +## Development Conventions + +### Branch Strategy + +- **Target branch**: `dev` (NOT master) +- **Master branch**: Release-ready code only +- **PR process**: Fork -> feature branch -> PR to `dev` + +### Naming Conventions + +#### Channel Names + +```groovy +ch_output_from_ // Initial output from a process +ch__for_ // Intermediate channels +``` + +#### Process/Module Names + +- Use lowercase with underscores: `final_quantification`, `preliminary_analysis` +- Follow nf-core conventions for consistency + +### Resource Labels + +Defined in `conf/base.config`: + +| Label | CPU | Memory | Time | Use Case | +| ------------------ | --- | ------ | ---- | --------------------- | +| `process_single` | 1 | 6 GB | 4h | Single-threaded tools | +| `process_tiny` | 1 | 1 GB | 1h | Minimal processing | +| `process_very_low` | 2 | 12 GB | 4h | Light parallelism | +| `process_low` | 4 | 36 GB | 8h | Moderate workload | +| `process_medium` | 8 | 72 GB | 16h | Standard processing | +| `process_high` | 12 | 108 GB | 20h | Heavy computation | + +### DIA-NN Module Labels + +All DIA-NN process modules use the `diann` label for container selection: + +```groovy +process DIANN_FINAL_QUANTIFICATION { + label 'process_high' + label 'diann' + // ... +} +``` + +The `diann` label is what version-override profiles target to switch containers. + +### Adding a New DIA-NN Feature + +1. **Identify minimum DIA-NN version** that supports the feature +2. **Modify the relevant module** in `modules/local/diann/` +3. **Add parameter** to `nextflow.config` with sensible default +4. **Update schema**: `nf-core pipelines schema build` +5. **Create or update test profile** in `conf/tests/` with the feature enabled +6. **Add to CI matrix** in `extended_ci.yml` (latest) and `merge_ci.yml` (version matrix) +7. **Update documentation**: `docs/usage.md`, `docs/output.md` + +### Code Style + +- **Indentation**: 4 spaces (enforced by Prettier) +- **Line length**: Aim for <120 characters +- **Comments**: Use `//` for single-line, `/* */` for multi-line +- **Strings**: Use single quotes `'text'` unless interpolation needed `"$var"` +- **Groovy closures**: Follow Nextflow DSL2 patterns + +--- + +## Troubleshooting + +### Pre-commit Issues + +**Problem**: Pre-commit hook fails with formatting issues + +**Solution**: The files were auto-fixed. Stage and commit again: + +```bash +git add . +git commit -m "your message" +``` + +### Testing Issues + +**Problem**: Test fails with "Process exceeded memory limit" + +**Solution**: Ensure you're using a test profile with resource limits: + +```bash +nextflow run . -profile test_dia,docker --outdir results +``` + +**Problem**: GHCR container pull fails + +**Solution**: Feature test profiles require GHCR authentication: + +```bash +echo "$GHCR_TOKEN" | docker login ghcr.io -u $GHCR_USERNAME --password-stdin +``` + +For local testing without GHCR access, use `test_dia` or `test_dia_dotd` (public containers). + +**Problem**: Snapshot test fails after intentional output changes + +**Solution**: Update snapshots: + +```bash +nf-test test --profile debug,test,docker --update-snapshot +``` + +### Nextflow Issues + +**Problem**: "Nextflow version is too old" + +**Solution**: + +```bash +nextflow self-update +# Or install specific version +export NXF_VER=25.04.0 +``` + +**Problem**: "Process terminated with exit code 137" + +**Solution**: Out of memory. Either: + +1. Use test profile: `-profile test_dia,docker` +2. Increase Docker memory limit in Docker Desktop settings + +**Problem**: Process error + +**Solution**: + +1. Check `.nextflow.log` for details +2. Check work directory: `cat work//.command.err` +3. Rerun with debug: `nextflow run . -profile debug,test_dia,docker --outdir results` + +--- + +## Quick Reference + +### Essential Commands + +```bash +# Pre-commit (MANDATORY before commit) +pre-commit run --all-files + +# Lint pipeline +nf-core pipelines lint + +# Update schema after parameter changes +nf-core pipelines schema build + +# Run core DIA test (public container, no auth) +nextflow run . -profile test_dia,docker --outdir results + +# Run QuantUMS test (requires GHCR auth) +nextflow run . -profile test_dia_quantums,docker --outdir results + +# Run with specific DIA-NN version override +nextflow run . -profile test_dia,diann_v2_2_0,docker --outdir results + +# Run nf-test suite +nf-test test --profile debug,test,docker --verbose + +# Resume pipeline +nextflow run . -profile test_dia,docker --outdir results -resume + +# Clean work directory +nextflow clean -f +``` + +### File Locations + +- **Main config**: `nextflow.config` +- **Schema**: `nextflow_schema.json` +- **Pre-commit config**: `.pre-commit-config.yaml` +- **nf-test config**: `nf-test.config` +- **Test configs**: `conf/tests/*.config` +- **Version overrides**: `conf/diann_versions/*.config` +- **Module configs**: `conf/modules/modules.config` +- **Base resources**: `conf/base.config` +- **Main workflow**: `workflows/quantmsdiann.nf` +- **DIA workflow**: `workflows/dia.nf` +- **DIA-NN modules**: `modules/local/diann/*/main.nf` +- **Entry point**: `main.nf` + +--- + +**Last Updated**: April 2, 2026 +**Pipeline Version**: 1.0.0 +**Minimum Nextflow**: 25.04.0 diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..fff751a --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,34 @@ +# bigbio/quantmsdiann: Changelog + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [1.0.0] bigbio/quantmsdiann - 2026-04-03 + +Initial release of the standalone DIA-NN quantitative proteomics pipeline, refactored from [bigbio/quantms](https://github.com/bigbio/quantms). + +### `Added` + +- Complete DIA-NN-based proteomics analysis pipeline built following nf-core guidelines +- Multi-format input support: Thermo RAW, mzML, Bruker .d, and .dia files +- DIA-NN version management with support for versions 1.8.1, 2.1.0, and 2.2.0 +- In-silico spectral library generation with configurable parameters +- Preliminary analysis with automatic mass accuracy calibration +- Empirical library assembly from DIA-NN .quant files +- Individual file analysis with per-file DIA-NN settings from SDRF +- Final quantification with protein-group, precursor, and gene-group matrices +- MSstats-compatible output generation (format conversion, no MSstats analysis) +- Quality control reporting via pmultiqc with interactive dashboards +- SDRF-driven experimental design with automatic parameter extraction +- Comprehensive CI/CD with test profiles for multiple DIA-NN versions + +### `Dependencies` + +| Dependency | Version | +| --------------------- | --------- | +| `nextflow` | >=25.04.0 | +| `dia-nn` | 1.8.1 | +| `thermorawfileparser` | 2.0.0.dev | +| `sdrf-pipelines` | 0.1.2 | +| `pmultiqc` | 0.0.43 | +| `quantms-utils` | 0.0.28 | diff --git a/CITATIONS.md b/CITATIONS.md new file mode 100644 index 0000000..d74e0f9 --- /dev/null +++ b/CITATIONS.md @@ -0,0 +1,61 @@ +# bigbio/quantmsdiann: Citations + +## [Pipeline](https://www.researchsquare.com/article/rs-3002027/v1) + +> Dai C, Pfeuffer J, Wang H, Zheng P, Käll L, Sachsenberg T, Demichev V, Bai M, Kohlbacher O, Perez-Riverol Y. quantms: a cloud-based pipeline for quantitative proteomics enables the reanalysis of public proteomics data. Nat Methods. 2024 Jul 4. doi: 10.1038/s41592-024-02343-1. Epub ahead of print. PMID: 38965444. + +## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) + +> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. + +## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) + +> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. + +## Pipeline tools + +- [DIA-NN](https://pubmed.ncbi.nlm.nih.gov/31768060/) + + > Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods. 2020 Jan;17(1):41-44. doi: 10.1038/s41592-019-0638-x. Epub 2019 Nov 25. PMID: 31768060; PMCID: PMC6949130. + +- [thermorawfileparser](https://pubmed.ncbi.nlm.nih.gov/31755270/) + + > Hulstaert N, Shofstahl J, Sachsenberg T, Walzer M, Barsnes H, Martens L, Perez-Riverol Y. ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. J Proteome Res. 2020 Jan 3;19(1):537-542. doi: 10.1021/acs.jproteome.9b00328. Epub 2019 Dec 6. PMID: 31755270; PMCID: PMC7116465. + +- [sdrf-pipelines](https://pubmed.ncbi.nlm.nih.gov/34615866/) + + > Dai C, Füllgrabe A, Pfeuffer J, Solovyeva EM, Deng J, Moreno P, Kamatchinathan S, Kundu DJ, George N, Fexova S, Grüning B, Föll MC, Griss J, Vaudel M, Audain E, Locard-Paulet M, Turewicz M, Eisenacher M, Uszkoreit J, Van Den Bossche T, Schwämmle V, Webel H, Schulze S, Bouyssié D, Jayaram S, Duggineni VK, Samaras P, Wilhelm M, Choi M, Wang M, Kohlbacher O, Brazma A, Papatheodorou I, Bandeira N, Deutsch EW, Vizcaíno JA, Bai M, Sachsenberg T, Levitsky LI, Perez-Riverol Y. A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun. 2021 Oct 6;12(1):5854. doi: 10.1038/s41467-021-26111-3. PMID: 34615866; PMCID: PMC8494749. + +- [OpenMS](https://pubmed.ncbi.nlm.nih.gov/27575624/) + + > Röst HL., Sachsenberg T., Aiche S., Bielow C., Weisser H., Aicheler F., Andreotti S., Ehrlich HC., Gutenbrunner P., Kenar E., Liang X., Nahnsen S., Nilse L., Pfeuffer J., Rosenberger G., Rurik M., Schmitt U., Veit J., Walzer M., Wojnar D., Wolski WE., Schilling O., Choudhary JS, Malmström L., Aebersold R., Reinert K., Kohlbacher O. (2016). OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nature methods, 13(9), 741–748. doi: 10.1038/nmeth.3959. PubMed PMID: 27575624; PubMed Central PMCID: PMC5617107. + +- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) + + > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. + +- [pmultiqc](https://github.com/bigbio/pmultiqc) + + > Dai C, Pfeuffer J, Wang H, Sachsenberg T, Bai M, Kohlbacher O, Perez-Riverol Y. pmultiqc: a MultiQC plugin for proteomics quality control reporting. 2024. GitHub: https://github.com/bigbio/pmultiqc + +## Software packaging/containerisation tools + +- [Anaconda](https://anaconda.com) + + > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. + +- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) + + > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. + +- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) + + > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. + +- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) + + > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. + +- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) + + > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675. diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..c089ec7 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,182 @@ +# Code of Conduct at nf-core (v1.4) + +## Our Pledge + +In the interest of fostering an open, collaborative, and welcoming environment, we as contributors and maintainers of nf-core pledge to making participation in our projects and community a harassment-free experience for everyone, regardless of: + +- Age +- Ability +- Body size +- Caste +- Familial status +- Gender identity and expression +- Geographical location +- Level of experience +- Nationality and national origins +- Native language +- Neurodiversity +- Race or ethnicity +- Religion +- Sexual identity and orientation +- Socioeconomic status + +Please note that the list above is alphabetised and is therefore not ranked in any order of preference or importance. + +## Preamble + +:::note +This Code of Conduct (CoC) has been drafted by Renuka Kudva, Cris Tuñí, and Michael Heuer, with input from the nf-core Core Team and Susanna Marquez from the nf-core community. "We", in this document, refers to the Safety Officers and members of the nf-core Core Team, both of whom are deemed to be members of the nf-core community and are therefore required to abide by this Code of Conduct. This document will be amended periodically to keep it up-to-date. In case of any dispute, the most current version will apply. +::: + +An up-to-date list of members of the nf-core core team can be found [here](https://nf-co.re/about). + +Our Safety Officers are Saba Nafees, Cris Tuñí, and Michael Heuer. + +nf-core is a young and growing community that welcomes contributions from anyone with a shared vision for [Open Science Policies](https://www.fosteropenscience.eu/taxonomy/term/8). Open science policies encompass inclusive behaviours and we strive to build and maintain a safe and inclusive environment for all individuals. + +We have therefore adopted this CoC, which we require all members of our community and attendees of nf-core events to adhere to in all our workspaces at all times. Workspaces include, but are not limited to, Slack, meetings on Zoom, gather.town, YouTube live etc. + +Our CoC will be strictly enforced and the nf-core team reserves the right to exclude participants who do not comply with our guidelines from our workspaces and future nf-core activities. + +We ask all members of our community to help maintain supportive and productive workspaces and to avoid behaviours that can make individuals feel unsafe or unwelcome. Please help us maintain and uphold this CoC. + +Questions, concerns, or ideas on what we can include? Contact members of the Safety Team on Slack or email safety [at] nf-co [dot] re. + +## Our Responsibilities + +Members of the Safety Team (the Safety Officers) are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour. + +The Safety Team, in consultation with the nf-core core team, have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this CoC, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. + +Members of the core team or the Safety Team who violate the CoC will be required to recuse themselves pending investigation. They will not have access to any reports of the violations and will be subject to the same actions as others in violation of the CoC. + +## When and where does this Code of Conduct apply? + +Participation in the nf-core community is contingent on following these guidelines in all our workspaces and events, such as hackathons, workshops, bytesize, and collaborative workspaces on gather.town. These guidelines include, but are not limited to, the following (listed alphabetically and therefore in no order of preference): + +- Communicating with an official project email address. +- Communicating with community members within the nf-core Slack channel. +- Participating in hackathons organised by nf-core (both online and in-person events). +- Participating in collaborative work on GitHub, Google Suite, community calls, mentorship meetings, email correspondence, and on the nf-core gather.town workspace. +- Participating in workshops, training, and seminar series organised by nf-core (both online and in-person events). This applies to events hosted on web-based platforms such as Zoom, gather.town, Jitsi, YouTube live etc. +- Representing nf-core on social media. This includes both official and personal accounts. + +## nf-core cares 😊 + +nf-core's CoC and expectations of respectful behaviours for all participants (including organisers and the nf-core team) include, but are not limited to, the following (listed in alphabetical order): + +- Ask for consent before sharing another community member’s personal information (including photographs) on social media. +- Be respectful of differing viewpoints and experiences. We are all here to learn from one another and a difference in opinion can present a good learning opportunity. +- Celebrate your accomplishments! (Get creative with your use of emojis 🎉 🥳 💯 🙌 !) +- Demonstrate empathy towards other community members. (We don’t all have the same amount of time to dedicate to nf-core. If tasks are pending, don’t hesitate to gently remind members of your team. If you are leading a task, ask for help if you feel overwhelmed.) +- Engage with and enquire after others. (This is especially important given the geographically remote nature of the nf-core community, so let’s do this the best we can) +- Focus on what is best for the team and the community. (When in doubt, ask) +- Accept feedback, yet be unafraid to question, deliberate, and learn. +- Introduce yourself to members of the community. (We’ve all been outsiders and we know that talking to strangers can be hard for some, but remember we’re interested in getting to know you and your visions for open science!) +- Show appreciation and **provide clear feedback**. (This is especially important because we don’t see each other in person and it can be harder to interpret subtleties. Also remember that not everyone understands a certain language to the same extent as you do, so **be clear in your communication to be kind.**) +- Take breaks when you feel like you need them. +- Use welcoming and inclusive language. (Participants are encouraged to display their chosen pronouns on Zoom or in communication on Slack) + +## nf-core frowns on 😕 + +The following behaviours from any participants within the nf-core community (including the organisers) will be considered unacceptable under this CoC. Engaging or advocating for any of the following could result in expulsion from nf-core workspaces: + +- Deliberate intimidation, stalking or following and sustained disruption of communication among participants of the community. This includes hijacking shared screens through actions such as using the annotate tool in conferencing software such as Zoom. +- “Doxing” i.e. posting (or threatening to post) another person’s personal identifying information online. +- Spamming or trolling of individuals on social media. +- Use of sexual or discriminatory imagery, comments, jokes, or unwelcome sexual attention. +- Verbal and text comments that reinforce social structures of domination related to gender, gender identity and expression, sexual orientation, ability, physical appearance, body size, race, age, religion, or work experience. + +### Online Trolling + +The majority of nf-core interactions and events are held online. Unfortunately, holding events online comes with the risk of online trolling. This is unacceptable — reports of such behaviour will be taken very seriously and perpetrators will be excluded from activities immediately. + +All community members are **required** to ask members of the group they are working with for explicit consent prior to taking screenshots of individuals during video calls. + +## Procedures for reporting CoC violations + +If someone makes you feel uncomfortable through their behaviours or actions, report it as soon as possible. + +You can reach out to members of the Safety Team (Saba Nafees, Cris Tuñí, and Michael Heuer) on Slack. Alternatively, contact a member of the nf-core core team [nf-core core team](https://nf-co.re/about), and they will forward your concerns to the Safety Team. + +Issues directly concerning members of the Core Team or the Safety Team will be dealt with by other members of the core team and the safety manager — possible conflicts of interest will be taken into account. nf-core is also in discussions about having an ombudsperson and details will be shared in due course. + +All reports will be handled with the utmost discretion and confidentiality. + +You can also report any CoC violations to safety [at] nf-co [dot] re. In your email report, please do your best to include: + +- Your contact information. +- Identifying information (e.g. names, nicknames, pseudonyms) of the participant who has violated the Code of Conduct. +- The behaviour that was in violation and the circumstances surrounding the incident. +- The approximate time of the behaviour (if different than the time the report was made). +- Other people involved in the incident, if applicable. +- If you believe the incident is ongoing. +- If there is a publicly available record (e.g. mailing list record, a screenshot). +- Any additional information. + +After you file a report, one or more members of our Safety Team will contact you to follow up on your report. + +## Who will read and handle reports + +All reports will be read and handled by the members of the Safety Team at nf-core. + +If members of the Safety Team are deemed to have a conflict of interest with a report, they will be required to recuse themselves as per our Code of Conduct and will not have access to any follow-ups. + +To keep this first report confidential from any of the Safety Team members, please submit your first report by direct messaging on Slack/direct email to any of the nf-core members you are comfortable disclosing the information to, and be explicit about which member(s) you do not consent to sharing the information with. + +## Reviewing reports + +After receiving the report, members of the Safety Team will review the incident report to determine whether immediate action is required, for example, whether there is immediate threat to participants’ safety. + +The Safety Team, in consultation with members of the nf-core core team, will assess the information to determine whether the report constitutes a Code of Conduct violation, for them to decide on a course of action. + +In the case of insufficient information, one or more members of the Safety Team may contact the reporter, the reportee, or any other attendees to obtain more information. + +Once additional information is gathered, the Safety Team will collectively review and decide on the best course of action to take, if any. The Safety Team reserves the right to not act on a report. + +## Confidentiality + +All reports, and any additional information included, are only shared with the team of safety officers (and possibly members of the core team, in case the safety officer is in violation of the CoC). We will respect confidentiality requests for the purpose of protecting victims of abuse. + +We will not name harassment victims, beyond discussions between the safety officer and members of the nf-core team, without the explicit consent of the individuals involved. + +## Enforcement + +Actions taken by the nf-core’s Safety Team may include, but are not limited to: + +- Asking anyone to stop a behaviour. +- Asking anyone to leave the event and online spaces either temporarily, for the remainder of the event, or permanently. +- Removing access to the gather.town and Slack, either temporarily or permanently. +- Communicating to all participants to reinforce our expectations for conduct and remind what is unacceptable behaviour; this may be public for practical reasons. +- Communicating to all participants that an incident has taken place and how we will act or have acted — this may be for the purpose of letting event participants know we are aware of and dealing with the incident. +- Banning anyone from participating in nf-core-managed spaces, future events, and activities, either temporarily or permanently. +- No action. + +## Attribution and Acknowledgements + +- The [Contributor Covenant, version 1.4](http://contributor-covenant.org/version/1/4) +- The [OpenCon 2017 Code of Conduct](http://www.opencon2017.org/code_of_conduct) (CC BY 4.0 OpenCon organisers, SPARC and Right to Research Coalition) +- The [eLife innovation sprint 2020 Code of Conduct](https://sprint.elifesciences.org/code-of-conduct/) +- The [Mozilla Community Participation Guidelines v3.1](https://www.mozilla.org/en-US/about/governance/policies/participation/) (version 3.1, CC BY-SA 3.0 Mozilla) + +## Changelog + +### v1.4 - February 8th, 2022 + +- Included a new member of the Safety Team. Corrected a typographical error in the text. + +### v1.3 - December 10th, 2021 + +- Added a statement that the CoC applies to nf-core gather.town workspaces. Corrected typographical errors in the text. + +### v1.2 - November 12th, 2021 + +- Removed information specific to reporting CoC violations at the Hackathon in October 2021. + +### v1.1 - October 14th, 2021 + +- Updated with names of new Safety Officers and specific information for the hackathon in October 2021. + +### v1.0 - March 15th, 2021 + +- Complete rewrite from original [Contributor Covenant](http://contributor-covenant.org/) CoC. diff --git a/LICENSE b/LICENSE index 67c3ab0..ef97dfd 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2026 BigBio Stack +Copyright (c) The bigbio/quantmsdiann team Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index fc39cfa..67c2de6 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,112 @@ # quantmsdiann -quantms workflow for DIANN tool including DIA and DDA analysis + +[![GitHub Actions CI Status](https://github.com/bigbio/quantmsdiann/actions/workflows/ci.yml/badge.svg)](https://github.com/bigbio/quantmsdiann/actions/workflows/ci.yml) +[![GitHub Actions Linting Status](https://github.com/bigbio/quantmsdiann/actions/workflows/linting.yml/badge.svg)](https://github.com/bigbio/quantmsdiann/actions/workflows/linting.yml) +[![Cite with Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.15573386.svg)](https://doi.org/10.5281/zenodo.15573386) +[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com) + +[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.04.0-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/) +[![nf-core template version](https://img.shields.io/badge/nf--core_template-3.5.2-green?style=flat&logo=nfcore&logoColor=white&color=%2324B064&link=https%3A%2F%2Fnf-co.re)](https://github.com/nf-core/tools/releases/tag/3.5.2) +[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/) +[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/) + +## Introduction + +**quantmsdiann** is a [bigbio](https://github.com/bigbio) bioinformatics pipeline, built following [nf-core](https://nf-co.re/) guidelines, for quantitative mass spectrometry analysis using [DIA-NN](https://github.com/vdemichev/DiaNN). It supports **Data-Independent Acquisition (DIA)** workflows including label-free, plexDIA (mTRAQ, SILAC, Dimethyl), phosphoproteomics with site localization, and Bruker timsTOF/PASEF data. + +The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a portable manner. It uses Docker/Singularity containers making results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process, making it easy to maintain and update software dependencies. + +## Pipeline summary + +

+ quantmsdiann workflow +

+ +The pipeline takes [SDRF](https://github.com/bigbio/proteomics-metadata-standard) metadata and mass spectrometry data files (`.raw`, `.mzML`, `.d`, `.dia`) as input and performs: + +1. **Input validation** — SDRF parsing and validation via [sdrf-pipelines](https://github.com/bigbio/sdrf-pipelines) +2. **File preparation** — RAW to mzML conversion ([ThermoRawFileParser](https://github.com/compomics/ThermoRawFileParser)), indexing, Bruker `.d` handling ([tdf2mzml](https://github.com/bigbio/tdf2mzml)) +3. **In-silico spectral library generation** — deep learning-based prediction, or use a user-provided library (`--diann_speclib`) +4. **Preliminary analysis** — per-file calibration and mass accuracy estimation (parallelized) +5. **Empirical library assembly** — consensus library from preliminary results with RT profiling +6. **Individual analysis** — per-file search with the empirical library (parallelized) +7. **Final quantification** — protein/peptide/gene group matrices with cross-run normalization +8. **MSstats conversion** — DIA-NN report to [MSstats](https://msstats.org/)-compatible format +9. **Quality control** — interactive QC report via [pmultiqc](https://github.com/bigbio/pmultiqc) + +## Supported DIA-NN Versions + +| Version | Profile | Container | Key features | +| --------------- | -------------- | ------------------------------------------ | ---------------------------------------------- | +| 1.8.1 (default) | `diann_v1_8_1` | `docker.io/biocontainers/diann:v1.8.1_cv1` | Core DIA analysis, TSV output | +| 2.1.0 | `diann_v2_1_0` | `ghcr.io/bigbio/diann:2.1.0` | Native .raw support, Parquet output | +| 2.2.0 | `diann_v2_2_0` | `ghcr.io/bigbio/diann:2.2.0` | Speed optimizations (up to 1.6x on HPC) | +| 2.3.2 | `diann_v2_3_2` | `ghcr.io/bigbio/diann:2.3.2` | DDA support (beta), InfinDIA, up to 9 var mods | + +Switch versions with e.g. `-profile diann_v2_2_0,docker`. See the [DIA-NN Version Selection](docs/usage.md#dia-nn-version-selection) guide and [full parameter reference](docs/parameters.md) for details. + +## Quick start + +> [!NOTE] +> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set up Nextflow. + +**Run with test data:** + +```bash +nextflow run bigbio/quantmsdiann -profile test_dia,docker --outdir results +``` + +**Run with your own data:** + +```bash +nextflow run bigbio/quantmsdiann \ + --input 'experiment.sdrf.tsv' \ + --database 'proteins.fasta' \ + --outdir './results' \ + -profile docker +``` + +**Run with a specific DIA-NN version:** + +```bash +nextflow run bigbio/quantmsdiann \ + --input 'experiment.sdrf.tsv' \ + --database 'proteins.fasta' \ + --outdir './results' \ + -profile docker,diann_v2_2_0 +``` + +> [!WARNING] +> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), not for defining parameters. + +## Documentation + +- [Usage](docs/usage.md) — How to run the pipeline, input formats, optional outputs, and custom configuration +- [Parameters](docs/parameters.md) — Complete reference of all pipeline parameters organised by category +- [Output](docs/output.md) — Description of all output files produced by the pipeline + +## Credits + +quantmsdiann is developed and maintained by: + +- [Yasset Perez-Riverol](https://github.com/ypriverol) (EMBL-EBI) +- [Dai Chengxin](https://github.com/daichengxin) (Beijing Proteome Research Center) +- [Julianus Pfeuffer](https://github.com/jpfeuffer) (Freie Universitat Berlin) +- [Vadim Demichev](https://github.com/vdemichev) (Charite Universitaetsmedizin Berlin) +- [Qi-Xuan Yue](https://github.com/yueqixuan) (Chongqing University of Posts and Telecommunications) + +## Contributions and Support + +If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). + +## Citation + +If you use quantmsdiann in your research, please cite: + +> Dai et al. "quantms: a cloud-based pipeline for quantitative proteomics" (2024). DOI: [10.5281/zenodo.15573386](https://doi.org/10.5281/zenodo.15573386) + +An extensive list of references for the tools used by the pipeline can be found in the [CITATIONS.md](CITATIONS.md) file. + +## License + +[MIT](LICENSE) diff --git a/assets/adaptivecard.json b/assets/adaptivecard.json new file mode 100644 index 0000000..266b980 --- /dev/null +++ b/assets/adaptivecard.json @@ -0,0 +1,68 @@ +{ + "type": "message", + "attachments": [ + { + "contentType": "application/vnd.microsoft.card.adaptive", + "contentUrl": null, + "content": { + "\$schema": "https://adaptivecards.io/schemas/adaptive-card.json", + "msteams": { + "width": "Full" + }, + "type": "AdaptiveCard", + "version": "1.2", + "body": [ + { + "type": "TextBlock", + "size": "Large", + "weight": "Bolder", + "color": "<% if (success) { %>Good<% } else { %>Attention<%} %>", + "text": "bigbio/quantmsdiann v${version} - ${runName}", + "wrap": true + }, + { + "type": "TextBlock", + "spacing": "None", + "text": "Completed at ${dateComplete} (duration: ${duration})", + "isSubtle": true, + "wrap": true + }, + { + "type": "TextBlock", + "text": "<% if (success) { %>Pipeline completed successfully!<% } else { %>Pipeline completed with errors. The full error message was: ${errorReport}.<% } %>", + "wrap": true + }, + { + "type": "TextBlock", + "text": "The command used to launch the workflow was as follows:", + "wrap": true + }, + { + "type": "TextBlock", + "text": "${commandLine}", + "isSubtle": true, + "wrap": true + } + ], + "actions": [ + { + "type": "Action.ShowCard", + "title": "Pipeline Configuration", + "card": { + "type": "AdaptiveCard", + "\$schema": "https://adaptivecards.io/schemas/adaptive-card.json", + "body": [ + { + "type": "FactSet", + "facts": [<% out << summary.collect{ k,v -> "{\"title\": \"$k\", \"value\" : \"$v\"}" + }.join(",\n") %> + ] + } + ] + } + } + ] + } + } + ] +} diff --git a/assets/email_template.html b/assets/email_template.html new file mode 100644 index 0000000..5f5324c --- /dev/null +++ b/assets/email_template.html @@ -0,0 +1,53 @@ + + + + + + + + bigbio/quantmsdiann Pipeline Report + + +
+ + + +

bigbio/quantmsdiann ${version}

+

Run Name: $runName

+ +<% if (!success){ + out << """ +
+

bigbio/quantmsdiann execution completed unsuccessfully!

+

The exit status of the task that caused the workflow execution to fail was: $exitStatus.

+

The full error message was:

+
${errorReport}
+
+ """ +} else { + out << """ +
+ bigbio/quantmsdiann execution completed successfully! +
+ """ +} +%> + +

The workflow was completed at $dateComplete (duration: $duration)

+

The command used to launch the workflow was as follows:

+
$commandLine
+ +

Pipeline Configuration:

+ + + <% out << summary.collect{ k,v -> "" }.join("\n") %> + +
$k
$v
+ +

bigbio/quantmsdiann

+

https://github.com/bigbio/quantmsdiann

+ +
+ + + diff --git a/assets/email_template.txt b/assets/email_template.txt new file mode 100644 index 0000000..4018b17 --- /dev/null +++ b/assets/email_template.txt @@ -0,0 +1,31 @@ +Run Name: $runName + +<% if (success){ + out << "## bigbio/quantmsdiann execution completed successfully! ##" +} else { + out << """#################################################### +## bigbio/quantmsdiann execution completed unsuccessfully! ## +#################################################### +The exit status of the task that caused the workflow execution to fail was: $exitStatus. +The full error message was: + +${errorReport} +""" +} %> + + +The workflow was completed at $dateComplete (duration: $duration) + +The command used to launch the workflow was as follows: + + $commandLine + + + +Pipeline Configuration: +----------------------- +<% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %> + +-- +bigbio/quantmsdiann +https://github.com/bigbio/quantmsdiann diff --git a/assets/methods_description_template.yml b/assets/methods_description_template.yml new file mode 100644 index 0000000..607b49d --- /dev/null +++ b/assets/methods_description_template.yml @@ -0,0 +1,28 @@ +id: "bigbio-quantmsdiann-methods-description" +description: "Suggested text and references to use when describing pipeline usage within the methods section of a publication." +section_name: "bigbio/quantmsdiann Methods Description" +section_href: "https://github.com/bigbio/quantmsdiann" +plot_type: "html" +## You can inject any metadata from the Nextflow '${workflow}' object +data: | +

Methods

+

Data was processed using bigbio/quantmsdiann v${workflow.manifest.version} ${doi_text} a bigbio pipeline built following nf-core guidelines (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.

+

The pipeline was executed with Nextflow v${workflow.nextflow.version} (Di Tommaso et al., 2017) with the following command:

+
${workflow.commandLine}
+

${tool_citations}

+

References

+
    +
  • Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. doi: 10.1038/nbt.3820
  • +
  • Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. doi: 10.1038/s41587-020-0439-x
  • +
  • Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., Köster, J., & Bioconda Team. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7), 475–476. doi: 10.1038/s41592-018-0046-7
  • +
  • da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580–2582. doi: 10.1093/bioinformatics/btx192
  • + ${tool_bibliography} +
+
+
Notes:
+
    + ${nodoi_text} +
  • The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!
  • +
  • You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.
  • +
+
diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml new file mode 100644 index 0000000..80e33d3 --- /dev/null +++ b/assets/multiqc_config.yml @@ -0,0 +1,22 @@ +report_comment: > + This report has been generated by the bigbio/quantmsdiann analysis pipeline. For information about how to interpret these results, please see the documentation. + +thousandsSep_format: "," +export_plots: false + +sp: + pmultiqc/exp_design: + fn: "*_design.tsv" + pmultiqc/sdrf: + fn: "*.sdrf.tsv" + pmultiqc/msstats: + fn: "*msstats_in.csv" + num_lines: 0 + pmultiqc/diann_report_tsv: + fn: "*report.tsv" + num_lines: 0 + pmultiqc/diann_report_parquet: + fn: "diann_report.parquet" + num_lines: 0 + +disable_version_detection: true diff --git a/assets/nf-core-quantmsdiann_logo_light.png b/assets/nf-core-quantmsdiann_logo_light.png new file mode 100644 index 0000000..eca7021 Binary files /dev/null and b/assets/nf-core-quantmsdiann_logo_light.png differ diff --git a/assets/schema_input.json b/assets/schema_input.json new file mode 100644 index 0000000..7b15010 --- /dev/null +++ b/assets/schema_input.json @@ -0,0 +1,30 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/bigbio/quantmsdiann/main/assets/schema_input.json", + "title": "bigbio/quantmsdiann pipeline - params.input schema", + "description": "Schema for the file provided with params.input", + "type": "array", + "items": { + "type": "object", + "required": ["source name", "comment[data file]", "assay name"], + "properties": { + "source name": { + "type": "string", + "pattern": "^\\S+$", + "errorMessage": "Sample name must be provided and cannot contain spaces" + }, + "comment[data file]": { + "type": "string", + "format": "file-path", + "errorMessage": "Spectrum files must be provided", + "meta": ["id"] + }, + "assay name": { + "type": "string", + "pattern": "^\\S+$", + "errorMessage": "Assay name must be provided and cannot contain spaces", + "meta": ["assay"] + } + } + } +} diff --git a/assets/sendmail_template.txt b/assets/sendmail_template.txt new file mode 100644 index 0000000..c762554 --- /dev/null +++ b/assets/sendmail_template.txt @@ -0,0 +1,53 @@ +To: $email +Subject: $subject +Mime-Version: 1.0 +Content-Type: multipart/related;boundary="nfcoremimeboundary" + +--nfcoremimeboundary +Content-Type: text/html; charset=utf-8 + +$email_html + +--nfcoremimeboundary +Content-Type: image/png;name="bigbio-quantmsdiann_logo.png" +Content-Transfer-Encoding: base64 +Content-ID: +Content-Disposition: inline; filename="bigbio-quantmsdiann_logo_light.png" + +<% out << new File("$projectDir/assets/bigbio-quantmsdiann_logo_light.png"). + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' ) %> + +<% +if (mqcFile){ +def mqcFileObj = new File("$mqcFile") +if (mqcFileObj.length() < mqcMaxSize){ +out << """ +--nfcoremimeboundary +Content-Type: text/html; name=\"multiqc_report\" +Content-Transfer-Encoding: base64 +Content-ID: +Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\" + +${mqcFileObj. + bytes. + encodeBase64(). + toString(). + tokenize( '\n' )*. + toList()*. + collate( 76 )*. + collect { it.join() }. + flatten(). + join( '\n' )} +""" +}} +%> + +--nfcoremimeboundary-- diff --git a/assets/slackreport.json b/assets/slackreport.json new file mode 100644 index 0000000..1c153d0 --- /dev/null +++ b/assets/slackreport.json @@ -0,0 +1,34 @@ +{ + "attachments": [ + { + "fallback": "Plain-text summary of the attachment.", + "color": "<% if (success) { %>good<% } else { %>danger<%} %>", + "author_name": "bigbio/quantmsdiann ${version} - ${runName}", + "author_icon": "https://www.nextflow.io/docs/latest/_static/favicon.ico", + "text": "<% if (success) { %>Pipeline completed successfully!<% } else { %>Pipeline completed with errors<% } %>", + "fields": [ + { + "title": "Command used to launch the workflow", + "value": "```${commandLine}```", + "short": false + } + <% + if (!success) { %> + , + { + "title": "Full error message", + "value": "```${errorReport}```", + "short": false + }, + { + "title": "Pipeline configuration", + "value": "<% out << summary.collect{ k,v -> k == "hook_url" ? "_${k}_: (_hidden_)" : ( ( v.class.toString().contains('Path') || ( v.class.toString().contains('String') && v.contains('/') ) ) ? "_${k}_: `${v}`" : (v.class.toString().contains('DateTime') ? ("_${k}_: " + v.format(java.time.format.DateTimeFormatter.ofLocalizedDateTime(java.time.format.FormatStyle.MEDIUM))) : "_${k}_: ${v}") ) }.join(",\n") %>", + "short": false + } + <% } + %> + ], + "footer": "Completed at <% out << dateComplete.format(java.time.format.DateTimeFormatter.ofLocalizedDateTime(java.time.format.FormatStyle.MEDIUM)) %> (duration: ${duration})" + } + ] +} diff --git a/conf/base.config b/conf/base.config new file mode 100644 index 0000000..ca22fec --- /dev/null +++ b/conf/base.config @@ -0,0 +1,74 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + bigbio/quantmsdiann Nextflow base config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + A 'blank slate' config file, appropriate for general use on most high performance + compute environments. Assumes that all software is installed and available on + the PATH. Runs in `local` mode - all jobs will be run on the logged in environment. +---------------------------------------------------------------------------------------- +*/ + +process { + + cpus = { 1 * task.attempt } + memory = { 6.GB * task.attempt } + time = { 4.h * task.attempt } + + errorStrategy = { task.exitStatus in ((130..145) + 104 + 175) ? 'retry' : 'finish' } + maxRetries = 1 + maxErrors = '-1' + + // Process-specific resource requirements + // NOTE - Please try and reuse the labels below as much as possible. + // These labels are used and recognised by default in DSL2 files hosted on nf-core/modules. + // If possible, it would be nice to keep the same label naming convention when + // adding in your local modules too. + + withLabel:process_tiny { + cpus = { 1 } + memory = { 1.GB * task.attempt } + time = { 1.h * task.attempt } + } + withLabel:process_single { + cpus = { 1 } + memory = { 6.GB * task.attempt } + time = { 4.h * task.attempt } + } + withLabel:process_low { + cpus = { 4 * task.attempt } + memory = { 12.GB * task.attempt } + time = { 6.h * task.attempt } + } + withLabel:process_very_low { + cpus = { 2 * task.attempt} + memory = { 4.GB * task.attempt} + time = { 3.h * task.attempt} + } + withLabel:process_medium { + cpus = { 8 * task.attempt } + memory = { 36.GB * task.attempt } + time = { 8.h * task.attempt } + } + withLabel:process_high { + cpus = { 12 * task.attempt } + memory = { 72.GB * task.attempt } + time = { 16.h * task.attempt } + } + withLabel:process_long { + time = { 20.h * task.attempt } + } + withLabel:process_high_memory { + memory = { 200.GB * task.attempt } + } + withLabel:error_ignore { + errorStrategy = 'ignore' + } + withLabel:error_retry { + errorStrategy = 'retry' + maxRetries = 2 + } + withLabel: process_gpu { + ext.use_gpu = { workflow.profile.contains('gpu') } + accelerator = { workflow.profile.contains('gpu') ? 1 : null } + } +} diff --git a/conf/dev.config b/conf/dev.config new file mode 100644 index 0000000..4df221a --- /dev/null +++ b/conf/dev.config @@ -0,0 +1,26 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running with nightly dev. containers (mainly for OpenMS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Only overwrites the used containers. E.g., uses the OpenMS nightly + executable and thirdparty containers. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile dev, [--outdir ] + +------------------------------------------------------------------------------------------- +*/ + +params { + config_profile_name = 'Development profile' + config_profile_description = 'To use nightly development containers' +} + +process { + withLabel: openms { + // Conda is no longer supported + container = {"${ ( workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ) && !task.ext.singularity_pull_docker_container ? 'oras://ghcr.io/openms/openms-tools-thirdparty-sif:latest' : 'ghcr.io/openms/openms-tools-thirdparty:latest' }"} + } + + +} diff --git a/conf/diann_versions/v1_8_1.config b/conf/diann_versions/v1_8_1.config new file mode 100644 index 0000000..5bfb7ef --- /dev/null +++ b/conf/diann_versions/v1_8_1.config @@ -0,0 +1,11 @@ +/* + * DIA-NN 1.8.1 container override (public biocontainers) + * Used by merge_ci.yml for version × feature matrix testing. + */ +params.diann_version = '1.8.1' + +process { + withLabel: diann { + container = 'docker.io/biocontainers/diann:v1.8.1_cv1' + } +} diff --git a/conf/diann_versions/v2_1_0.config b/conf/diann_versions/v2_1_0.config new file mode 100644 index 0000000..9915726 --- /dev/null +++ b/conf/diann_versions/v2_1_0.config @@ -0,0 +1,14 @@ +/* + * DIA-NN 2.1.0 container override (private ghcr.io) + * Used by merge_ci.yml for version × feature matrix testing. + */ +params.diann_version = '2.1.0' + +process { + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.1.0' + } +} + +singularity.enabled = false +docker.enabled = true diff --git a/conf/diann_versions/v2_2_0.config b/conf/diann_versions/v2_2_0.config new file mode 100644 index 0000000..93ea4ee --- /dev/null +++ b/conf/diann_versions/v2_2_0.config @@ -0,0 +1,14 @@ +/* + * DIA-NN 2.2.0 container override (private ghcr.io) + * Used by merge_ci.yml for version × feature matrix testing. + */ +params.diann_version = '2.2.0' + +process { + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.2.0' + } +} + +singularity.enabled = false +docker.enabled = true diff --git a/conf/modules/dia.config b/conf/modules/dia.config new file mode 100644 index 0000000..fff6e00 --- /dev/null +++ b/conf/modules/dia.config @@ -0,0 +1,32 @@ +/* +======================================================================================== + DIA-NN module options — all DIA pipeline steps +======================================================================================== + Passes params.diann_extra_args to each DIA-NN step via ext.args. + Blocked-flag validation is performed in each module's script block, + so it applies regardless of whether ext.args is overridden by the user. +---------------------------------------------------------------------------------------- +*/ + +process { + + withName: ".*:DIA:INSILICO_LIBRARY_GENERATION" { + ext.args = { params.diann_extra_args ?: '' } + } + + withName: ".*:DIA:PRELIMINARY_ANALYSIS" { + ext.args = { params.diann_extra_args ?: '' } + } + + withName: ".*:DIA:ASSEMBLE_EMPIRICAL_LIBRARY" { + ext.args = { params.diann_extra_args ?: '' } + } + + withName: ".*:DIA:INDIVIDUAL_ANALYSIS" { + ext.args = { params.diann_extra_args ?: '' } + } + + withName: ".*:DIA:FINAL_QUANTIFICATION" { + ext.args = { params.diann_extra_args ?: '' } + } +} diff --git a/conf/modules/shared.config b/conf/modules/shared.config new file mode 100644 index 0000000..8387b9c --- /dev/null +++ b/conf/modules/shared.config @@ -0,0 +1,67 @@ +/* +======================================================================================== + Shared module options — publishDir for outputs used across all workflows +======================================================================================== +*/ + +process { + + // publishDir for pmultiqc reports + withName: 'BIGBIO_QUANTMSDIANN:QUANTMSDIANN:SUMMARY_PIPELINE' { + publishDir = [ + path: { "${params.outdir}/pmultiqc" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // publishDir for SDRF files + withName: 'BIGBIO_QUANTMSDIANN:QUANTMSDIANN:INPUT_CHECK:SAMPLESHEET_CHECK' { + publishDir = [ + path: { "${params.outdir}/sdrf" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // publishDir for configurations files from SDRF parsing + withName: '.*:SDRF_PARSING' { + publishDir = [ + path: { "${params.outdir}/sdrf" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // Result tables from DIA-NN quantification and MSstats format conversion + withName: '.*:(FINAL_QUANTIFICATION|DIANN_MSSTATS)' { + publishDir = [ + path: { "${params.outdir}/quant_tables" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // Optional: publish TSV spectral library from in-silico generation. + // Enable via ext.publish_speclib_tsv in a custom config or via --save_speclib_tsv. + withName: '.*:INSILICO_LIBRARY_GENERATION' { + publishDir = [ + path: { "${params.outdir}/library_generation" }, + mode: 'copy', + saveAs: { filename -> + if (filename.equals('versions.yml')) return null + if (filename.endsWith('.tsv') && (task.ext.publish_speclib_tsv || params.save_speclib_tsv)) return filename + return null + } + ] + } + + // publishDir for all features tables + withName: '.*:MZML_STATISTICS' { + publishDir = [ + path: { "${params.outdir}/spectra/mzml_statistics" }, + mode: 'copy', + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } +} diff --git a/conf/modules/verbose_modules.config b/conf/modules/verbose_modules.config new file mode 100644 index 0000000..e72666c --- /dev/null +++ b/conf/modules/verbose_modules.config @@ -0,0 +1,72 @@ + +// verbose_modules.config +process { + + // Override default publish behavior to include all intermediate outputs, + // here we use parameter publish_dir_mode. + publishDir = [ + path: { "${params.outdir}/${task.process.tokenize(':')[-1].toLowerCase()}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + + // Spectra conversion and preprocessing + withName: '.*:THERMORAWFILEPARSER' { + publishDir = [ + path: { "${params.outdir}/spectra/thermorawfileparser" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + withName: '.*:(MZML_INDEXING|SPECTRUM_FEATURES|MZML_STATISTICS)' { + publishDir = [ + path: { "${params.outdir}/spectra/${task.process.tokenize(':')[-1].toLowerCase()}" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // DIA-NN preprocessing steps + withName: '.*:PRELIMINARY_ANALYSIS' { + publishDir = [ + path: { "${params.outdir}/diann_preprocessing/preliminary_analysis" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + withName: '.*:INDIVIDUAL_ANALYSIS' { + publishDir = [ + path: { "${params.outdir}/diann_preprocessing/individual_analysis" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // DIA-NN library generation + withName: '.*:ASSEMBLE_EMPIRICAL_LIBRARY' { + publishDir = [ + path: { "${params.outdir}/library_generation/assemble_empirical_library" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + withName: '.*:INSILICO_LIBRARY_GENERATION' { + publishDir = [ + path: { "${params.outdir}/library_generation/insilico_library_generation" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } + + // DIA-NN configuration + withName: '.*:GENERATE_CFG' { + publishDir = [ + path: { "${params.outdir}/sdrf" }, + mode: params.publish_dir_mode, + saveAs: { filename -> filename.equals('versions.yml') ? null : filename } + ] + } +} diff --git a/conf/pride_codon_slurm.config b/conf/pride_codon_slurm.config new file mode 100644 index 0000000..0848be1 --- /dev/null +++ b/conf/pride_codon_slurm.config @@ -0,0 +1,62 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Nextflow config file for EMBL-EBI Codon Cluster for the SLURM login nodes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Author: Yasset Perez-Riverol +Mail: yperez@ebi.ac.uk +URL: https://www.ebi.ac.uk/ +Based on: https://github.com/nf-core/configs/blob/master/conf/ebi_codon_slurm.config +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +params { + config_profile_contact = "Yasset Perez-Riverol" + config_profile_description = "The European Bioinformatics Institute HPC cluster (codon) profile for the SLURM login nodes" + config_profile_url = "https://www.ebi.ac.uk/" +} + +singularity { + enabled = true + // the default is 20 minutes and fails with large images + pullTimeout = "3 hours" + autoMounts = false + runOptions = '-B /hps/nobackup/juan/pride/reanalysis:/hps/nobackup/juan/pride/reanalysis' + cacheDir = "/hps/nobackup/juan/pride/reanalysis/singularity/" +} + +process { + // this is to avoid errors for missing files due to shared filesystem latency + maxRetries = 30 + errorStrategy = { task.exitStatus in ((130..145).toList() + [104, 1, 9, 134, 97]) ? "retry" : "terminate" } + cache = "lenient" + afterScript = "sleep 60" + + resourceLimits = [ + memory: 1900.GB, + cpus: 48, + time: 30.d + ] + + withName:ASSEMBLE_EMPIRICAL_LIBRARY{ + memory = {def val = (ms_files as List).size() < 200 ? 72 * task.attempt : 250 * task.attempt; Math.min(val, 1900).GB} + cpus = {Math.min((ms_files as List).size() < 200 ? 12 * task.attempt : 24 * task.attempt, 48)} + } + + withLabel: diann { + container = '/hps/nobackup/juan/pride/reanalysis/singularity/ghcr.io-bigbio-diann-1.9.2.sif' + } +} + +executor { + name = "slurm" + queueSize = 2000 + submitRateLimit = "10/1sec" + exitReadTimeout = "30 min" + queueGlobalStatus = true + jobName = { + task.name + .replace("[", "(") + .replace("]", ")") + .replace(" ", "_") + } +} diff --git a/conf/tests/test_dia.config b/conf/tests/test_dia.config new file mode 100644 index 0000000..2d321bc --- /dev/null +++ b/conf/tests/test_dia.config @@ -0,0 +1,47 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests (DIA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple test. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_dia, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '6.GB', + time: '48.h' + ] +} +params { + config_profile_name = 'Test profile for DIA' + config_profile_description = 'Minimal test dataset to check pipeline function for the data-independent acquisition pipeline branch.' + + outdir = './results_dia' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/PXD026600.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 15 + max_peptide_length = 30 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 +} + +process { + // thermorawfileparser + withName: 'BIGBIO_QUANTMSDIANN:QUANTMSDIANN:FILE_PREPARATION:THERMORAWFILEPARSER' { + publishDir = [path: { "${params.outdir}/${task.process.tokenize(':')[-1].toLowerCase()}" }, pattern: "*.log" ] + } +} diff --git a/conf/tests/test_dia_2_2_0.config b/conf/tests/test_dia_2_2_0.config new file mode 100644 index 0000000..3772ec6 --- /dev/null +++ b/conf/tests/test_dia_2_2_0.config @@ -0,0 +1,51 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for testing DIA-NN 2.2.0 (latest release) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Tests the pipeline with the latest DIA-NN 2.2.0 container to verify forward + compatibility. Uses the same test data as test_dia.config. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_dia_2_2_0, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '12.GB', + time: '48.h' + ] +} + +params { + config_profile_name = 'Test profile for DIA-NN 2.2.0' + config_profile_description = 'Test dataset to verify pipeline compatibility with DIA-NN 2.2.0 (latest release).' + + outdir = './results_dia_2_2_0' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/PXD026600.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 15 + max_peptide_length = 30 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 +} + +process { + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.2.0' + } +} + +singularity.enabled = false +docker.enabled = true diff --git a/conf/tests/test_dia_dotd.config b/conf/tests/test_dia_dotd.config new file mode 100644 index 0000000..cd151ca --- /dev/null +++ b/conf/tests/test_dia_dotd.config @@ -0,0 +1,46 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests (DIA with Bruker .d files) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple test + for the DIA-PASEF workflow with timsTOF .d files. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_dia_dotd, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '6.GB', + time: '48.h' + ] +} + +params { + config_profile_name = 'Test profile for DIA with Bruker .d files' + config_profile_description = 'Test dataset for DIA-PASEF workflow with timsTOF .d files (PXD065380).' + + outdir = './results_dia_dotd' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci_dotd/PXD065380.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/databases/PXD065380.fasta' + + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 7 + max_peptide_length = 30 + max_precursor_charge = 4 + allowed_missed_cleavages = 2 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 + mass_acc_automatic = false + mass_acc_ms1 = 10 + mass_acc_ms2 = 10 +} diff --git a/conf/tests/test_dia_local.config b/conf/tests/test_dia_local.config new file mode 100644 index 0000000..8d523c5 --- /dev/null +++ b/conf/tests/test_dia_local.config @@ -0,0 +1,18 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Local container overrides for testing with dev builds of sdrf-pipelines and quantms-utils. + Uses docker.io/ prefix to prevent quay.io registry from being prepended. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +process { + withName: 'SDRF_PARSING' { + container = 'docker.io/local/sdrf-pipelines:dev' + } + withName: 'SAMPLESHEET_CHECK' { + container = 'docker.io/local/quantms-utils:dev' + } + withName: 'DIANN_MSSTATS' { + container = 'docker.io/local/quantms-utils:dev' + } +} diff --git a/conf/tests/test_dia_parquet.config b/conf/tests/test_dia_parquet.config new file mode 100644 index 0000000..dfea81d --- /dev/null +++ b/conf/tests/test_dia_parquet.config @@ -0,0 +1,54 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for testing Parquet output (requires DIA-NN >= 2.0) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Tests the Parquet-native output format introduced in DIA-NN 2.0. + Uses ghcr.io/bigbio/diann:2.2.0. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_dia_parquet, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '12.GB', + time: '48.h' + ] +} + +params { + config_profile_name = 'Test profile for DIA Parquet output' + config_profile_description = 'Test dataset to check Parquet output format (DIA-NN >= 2.0).' + + outdir = './results_dia_parquet' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/PXD026600.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 15 + max_peptide_length = 30 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 + + // Parquet-specific: enable decoy reporting (Parquet-only feature in DIA-NN 2.0+) + diann_report_decoys = true +} + +process { + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.2.0' + } +} + +singularity.enabled = false +docker.enabled = true diff --git a/conf/tests/test_dia_quantums.config b/conf/tests/test_dia_quantums.config new file mode 100644 index 0000000..9f531b6 --- /dev/null +++ b/conf/tests/test_dia_quantums.config @@ -0,0 +1,51 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for testing QuantUMS quantification (requires DIA-NN >= 1.9.2) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Tests the QuantUMS protein quantification method introduced in DIA-NN 1.9. + Uses ghcr.io/bigbio/diann:2.2.0 (minimum supported: 1.9.2). + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_dia_quantums, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '6.GB', + time: '48.h' + ] +} + +params { + config_profile_name = 'Test profile for DIA QuantUMS' + config_profile_description = 'Test dataset to check QuantUMS quantification method (DIA-NN >= 1.9.2).' + + outdir = './results_dia_quantums' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/PXD026600.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 15 + max_peptide_length = 30 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 + + // QuantUMS-specific: enable QuantUMS instead of default --direct-quant + quantums = true +} + +process { + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.2.0' + } +} diff --git a/conf/tests/test_full_dia.config b/conf/tests/test_full_dia.config new file mode 100644 index 0000000..67b05a2 --- /dev/null +++ b/conf/tests/test_full_dia.config @@ -0,0 +1,38 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running real full dia tests (DIA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a real and full-size test. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_full_dia, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +process { + resourceLimits = [ + cpus: 4, + memory: '6.GB', + time: '48.h' + ] +} + +params { + config_profile_name = 'Real full-size test profile for DIA' + config_profile_description = 'Real full-size test dataset to check pipeline function for the data-independent acquisition pipeline branch.' + + outdir = './results_dia_full' + + // Input data + input = 'https://raw.githubusercontent.com/nf-core/test-datasets/quantms/testdata-aws/dia_full/PXD004684.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 450 + max_pr_mz = 1080 + min_fr_mz = 500 + max_fr_mz = 1500 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + max_mods = 1 +} diff --git a/conf/tests/test_latest_dia.config b/conf/tests/test_latest_dia.config new file mode 100644 index 0000000..10bd418 --- /dev/null +++ b/conf/tests/test_latest_dia.config @@ -0,0 +1,54 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running minimal tests (DIA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Defines input files and everything required to run a fast and simple test. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile test_latest_dia, [--outdir ] + +------------------------------------------------------------------------------------------------ +*/ + +params { + config_profile_name = 'Test profile for latest DIA (2.2.0)' + config_profile_description = 'Minimal test dataset to check pipeline function with the latest DIA-NN 2.2.0.' + + outdir = './results_latest_dia' + + // Input data + input = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/PXD026600.sdrf.tsv' + database = 'https://raw.githubusercontent.com/bigbio/quantms-test-datasets/quantms/testdata/dia_ci/REF_EColi_K12_UPS1_combined.fasta' + min_pr_mz = 350 + max_pr_mz = 950 + min_fr_mz = 500 + max_fr_mz = 1500 + min_peptide_length = 15 + max_peptide_length = 30 + max_precursor_charge = 3 + allowed_missed_cleavages = 1 + diann_normalize = false + publish_dir_mode = 'symlink' + max_mods = 2 +} + +process { + // thermorawfileparser + withName: 'BIGBIO_QUANTMSDIANN:QUANTMSDIANN:FILE_PREPARATION:THERMORAWFILEPARSER' { + publishDir = [path: { "${params.outdir}/${task.process.tokenize(':')[-1].toLowerCase()}" }, pattern: "*.log" ] + } + + withLabel: diann { + container = 'ghcr.io/bigbio/diann:2.2.0' // This docker container is private for quantmsdiann + } + + resourceLimits = [ + cpus: 4, + memory: '12.GB', + time: '48.h' + ] + +} + +singularity.enabled = false // Force to use docker +docker.enabled = true diff --git a/conf/wave.config b/conf/wave.config new file mode 100644 index 0000000..bab40de --- /dev/null +++ b/conf/wave.config @@ -0,0 +1,22 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Nextflow config file for running with containers from wave sequera service containers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Only overwrites the used containers providers instead of using quay.io containers or + biocontainers singularity repositories. + + Use as follows: + nextflow run bigbio/quantmsdiann -profile wave, [--outdir ] + +------------------------------------------------------------------------------------------- +*/ +params { + config_profile_name = 'Wave profile' + config_profile_description = 'To use containers from wave service' +} + +process { + withName: 'MS2RESCORE' { + container = {"${ ( workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ) && !task.ext.singularity_pull_docker_container ? 'oras://community.wave.seqera.io/library/quantms-rescoring:0.0.7--c57266e6c9f27985' : 'community.wave.seqera.io/library/quantms-rescoring:0.0.7--2ab4d0ac2f872759' }"} + } +} diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..7d34cd4 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,7 @@ +# bigbio/quantmsdiann: Documentation + +Documentation for the bigbio/quantmsdiann pipeline. + +- [Usage](usage.md) - How to run the pipeline +- [Parameters](parameters.md) - Complete reference of all pipeline parameters +- [Output](output.md) - Description of pipeline output files diff --git a/docs/images/nf-core-quantmsdiann_logo_dark.png b/docs/images/nf-core-quantmsdiann_logo_dark.png new file mode 100644 index 0000000..eca7021 Binary files /dev/null and b/docs/images/nf-core-quantmsdiann_logo_dark.png differ diff --git a/docs/images/nf-core-quantmsdiann_logo_light.png b/docs/images/nf-core-quantmsdiann_logo_light.png new file mode 100644 index 0000000..eca7021 Binary files /dev/null and b/docs/images/nf-core-quantmsdiann_logo_light.png differ diff --git a/docs/images/quantmsdiann_workflow.svg b/docs/images/quantmsdiann_workflow.svg new file mode 100644 index 0000000..93c114d --- /dev/null +++ b/docs/images/quantmsdiann_workflow.svg @@ -0,0 +1,217 @@ + + + + + + + + + + + + LEGEND + + Optional / skippable + + Input handling + + Preprocessing + + DIA-NN analysis + + Statistics & export + + Quality control + + + Process / module + + Output file + + + + + Per-file (parallel) + + + + Collective (all files) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + SDRF + Raw Files + .raw / .mzML / .d / .dia + + + FASTA Database + Protein sequences + + + + + INPUT_CHECK + Input Validation + SDRF parsing & validation + + + Create Input Channel + + + + + FILE_PREPARATION + File Preparation + RAW → mzML conversion, indexing + ⫽ per-file parallel + + + + + GENERATE_CFG + Generate Config + enzyme + mods → diann.cfg + + + + INSILICO_LIBRARY_GENERATION + Spectral Library + in-silico prediction from FASTA + skip if --diann_speclib + + + + + PRELIMINARY_ANALYSIS + Preliminary Analysis + per-file calibration & mass accuracy + ⫽ per-file parallel + + + + + ASSEMBLE_EMPIRICAL_LIBRARY + Empirical Library + consensus from .quant files + + + skip if + --skip_preliminary + _analysis + + + + + INDIVIDUAL_ANALYSIS + Individual Analysis + per-file search with empirical library + ⫽ per-file parallel + + + + + FINAL_QUANTIFICATION + Final Quantification + summary report + matrices + + + + + DIANN_MSSTATS + MSstats Conversion + DIA-NN report → MSstats CSV + + + + + PMULTIQC + pMultiQC + Quality control report + + + + Quant Tables + pg / pr / gg matrices + + + MSstats CSV + msstats_in.csv + + + QC Report + HTML interactive + + diff --git a/docs/output.md b/docs/output.md new file mode 100644 index 0000000..dab8fcd --- /dev/null +++ b/docs/output.md @@ -0,0 +1,130 @@ +# bigbio/quantmsdiann: Output + +## Introduction + +This document describes the output produced by the pipeline. Most plots are taken from the pmultiqc report, which summarises results at the end of the pipeline. + +The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. + +## Pipeline overview + +The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes DIA data using the following steps: + +1. RAW data is converted to mzML using ThermoRawFileParser (or .d/.dia files are handled natively) +2. DIA-NN is used for identification and quantification of peptides and proteins +3. DIA-NN report is converted to MSstats-compatible format +4. Generation of QC reports using pmultiqc + +## Output structure + +Output will be saved to the folder defined by the parameter `--outdir`. + +### Default Output Structure + +``` +results/ +├── pipeline_info/ # Nextflow pipeline information +├── sdrf/ # SDRF files and configs +├── quant_tables/ # Quantification tables and results +│ ├── diann_report.{tsv,parquet} # Main DIA-NN report +│ ├── diann_report.pg_matrix.tsv # Protein group matrix +│ ├── diann_report.pr_matrix.tsv # Precursor matrix +│ ├── diann_report.gg_matrix.tsv # Gene group matrix +│ └── out_msstats_in.csv # MSstats-compatible output +└── pmultiqc/ # pmultiqc reports + ├── multiqc_plots/ + │ ├── png/ + │ ├── svg/ + │ └── pdf/ + └── multiqc_data/ +``` + +### Verbose Output Structure + +For more detailed output with all intermediate files, use the verbose output configuration by providing `-profile verbose_modules`. This is useful for debugging or detailed analysis: + +``` +results/ +├── pipeline_info/ +├── sdrf/ +├── spectra/ +│ ├── thermorawfileparser/ # Converted raw files +│ └── mzml_statistics/ # mzML file statistics +├── database_generation/ +│ ├── insilico_library_generation/ # In silico library +│ └── assemble_empirical_library/ # Empirical library +├── diann_preprocessing/ +│ ├── preliminary_analysis/ # Preliminary analysis results +│ └── individual_analysis/ # Individual analysis results +├── quant_tables/ +└── pmultiqc/ +``` + +### Key Output Files + +- **DIA-NN quantification results:** + - `quant_tables/diann_report.{tsv,parquet}` - Main DIA-NN report with peptide and protein quantification + - `quant_tables/diann_report.pr_matrix.tsv` - Precursor quantification matrix + - `quant_tables/diann_report.pg_matrix.tsv` - Protein group quantification matrix + - `quant_tables/diann_report.gg_matrix.tsv` - Gene group quantification matrix + - `quant_tables/diann_report.unique_genes_matrix.tsv` - Unique gene quantification matrix + - `quant_tables/out_msstats_in.csv` - MSstats-compatible quantification table + +### Parquet vs TSV Output + +Starting with DIA-NN 2.0, the main report is produced in **Apache Parquet** format (`diann_report.parquet`) instead of the legacy TSV (`diann_report.tsv`). Parquet files are columnar, compressed, and significantly faster to load in downstream tools such as Python (pandas/pyarrow) or R (arrow). + +| DIA-NN Version | Main report format | Matrix format | +| -------------- | ---------------------- | ------------- | +| 1.8.1 | `diann_report.tsv` | `.tsv` | +| 2.1.0+ | `diann_report.parquet` | `.tsv` | + +The pipeline detects the DIA-NN version and handles the output format automatically. Downstream steps (MSstats conversion, pmultiqc) accept both formats. + +To read Parquet files: + +```python +# Python +import pandas as pd +df = pd.read_parquet("diann_report.parquet") +``` + +```r +# R +library(arrow) +df <- read_parquet("diann_report.parquet") +``` + +### MSstats-Compatible Output + +The pipeline produces `quant_tables/out_msstats_in.csv`, an MSstats-compatible quantification table generated by `quantms-utils`. This file contains long-format precursor-level intensities with the columns required by the [MSstats](https://msstats.org/) R package for downstream statistical analysis (e.g. differential expression, sample-size estimation). + +Key columns include: `ProteinName`, `PeptideSequence`, `PrecursorCharge`, `FragmentIon`, `ProductCharge`, `IsotopeLabelType`, `Condition`, `BioReplicate`, `Run`, `Intensity`. + +The condition and biological replicate assignments are derived from the SDRF factor columns. + +### Optional Output Files + +These files are not published by default. Enable them with `save_*` parameters or `ext.*` config properties (see [Usage: Optional outputs](usage.md#optional-outputs)). + +- `library_generation/*.tsv` - TSV spectral library from in-silico library generation (`--save_speclib_tsv`) + +### Nextflow pipeline info + +[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. + +`pipeline_info/`: + +- `execution_report.html` - Resource usage report +- `execution_timeline.html` - Timeline visualization +- `execution_trace.txt` - Detailed execution trace +- `pipeline_dag.html` - DAG visualization +- `software_versions.yml` - Software versions used + +### pmultiqc + +All QC results are generated by [pmultiqc](https://github.com/bigbio/pmultiqc), a proteomics plugin for [MultiQC](http://multiqc.info). The interactive HTML report provides: + +- Identification and quantification metrics +- Sample-level quality statistics +- Pipeline software versions diff --git a/docs/parameters.md b/docs/parameters.md new file mode 100644 index 0000000..0f25c4d --- /dev/null +++ b/docs/parameters.md @@ -0,0 +1,136 @@ +# bigbio/quantmsdiann: Parameters + +This document lists every pipeline parameter organised by category. Default values come from `nextflow.config`; types and constraints come from `nextflow_schema.json`. + +## 1. Input/Output Options + +| Parameter | Type | Default | Description | +| -------------------- | ----------------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `--input` | string (file-path) | _required_ | URI/path to an SDRF file with `.sdrf`, `.tsv`, or `.csv` extension. Parameters such as enzyme, fixed modifications, and acquisition method are read from the SDRF. | +| `--database` | string (file-path) | _required_ | Path to a FASTA protein database. Must not contain decoys for DIA data. | +| `--outdir` | string (directory-path) | `./results` | The output directory where results will be saved. | +| `--publish_dir_mode` | string | `copy` | Method used to save pipeline results. One of: `symlink`, `rellink`, `link`, `copy`, `copyNoFollow`, `move`. | +| `--root_folder` | string | `null` | Root folder in which spectrum files specified in the SDRF are searched. Used when you have a local copy of the experiment. | +| `--local_input_type` | string | `mzML` | Overwrite the file type/extension of filenames in the SDRF when using `--root_folder`. One of: `mzML`, `raw`, `d`, `dia`. Compressed variants (`.gz`, `.tar`, `.tar.gz`, `.zip`) are supported for `mzML`, `raw`, and `d`. | +| `--email` | string | `null` | Email address for completion summary. | + +## 2. SDRF Validation + +| Parameter | Type | Default | Description | +| ---------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------- | +| `--use_ols_cache_only` | boolean | `true` | Use only the cached Ontology Lookup Service (OLS) for ontology term validation. Set to `false` to allow network requests. | + +## 3. File Preparation (Spectrum Preprocessing) + +| Parameter | Type | Default | Description | +| ------------------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `--reindex_mzml` | boolean | `true` | Force re-indexing of input mzML files at the start of the pipeline for safety. | +| `--mzml_statistics` | boolean | `false` | Compute MS1/MS2 statistics from mzML files. Generates `*_ms_info.parquet` files for QC. Bruker `.d` files are always skipped. | +| `--mzml_features` | boolean | `false` | Compute MS1-level features during the mzML statistics step. Only available for mzML files. | + +## 4. Search Parameters + +| Parameter | Type | Default | Description | +| --------------------------------- | ------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `--met_excision` | boolean | `true` | Account for N-terminal methionine excision during database search. | +| `--allowed_missed_cleavages` | integer | `2` | Maximum number of allowed missed enzyme cleavages per peptide. | +| `--precursor_mass_tolerance` | integer | `5` | Precursor mass tolerance for database search (see `--precursor_mass_tolerance_unit`). Can be overridden; falls back to SDRF value. | +| `--precursor_mass_tolerance_unit` | string | `ppm` | Precursor mass tolerance unit. One of: `ppm`, `Da`. | +| `--fragment_mass_tolerance` | number | `0.03` | Fragment mass tolerance for database search (see `--fragment_mass_tolerance_unit`). | +| `--fragment_mass_tolerance_unit` | string | `Da` | Fragment mass tolerance unit. One of: `ppm`, `Da`. | +| `--variable_mods` | string | `Oxidation (M)` | Comma-separated variable modifications in Unimod format (e.g. `Oxidation (M),Carbamidomethyl (C)`). Can be overridden; falls back to SDRF. | +| `--min_precursor_charge` | integer | `2` | Minimum precursor ion charge. | +| `--max_precursor_charge` | integer | `4` | Maximum precursor ion charge. | +| `--min_peptide_length` | integer | `6` | Minimum peptide length to consider. | +| `--max_peptide_length` | integer | `40` | Maximum peptide length to consider. | +| `--max_mods` | integer | `3` | Maximum number of variable modifications per peptide. | +| `--min_pr_mz` | number | `400` | Minimum precursor m/z for in-silico library generation or library-free search. | +| `--max_pr_mz` | number | `2400` | Maximum precursor m/z for in-silico library generation or library-free search. | +| `--min_fr_mz` | number | `100` | Minimum fragment m/z for in-silico library generation or library-free search. | +| `--max_fr_mz` | number | `1800` | Maximum fragment m/z for in-silico library generation or library-free search. | + +## 5. DIA-NN General + +| Parameter | Type | Default | Description | +| -------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `--diann_version` | string | `1.8.1` | DIA-NN version used by the workflow. Controls version-dependent flags (e.g. `--monitor-mod` for 1.8.x). See [DIA-NN Version Selection](usage.md#dia-nn-version-selection). | +| `--diann_debug` | integer | `3` | DIA-NN debug/verbosity level (0-4). Higher values produce more verbose logs. | +| `--diann_speclib` | string | `null` | Path to an external spectral library. If provided, the in-silico library generation step is skipped. | +| `--diann_extra_args` | string | `null` | Extra arguments appended to all DIA-NN steps. Flags incompatible with a step are automatically stripped with a warning. See [Passing Extra Arguments to DIA-NN](usage.md#passing-extra-arguments-to-dia-nn). | + +## 6. Mass Accuracy & Calibration + +| Parameter | Type | Default | Description | +| ------------------------- | ------- | ------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `--mass_acc_automatic` | boolean | `true` | Automatically determine MS2 mass accuracy. When `true`, `--mass_acc_ms2` is ignored during preliminary analysis. | +| `--mass_acc_ms1` | number | `15` | MS1 mass accuracy in ppm. Overrides automatic calibration when `--mass_acc_automatic false`. Maps to DIA-NN `--mass-acc-ms1`. | +| `--mass_acc_ms2` | number | `15` | MS2 mass accuracy in ppm. Overrides automatic calibration when `--mass_acc_automatic false`. Maps to DIA-NN `--mass-acc`. | +| `--scan_window` | integer | `8` | Scan window radius. Should approximate the average number of data points per peak. | +| `--scan_window_automatic` | boolean | `true` | Automatically determine the scan window. When `true`, `--scan_window` is ignored. | +| `--quick_mass_acc` | boolean | `true` | Use a fast heuristic algorithm for mass accuracy calibration instead of ID-number optimisation. | +| `--performance_mode` | boolean | `true` | Enable low-RAM, high-speed mode. Adds `--min-corr 2 --corr-diff 1 --time-corr-only` to DIA-NN. | + +## 7. Bruker/timsTOF + +| Parameter | Type | Default | Description | +| ------------------- | ------- | ------- | ---------------------------------------------------------------------------------------------------- | +| `--diann_tims_sum` | boolean | `false` | Enable `--quant-tims-sum` for slice/scanning timsTOF methods (highly recommended for Synchro-PASEF). | +| `--diann_im_window` | number | `null` | Set `--im-window` to ensure the IM extraction window is not smaller than the specified value. | + +## 8. PTM Localization + +| Parameter | Type | Default | Description | +| --------------------------- | ------- | ------------------------------------- | -------------------------------------------------------------------------------------------------------------- | +| `--enable_mod_localization` | boolean | `false` | Enable modification localization scoring in DIA-NN via `--monitor-mod` (DIA-NN 1.8.x only; automatic in 2.0+). | +| `--mod_localization` | string | `Phospho (S),Phospho (T),Phospho (Y)` | Comma-separated modification names or UniMod accessions for PTM localization (e.g. `UniMod:21`). | + +## 9. Library Generation + +| Parameter | Type | Default | Description | +| -------------------- | ------- | ------- | -------------------------------------------------------------------------------------------- | +| `--save_speclib_tsv` | boolean | `false` | Publish the TSV spectral library from in-silico library generation to `library_generation/`. | + +## 10. Preliminary Analysis + +| Parameter | Type | Default | Description | +| ----------------------------- | ------- | ------- | ------------------------------------------------------------------------------------------------------------------- | +| `--skip_preliminary_analysis` | boolean | `false` | Skip preliminary analysis. Use the provided spectral library as-is instead of generating a local consensus library. | +| `--random_preanalysis` | boolean | `false` | Enable random selection of spectrum files for empirical library generation. | +| `--random_preanalysis_seed` | integer | `42` | Random seed for file selection when `--random_preanalysis` is enabled. | +| `--empirical_assembly_ms_n` | integer | `200` | Number of randomly selected spectrum files when `--random_preanalysis` is enabled. | + +## 11. Quantification & Output + +| Parameter | Type | Default | Description | +| ------------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `--pg_level` | integer | `2` | Protein inference mode. `0` = isoforms, `1` = protein names (from FASTA), `2` = genes (default). | +| `--species_genes` | boolean | `false` | Add the organism identifier to gene names in DIA-NN output. | +| `--diann_normalize` | boolean | `true` | Enable cross-run normalisation in DIA-NN. Set to `false` to add `--no-norm`. | +| `--diann_report_decoys` | boolean | `false` | Include decoy PSMs in the main `.parquet` report (DIA-NN 2.0+ only). | +| `--diann_export_xic` | boolean | `false` | Extract MS1/fragment chromatograms for identified precursors (equivalent to the XICs option in the DIA-NN GUI). | +| `--diann_no_peptidoforms` | boolean | `false` | Disable automatic peptidoform scoring when variable modifications are declared (not recommended by DIA-NN). | +| `--diann_use_quant` | boolean | `true` | Reuse existing `.quant` files if available (`--use-quant`). | +| `--quantums` | boolean | `false` | Enable QuantUMS quantification (requires DIA-NN >= 1.9.2). When `false`, the pipeline passes `--direct-quant` to use legacy quantification (only for DIA-NN >= 1.9.2; silently skipped for 1.8.x where direct quant is the only mode). | +| `--quantums_train_runs` | string | `null` | Run index range for QuantUMS training (e.g. `0:5`). Maps to `--quant-train-runs`. Requires DIA-NN >= 1.9.2. | +| `--quantums_sel_runs` | integer | `null` | Number of automatically selected runs for QuantUMS training. Must be >= 6. Maps to `--quant-sel-runs`. Requires DIA-NN >= 1.9.2. | +| `--quantums_params` | string | `null` | Pre-calculated QuantUMS parameters. Maps to `--quant-params`. Requires DIA-NN >= 1.9.2. | + +## 12. Quality Control + +| Parameter | Type | Default | Description | +| ---------------------------- | ------- | ------- | ------------------------------------------------------------------------------------ | +| `--enable_pmultiqc` | boolean | `true` | Generate the pmultiqc QC report. | +| `--pmultiqc_idxml_skip` | boolean | `true` | Skip idXML files (do not generate search engine score plots) in the pmultiqc report. | +| `--contaminant_string` | string | `CONT` | Contaminant affix string for pmultiqc. Maps to `--contaminant_affix` in pmultiqc. | +| `--protein_level_fdr_cutoff` | number | `0.01` | Experiment-wide protein (group)-level FDR cutoff. | + +## 13. MultiQC & Reporting + +| Parameter | Type | Default | Description | +| ------------------------------- | ------------------ | ------- | --------------------------------------------------------------------------------- | +| `--multiqc_config` | string (file-path) | `null` | Custom config file to supply to MultiQC. | +| `--multiqc_title` | string | `null` | MultiQC report title. Used as page header and filename. | +| `--multiqc_logo` | string | `null` | Custom logo file for MultiQC. Must also be referenced in the MultiQC config file. | +| `--skip_table_plots` | boolean | `false` | Skip protein/peptide table plots in pmultiqc for large datasets. | +| `--max_multiqc_email_size` | string | `25.MB` | File size limit when attaching MultiQC reports to summary emails. | +| `--multiqc_methods_description` | string | `null` | Custom MultiQC YAML file containing an HTML methods description. | diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 0000000..b04cd7d --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,465 @@ +# bigbio/quantmsdiann: Usage + +## Introduction + +quantmsdiann is a Nextflow pipeline for DIA-NN-based quantitative mass spectrometry analysis. + +## Running the pipeline + +The typical command for running the pipeline is as follows: + +```bash +nextflow run bigbio/quantmsdiann \ + --input 'experiment.sdrf.tsv' \ + --database 'proteins.fasta' \ + --outdir './results' \ + -profile docker +``` + +The input file must be in [Sample-to-data-relationship format (SDRF)](https://pubs.acs.org/doi/abs/10.1021/acs.jproteome.0c00376) and can have `.sdrf`, `.tsv`, or `.csv` file extensions. + +### Supported file formats + +The pipeline supports the following mass spectrometry data file formats: + +- **`.raw`** - Thermo RAW files (automatically converted to mzML) +- **`.mzML`** - Open standard mzML files +- **`.d`** - Bruker timsTOF files (processed natively by DIA-NN) +- **`.dia`** - DIA-NN native binary format (passed through without conversion) + +Compressed variants are supported for `.raw`, `.mzML`, and `.d` formats: `.gz`, `.tar`, `.tar.gz`, `.zip`. + +### Preprocessing Options + +The pipeline includes several preprocessing steps that can be controlled via parameters: + +- **`--reindex_mzml`** (default: `true`) -- Force re-indexing of input mzML files at the start of the pipeline. This fixes common issues with slightly incomplete or outdated mzML files and is enabled by default for safety. Set to `false` only if you are certain your mzML files are well-formed. + +- **`--mzml_statistics`** (default: `false`) -- Compute MS1/MS2 statistics from mzML files. When enabled, `*_ms_info.parquet` files are generated for each mzML file and used in QC reporting. Bruker `.d` files are always skipped by this step. + +- **`--mzml_features`** (default: `false`) -- Compute MS1-level features during the mzML statistics step. Only available for mzML files. + +### Bruker/timsTOF Data + +For Bruker timsTOF datasets, DIA-NN recommends manually fixing MS1 and MS2 mass accuracy (typically 10-15 ppm) rather than using automatic calibration. There are two ways to set this: + +**Option 1 — SDRF columns (per-file control, recommended):** + +Set `PrecursorMassTolerance`, `PrecursorMassToleranceUnit`, `FragmentMassTolerance`, and `FragmentMassToleranceUnit` columns in your SDRF file. The pipeline reads these per-file and passes them to DIA-NN when `--mass_acc_automatic false` is set. This allows different tolerances for different files in the same experiment. + +**Option 2 — Pipeline parameters (global override):** + +```bash +nextflow run bigbio/quantmsdiann \ + --input sdrf.tsv \ + --database proteins.fasta \ + --mass_acc_automatic false \ + --mass_acc_ms1 \ + --mass_acc_ms2 \ + -profile docker +``` + +For Synchro-PASEF data, enable `--diann_tims_sum` (which adds `--quant-tims-sum` to DIA-NN). + +> [!NOTE] +> The pipeline will emit a warning during PRELIMINARY_ANALYSIS if it detects `.d` files with automatic mass accuracy calibration enabled, recommending to set tolerances via SDRF or pipeline parameters. + +### Pipeline settings via params file + +Pipeline settings can be provided in a `yaml` or `json` file via `-params-file `: + +```bash +nextflow run bigbio/quantmsdiann -profile docker -params-file params.yaml +``` + +```yaml +input: "./experiment.sdrf.tsv" +database: "./proteins.fasta" +outdir: "./results" +``` + +> [!WARNING] +> Do not use `-c ` to specify parameters. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources) or module arguments. + +### Reproducibility + +Specify the pipeline version when running on your data: + +```bash +nextflow run bigbio/quantmsdiann -r 1.0.0 -profile docker --input sdrf.tsv --database db.fasta --outdir results +``` + +## Core Nextflow arguments + +### `-profile` + +Use this parameter to choose a configuration profile: + +- `docker` - Run with Docker containers +- `singularity` - Run with Singularity containers +- `podman` - Run with Podman containers +- `apptainer` - Run with Apptainer containers + +Multiple profiles can be loaded: `-profile test_dia,docker` + +### `-resume` + +Resume from cached results: + +```bash +nextflow run bigbio/quantmsdiann -profile test_dia,docker --outdir results -resume +``` + +## Test profiles + +```bash +# Quick DIA test +nextflow run . -profile test_dia,docker --outdir results + +# DIA with Bruker .d files +nextflow run . -profile test_dia_dotd,docker --outdir results + +# Latest DIA-NN version (2.1.0) +nextflow run . -profile test_latest_dia,docker --outdir results +``` + +## DIA-NN parameters + +The pipeline passes parameters to DIA-NN at different steps. Some parameters come from the SDRF metadata (per-file), some from `nextflow.config` defaults, and some from the command line. The table below documents each parameter, its source, and which pipeline steps use it. + +### Parameter sources + +Parameters are resolved in this priority order: + +1. **SDRF metadata** (per-file, from `convert-diann` design file) — highest priority +2. **Pipeline parameters** (`--param_name` on command line or params file) +3. **Nextflow defaults** (`nextflow.config`) — lowest priority + +### Pipeline steps + +| Step | Description | +| ------------------------------- | ------------------------------------------------------------------- | +| **INSILICO_LIBRARY_GENERATION** | Predicts a spectral library from FASTA using DIA-NN's deep learning | +| **PRELIMINARY_ANALYSIS** | Per-file calibration and mass accuracy estimation (first pass) | +| **ASSEMBLE_EMPIRICAL_LIBRARY** | Builds consensus empirical library from preliminary results | +| **INDIVIDUAL_ANALYSIS** | Per-file quantification with the empirical library (second pass) | +| **FINAL_QUANTIFICATION** | Aggregates all files into protein/peptide matrices | + +### Per-file parameters from SDRF + +These parameters are extracted per-file from the SDRF via `convert-diann` and stored in `diann_design.tsv`: + +| DIA-NN flag | SDRF column | Design column | Steps | Notes | +| ---------------- | -------------------------------------------------- | ------------------------ | ----------------------- | ----------------------------------------------- | +| `--mass-acc-ms1` | `comment[precursor mass tolerance]` | `PrecursorMassTolerance` | PRELIMINARY, INDIVIDUAL | Falls back to auto-detect if missing or not ppm | +| `--mass-acc` | `comment[fragment mass tolerance]` | `FragmentMassTolerance` | PRELIMINARY, INDIVIDUAL | Falls back to auto-detect if missing or not ppm | +| `--min-pr-mz` | `comment[ms1 scan range]` or `comment[ms min mz]` | `MS1MinMz` | PRELIMINARY, INDIVIDUAL | Per-file for GPF; global broadest for INSILICO | +| `--max-pr-mz` | `comment[ms1 scan range]` or `comment[ms max mz]` | `MS1MaxMz` | PRELIMINARY, INDIVIDUAL | Per-file for GPF; global broadest for INSILICO | +| `--min-fr-mz` | `comment[ms2 scan range]` or `comment[ms2 min mz]` | `MS2MinMz` | PRELIMINARY, INDIVIDUAL | Per-file for GPF; global broadest for INSILICO | +| `--max-fr-mz` | `comment[ms2 scan range]` or `comment[ms2 max mz]` | `MS2MaxMz` | PRELIMINARY, INDIVIDUAL | Per-file for GPF; global broadest for INSILICO | + +### Global parameters from config + +These parameters apply globally across all files. They are set in `diann_config.cfg` (from SDRF) or as pipeline parameters: + +| DIA-NN flag | Pipeline parameter | Default | Steps | Notes | +| --------------------------------------------- | -------------------------------------------------- | ----------------------------------------------- | ---------------------------------------- | --------------------------------------------------------------- | +| `--cut` | (from SDRF enzyme) | — | ALL | Enzyme cut rule, derived from `comment[cleavage agent details]` | +| `--fixed-mod` | (from SDRF) | — | ALL | Fixed modifications from `comment[modification parameters]` | +| `--var-mod` | (from SDRF) | — | ALL | Variable modifications from `comment[modification parameters]` | +| `--monitor-mod` | `--enable_mod_localization` + `--mod_localization` | `false` / `Phospho (S),Phospho (T),Phospho (Y)` | PRELIMINARY, ASSEMBLE, INDIVIDUAL, FINAL | PTM site localization scoring (DIA-NN 1.8.x only) | +| `--window` | `--scan_window` | `8` | PRELIMINARY, ASSEMBLE, INDIVIDUAL | Scan window; auto-detected when `--scan_window_automatic=true` | +| `--quick-mass-acc` | `--quick_mass_acc` | `true` | PRELIMINARY | Fast mass accuracy calibration | +| `--min-corr 2 --corr-diff 1 --time-corr-only` | `--performance_mode` | `true` | PRELIMINARY | High-speed, low-RAM mode | +| `--pg-level` | `--pg_level` | `2` | INDIVIDUAL, FINAL | Protein grouping level | +| `--species-genes` | `--species_genes` | `false` | FINAL | Use species-specific gene names | +| `--no-norm` | `--diann_normalize` | `true` | FINAL | Disable normalization when `false` | + +### PTM site localization (`--monitor-mod`) + +DIA-NN supports PTM site localization scoring via `--monitor-mod`. When enabled, DIA-NN reports `PTM.Site.Confidence` and `PTM.Q.Value` columns for the specified modifications. + +**Important**: `--monitor-mod` is applied to all DIA-NN steps **except INSILICO_LIBRARY_GENERATION** (where it has no effect). It is particularly important for: + +- **PRELIMINARY_ANALYSIS**: Affects PTM-aware scoring during calibration. +- **ASSEMBLE_EMPIRICAL_LIBRARY**: Strongly affects empirical library generation for PTM peptides. +- **INDIVIDUAL_ANALYSIS** and **FINAL_QUANTIFICATION**: Enables PTM site confidence scoring. + +Note: For DIA-NN 2.0+, `--monitor-mod` is no longer needed — PTM localization is handled automatically by `--var-mod`. The flag is only used for DIA-NN 1.8.x. + +To enable PTM site localization: + +```bash +nextflow run bigbio/quantmsdiann \ + --enable_mod_localization \ + --mod_localization 'Phospho (S),Phospho (T),Phospho (Y)' \ + ... +``` + +The parameter accepts two formats: + +- **Modification names** (quantms-compatible): `Phospho (S),Phospho (T),Phospho (Y)` — site info in parentheses is stripped, the base name is mapped to UniMod +- **UniMod accessions** (direct): `UniMod:21,UniMod:1` + +Supported modification name mappings: + +| Name | UniMod ID | Example | +| ----------- | ------------ | ------------------------------------- | +| Phospho | `UniMod:21` | `Phospho (S),Phospho (T),Phospho (Y)` | +| GlyGly | `UniMod:121` | `GlyGly (K)` | +| Acetyl | `UniMod:1` | `Acetyl (Protein N-term)` | +| Oxidation | `UniMod:35` | `Oxidation (M)` | +| Deamidated | `UniMod:7` | `Deamidated (N),Deamidated (Q)` | +| Methylation | `UniMod:34` | `Methylation (K),Methylation (R)` | + +## Passing Extra Arguments to DIA-NN + +The `--diann_extra_args` parameter appends additional DIA-NN command-line flags to **all** DIA-NN steps (INSILICO_LIBRARY_GENERATION, PRELIMINARY_ANALYSIS, ASSEMBLE_EMPIRICAL_LIBRARY, INDIVIDUAL_ANALYSIS, FINAL_QUANTIFICATION). + +```bash +nextflow run bigbio/quantmsdiann \ + --diann_extra_args '--smart-profiling --peak-center' \ + ... +``` + +Flags that conflict with a specific step are **automatically stripped** with a warning. Each module maintains its own block list of managed flags. The table below summarises the key blocked flags per step: + +| Step | Key blocked flags (managed by pipeline) | +| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| INSILICO_LIBRARY_GENERATION | `--fasta`, `--fasta-search`, `--gen-spec-lib`, `--predictor`, `--lib`, `--missed-cleavages`, `--min-pep-len`, `--max-pep-len`, `--min-pr-charge`, `--max-pr-charge`, `--var-mods`, `--min-pr-mz`, `--max-pr-mz`, `--min-fr-mz`, `--max-fr-mz`, `--met-excision`, `--monitor-mod` | +| PRELIMINARY_ANALYSIS | `--mass-acc`, `--mass-acc-ms1`, `--window`, `--quick-mass-acc`, `--min-corr`, `--corr-diff`, `--time-corr-only`, `--min-pr-mz`, `--max-pr-mz`, `--min-fr-mz`, `--max-fr-mz`, `--monitor-mod`, `--var-mod`, `--fixed-mod` | +| ASSEMBLE_EMPIRICAL_LIBRARY | `--mass-acc`, `--mass-acc-ms1`, `--window`, `--individual-mass-acc`, `--individual-windows`, `--out-lib`, `--gen-spec-lib`, `--rt-profiling`, `--monitor-mod`, `--var-mod`, `--fixed-mod` | +| INDIVIDUAL_ANALYSIS | `--mass-acc`, `--mass-acc-ms1`, `--window`, `--pg-level`, `--relaxed-prot-inf`, `--no-ifs-removal`, `--min-pr-mz`, `--max-pr-mz`, `--min-fr-mz`, `--max-fr-mz`, `--monitor-mod`, `--var-mod`, `--fixed-mod` | +| FINAL_QUANTIFICATION | `--pg-level`, `--species-genes`, `--no-norm`, `--report-decoys`, `--xic`, `--qvalue`, `--window`, `--individual-windows`, `--monitor-mod`, `--var-mod`, `--fixed-mod` | + +All steps also block shared infrastructure flags: `--out`, `--temp`, `--threads`, `--verbose`, `--lib`, `--f`, `--fasta`, `--use-quant`, `--matrices`, `--no-main-report`. + +For step-specific overrides that bypass this mechanism, use custom Nextflow config files with `ext.args`: + +```groovy +// custom.config -- add a flag only to FINAL_QUANTIFICATION +process { + withName: '.*:FINAL_QUANTIFICATION' { + ext.args = '--my-special-flag' + } +} +``` + +## DIA-NN Version Selection + +The pipeline supports multiple DIA-NN versions via built-in Nextflow profiles. Each profile sets `params.diann_version` and overrides the container image for all `diann`-labelled processes. + +| Profile | DIA-NN Version | Container | Key features | +| -------------- | -------------- | ------------------------------------------ | -------------------------------------------------------------- | +| `diann_v1_8_1` | 1.8.1 | `docker.io/biocontainers/diann:v1.8.1_cv1` | Default. Public BioContainers image. TSV output. | +| `diann_v2_1_0` | 2.1.0 | `ghcr.io/bigbio/diann:2.1.0` | Parquet output. Native .raw on Linux. QuantUMS (`--quantums`). | +| `diann_v2_2_0` | 2.2.0 | `ghcr.io/bigbio/diann:2.2.0` | Speed optimizations (up to 1.6x on HPC). Parquet output. | +| `diann_v2_3_2` | 2.3.2 | `ghcr.io/bigbio/diann:2.3.2` | DDA support (`--diann_dda`), InfinDIA, up to 9 variable mods. | + +**Version-dependent features:** Some parameters are only available with newer DIA-NN versions. The pipeline handles version compatibility automatically: + +- **QuantUMS** (`--quantums`): Requires >= 1.9.2. The `--direct-quant` flag is automatically skipped for DIA-NN 1.8.x where direct quantification is the only mode. +- **DDA mode** (`--diann_dda`): Requires >= 2.3.2. The pipeline will error if enabled with an older version. +- **InfinDIA** (`--enable_infin_dia`): Requires >= 2.3.0. + +Usage: + +```bash +# Run with DIA-NN 2.2.0 +nextflow run bigbio/quantmsdiann \ + -profile diann_v2_2_0,docker \ + --input sdrf.tsv --database db.fasta --outdir results + +# Run with DIA-NN 2.3.2 (latest, enables DDA and InfinDIA) +nextflow run bigbio/quantmsdiann \ + -profile diann_v2_3_2,docker \ + --input sdrf.tsv --database db.fasta --outdir results +``` + +> [!NOTE] +> DIA-NN 2.x images are hosted on `ghcr.io/bigbio` and may require authentication for private registries. The `diann_v2_1_0` and `diann_v2_2_0` profiles force Docker mode by default; for Singularity, override with your own config. + +## Verbose Module Output + +By default, only final result files are published. For debugging or detailed inspection, the `verbose_modules` profile publishes all intermediate files from every DIA-NN step: + +```bash +nextflow run bigbio/quantmsdiann -profile verbose_modules,docker ... +``` + +This publishes intermediate outputs to descriptive subdirectories (e.g. `spectra/thermorawfileparser/`, `diann_preprocessing/preliminary_analysis/`, `library_generation/`). See [Output: Verbose Output Structure](output.md#verbose-output-structure) for the full directory layout. + +## Container Version Override Guide + +You can override the container image for any process without modifying pipeline code. This is useful for testing custom or newer DIA-NN builds. + +**Docker:** + +```groovy +// custom_container.config +process { + withLabel: diann { + container = 'my-registry.io/diann:custom-build' + } +} +``` + +```bash +nextflow run bigbio/quantmsdiann -c custom_container.config -profile docker ... +``` + +**Singularity with caching:** + +```groovy +// custom_singularity.config +singularity.cacheDir = '/path/to/singularity/cache' + +process { + withLabel: diann { + container = '/path/to/diann_custom.sif' + } +} +``` + +```bash +nextflow run bigbio/quantmsdiann -c custom_singularity.config -profile singularity ... +``` + +## SLURM Example + +For running on HPC clusters with SLURM, the pipeline includes a reference configuration at `conf/pride_codon_slurm.config`. Use it via the `pride_slurm` profile: + +```bash +nextflow run bigbio/quantmsdiann \ + -profile pride_slurm \ + --input sdrf.tsv --database db.fasta --outdir results +``` + +This profile enables Singularity, sets SLURM as the executor, and provides resource scaling for large experiments. Adapt it as a template for your own cluster by creating a custom config file. + +## Optional outputs + +By default, only final result files are published. Intermediate files can be exported using `save_*` parameters or via `ext.*` properties in a custom Nextflow config. + +| Parameter | Default | Description | +| -------------------- | ------- | ------------------------------------------------------------------------------------------- | +| `--save_speclib_tsv` | `false` | Publish the TSV spectral library from in-silico library generation to `library_generation/` | + +**Using a parameter:** + +```bash +nextflow run bigbio/quantmsdiann \ + --input 'experiment.sdrf.tsv' \ + --database 'proteins.fasta' \ + --save_speclib_tsv \ + --outdir './results' \ + -profile docker +``` + +**Using a custom Nextflow config (ext properties):** + +```groovy +// custom.config +process { + withName: '.*:INSILICO_LIBRARY_GENERATION' { + ext.publish_speclib_tsv = true + } +} +``` + +```bash +nextflow run bigbio/quantmsdiann -c custom.config ... +``` + +For full verbose output of all intermediate files (useful for debugging), use the `verbose_modules` profile: + +```bash +nextflow run bigbio/quantmsdiann -profile verbose_modules,docker ... +``` + +## Custom configuration + +### Resource requests + +Each step in the pipeline has default resource requirements. If a job exits with error code `137` or `143` (exceeded resources), it will automatically resubmit with higher requests (2x, then 3x original). + +To customize resources for a specific process: + +```nextflow +process { + withName: 'BIGBIO_QUANTMSDIANN:QUANTMSDIANN:DIA:FINAL_QUANTIFICATION' { + memory = 100.GB + } +} +``` + +Save this to a file and pass via `-c custom.config`. + +## Running in the background + +Use `screen`, `tmux`, or the Nextflow `-bg` flag to run the pipeline in the background: + +```bash +nextflow run bigbio/quantmsdiann -profile docker --input sdrf.tsv --database db.fasta --outdir results -bg +``` + +## Developer testing with local containers + +When developing changes to `sdrf-pipelines` or `quantms-utils`, you can build local Docker containers and test them with the pipeline without publishing to a registry. + +### 1. Build local dev containers + +```bash +# From sdrf-pipelines repo +cd /path/to/sdrf-pipelines +docker build -f Dockerfile.dev -t local/sdrf-pipelines:dev . + +# From quantms-utils repo +cd /path/to/quantms-utils +docker build -f Dockerfile.dev -t local/quantms-utils:dev . +``` + +### 2. Run the pipeline with local containers + +Use the `test_dia_local.config` to override container references: + +```bash +nextflow run main.nf \ + -profile test_dia,docker \ + -c conf/tests/test_dia_local.config \ + --outdir results +``` + +This config (`conf/tests/test_dia_local.config`) overrides: + +- `SDRF_PARSING` → `local/sdrf-pipelines:dev` +- `SAMPLESHEET_CHECK` → `local/quantms-utils:dev` +- `DIANN_MSSTATS` → `local/quantms-utils:dev` + +### 3. Using pre-converted mzML files + +To skip ThermoRawFileParser (useful on macOS/ARM where Mono crashes): + +```bash +# Convert raw files with ThermoRawFileParser v2.0+ +docker run --rm --platform=linux/amd64 \ + -v /path/to/raw:/data -v /path/to/mzml:/out \ + quay.io/biocontainers/thermorawfileparser:2.0.0.dev--h9ee0642_0 \ + ThermoRawFileParser -d /data -o /out -f 2 + +# Run pipeline with pre-converted files +nextflow run main.nf \ + -profile test_dia,docker \ + -c conf/tests/test_dia_local.config \ + --root_folder /path/to/mzml \ + --local_input_type mzML \ + --outdir results +``` + +## Nextflow memory requirements + +Add the following to your environment to limit Java memory: + +```bash +NXF_OPTS='-Xms1g -Xmx4g' +``` diff --git a/main.nf b/main.nf new file mode 100644 index 0000000..048454d --- /dev/null +++ b/main.nf @@ -0,0 +1,81 @@ +#!/usr/bin/env nextflow +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + bigbio/quantmsdiann +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Github : https://github.com/bigbio/quantmsdiann + Website: https://nf-co.re/quantmsdiann + Slack : https://nfcore.slack.com/channels/quantms +---------------------------------------------------------------------------------------- +*/ + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT FUNCTIONS / MODULES / SUBWORKFLOWS / WORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +include { QUANTMSDIANN } from './workflows/quantmsdiann' +include { PIPELINE_COMPLETION } from './subworkflows/local/utils_nfcore_quantms_pipeline' +include { UTILS_NEXTFLOW_PIPELINE } from './subworkflows/nf-core/utils_nextflow_pipeline' + + +// +// WORKFLOW: Run main bigbio/quantmsdiann analysis pipeline +// +workflow BIGBIO_QUANTMSDIANN { + + main: + + QUANTMSDIANN () + + emit: + multiqc_report = QUANTMSDIANN.out.multiqc_report // channel: /path/to/multiqc_report.html +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + RUN ALL WORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// WORKFLOW: Execute a single named workflow for the pipeline +// See: https://github.com/nf-core/rnaseq/issues/619 +// +workflow { + + main: + + // Dump parameters to JSON file for documenting the pipeline settings + + UTILS_NEXTFLOW_PIPELINE ( + false, + true, + params.outdir, + false + ) + + // could take UTILS_NEXTFLOW_PIPELINE.out.samplesheet channel as parsed input + BIGBIO_QUANTMSDIANN () + + // + // SUBWORKFLOW: Run completion tasks + // + PIPELINE_COMPLETION ( + params.email, + params.email_on_fail, + params.plaintext_email, + params.outdir, + params.monochrome_logs, + params.hook_url, + BIGBIO_QUANTMSDIANN.out.multiqc_report + ) +} + + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + THE END +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ diff --git a/modules.json b/modules.json new file mode 100644 index 0000000..c0e263c --- /dev/null +++ b/modules.json @@ -0,0 +1,49 @@ +{ + "name": "bigbio/quantmsdiann", + "homePage": "https://github.com/bigbio/quantmsdiann", + "repos": { + "https://github.com/bigbio/nf-modules.git": { + "modules": { + "bigbio": { + "thermorawfileparser": { + "branch": "main", + "git_sha": "daddf26d8c36e28dffe673949dc340ed6387de9a", + "installed_by": [ + "modules" + ] + } + } + } + }, + "https://github.com/nf-core/modules.git": { + "modules": { + "nf-core": {} + }, + "subworkflows": { + "nf-core": { + "utils_nextflow_pipeline": { + "branch": "master", + "git_sha": "05954dab2ff481bcb999f24455da29a5828af08d", + "installed_by": [ + "subworkflows" + ] + }, + "utils_nfcore_pipeline": { + "branch": "master", + "git_sha": "271e7fc14eb1320364416d996fb077421f3faed2", + "installed_by": [ + "subworkflows" + ] + }, + "utils_nfschema_plugin": { + "branch": "master", + "git_sha": "e753770db613ce014b3c4bc94f6cba443427b726", + "installed_by": [ + "subworkflows" + ] + } + } + } + } + } +} diff --git a/modules/bigbio/thermorawfileparser/environment.yml b/modules/bigbio/thermorawfileparser/environment.yml new file mode 100644 index 0000000..99405da --- /dev/null +++ b/modules/bigbio/thermorawfileparser/environment.yml @@ -0,0 +1,6 @@ +channels: + - conda-forge + - bioconda + - defaults +dependencies: + - bioconda::thermorawfileparser=2.0.0.dev diff --git a/modules/bigbio/thermorawfileparser/main.nf b/modules/bigbio/thermorawfileparser/main.nf new file mode 100644 index 0000000..283cbee --- /dev/null +++ b/modules/bigbio/thermorawfileparser/main.nf @@ -0,0 +1,72 @@ +process THERMORAWFILEPARSER { + tag "${meta.id}" + label 'process_low' + label 'process_single' + label 'error_retry' + + conda "${moduleDir}/environment.yml" + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/thermorawfileparser:2.0.0.dev--h9ee0642_0' : + 'biocontainers/thermorawfileparser:2.0.0.dev--h9ee0642_0' }" + + input: + tuple val(meta), path(raw) + + output: + tuple val(meta), path("*.{mzML,mzML.gz,mgf,mgf.gz,parquet,parquet.gz}"), emit: spectra + tuple val("${task.process}"), val('thermorawfileparser'), eval("thermorawfileparser --version"), emit: versions_thermorawfileparser, topic: versions + path "*.log", emit: log + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Detect existing format options in any supported syntax: -f=2, -f 2, --format=2, + // or --format 2. + def hasFormatArg = (args =~ /(^|\s)(-f(=|\s)\d+|--format(=|\s)\d+)/).find() + // Default to indexed mzML format (-f=2) if not specified in args + def formatArg = hasFormatArg ? '' : '-f=2' + def prefix = task.ext.prefix ?: "${meta.id}" + def suffix = args.contains("--format 0") || args.contains("-f 0") + ? "mgf" + : args.contains("--format 1") || args.contains("-f 1") + ? "mzML" + : args.contains("--format 2") || args.contains("-f 2") + ? "mzML" + : args.contains("--format 3") || args.contains("-f 3") + ? "parquet" + : "mzML" + suffix = args.contains("--gzip") ? "${suffix}.gz" : "${suffix}" + + """ + thermorawfileparser \\ + -i='${raw}' \\ + ${formatArg} ${args} \\ + -o=./ 2>&1 | tee '${prefix}_conversion.log' + """ + + stub: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def suffix = args.contains("--format 0") || args.contains("-f 0") + ? "mgf" + : args.contains("--format 1") || args.contains("-f 1") + ? "mzML" + : args.contains("--format 2") || args.contains("-f 2") + ? "mzML" + : args.contains("--format 3") || args.contains("-f 3") + ? "parquet" + : "mzML" + suffix = args.contains("--gzip") ? "${suffix}.gz" : "${suffix}" + + """ + touch '${prefix}.${suffix}' + touch '${prefix}_conversion.log' + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + ThermoRawFileParser: \$(thermorawfileparser --version) + END_VERSIONS + """ +} diff --git a/modules/bigbio/thermorawfileparser/meta.yml b/modules/bigbio/thermorawfileparser/meta.yml new file mode 100644 index 0000000..48b7d34 --- /dev/null +++ b/modules/bigbio/thermorawfileparser/meta.yml @@ -0,0 +1,81 @@ +name: thermorawfileparser +description: Convert RAW file to mzML or MGF files format +keywords: + - raw + - mzml + - mgf + - parquet + - parser + - proteomics +tools: + - thermorawfileparser: + description: | + ThermoRawFileParser converts Thermo RAW files to open standard formats like mzML, producing indexed output files. + Use `task.ext.args` to pass additional arguments, e.g.: + - `-f=0` for MGF output, `-f=1` for mzML, `-f=2` for indexed mzML (default), `-f=3` for Parquet, `-f=4` for None + - `-L` or `--msLevel=VALUE` to select MS levels (e.g., `-L=1,2` or `--msLevel=1-3`) + homepage: https://github.com/compomics/ThermoRawFileParser + documentation: https://github.com/compomics/ThermoRawFileParser + tool_dev_url: https://github.com/compomics/ThermoRawFileParser + doi: "10.1021/acs.jproteome.9b00328" + licence: + - "Apache Software" + identifier: biotools:ThermoRawFileParser +input: + - - meta: + type: map + description: | + Groovy Map containing sample information + e.g. `[ id:'sample1', single_end:false ]` + - raw: + type: file + description: Thermo RAW file + pattern: "*.{raw,RAW}" + ontologies: [] +output: + spectra: + - - meta: + type: map + description: | + Groovy Map containing sample information + e.g. `[ id:'sample1', single_end:false ]` + - "*.{mzML,mzML.gz,mgf,mgf.gz,parquet,parquet.gz}": + type: file + description: Mass spectra in open format + pattern: "*.{mzML,mzML.gz,mgf,mgf.gz,parquet,parquet.gz}" + ontologies: [] + versions_thermorawfileparser: + - - ${task.process}: + type: string + description: The process the versions were collected from + - thermorawfileparser: + type: string + description: The name of the tool + - ThermoRawFileParser.sh --version: + type: eval + description: The expression to obtain the version of the tool + log: + - "*.log": + type: file + description: Log file from the conversion process + pattern: "*.log" + ontologies: [] +topics: + versions: + - - ${task.process}: + type: string + description: The process the versions were collected from + - thermorawfileparser: + type: string + description: The name of the tool + - ThermoRawFileParser.sh --version: + type: eval + description: The expression to obtain the version of the tool +authors: + - "@jonasscheid" + - "@daichengxin" + - "@ypriverol" +maintainers: + - "@jonasscheid" + - "@daichengxin" + - "@ypriverol" diff --git a/modules/bigbio/thermorawfileparser/tests/main.nf.test b/modules/bigbio/thermorawfileparser/tests/main.nf.test new file mode 100644 index 0000000..c1f8ce9 --- /dev/null +++ b/modules/bigbio/thermorawfileparser/tests/main.nf.test @@ -0,0 +1,54 @@ +nextflow_process { + + name "Test Process THERMORAWFILEPARSER" + script "../main.nf" + process "THERMORAWFILEPARSER" + tag "modules" + tag "modules_bigbio" + tag "thermorawfileparser" + + test("Should convert RAW to mzML") { + + when { + process { + """ + input[0] = [ + [ id: 'test', mzml_id: 'UPS1_50amol_R3' ], + file(params.test_data['proteomics']['msspectra']['ups1_50amol_r3'], checkIfExists: false) + ] + """ + } + } + + then { + assert process.success + assert snapshot(process.out.versions_thermorawfileparser).match("versions") + assert new File(process.out.spectra[0][1]).name == 'TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML' + assert process.out.log.size() == 1 + } + } + + test("Should run stub mode") { + + options "-stub" + + when { + process { + """ + input[0] = [ + [ id: 'test_sample', mzml_id: 'test_sample' ], + file(params.test_data['proteomics']['msspectra']['ups1_50amol_r3'], checkIfExists: false) + ] + """ + } + } + + then { + assert process.success + assert snapshot(process.out.versions_thermorawfileparser).match("versions_stub") + assert new File(process.out.spectra[0][1]).name == 'test_sample.mzML' + assert snapshot(process.out).match() + assert process.out.log.size() == 1 + } + } +} diff --git a/modules/bigbio/thermorawfileparser/tests/main.nf.test.snap b/modules/bigbio/thermorawfileparser/tests/main.nf.test.snap new file mode 100644 index 0000000..f5d956b --- /dev/null +++ b/modules/bigbio/thermorawfileparser/tests/main.nf.test.snap @@ -0,0 +1,83 @@ +{ + "versions_stub": { + "content": [ + [ + [ + "THERMORAWFILEPARSER", + "thermorawfileparser", + "2.0.0.0" + ] + ] + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.04.6" + }, + "timestamp": "2026-03-23T22:31:25.789483" + }, + "versions": { + "content": [ + [ + [ + "THERMORAWFILEPARSER", + "thermorawfileparser", + "2.0.0.0" + ] + ] + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.04.6" + }, + "timestamp": "2026-03-23T22:29:04.754961" + }, + "Should run stub mode": { + "content": [ + { + "0": [ + [ + { + "id": "test_sample", + "mzml_id": "test_sample" + }, + "test_sample.mzML:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "1": [ + [ + "THERMORAWFILEPARSER", + "thermorawfileparser", + "2.0.0.0" + ] + ], + "2": [ + "test_sample_conversion.log:md5,d41d8cd98f00b204e9800998ecf8427e" + ], + "log": [ + "test_sample_conversion.log:md5,d41d8cd98f00b204e9800998ecf8427e" + ], + "spectra": [ + [ + { + "id": "test_sample", + "mzml_id": "test_sample" + }, + "test_sample.mzML:md5,d41d8cd98f00b204e9800998ecf8427e" + ] + ], + "versions_thermorawfileparser": [ + [ + "THERMORAWFILEPARSER", + "thermorawfileparser", + "2.0.0.0" + ] + ] + } + ], + "meta": { + "nf-test": "0.9.3", + "nextflow": "25.04.6" + }, + "timestamp": "2026-03-23T22:31:25.88765" + } +} \ No newline at end of file diff --git a/modules/bigbio/thermorawfileparser/tests/nextflow.config b/modules/bigbio/thermorawfileparser/tests/nextflow.config new file mode 100644 index 0000000..0293c16 --- /dev/null +++ b/modules/bigbio/thermorawfileparser/tests/nextflow.config @@ -0,0 +1,3 @@ +process { + publishDir = { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" } +} diff --git a/modules/local/diann/assemble_empirical_library/main.nf b/modules/local/diann/assemble_empirical_library/main.nf new file mode 100644 index 0000000..2bc0525 --- /dev/null +++ b/modules/local/diann/assemble_empirical_library/main.nf @@ -0,0 +1,92 @@ +process ASSEMBLE_EMPIRICAL_LIBRARY { + tag "$meta.experiment_id" + label 'process_low' + label 'diann' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://containers.biocontainers.pro/s3/SingImgsRepo/diann/v1.8.1_cv1/diann_v1.8.1_cv1.img' : + 'docker.io/biocontainers/diann:v1.8.1_cv1' }" + + input: + // In this step the real files are passed, and not the names + path(ms_files) + val(meta) + path("quant/*") + path(lib) + path(diann_config) + + output: + path "empirical_library.*", emit: empirical_library + path "assemble_empirical_library.log", emit: log + path "versions.yml", emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Strip flags that are managed by the pipeline to prevent silent conflicts + def blocked = ['--no-main-report', '--no-ifs-removal', '--matrices', '--out', + '--temp', '--threads', '--verbose', '--lib', '--f', '--fasta', + '--mass-acc', '--mass-acc-ms1', '--window', + '--individual-mass-acc', '--individual-windows', + '--out-lib', '--use-quant', '--gen-spec-lib', '--rt-profiling', + '--monitor-mod', '--var-mod', '--fixed-mod', + '--channels', '--lib-fixed-mod', '--original-mods'] + // Sort by length descending so longer flags (e.g. --mass-acc-ms1) are matched before shorter prefixes (--mass-acc) + blocked.sort { a -> -a.length() }.each { flag -> + def flagPattern = '(?<=^|\\s)' + java.util.regex.Pattern.quote(flag) + '(?=\\s|\$)(\\s+(?!-{1,2}[a-zA-Z])\\S+)*' + if (args =~ flagPattern) { + log.warn "DIA-NN: '${flag}' is managed by the pipeline for ASSEMBLE_EMPIRICAL_LIBRARY and will be stripped." + args = args.replaceAll(flagPattern, '').trim() + } + } + + if (params.mass_acc_automatic) { + mass_acc = '--individual-mass-acc' + } else if (meta['precursormasstoleranceunit']?.toLowerCase()?.endsWith('ppm') && meta['fragmentmasstoleranceunit']?.toLowerCase()?.endsWith('ppm')){ + mass_acc = "--mass-acc ${meta['fragmentmasstolerance']} --mass-acc-ms1 ${meta['precursormasstolerance']}" + } else { + mass_acc = '--individual-mass-acc' + } + scan_window = params.scan_window_automatic ? '--individual-windows' : "--window $params.scan_window" + diann_no_peptidoforms = params.diann_no_peptidoforms ? "--no-peptidoforms" : "" + diann_tims_sum = params.diann_tims_sum ? "--quant-tims-sum" : "" + diann_im_window = params.diann_im_window ? "--im-window $params.diann_im_window" : "" + + """ + # Precursor Tolerance value was: ${meta['precursormasstolerance']} + # Fragment Tolerance value was: ${meta['fragmentmasstolerance']} + # Precursor Tolerance unit was: ${meta['precursormasstoleranceunit']} + # Fragment Tolerance unit was: ${meta['fragmentmasstoleranceunit']} + + ls -lcth + + # Extract --var-mod, --fixed-mod, and --monitor-mod flags from diann_config.cfg + mod_flags=\$(grep -oP '(--var-mod\\s+\\S+|--fixed-mod\\s+\\S+|--monitor-mod\\s+\\S+|--lib-fixed-mod\\s+\\S+|--original-mods|--channels\\s+.+)' ${diann_config} | tr '\\n' ' ') + + diann --f ${(ms_files as List).join(' --f ')} \\ + --lib ${lib} \\ + --threads ${task.cpus} \\ + --out-lib empirical_library \\ + --verbose $params.diann_debug \\ + --rt-profiling \\ + --temp ./quant/ \\ + --use-quant \\ + ${mass_acc} \\ + ${scan_window} \\ + --gen-spec-lib \\ + ${diann_no_peptidoforms} \\ + ${diann_tims_sum} \\ + ${diann_im_window} \\ + \${mod_flags} \\ + $args + + cp report.log.txt assemble_empirical_library.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + DIA-NN: \$(diann 2>&1 | grep "DIA-NN" | grep -oP "\\d+\\.\\d+(\\.\\w+)*(\\.[\\d]+)?") + END_VERSIONS + """ +} diff --git a/modules/local/diann/assemble_empirical_library/meta.yml b/modules/local/diann/assemble_empirical_library/meta.yml new file mode 100644 index 0000000..7fa6585 --- /dev/null +++ b/modules/local/diann/assemble_empirical_library/meta.yml @@ -0,0 +1,39 @@ +name: assemble_empirical_library +description: A module for assembling an empirical library based on a preliminary analysis of the in-silico library with DIA-NN. +keywords: + - DIA-NN + - DIA +tools: + - DIA-NN: + description: | + DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing by Demichev. + homepage: https://github.com/vdemichev/DiaNN + documentation: https://github.com/vdemichev/DiaNN +input: + - mzMLs: + type: file + description: Spectra files in mzML format + pattern: "*.mzML" + - quant: + type: file + description: The .quant files from DIA-NN preliminary analysis, containing IDs and quantification information. + pattern: "*.quant" + - lib: + type: file + description: Spectra library file + pattern: "*.tsv" +output: + - empirical_library: + type: file + description: An empirical spectral library from the .quant files. + pattern: "empirical_library.tsv" + - log: + type: file + description: DIA-NN log file + pattern: "assemble_empirical_library.log" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/diann/diann_msstats/main.nf b/modules/local/diann/diann_msstats/main.nf new file mode 100644 index 0000000..4374b58 --- /dev/null +++ b/modules/local/diann/diann_msstats/main.nf @@ -0,0 +1,33 @@ +process DIANN_MSSTATS { + tag "diann_msstats" + label 'process_medium' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/quantms-utils:0.0.28--pyh106432d_0' : + 'biocontainers/quantms-utils:0.0.28--pyh106432d_0' }" + + input: + path(report) + path(exp_design) + + output: + path "*msstats_in.csv", emit: out_msstats + path "*.log", emit: log + path "versions.yml", emit: versions + + script: + def args = task.ext.args ?: '' + """ + quantmsutilsc diann2msstats \\ + --report ${report} \\ + --exp_design ${exp_design} \\ + --qvalue_threshold $params.protein_level_fdr_cutoff \\ + $args \\ + 2>&1 | tee convert_report.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + quantms-utils: \$(pip show quantms-utils | grep "Version" | awk -F ': ' '{print \$2}') + END_VERSIONS + """ +} diff --git a/modules/local/diann/diann_msstats/meta.yml b/modules/local/diann/diann_msstats/meta.yml new file mode 100644 index 0000000..ac1f147 --- /dev/null +++ b/modules/local/diann/diann_msstats/meta.yml @@ -0,0 +1,48 @@ +name: diann_msstats +description: Convert DIA-NN report to MSstats format +keywords: + - DIA-NN + - conversion + - MSstats +tools: + - custom: + description: | + A custom module for DIA-NN report to MSstats conversion. + homepage: https://github.com/bigbio/quantmsdiann + documentation: https://github.com/bigbio/quantmsdiann +input: + - report: + type: file + description: DIA-NN main report file + pattern: "*.tsv" + - exp_design: + type: file + description: An experimental design file including Sample and replicates column et al. + pattern: "*.tsv" + - report_pr: + type: file + description: A text table containing normalized quantities for precursors. They are filtered at 1% FDR, using both global and run-specific q-values for precursors + pattern: "*pr_matrix.tsv" + - report_pg: + type: file + description: A text table containing normalized quantities for protein groups. They are filtered at 1% FDR, using global q-values for protein groups + pattern: "*pg_matrix.tsv" + - meta: + type: map + description: Groovy Map containing sample information + - fasta: + type: file + description: Protein sequence database in Fasta format. + pattern: "*.{fasta,fa}" +output: + - out_msstats: + type: file + description: MSstats input file + pattern: "*msstats_in.csv" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" + - "@wanghong" diff --git a/modules/local/diann/final_quantification/main.nf b/modules/local/diann/final_quantification/main.nf new file mode 100644 index 0000000..cb7d481 --- /dev/null +++ b/modules/local/diann/final_quantification/main.nf @@ -0,0 +1,112 @@ +process FINAL_QUANTIFICATION { + tag "$meta.experiment_id" + label 'process_high' + label 'diann' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://containers.biocontainers.pro/s3/SingImgsRepo/diann/v1.8.1_cv1/diann_v1.8.1_cv1.img' : + 'docker.io/biocontainers/diann:v1.8.1_cv1' }" + + input: + // Note that the files are passed as names and not paths, this prevents them from being staged + // in the directory + val(ms_files) + val(meta) + path(empirical_library) + // The quant path is passed, and diann will use the files in the quant directory instead + // of the ones passed in ms_files. + path("quant/") + path(fasta) + path(diann_config) + + output: + // DIA-NN 2.0 don't return report in tsv format + path "diann_report.{tsv,parquet}", emit: main_report, optional: true + path "diann_report.manifest.txt", emit: report_manifest, optional: true + path "diann_report.protein_description.tsv", emit: protein_description, optional: true + path "diann_report.stats.tsv", emit: report_stats + path "diann_report.pr_matrix.tsv", emit: pr_matrix + path "diann_report.pg_matrix.tsv", emit: pg_matrix + path "diann_report.gg_matrix.tsv", emit: gg_matrix + path "diann_report.unique_genes_matrix.tsv", emit: unique_gene_matrix + path "diannsummary.log", emit: log + + // Different library files format are exported due to different DIA-NN versions + path "empirical_library.tsv", emit: final_speclib, optional: true + path "empirical_library.tsv.skyline.speclib", emit: skyline_speclib, optional: true + path "versions.yml", emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Strip flags that are managed by the pipeline to prevent silent conflicts + def blocked = ['--no-main-report', '--gen-spec-lib', '--out-lib', '--no-ifs-removal', + '--temp', '--threads', '--verbose', '--lib', '--f', '--fasta', + '--use-quant', '--matrices', '--out', '--relaxed-prot-inf', '--pg-level', + '--qvalue', '--window', '--individual-windows', + '--species-genes', '--report-decoys', '--xic', '--no-norm', + '--monitor-mod', '--var-mod', '--fixed-mod', + '--channels', '--lib-fixed-mod', '--original-mods'] + // Sort by length descending so longer flags (e.g. --individual-windows) are matched before shorter prefixes (--window) + blocked.sort { a -> -a.length() }.each { flag -> + def flagPattern = '(?<=^|\\s)' + java.util.regex.Pattern.quote(flag) + '(?=\\s|\$)(\\s+(?!-{1,2}[a-zA-Z])\\S+)*' + if (args =~ flagPattern) { + log.warn "DIA-NN: '${flag}' is managed by the pipeline for FINAL_QUANTIFICATION and will be stripped." + args = args.replaceAll(flagPattern, '').trim() + } + } + + scan_window = params.scan_window_automatic ? "--individual-windows" : "--window $params.scan_window" + species_genes = params.species_genes ? "--species-genes": "" + no_norm = params.diann_normalize ? "" : "--no-norm" + report_decoys = params.diann_report_decoys ? "--report-decoys": "" + diann_export_xic = params.diann_export_xic ? "--xic": "" + // --direct-quant only exists in DIA-NN >= 1.9.2 (QuantUMS counterpart); skip for older versions + quantums = params.quantums ? "" : (params.diann_version >= '1.9' ? "--direct-quant" : "") + quantums_train_runs = params.quantums_train_runs ? "--quant-train-runs $params.quantums_train_runs": "" + quantums_sel_runs = params.quantums_sel_runs ? "--quant-sel-runs $params.quantums_sel_runs": "" + quantums_params = params.quantums_params ? "--quant-params $params.quantums_params": "" + diann_no_peptidoforms = params.diann_no_peptidoforms ? "--no-peptidoforms" : "" + diann_use_quant = params.diann_use_quant ? "--use-quant" : "" + + """ + # Notes: if .quant files are passed, mzml/.d files are not accessed, so the name needs to be passed but files + # do not need to be present. + + # Extract --var-mod, --fixed-mod, and --monitor-mod flags from diann_config.cfg + mod_flags=\$(grep -oP '(--var-mod\\s+\\S+|--fixed-mod\\s+\\S+|--monitor-mod\\s+\\S+|--lib-fixed-mod\\s+\\S+|--original-mods|--channels\\s+.+)' ${diann_config} | tr '\\n' ' ') + + diann --lib ${empirical_library} \\ + --fasta ${fasta} \\ + --f ${(ms_files as List).join(' --f ')} \\ + --threads ${task.cpus} \\ + --verbose $params.diann_debug \\ + --temp ./quant/ \\ + --relaxed-prot-inf \\ + --pg-level $params.pg_level \\ + ${species_genes} \\ + ${no_norm} \\ + --matrices \\ + --out diann_report.tsv \\ + --qvalue $params.protein_level_fdr_cutoff \\ + ${report_decoys} \\ + ${diann_export_xic} \\ + ${quantums} \\ + ${quantums_train_runs} \\ + ${quantums_sel_runs} \\ + ${quantums_params} \\ + ${diann_no_peptidoforms} \\ + ${diann_use_quant} \\ + \${mod_flags} \\ + $args + + cp diann_report.log.txt diannsummary.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + DIA-NN: \$(diann 2>&1 | grep "DIA-NN" | grep -oP "\\d+\\.\\d+(\\.\\w+)*(\\.[\\d]+)?") + END_VERSIONS + """ +} diff --git a/modules/local/diann/final_quantification/meta.yml b/modules/local/diann/final_quantification/meta.yml new file mode 100644 index 0000000..23f64c7 --- /dev/null +++ b/modules/local/diann/final_quantification/meta.yml @@ -0,0 +1,55 @@ +name: final_quantification +description: A module for summarization of results from DIA-NN analysis. +keywords: + - DIA-NN + - DIA +tools: + - DIA-NN: + description: | + DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing by Demichev. + homepage: https://github.com/vdemichev/DiaNN + documentation: https://github.com/vdemichev/DiaNN +input: + - empirical_library: + type: file + description: Empirical spectral library generated by DIA-NN + pattern: "*.tsv" + - mzMLs: + type: file + description: Spectra files in mzML format. + pattern: "*.mzML" + - fasta: + type: file + description: Protein sequence database in Fasta format. + pattern: "*.{fasta,fa}" + - quant: + type: file + description: Identification and Quantification file from DIA-NN. + pattern: "*.quant" +output: + - main_report: + type: file + description: A text table containing precursor and protein IDs, as well as plenty of associated information. + pattern: "*.tsv" + - pr_matrix: + type: file + description: A text table containing normalized quantities for precursors. They are filtered at 1% FDR, using both global and run-specific q-values for precursors + pattern: "*.tsv" + - pg_matrix: + type: file + description: A text table containing normalized quantities for protein groups. They are filtered at 1% FDR, using global q-values for protein groups + pattern: "*.tsv" + - gg_matrix: + type: file + description: A text table containing normalized quantities for gene groups. + pattern: "*.tsv" + - log: + type: file + description: DIA-NN log file + pattern: "diannsummary.log" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/diann/generate_cfg/main.nf b/modules/local/diann/generate_cfg/main.nf new file mode 100644 index 0000000..8377030 --- /dev/null +++ b/modules/local/diann/generate_cfg/main.nf @@ -0,0 +1,33 @@ +process GENERATE_CFG { + tag "$meta.experiment_id" + label 'process_tiny' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/quantms-utils:0.0.28--pyh106432d_0' : + 'biocontainers/quantms-utils:0.0.28--pyh106432d_0' }" + + input: + val(meta) + + output: + path 'diann_config.cfg', emit: diann_cfg + path 'versions.yml', emit: versions + path '*.log' + + script: + def args = task.ext.args ?: '' + + """ + quantmsutilsc dianncfg \\ + --enzyme "${meta.enzyme}" \\ + --fix_mod "${meta.fixedmodifications}" \\ + --var_mod "${meta.variablemodifications}" \\ + $args \\ + 2>&1 | tee GENERATE_DIANN_CFG.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + quantms-utils: \$(pip show quantms-utils | grep "Version" | awk -F ': ' '{print \$2}') + END_VERSIONS + """ +} diff --git a/modules/local/diann/generate_cfg/meta.yml b/modules/local/diann/generate_cfg/meta.yml new file mode 100644 index 0000000..7e70807 --- /dev/null +++ b/modules/local/diann/generate_cfg/meta.yml @@ -0,0 +1,30 @@ +name: generate_cfg +description: A module to generate DIA-NN configuration files, based on input files and params. +keywords: + - configure + - DIA-NN +tools: + - custom: + description: | + A custom module to generate DIA-NN configuration files from input files and params. + homepage: https://github.com/bigbio/quantmsdiann + documentation: https://github.com/bigbio/quantmsdiann +input: + - meta: + type: map + description: Groovy Map containing sample information +output: + - diann_cfg: + type: file + description: DIA-NN configure file for search and quantification + pattern: "diann_config.cfg" + - version: + type: file + description: File containing software version + pattern: "versions.yml" + - log: + type: file + description: log file + pattern: "*.log" +authors: + - "@daichengxin" diff --git a/modules/local/diann/individual_analysis/main.nf b/modules/local/diann/individual_analysis/main.nf new file mode 100644 index 0000000..3bdccae --- /dev/null +++ b/modules/local/diann/individual_analysis/main.nf @@ -0,0 +1,126 @@ +process INDIVIDUAL_ANALYSIS { + tag "$ms_file.baseName" + label 'process_high' + label 'diann' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://containers.biocontainers.pro/s3/SingImgsRepo/diann/v1.8.1_cv1/diann_v1.8.1_cv1.img' : + 'docker.io/biocontainers/diann:v1.8.1_cv1' }" + + input: + tuple val(meta), path(ms_file), path(fasta), path(library) + path(diann_config) + + output: + path "*.quant", emit: diann_quant + path "*_final_diann.log", emit: log + path "versions.yml", emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Strip flags that are managed by the pipeline to prevent silent conflicts + def blocked = ['--use-quant', '--gen-spec-lib', '--out-lib', '--matrices', '--out', '--rt-profiling', + '--temp', '--threads', '--verbose', '--lib', '--f', '--fasta', + '--mass-acc', '--mass-acc-ms1', '--window', + '--no-ifs-removal', '--no-main-report', '--relaxed-prot-inf', '--pg-level', + '--min-pr-mz', '--max-pr-mz', '--min-fr-mz', '--max-fr-mz', + '--monitor-mod', '--var-mod', '--fixed-mod', + '--channels', '--lib-fixed-mod', '--original-mods'] + // Sort by length descending so longer flags (e.g. --mass-acc-ms1) are matched before shorter prefixes (--mass-acc) + blocked.sort { a -> -a.length() }.each { flag -> + def flagPattern = '(?<=^|\\s)' + java.util.regex.Pattern.quote(flag) + '(?=\\s|\$)(\\s+(?!-{1,2}[a-zA-Z])\\S+)*' + if (args =~ flagPattern) { + log.warn "DIA-NN: '${flag}' is managed by the pipeline for INDIVIDUAL_ANALYSIS and will be stripped." + args = args.replaceAll(flagPattern, '').trim() + } + } + + // Warn about flags that override pipeline-computed calibration values (not blocked, but may change behaviour) + ['--individual-windows', '--individual-mass-acc'].each { flag -> + if (args.contains(flag)) { + log.warn "DIA-NN: '${flag}' overrides the mass accuracy / scan window values computed by the PRELIMINARY_ANALYSIS step. This may change pipeline behaviour." + } + } + + if (params.mass_acc_automatic || params.scan_window_automatic) { + if (meta.mass_acc_ms2 != "0" && meta.mass_acc_ms2 != null) { + mass_acc_ms2 = meta.mass_acc_ms2 + mass_acc_ms1 = meta.mass_acc_ms1 + scan_window = meta.scan_window + } + else if (meta['precursormasstoleranceunit']?.toLowerCase()?.endsWith('ppm') && meta['fragmentmasstoleranceunit']?.toLowerCase()?.endsWith('ppm')) { + mass_acc_ms2 = meta['fragmentmasstolerance'] + mass_acc_ms1 = meta['precursormasstolerance'] + scan_window = params.scan_window + } + else { + mass_acc_ms2 = params.mass_acc_ms2 + mass_acc_ms1 = params.mass_acc_ms1 + scan_window = params.scan_window + } + } else { + if (meta['precursormasstoleranceunit']?.toLowerCase()?.endsWith('ppm') && meta['fragmentmasstoleranceunit']?.toLowerCase()?.endsWith('ppm')) { + mass_acc_ms1 = meta["precursormasstolerance"] + mass_acc_ms2 = meta["fragmentmasstolerance"] + scan_window = params.scan_window + } + else if (meta.mass_acc_ms2 != "0" && meta.mass_acc_ms2 != null) { + mass_acc_ms2 = meta.mass_acc_ms2 + mass_acc_ms1 = meta.mass_acc_ms1 + scan_window = meta.scan_window + } + else { + mass_acc_ms2 = params.mass_acc_ms2 + mass_acc_ms1 = params.mass_acc_ms1 + scan_window = params.scan_window + } + } + + diann_no_peptidoforms = params.diann_no_peptidoforms ? "--no-peptidoforms" : "" + diann_tims_sum = params.diann_tims_sum ? "--quant-tims-sum" : "" + diann_im_window = params.diann_im_window ? "--im-window $params.diann_im_window" : "" + + // Per-file scan ranges from SDRF (empty = no flag, DIA-NN auto-detects) + min_pr_mz = meta['ms1minmz'] ? "--min-pr-mz ${meta['ms1minmz']}" : "" + max_pr_mz = meta['ms1maxmz'] ? "--max-pr-mz ${meta['ms1maxmz']}" : "" + min_fr_mz = meta['ms2minmz'] ? "--min-fr-mz ${meta['ms2minmz']}" : "" + max_fr_mz = meta['ms2maxmz'] ? "--max-fr-mz ${meta['ms2maxmz']}" : "" + + """ + # Extract --var-mod, --fixed-mod, and --monitor-mod flags from diann_config.cfg + mod_flags=\$(grep -oP '(--var-mod\\s+\\S+|--fixed-mod\\s+\\S+|--monitor-mod\\s+\\S+|--lib-fixed-mod\\s+\\S+|--original-mods|--channels\\s+.+)' ${diann_config} | tr '\\n' ' ') + + diann --lib ${library} \\ + --f ${ms_file} \\ + --fasta ${fasta} \\ + --threads ${task.cpus} \\ + --verbose $params.diann_debug \\ + --temp ./ \\ + --mass-acc ${mass_acc_ms2} \\ + --mass-acc-ms1 ${mass_acc_ms1} \\ + --window ${scan_window} \\ + --no-ifs-removal \\ + --no-main-report \\ + --relaxed-prot-inf \\ + --pg-level $params.pg_level \\ + ${min_pr_mz} \\ + ${max_pr_mz} \\ + ${min_fr_mz} \\ + ${max_fr_mz} \\ + ${diann_no_peptidoforms} \\ + ${diann_tims_sum} \\ + ${diann_im_window} \\ + \${mod_flags} \\ + $args + + cp report.log.txt ${ms_file.baseName}_final_diann.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + DIA-NN: \$(diann 2>&1 | grep "DIA-NN" | grep -oP "\\d+\\.\\d+(\\.\\w+)*(\\.[\\d]+)?") + END_VERSIONS + """ +} diff --git a/modules/local/diann/individual_analysis/meta.yml b/modules/local/diann/individual_analysis/meta.yml new file mode 100644 index 0000000..8c0c677 --- /dev/null +++ b/modules/local/diann/individual_analysis/meta.yml @@ -0,0 +1,43 @@ +name: individual_analysis +description: A module for final analysis of individual raw files based on DIA-NN. +keywords: + - DIA-NN + - DIA +tools: + - DIA-NN: + description: | + DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing by Demichev. + homepage: https://github.com/vdemichev/DiaNN + documentation: https://github.com/vdemichev/DiaNN +input: + - empirical_library: + type: file + description: An empirical spectral library from the .quant files. + pattern: "empirical_library.tsv" + - mzML: + type: file + description: Spectra file in mzML format + pattern: "*.mzML" + - fasta: + type: file + description: Protein sequence database in fasta format + pattern: "*.{fasta,fa}" +output: + - diann_quant: + type: file + description: Quantification file from DIA-NN + pattern: "*.quant" + - lib: + type: file + description: Spectral library file + pattern: "*.tsv" + - log: + type: file + description: DIA-NN log file + pattern: "*_diann.log" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/diann/insilico_library_generation/main.nf b/modules/local/diann/insilico_library_generation/main.nf new file mode 100644 index 0000000..b347483 --- /dev/null +++ b/modules/local/diann/insilico_library_generation/main.nf @@ -0,0 +1,78 @@ +process INSILICO_LIBRARY_GENERATION { + tag "$fasta.name" + label 'process_medium' + label 'diann' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://containers.biocontainers.pro/s3/SingImgsRepo/diann/v1.8.1_cv1/diann_v1.8.1_cv1.img' : + 'docker.io/biocontainers/diann:v1.8.1_cv1' }" + + input: + path(fasta) + path(diann_config) + + output: + path "versions.yml", emit: versions + path "*.predicted.speclib", emit: predict_speclib + path "*.tsv", emit: speclib_tsv, optional: true + path "silicolibrarygeneration.log", emit: log + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Strip flags that are managed by the pipeline to prevent silent conflicts + def blocked = ['--use-quant', '--no-main-report', '--matrices', '--out', + '--temp', '--threads', '--verbose', '--lib', '--f', '--fasta', + '--fasta-search', '--predictor', '--gen-spec-lib', + '--missed-cleavages', '--min-pep-len', '--max-pep-len', + '--min-pr-charge', '--max-pr-charge', '--var-mods', + '--min-pr-mz', '--max-pr-mz', '--min-fr-mz', '--max-fr-mz', + '--met-excision', '--monitor-mod'] + // Sort by length descending so longer flags (e.g. --fasta-search) are matched before shorter prefixes (--fasta, --f) + blocked.sort { a -> -a.length() }.each { flag -> + def flagPattern = '(?<=^|\\s)' + java.util.regex.Pattern.quote(flag) + '(?=\\s|\$)(\\s+(?!-{1,2}[a-zA-Z])\\S+)*' + if (args =~ flagPattern) { + log.warn "DIA-NN: '${flag}' is managed by the pipeline for INSILICO_LIBRARY_GENERATION and will be stripped." + args = args.replaceAll(flagPattern, '').trim() + } + } + + min_pr_mz = params.min_pr_mz ? "--min-pr-mz $params.min_pr_mz":"" + max_pr_mz = params.max_pr_mz ? "--max-pr-mz $params.max_pr_mz":"" + min_fr_mz = params.min_fr_mz ? "--min-fr-mz $params.min_fr_mz":"" + max_fr_mz = params.max_fr_mz ? "--max-fr-mz $params.max_fr_mz":"" + met_excision = params.met_excision ? "--met-excision" : "" + diann_no_peptidoforms = params.diann_no_peptidoforms ? "--no-peptidoforms" : "" + + """ + diann `cat ${diann_config}` \\ + --fasta ${fasta} \\ + --fasta-search \\ + ${min_pr_mz} \\ + ${max_pr_mz} \\ + ${min_fr_mz} \\ + ${max_fr_mz} \\ + --missed-cleavages $params.allowed_missed_cleavages \\ + --min-pep-len $params.min_peptide_length \\ + --max-pep-len $params.max_peptide_length \\ + --min-pr-charge $params.min_precursor_charge \\ + --max-pr-charge $params.max_precursor_charge \\ + --var-mods $params.max_mods \\ + --threads ${task.cpus} \\ + --predictor \\ + --verbose $params.diann_debug \\ + --gen-spec-lib \\ + ${diann_no_peptidoforms} \\ + ${met_excision} \\ + ${args} + + cp *lib.log.txt silicolibrarygeneration.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + DIA-NN: \$(diann 2>&1 | grep "DIA-NN" | grep -oP "\\d+\\.\\d+(\\.\\w+)*(\\.[\\d]+)?") + END_VERSIONS + """ +} diff --git a/modules/local/diann/insilico_library_generation/meta.yml b/modules/local/diann/insilico_library_generation/meta.yml new file mode 100644 index 0000000..5f9d68b --- /dev/null +++ b/modules/local/diann/insilico_library_generation/meta.yml @@ -0,0 +1,36 @@ +name: insilico_library_generation +description: A module for in silico predicted library generation based on DIA-NN. +keywords: + - DIA-NN + - library free + - DIA +tools: + - DIA-NN: + description: | + DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing by Demichev. + homepage: https://github.com/vdemichev/DiaNN + documentation: https://github.com/vdemichev/DiaNN +input: + - fasta: + type: file + description: FASTA sequence databases + pattern: "*.{fasta,fa}" + - cfg: + type: file + description: specifies a configuration file to load options/commands from. + pattern: "*.cfg" +output: + - predict_speclib: + type: file + description: In silico-predicted spectral library by deep learning predictor in DIA-NN + pattern: "*.predicted.speclib" + - log: + type: file + description: DIA-NN log file + pattern: "silicolibrarygeneration.log" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/diann/preliminary_analysis/main.nf b/modules/local/diann/preliminary_analysis/main.nf new file mode 100644 index 0000000..8bb818b --- /dev/null +++ b/modules/local/diann/preliminary_analysis/main.nf @@ -0,0 +1,115 @@ +process PRELIMINARY_ANALYSIS { + tag "$ms_file.baseName" + label 'process_high' + label 'diann' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://containers.biocontainers.pro/s3/SingImgsRepo/diann/v1.8.1_cv1/diann_v1.8.1_cv1.img' : + 'docker.io/biocontainers/diann:v1.8.1_cv1' }" + + input: + tuple val(meta), path(ms_file), path(predict_library) + path(diann_config) + + output: + path "*.quant", emit: diann_quant + tuple val(meta), path("*_diann.log"), emit: log + path "versions.yml", emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + // Strip flags that are managed by the pipeline to prevent silent conflicts + def blocked = ['--use-quant', '--gen-spec-lib', '--out-lib', '--matrices', '--out', + '--temp', '--threads', '--verbose', '--lib', '--f', '--fasta', + '--mass-acc', '--mass-acc-ms1', '--window', + '--quick-mass-acc', '--min-corr', '--corr-diff', '--time-corr-only', + '--min-pr-mz', '--max-pr-mz', '--min-fr-mz', '--max-fr-mz', + '--monitor-mod', '--var-mod', '--fixed-mod', '--no-prot-inf', + '--channels', '--lib-fixed-mod', '--original-mods'] + // Sort by length descending so longer flags (e.g. --mass-acc-ms1) are matched before shorter prefixes (--mass-acc) + blocked.sort { a -> -a.length() }.each { flag -> + def flagPattern = '(?<=^|\\s)' + java.util.regex.Pattern.quote(flag) + '(?=\\s|\$)(\\s+(?!-{1,2}[a-zA-Z])\\S+)*' + if (args =~ flagPattern) { + log.warn "DIA-NN: '${flag}' is managed by the pipeline for PRELIMINARY_ANALYSIS and will be stripped." + args = args.replaceAll(flagPattern, '').trim() + } + } + + // Performance flags for preliminary analysis calibration step + quick_mass_acc = params.quick_mass_acc ? "--quick-mass-acc" : "" + performance_flags = params.performance_mode ? "--min-corr 2 --corr-diff 1 --time-corr-only" : "" + diann_no_peptidoforms = params.diann_no_peptidoforms ? "--no-peptidoforms" : "" + + // I am using here the ["key"] syntax, since the preprocessed meta makes + // was evaluating to null when using the dot notation. + + if (params.mass_acc_automatic) { + mass_acc = "" + } else if (meta['precursormasstoleranceunit']?.toLowerCase()?.endsWith('ppm') && meta['fragmentmasstoleranceunit']?.toLowerCase()?.endsWith('ppm')){ + mass_acc = "--mass-acc ${meta['fragmentmasstolerance']} --mass-acc-ms1 ${meta['precursormasstolerance']}" + } else { + log.info "Warning: DIA-NN only supports ppm unit tolerance for MS1 and MS2. Falling back to `mass_acc_automatic`=`true` to automatically determine the tolerance by DIA-NN!" + mass_acc = "" + } + + // Warn about auto-calibration with Bruker/timsTOF data + if (params.mass_acc_automatic && ms_file.name.toString().toLowerCase().endsWith('.d')) { + log.warn "Bruker/timsTOF .d file detected (${ms_file.name}) with automatic mass accuracy calibration. " + + "DIA-NN recommends manually fixing MS1 and MS2 mass accuracy for timsTOF datasets (typically 10-15 ppm). " + + "Set tolerances via SDRF columns (PrecursorMassTolerance, FragmentMassTolerance) for per-file control, " + + "or use --mass_acc_automatic false with --mass_acc_ms1 and --mass_acc_ms2 pipeline parameters for a global override." + } + + // Notes: Use double quotes for params, so that it is escaped in the shell. + scan_window = params.scan_window_automatic ? '' : "--window $params.scan_window" + diann_tims_sum = params.diann_tims_sum ? "--quant-tims-sum" : "" + diann_im_window = params.diann_im_window ? "--im-window $params.diann_im_window" : "" + + // Per-file scan ranges from SDRF (empty = no flag, DIA-NN auto-detects) + min_pr_mz = meta['ms1minmz'] ? "--min-pr-mz ${meta['ms1minmz']}" : "" + max_pr_mz = meta['ms1maxmz'] ? "--max-pr-mz ${meta['ms1maxmz']}" : "" + min_fr_mz = meta['ms2minmz'] ? "--min-fr-mz ${meta['ms2minmz']}" : "" + max_fr_mz = meta['ms2maxmz'] ? "--max-fr-mz ${meta['ms2maxmz']}" : "" + + """ + # Precursor Tolerance value was: ${meta['precursormasstolerance']} + # Fragment Tolerance value was: ${meta['fragmentmasstolerance']} + # Precursor Tolerance unit was: ${meta['precursormasstoleranceunit']} + # Fragment Tolerance unit was: ${meta['fragmentmasstoleranceunit']} + + # Final mass accuracy is '${mass_acc}' + + # Extract --var-mod, --fixed-mod, and --monitor-mod flags from diann_config.cfg + mod_flags=\$(grep -oP '(--var-mod\\s+\\S+|--fixed-mod\\s+\\S+|--monitor-mod\\s+\\S+|--lib-fixed-mod\\s+\\S+|--original-mods|--channels\\s+.+)' ${diann_config} | tr '\\n' ' ') + + diann --lib ${predict_library} \\ + --f ${ms_file} \\ + --threads ${task.cpus} \\ + --verbose $params.diann_debug \\ + ${scan_window} \\ + --temp ./ \\ + ${mass_acc} \\ + ${quick_mass_acc} \\ + ${performance_flags} \\ + ${min_pr_mz} \\ + ${max_pr_mz} \\ + ${min_fr_mz} \\ + ${max_fr_mz} \\ + ${diann_no_peptidoforms} \\ + ${diann_tims_sum} \\ + ${diann_im_window} \\ + --no-prot-inf \\ + \${mod_flags} \\ + $args + + cp report.log.txt ${ms_file.baseName}_diann.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + DIA-NN: \$(diann 2>&1 | grep "DIA-NN" | grep -oP "\\d+\\.\\d+(\\.\\w+)*(\\.[\\d]+)?") + END_VERSIONS + """ +} diff --git a/modules/local/diann/preliminary_analysis/meta.yml b/modules/local/diann/preliminary_analysis/meta.yml new file mode 100644 index 0000000..5eb752a --- /dev/null +++ b/modules/local/diann/preliminary_analysis/meta.yml @@ -0,0 +1,38 @@ +name: preliminary_analysis +description: A module for preliminary analysis of individual raw files with DIA-NN using the in-silico generated library (also from DIA-NN). +keywords: + - DIA-NN + - DIA +tools: + - DIA-NN: + description: | + DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing by Demichev. + homepage: https://github.com/vdemichev/DiaNN + documentation: https://github.com/vdemichev/DiaNN +input: + - meta: + type: map + description: Groovy Map containing sample information + - predict_tsv: + type: file + description: In silico-predicted spectral library by deep learning predictor in DIA-NN + pattern: "*.tsv" + - mzML: + type: file + description: Spectra file in mzML format + pattern: "*.mzML" +output: + - diann_quant: + type: file + description: Quantification file from DIA-NN + pattern: "*.quant" + - log: + type: file + description: DIA-NN log file + pattern: "*_diann.log" + - version: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/openms/mzml_indexing/main.nf b/modules/local/openms/mzml_indexing/main.nf new file mode 100644 index 0000000..3798a64 --- /dev/null +++ b/modules/local/openms/mzml_indexing/main.nf @@ -0,0 +1,33 @@ +process MZML_INDEXING { + tag "$meta.id" + label 'process_low' + label 'openms' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'oras://ghcr.io/bigbio/openms-tools-thirdparty-sif:2025.04.14' : + 'ghcr.io/bigbio/openms-tools-thirdparty:2025.04.14' }" + + input: + tuple val(meta), path(mzmlfile) + + output: + tuple val(meta), path("out/*.mzML"), emit: mzmls_indexed + path "versions.yml", emit: versions + path "*.log", emit: log + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + """ + mkdir -p out + FileConverter -in ${mzmlfile} -out out/${mzmlfile.baseName}.mzML \\ + ${args} \\ + 2>&1 | tee ${mzmlfile.baseName}_mzmlindexing.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + FileConverter: \$(FileConverter 2>&1 | grep -E '^Version(.*)' | sed 's/Version: //g' | cut -d ' ' -f 1) + END_VERSIONS + """ +} diff --git a/modules/local/openms/mzml_indexing/meta.yml b/modules/local/openms/mzml_indexing/meta.yml new file mode 100644 index 0000000..e528b59 --- /dev/null +++ b/modules/local/openms/mzml_indexing/meta.yml @@ -0,0 +1,41 @@ +name: mzml_indexing +description: Converts mzML to indexed mzML +keywords: + - raw + - mzML + - OpenMS +tools: + - FileConverter: + description: | + Converts between different MS file formats + homepage: http://www.openms.de/doxygen/nightly/html/TOPP_FileConverter.html + documentation: http://www.openms.de/doxygen/nightly/html/TOPP_FileConverter.html +input: + - meta: + type: map + description: | + Groovy Map containing sample information + - mzmlfile: + type: file + description: | + Input file to convert. + pattern: "*.mzML" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + - mzmls_indexed: + type: file + description: indexed mzML file + pattern: "*.mzML" + - log: + type: file + description: log file + pattern: "*.log" + - version: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/pmultiqc/main.nf b/modules/local/pmultiqc/main.nf new file mode 100644 index 0000000..facf4e8 --- /dev/null +++ b/modules/local/pmultiqc/main.nf @@ -0,0 +1,49 @@ +process PMULTIQC { + label 'process_high' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/pmultiqc:0.0.43--pyhdfd78af_0' : + 'biocontainers/pmultiqc:0.0.43--pyhdfd78af_0' }" + + input: + path multiqc_inputs, stageAs: 'results/*' + + output: + path "*.html", emit: ch_pmultiqc_report + path "*.db", optional: true, emit: ch_pmultiqc_db + path "versions.yml", emit: versions + path "*_data", emit: data + + script: + def args = task.ext.args ?: '' + def disable_pmultiqc = (params.enable_pmultiqc) ? "--quantms-plugin" : "" + def disable_table_plots = (params.enable_pmultiqc) && (params.skip_table_plots) ? "--disable-table" : "" + def disable_idxml_index = (params.enable_pmultiqc) && (params.pmultiqc_idxml_skip) ? "--ignored-idxml" : "" + def contaminant_affix = params.contaminant_string ? "--contaminant-affix ${params.contaminant_string}" : "" + + """ + set -x + set -e + + # leaving here to ease debugging + ls -lcth * + + cat results/*openms_design.tsv 2>/dev/null || true + + multiqc \\ + -f \\ + ${disable_pmultiqc} \\ + --config ./results/multiqc_config.yml \\ + ${args} \\ + ${disable_table_plots} \\ + ${disable_idxml_index} \\ + ${contaminant_affix} \\ + ./results \\ + -o . + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + pmultiqc: \$(multiqc --pmultiqc_version | sed -e "s/pmultiqc, version //g") + END_VERSIONS + """ +} diff --git a/modules/local/pmultiqc/meta.yml b/modules/local/pmultiqc/meta.yml new file mode 100644 index 0000000..adf63f2 --- /dev/null +++ b/modules/local/pmultiqc/meta.yml @@ -0,0 +1,37 @@ +name: pmultiqc +description: A library for proteomics QC report based on MultiQC framework. +keywords: + - MultiQC + - QC + - Proteomics +tools: + - pmultiqc: + description: | + A library for proteomics QC report based on MultiQC framework. + homepage: https://github.com/bigbio/pmultiqc/ + documentation: https://github.com/bigbio/pmultiqc/ +input: + - multiqc_inputs: + type: file + description: | + Collection of files staged under results/ for MultiQC + (e.g. experimental design, pipeline outputs, version files, config) +output: + - report: + type: file + description: MultiQC report file + pattern: "*.html" + - quantmsdb: + type: file + description: Sqlite3 database file stored protein psm and quantification information + pattern: "*.db" + - data: + type: dir + description: MultiQC data dir + pattern: "*_data" + - versions: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/samplesheet_check/main.nf b/modules/local/samplesheet_check/main.nf new file mode 100644 index 0000000..f2b7112 --- /dev/null +++ b/modules/local/samplesheet_check/main.nf @@ -0,0 +1,50 @@ +process SAMPLESHEET_CHECK { + + tag "$input_file" + label 'process_tiny' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/quantms-utils:0.0.28--pyh106432d_0' : + 'biocontainers/quantms-utils:0.0.28--pyh106432d_0' }" + + input: + path input_file + + output: + path "*.log", emit: log + path "*.sdrf.tsv", includeInputs: true, emit: checked_file + path "versions.yml", emit: versions + + when: + task.ext.when == null || task.ext.when + + script: + def args = task.ext.args ?: '' + def string_use_ols_cache_only = params.use_ols_cache_only == true ? "--use_ols_cache_only" : "" + + """ + # Get basename and create output filename + BASENAME=\$(basename "${input_file}") + # Remove .sdrf.tsv, .sdrf.csv, or .sdrf extension (in that order to match longest first) + BASENAME=\$(echo "\$BASENAME" | sed -E 's/\\.sdrf\\.(tsv|csv)\$//' | sed -E 's/\\.sdrf\$//') + OUTPUT_FILE="\${BASENAME}.sdrf.tsv" + + # Convert CSV to TSV if needed using pandas + if [[ "${input_file}" == *.csv ]]; then + python -c "import pandas as pd; df = pd.read_csv('${input_file}'); df.to_csv('\$OUTPUT_FILE', sep='\\t', index=False)" + elif [[ "${input_file}" != "\$OUTPUT_FILE" ]]; then + cp "${input_file}" "\$OUTPUT_FILE" + fi + + quantmsutilsc checksamplesheet --exp_design "\$OUTPUT_FILE" \\ + --minimal \\ + ${string_use_ols_cache_only} \\ + $args \\ + 2>&1 | tee input_check.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + quantms-utils: \$(pip show quantms-utils | grep "Version" | awk -F ': ' '{print \$2}') + END_VERSIONS + """ +} diff --git a/modules/local/samplesheet_check/meta.yml b/modules/local/samplesheet_check/meta.yml new file mode 100644 index 0000000..e21b447 --- /dev/null +++ b/modules/local/samplesheet_check/meta.yml @@ -0,0 +1,30 @@ +name: samplesheet_check +description: Validate and check samplesheet/experimental design files +keywords: + - validation + - samplesheet + - experimental design + - sdrf +tools: + - samplesheet_check +input: + - meta: input_file + type: file + description: Input samplesheet or experimental design file + pattern: "*.{tsv,csv,sdrf}" +output: + - meta: log + type: file + description: Log file from validation process + pattern: "*.log" + - meta: checked_file + type: file + description: Validated input file + pattern: "*.sdrf.tsv" + - meta: versions + type: file + description: File containing software versions + pattern: "versions.yml" +authors: + - "@nf-core" + - "@bigbio" diff --git a/modules/local/sdrf_parsing/main.nf b/modules/local/sdrf_parsing/main.nf new file mode 100644 index 0000000..fabc1a8 --- /dev/null +++ b/modules/local/sdrf_parsing/main.nf @@ -0,0 +1,37 @@ +process SDRF_PARSING { + tag "$sdrf.name" + label 'process_tiny' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/sdrf-pipelines:0.1.2--pyhdfd78af_0' : + 'biocontainers/sdrf-pipelines:0.1.2--pyhdfd78af_0' }" + + input: + path sdrf + + output: + path "diann_design.tsv" , emit: ch_expdesign + path "diann_config.cfg" , emit: ch_diann_cfg + path "*.log" , emit: log + path "versions.yml" , emit: versions + + script: + def args = task.ext.args ?: '' + def mod_loc_flag = (params.enable_mod_localization && params.mod_localization) ? + "--mod_localization '${params.mod_localization}'" : '' + def diann_version_flag = params.diann_version ? "--diann_version '${params.diann_version}'" : '' + + """ + parse_sdrf convert-diann \\ + -s ${sdrf} \\ + ${mod_loc_flag} \\ + ${diann_version_flag} \\ + $args \\ + 2>&1 | tee ${sdrf.baseName}_parsing.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + sdrf-pipelines: \$(parse_sdrf --version 2>/dev/null | awk -F ' ' '{print \$2}') + END_VERSIONS + """ +} diff --git a/modules/local/sdrf_parsing/meta.yml b/modules/local/sdrf_parsing/meta.yml new file mode 100644 index 0000000..7c311f4 --- /dev/null +++ b/modules/local/sdrf_parsing/meta.yml @@ -0,0 +1,36 @@ +name: sdrf_parsing +description: Convert SDRF proteomics files into pipelines config files +keywords: + - SDRF + - bioinformatics tools + - OpenMS +tools: + - sdrf-pipelines: + description: | + Convert SDRF proteomics files into pipelines config files. + homepage: https://github.com/bigbio/sdrf-pipelines + documentation: https://github.com/bigbio/sdrf-pipelines +input: + - sdrf_files: + type: file + description: | + A valid sdrf file +output: + - ch_expdesign: + type: file + description: experimental design file in OpenMS format + pattern: "*_design.tsv" + - mqpar: + type: file + description: maxquant configuration file + pattern: "*.xml" + - log: + type: file + description: log file + pattern: "*.log" + - version: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@daichengxin" diff --git a/modules/local/utils/decompress_dotd/main.nf b/modules/local/utils/decompress_dotd/main.nf new file mode 100644 index 0000000..d9ccde8 --- /dev/null +++ b/modules/local/utils/decompress_dotd/main.nf @@ -0,0 +1,112 @@ + +process DECOMPRESS { + tag "$meta.id" + label 'process_low' + label 'error_retry' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/mulled-v2-796b0610595ad1995b121d0b85375902097b78d4:a3a3220eb9ee55710d743438b2ab9092867c98c6-0' : + 'quay.io/biocontainers/mulled-v2-796b0610595ad1995b121d0b85375902097b78d4:a3a3220eb9ee55710d743438b2ab9092867c98c6-0' }" + + stageInMode { + if (task.attempt == 1) { + if (task.executor == 'awsbatch') { + 'symlink' + } else { + 'link' + } + } else if (task.attempt == 2) { + if (task.executor == 'awsbatch') { + 'copy' + } else { + 'symlink' + } + } else { + 'copy' + } + } + + input: + tuple val(meta), path(compressed_file) + + output: + tuple val(meta), path('*.d'), emit: decompressed_files + path 'versions.yml', emit: versions + path '*.log', emit: log + + script: + String prefix = task.ext.prefix ?: "${meta.id}" + + """ + function verify_tar { + exit_code=0 + error=\$(tar df "\$1" 2>&1) || exit_code=\$? + if [ \$exit_code -eq 2 ]; then + echo "\${error}" + exit 2 + fi + + case \${error} in + *'No such file'* ) + echo "\${error}" | grep "No such file" + exit 1 + ;; + *'Size differs'* ) + echo "\${error}" | grep "Size differs" + exit 1 + ;; + esac + } + + + + function extract { + if [ -z "\$1" ]; then + echo "Usage: extract ." + exit 1 + else + if [ -f "\$1" ]; then + case "\$1" in + *.tar.gz) tar xvzf "\$1" && verify_tar "\$1" ;; + *.gz) gunzip "\$1" ;; + *.tar) tar xvf "\$1" && verify_tar "\$1" ;; + *.zip) unzip "\$1" ;; + *) echo "extract: '\$1' - unknown archive method"; exit 1 ;; + esac + else + echo "\$1 - file does not exist"; exit 1 + fi + fi + } + + tar --help 2>&1 | tee -a ${prefix}_decompression.log + gunzip --help 2>&1 | tee -a ${prefix}_decompression.log + (unzip --help 2>&1 || zip --help 2>&1) | tee -a ${prefix}_decompression.log + echo "Unpacking..." | tee -a ${prefix}_decompression.log + + extract ${compressed_file} 2>&1 | tee -a ${prefix}_decompression.log + + # Fix read-only permissions from Bruker/Windows zip archives (dirs extracted as dr-xr-xr-x) + chmod -R u+w . 2>/dev/null || true + + expected_dir="${file(compressed_file.baseName).baseName}.d" + if [ -d "\${expected_dir}" ]; then + echo "Found \${expected_dir}" + else + # Handle archives where internal directory name differs (e.g. spaces vs underscores) + extracted_dir=\$(find . -maxdepth 1 -name "*.d" -type d | head -1 | sed 's|^\\./||') + if [ -n "\${extracted_dir}" ]; then + mv "\${extracted_dir}" "\${expected_dir}" + fi + fi + + ls -l | tee -a ${compressed_file.baseName}_decompression.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + gunzip: \$(gunzip --help 2>&1 | head -1 | grep -oE "\\d+\\.\\d+(\\.\\d+)?") + tar: \$(tar --help 2>&1 | head -1 | grep -oE "\\d+\\.\\d+(\\.\\d+)?") + unzip: \$((unzip --help 2>&1 || zip --help 2>&1) | head -2 | tail -1 | grep -oE "\\d+\\.\\d+") + END_VERSIONS + """ +} diff --git a/modules/local/utils/decompress_dotd/meta.yml b/modules/local/utils/decompress_dotd/meta.yml new file mode 100644 index 0000000..bbc7c58 --- /dev/null +++ b/modules/local/utils/decompress_dotd/meta.yml @@ -0,0 +1,45 @@ +name: decompress_dotd +description: Decompress .tar/.gz files that contain a .d file/directory +keywords: + - raw + - bruker + - .d +tools: + - tar: + description: | + Generates and extracts archives. + homepage: https://www.gnu.org/software/tar/ + - gunzip: + description: | + Decompresses using zlib. + homepage: https://www.gnu.org/software/gzip/ +input: + - meta: + type: map + description: | + Groovy Map containing sample information + - rawfile: + type: file + description: | + Bruker Raw file archived using tar + pattern: "*.{d.tar,.tar,.gz,.d.tar.gz}" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'MD5', enzyme:trypsin ] + - dotd: + type: path + description: Raw Bruker .d file + pattern: "*.d" + - log: + type: file + description: log file + pattern: "*.log" + - version: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@jspaezp" diff --git a/modules/local/utils/mzml_statistics/main.nf b/modules/local/utils/mzml_statistics/main.nf new file mode 100644 index 0000000..d6eb3e0 --- /dev/null +++ b/modules/local/utils/mzml_statistics/main.nf @@ -0,0 +1,38 @@ +process MZML_STATISTICS { + tag "$meta.id" + label 'process_very_low' + label 'process_single' + + container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? + 'https://depot.galaxyproject.org/singularity/quantms-utils:0.0.28--pyh106432d_0' : + 'biocontainers/quantms-utils:0.0.28--pyh106432d_0' }" + + input: + tuple val(meta), path(ms_file) + + output: + path "*_ms_info.parquet", emit: ms_statistics + tuple val(meta), path("*_ms2_info.parquet"), emit: ms2_statistics, optional: true + path "*_feature_info.parquet", emit: feature_statistics, optional: true + path "versions.yml", emit: versions + path "*.log", emit: log + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + def string_ms2_file = params.mzml_features == true ? "--ms2_file" : "" + def string_features_file = params.mzml_features == true ? "--feature_detection" : "" + + """ + quantmsutilsc mzmlstats --ms_path "${ms_file}" \\ + ${string_ms2_file} \\ + ${string_features_file} \\ + $args \\ + 2>&1 | tee ${ms_file.baseName}_mzml_statistics.log + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + quantms-utils: \$(pip show quantms-utils | grep "Version" | awk -F ': ' '{print \$2}') + END_VERSIONS + """ +} diff --git a/modules/local/utils/mzml_statistics/meta.yml b/modules/local/utils/mzml_statistics/meta.yml new file mode 100644 index 0000000..ae4ffd8 --- /dev/null +++ b/modules/local/utils/mzml_statistics/meta.yml @@ -0,0 +1,41 @@ +name: mzml_statistics +description: A module for mzML file statistics +keywords: + - mzML + - statistics +tools: + - custom: + description: | + A custom module for mzML file statistics. + homepage: https://github.com/bigbio/quantmsdiann + documentation: https://github.com/bigbio/quantmsdiann +input: + - mzml: + type: file + description: Spectra file in mzML format + pattern: "*.mzML" +output: + - ms_statistics: + type: file + description: MS-level statistics parquet file + pattern: "*_ms_info.parquet" + - ms2_statistics: + type: file + description: MS2-level statistics parquet file (optional) + pattern: "*_ms2_info.parquet" + optional: true + - feature_statistics: + type: file + description: Feature detection parquet file (optional) + pattern: "*_feature_info.parquet" + optional: true + - versions: + type: file + description: File containing software versions + pattern: "versions.yml" + - log: + type: file + description: Log file from mzML statistics computation + pattern: "*.log" +authors: + - "@wanghong" diff --git a/modules/local/utils/tdf2mzml/main.nf b/modules/local/utils/tdf2mzml/main.nf new file mode 100644 index 0000000..a242935 --- /dev/null +++ b/modules/local/utils/tdf2mzml/main.nf @@ -0,0 +1,38 @@ +process TDF2MZML { + tag "$meta.id" + label 'process_single' + label 'error_retry' + + container 'quay.io/bigbio/tdf2mzml:latest' // TODO: pin to a specific version tag for reproducibility + + input: + tuple val(meta), path(rawfile) + + output: + tuple val(meta), path("*.mzML"), emit: mzmls_converted + path "versions.yml", emit: versions + path "*.log", emit: log + + script: + def args = task.ext.args ?: '' + def prefix = task.ext.prefix ?: "${meta.id}" + + """ + echo "Converting..." | tee --append ${rawfile.baseName}_conversion.log + tdf2mzml.py -i *.d $args 2>&1 | tee --append ${rawfile.baseName}_conversion.log + + # Rename .mzml to .mzML via temp file to handle case-insensitive filesystems (e.g. macOS) + mv *.mzml __tmp_converted.mzML && mv __tmp_converted.mzML ${file(rawfile.baseName).baseName}.mzML + + # Rename .d directory only if the name differs (avoid 'same file' error) + target_d="${file(rawfile.baseName).baseName}.d" + if [ ! -d "\${target_d}" ]; then + mv *.d "\${target_d}" + fi + + cat <<-END_VERSIONS > versions.yml + "${task.process}": + tdf2mzml.py: \$(tdf2mzml.py --version) + END_VERSIONS + """ +} diff --git a/modules/local/utils/tdf2mzml/meta.yml b/modules/local/utils/tdf2mzml/meta.yml new file mode 100644 index 0000000..ebb90b8 --- /dev/null +++ b/modules/local/utils/tdf2mzml/meta.yml @@ -0,0 +1,42 @@ +name: tdf2mzml +description: convert raw bruker files to mzml files +keywords: + - raw + - mzML + - .d +tools: + - tdf2mzml: + description: | + It takes a bruker .d raw file as input and outputs indexed mzML + homepage: https://github.com/mafreitas/tdf2mzml + documentation: https://github.com/mafreitas/tdf2mzml +input: + - meta: + type: map + description: | + Groovy Map containing sample information + - rawfile: + type: file + description: | + Bruker .d raw directory + pattern: "*.d" +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'MD5', enzyme:trypsin ] + - mzml: + type: file + description: indexed mzML + pattern: "*.mzML" + - log: + type: file + description: log file + pattern: "*.log" + - version: + type: file + description: File containing software version + pattern: "versions.yml" +authors: + - "@jspaezp" diff --git a/nextflow.config b/nextflow.config new file mode 100644 index 0000000..33c5c95 --- /dev/null +++ b/nextflow.config @@ -0,0 +1,381 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + bigbio/quantmsdiann Nextflow config file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Default config options for all compute environments +---------------------------------------------------------------------------------------- +*/ + +// Global default params, used in configs +params { + + // Workflow flags + root_folder = null + local_input_type = 'mzML' + database = null + + // Input options and validation of sdrf files + input = null + use_ols_cache_only = true // Use only the OLS cache for ontology validation (no network requests) + + // Tools flags + protein_level_fdr_cutoff = 0.01 + + // Debug level + pp_debug = 0 + + //// Conversion and mzml statistics flags + reindex_mzml = true + mzml_statistics = false + mzml_features = false + + // shared search engine parameters + met_excision = true // Met-excision is enabled by default + allowed_missed_cleavages = 2 + precursor_mass_tolerance = 5 + precursor_mass_tolerance_unit = 'ppm' + variable_mods = 'Oxidation (M)' + fragment_mass_tolerance = 0.03 + fragment_mass_tolerance_unit = 'Da' + min_precursor_charge = 2 + max_precursor_charge = 4 + min_peptide_length = 6 + max_peptide_length = 40 + max_mods = 3 + min_pr_mz = 400 + max_pr_mz = 2400 + min_fr_mz = 100 + max_fr_mz = 1800 + + // DIA-NN: General + diann_version = '1.8.1' // Used to control version-dependent flags (e.g. --monitor-mod for 1.8.x) + diann_debug = 3 + diann_speclib = null + diann_extra_args = null + + // Optional outputs — control which intermediate files are published + save_speclib_tsv = false // Save the TSV spectral library from in-silico generation + + // DIA-NN: PTM site localization (--monitor-mod) + enable_mod_localization = false + // Comma-separated modification names, e.g. 'Phospho (S),Phospho (T),Phospho (Y)' + // or UniMod accessions, e.g. 'UniMod:21,UniMod:1' + mod_localization = 'Phospho (S),Phospho (T),Phospho (Y)' + + // DIA-NN: PRELIMINARY_ANALYSIS — calibration & mass accuracy + scan_window = 8 + scan_window_automatic = true + mass_acc_automatic = true + performance_mode = true // add '--min-corr 2 --corr-diff 1 --time-corr-only' + quick_mass_acc = true + diann_tims_sum = false // add '--quant-tims-sum' + diann_im_window = null // add '--im-window' + + // DIA-NN: ASSEMBLE_EMPIRICAL_LIBRARY — library assembly + skip_preliminary_analysis = false + random_preanalysis = false + random_preanalysis_seed = 42 + empirical_assembly_ms_n = 200 + + // DIA-NN: INDIVIDUAL_ANALYSIS + mass_acc_ms2 = 15 + mass_acc_ms1 = 15 + + // DIA-NN: FINAL_QUANTIFICATION — summarization & output + pg_level = 2 + species_genes = false + diann_normalize = true + diann_report_decoys = false + diann_export_xic = false + quantums = false + quantums_train_runs = null + quantums_sel_runs = null + quantums_params = null + diann_no_peptidoforms = false // add '--no-peptidoforms' + diann_use_quant = true // add '--use-quant' to FINAL_QUANTIFICATION + + // pmultiqc options + enable_pmultiqc = true + pmultiqc_idxml_skip = true + contaminant_string = 'CONT' + + // MultiQC options + multiqc_config = null + multiqc_title = null + multiqc_logo = null + skip_table_plots = false + max_multiqc_email_size = '25.MB' + multiqc_methods_description = null + + // Boilerplate options + outdir = './results' + publish_dir_mode = 'copy' + email = null + email_on_fail = null + plaintext_email = false + monochrome_logs = false + hook_url = System.getenv('HOOK_URL') + help = false + help_full = false + show_hidden = false + version = false + pipelines_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/' + trace_report_suffix = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss') + + // Config options + config_profile_name = null + custom_config_version = 'master' + custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}" + config_profile_description = null + config_profile_contact = null + config_profile_url = null + + // Schema validation default options + validate_params = true +} + +// Load base.config by default for all pipelines +includeConfig 'conf/base.config' + +profiles { + debug { + dumpHashes = true + process.beforeScript = 'echo $HOSTNAME' + cleanup = false + nextflow.enable.configProcessNamesValidation = true + } + // Conda profiles removed - Conda is no longer supported + docker { + docker.enabled = true + conda.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + apptainer.enabled = false + docker.runOptions = '-u $(id -u):$(id -g)' + } + arm64 { + process.arch = 'arm64' + // see discussion: https://github.com/nf-core/modules/issues/6694 + // For now if you're using arm64 you have to use wave for the sake of the maintainers + // wave profile + apptainer.ociAutoPull = true + singularity.ociAutoPull = true + wave.enabled = true + wave.freeze = true + wave.strategy = 'conda,container' + } + emulate_amd64 { + docker.runOptions = '-u $(id -u):$(id -g) --platform=linux/amd64' + } + singularity { + singularity.enabled = true + singularity.autoMounts = true + singularity.pullTimeout = '1 h' + conda.enabled = false + docker.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + apptainer.enabled = false + } + podman { + podman.enabled = true + conda.enabled = false + docker.enabled = false + singularity.enabled = false + shifter.enabled = false + charliecloud.enabled = false + apptainer.enabled = false + } + shifter { + shifter.enabled = true + conda.enabled = false + docker.enabled = false + singularity.enabled = false + podman.enabled = false + charliecloud.enabled = false + apptainer.enabled = false + } + charliecloud { + charliecloud.enabled = true + conda.enabled = false + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + apptainer.enabled = false + } + apptainer { + apptainer.enabled = true + apptainer.autoMounts = true + conda.enabled = false + docker.enabled = false + singularity.enabled = false + podman.enabled = false + shifter.enabled = false + charliecloud.enabled = false + } + wave { + apptainer.ociAutoPull = true + singularity.ociAutoPull = true + wave.enabled = true + wave.freeze = true + wave.strategy = 'conda,container' + } + gpu { + docker.runOptions = '-u $(id -u):$(id -g) --gpus all' + apptainer.runOptions = '--nv' + singularity.runOptions = '--nv' + } + // Micromamba profile removed - Conda is no longer supported + test_dia { includeConfig 'conf/tests/test_dia.config' } + test_dia_dotd { includeConfig 'conf/tests/test_dia_dotd.config' } + test_dia_quantums { includeConfig 'conf/tests/test_dia_quantums.config' } + test_dia_parquet { includeConfig 'conf/tests/test_dia_parquet.config' } + test_dia_2_2_0 { includeConfig 'conf/tests/test_dia_2_2_0.config' } + test_latest_dia { includeConfig 'conf/tests/test_latest_dia.config' } + test_full_dia { includeConfig 'conf/tests/test_full_dia.config' } + // DIA-NN version overrides (used by merge_ci.yml matrix) + diann_v1_8_1 { includeConfig 'conf/diann_versions/v1_8_1.config' } + diann_v2_1_0 { includeConfig 'conf/diann_versions/v2_1_0.config' } + diann_v2_2_0 { includeConfig 'conf/diann_versions/v2_2_0.config' } + dev { includeConfig 'conf/dev.config' } + pride_slurm { includeConfig 'conf/pride_codon_slurm.config' } + manual_wave { includeConfig 'conf/wave.config' } + verbose_modules { includeConfig 'conf/modules/verbose_modules.config' } + // mambaci { includeConfig 'conf/mambaci.config' } +} + +// Load nf-core custom profiles from different institutions + +// If params.custom_config_base is set AND either the NXF_OFFLINE environment variable is not set or params.custom_config_base is a local path, the nfcore_custom.config file from the specified base path is included. +// Load bigbio/quantmsdiann custom profiles from different institutions. +includeConfig params.custom_config_base && (!System.getenv('NXF_OFFLINE') || !params.custom_config_base.startsWith('http')) ? "${params.custom_config_base}/nfcore_custom.config" : "/dev/null" + +// Load pipeline-specific custom profiles (reuse quantms institutional configs) +includeConfig params.custom_config_base && (!System.getenv('NXF_OFFLINE') || !params.custom_config_base.startsWith('http')) ? "${params.custom_config_base}/pipeline/quantms.config" : "/dev/null" + +// Set default registry for Apptainer, Docker, Podman, Charliecloud and Singularity independent of -profile +// Will not be used unless Apptainer / Docker / Podman / Charliecloud / Singularity are enabled +// Set to your registry if you have a mirror of containers +apptainer.registry = 'quay.io' +docker.registry = 'quay.io' +podman.registry = 'quay.io' +singularity.registry = 'quay.io' +charliecloud.registry = 'quay.io' + +// Export these variables to prevent local Python/R libraries from conflicting with those in the container +// The JULIA depot path has been adjusted to a fixed path `/usr/local/share/julia` that needs to be used for packages in the container. +// See https://apeltzer.github.io/post/03-julia-lang-nextflow/ for details on that. Once we have a common agreement on where to keep Julia packages, this is adjustable. + +env { + PYTHONNOUSERSITE = 1 + R_PROFILE_USER = "/.Rprofile" + R_ENVIRON_USER = "/.Renviron" + JULIA_DEPOT_PATH = "/usr/local/share/julia" +} + +// Set bash options +process.shell = [ + "bash", + "-C", // No clobber - prevent output redirection from overwriting files. + "-e", // Exit if a tool returns a non-zero status/exit code + "-u", // Treat unset variables and parameters as an error + "-o", // Returns the status of the last command to exit.. + "pipefail" // ..with a non-zero status or zero if all successfully execute +] + +// Disable process selector warnings by default. Use debug profile to enable warnings. +nextflow.enable.configProcessNamesValidation = false + +timeline { + enabled = true + file = "${params.outdir}/pipeline_info/execution_timeline_${params.trace_report_suffix}.html" +} +report { + enabled = true + file = "${params.outdir}/pipeline_info/execution_report_${params.trace_report_suffix}.html" +} +trace { + enabled = true + file = "${params.outdir}/pipeline_info/execution_trace_${params.trace_report_suffix}.txt" +} +dag { + enabled = true + file = "${params.outdir}/pipeline_info/pipeline_dag_${params.trace_report_suffix}.html" +} + +manifest { + name = 'bigbio/quantmsdiann' + homePage = 'https://github.com/bigbio/quantmsdiann' + contributors = [ + [ + name: 'Yasset Perez-Riverol', + affiliation: 'European Bioinformatics Institute (EMBL-EBI), Cambridge, UK', + email: 'ypriverol@gmail.com', + github: 'ypriverol', + contribution: ['maintainer', 'author'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-6579-6941' + ], + [ + name: 'Dai Chengxin', + affiliation: 'State Key Laboratory of Medical Proteomics, Beijing Proteome Research Center, Beijing, China', + email: 'daichengxin999@gmail.com', + github: 'daichengxin', + contribution: ['author', 'maintainer'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-6943-5211' + ], + [ + name: 'Julianus Pfeuffer', + affiliation: 'Algorithmic Bioinformatics, Freie Universität Berlin, Berlin, Germany', + email: 'jule.pf@gmail.com', + github: 'jpfeuffer', + contribution: ['author', 'maintainer'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-8948-9209' + ], + [ + name: 'Dongze He', + affiliation: 'Altos Labs, Inc.', + email: 'dongzehe.zaza@gmail.com', + github: 'DongzeHe', + contribution: ['contributor'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-8259-7434' + ], + [ + name: 'Henry Webel', + affiliation: 'DTU biosustain, Technical University of Denmark, Lyngby, Denmark', + email: 'heweb@dtu.dk', + github: 'enryh', + contribution: ['contributor'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-8833-7617' + ], + [ + name: 'Fabian Egli', + github: 'fabianegli', + contribution: ['contributor'], // List of contribution types ('author', 'maintainer' or 'contributor') + orcid: '0000-0001-5294-401X' + ] + ] + description = """DIA-NN quantitative mass spectrometry workflow built following nf-core guidelines""" + mainScript = 'main.nf' + defaultBranch = 'main' + nextflowVersion = '!>=25.04.0' + version = '1.0.0' + doi = '10.5281/zenodo.15573386' +} + +// Nextflow plugins +plugins { + id 'nf-schema@2.5.1' // Validation of pipeline parameters and creation of an input channel from a sample sheet +} + +validation { + defaultIgnoreParams = ["genomes"] + monochromeLogs = params.monochrome_logs +} + +// Load per-workflow module configs for DSL2 module specific options +includeConfig 'conf/modules/shared.config' +includeConfig 'conf/modules/dia.config' diff --git a/nextflow_schema.json b/nextflow_schema.json new file mode 100644 index 0000000..8708cf7 --- /dev/null +++ b/nextflow_schema.json @@ -0,0 +1,693 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/bigbio/quantmsdiann/main/nextflow_schema.json", + "title": "bigbio/quantmsdiann pipeline parameters", + "description": "DIA-NN quantitative mass spectrometry workflow built following nf-core guidelines", + "type": "object", + "$defs": { + "input_output_options": { + "title": "Input/output options", + "type": "object", + "fa_icon": "fas fa-terminal", + "description": "Define where the pipeline should find input data and save output data.", + "required": ["input", "outdir"], + "properties": { + "input": { + "type": "string", + "format": "file-path", + "exists": true, + "mimetype": "text/csv", + "pattern": "^\\S+\\.(?:csv|tsv|sdrf)$", + "description": "URI/path to an SDRF file in SDRF format with .sdrf, .tsv, or .csv extension. For more info see help text or docs.", + "help_text": "Input is specified by using a path or URI to a PRIDE Sample to Data Relation Format file (SDRF), e.g. as part of a submitted and annotated PRIDE experiment (see [here](https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects) for examples). Input files will be downloaded and cached from the URIs specified in the SDRF file.\n\nThe SDRF file can have .sdrf, .tsv, or .csv extensions. An OpenMS-style experimental design will be generated based on the factor columns of the SDRF.\n\nThe following parameters are read **exclusively** from the SDRF file (required columns):\n\n * `acquisition_method` (from 'Proteomics Data Acquisition Method' column),\n * `labelling_type` (from 'Label' column),\n * `enzyme` (from 'Enzyme' column),\n * `fixed_mods` (from 'FixedModifications' column)\n\nThe following parameters are read from SDRF but can be **overridden via command line**:\n\n * `precursor_mass_tolerance` (from 'PrecursorMassTolerance' column),\n * `precursor_mass_tolerance_unit` (from 'PrecursorMassToleranceUnit' column),\n * `fragment_mass_tolerance` (from 'FragmentMassTolerance' column),\n * `fragment_mass_tolerance_unit` (from 'FragmentMassToleranceUnit' column),\n * `variable_mods` (from 'VariableModifications' column)", + "fa_icon": "fas fa-file-csv" + }, + "outdir": { + "type": "string", + "description": "The output directory where the results will be saved.", + "default": "./results", + "fa_icon": "fas fa-folder-open", + "format": "directory-path" + }, + "email": { + "type": "string", + "description": "Email address for completion summary.", + "fa_icon": "fas fa-envelope", + "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" + }, + "multiqc_title": { + "type": "string", + "description": "MultiQC report title. Printed as page header, used for filename if not otherwise specified.", + "fa_icon": "fas fa-file-signature" + }, + "root_folder": { + "type": "string", + "description": "Root folder in which the spectrum files specified in the SDRF are searched", + "fa_icon": "fas fa-folder", + "help_text": "This optional parameter can be used to specify a root folder in which the spectrum files specified in the SDRF are searched.\nIt is usually used if you have a local version of the experiment already. Note that this option does not support recursive\nsearching yet." + }, + "local_input_type": { + "type": "string", + "description": "Overwrite the file type/extension of the filename as specified in the SDRF", + "fa_icon": "fas fa-file-invoice", + "default": "mzML", + "help_text": "If the above [`--root_folder`](#root_folder) was given to load local input files, this overwrites the file type/extension of\nthe filename as specified in the SDRF. Usually used in case you have an mzML-converted version of the files already. Needs to be\none of 'mzML', 'raw', 'd', or 'dia' (the letter cases should match your files exactly). Compressed variants (.gz, .tar, .tar.gz, .zip) are supported for 'mzML', 'raw', and 'd' formats." + } + } + }, + "sdrf_validation": { + "title": "SDRF validation", + "type": "object", + "description": "Settings for validating the input SDRF file.", + "default": "", + "properties": { + "use_ols_cache_only": { + "type": "boolean", + "description": "Use cached version of the Ontology Lookup Service (OLS).", + "fa_icon": "far fa-check-square", + "help_text": "Use only the cached version of the Ontology Lookup Service (OLS) for ontology term validation. This is useful if you don't want the pipeline to query internet services.", + "default": true + } + } + }, + "protein_database": { + "title": "Protein database", + "type": "object", + "description": "Settings that relate to the mandatory protein database and the optional generation of decoy entries. Note: Decoys for DIA will be created internally.", + "default": "", + "properties": { + "database": { + "type": "string", + "format": "file-path", + "exists": true, + "mimetype": "text/fasta", + "pattern": "^\\S+\\.(?:fasta|fa)$", + "description": "The `fasta` protein database used during database search. *Note:* For DIA data, it must not contain decoys.", + "fa_icon": "fas fa-file", + "help_text": "Since the database is not included in an SDRF, this parameter always needs to be given to specify the input protein database\nwhen you run the pipeline. Remember to include contaminants (and decoys if not in DIA mode and if not added in the pipeline with [`--add_decoys`](#add_decoys))\n\n```bash\n--database '[path to fasta protein database]'\n```" + } + }, + "fa_icon": "fas fa-database", + "required": ["database"] + }, + "spectrum_preprocessing": { + "title": "Spectrum preprocessing", + "type": "object", + "description": "Configure file conversion and preprocessing options.", + "default": "", + "properties": { + "reindex_mzml": { + "type": "boolean", + "default": true, + "description": "Force initial re-indexing of input mzML files. Also fixes some common mistakes in slightly incomplete/outdated mzMLs. (Default: true for safety)", + "fa_icon": "far fa-check-square", + "help_text": "Force re-indexing in the beginning of the pipeline to make sure that indices are up-to-date." + }, + "mzml_statistics": { + "type": "boolean", + "description": "Compute MS1/MS2 statistics from mzML files. Only runs on mzML files, .d files are always skipped.", + "help_text": "Enable to generate *_ms_info.parquet statistics for QC reporting. Only available for mzML files, Bruker .d files are always excluded.", + "fa_icon": "far fa-check-square" + }, + "mzml_features": { + "type": "boolean", + "description": "Compute with mzmlstatistics step the features at MS1 level and output to a RAW file, only available for mzML files", + "help_text": "Compute with mzmlstatistics step the features at MS1 level and output to a RAW file, only available for mzML files", + "fa_icon": "far fa-check-square" + } + }, + "fa_icon": "far fa-chart-bar" + }, + "database_search": { + "title": "Database search", + "type": "object", + "description": "", + "default": "", + "properties": { + "met_excision": { + "type": "boolean", + "description": "Database searches accounted for N-terminal methionine excision, a common co-translational modification where the initial methionine is enzymatically removed from proteins.", + "default": true, + "fa_icon": "far fa-check-square", + "help_text": "Database searches account for N-terminal methionine excision." + }, + "allowed_missed_cleavages": { + "type": "integer", + "description": "Specify the maximum number of allowed missed enzyme cleavages in a peptide. The parameter is not applied if `unspecific cleavage` is specified as enzyme.", + "default": 2, + "fa_icon": "fas fa-sliders-h" + }, + "precursor_mass_tolerance": { + "type": "integer", + "description": "Precursor mass tolerance used for database search. For High-Resolution instruments a precursor mass tolerance value of 5 ppm is recommended (i.e. 5). See also [`--precursor_mass_tolerance_unit`](#precursor_mass_tolerance_unit).", + "default": 5, + "fa_icon": "fas fa-sliders-h", + "help_text": "This value can be overridden via command line. If not specified, the value from the SDRF file will be used." + }, + "precursor_mass_tolerance_unit": { + "type": "string", + "description": "Precursor mass tolerance unit used for database search. Possible values are 'ppm' (default) and 'Da'.", + "default": "ppm", + "fa_icon": "fas fa-sliders-h", + "enum": ["Da", "ppm"], + "help_text": "This value can be overridden via command line. If not specified, the value from the SDRF file will be used." + }, + "fragment_mass_tolerance": { + "type": "number", + "description": "Fragment mass tolerance used for database search. The default of 0.03 Da is for high-resolution instruments.", + "default": 0.03, + "fa_icon": "fas fa-sliders-h", + "help_text": "This value can be overridden via command line. If not specified, the value from the SDRF file will be used." + }, + "fragment_mass_tolerance_unit": { + "type": "string", + "description": "Fragment mass tolerance unit used for database search. Possible values are 'ppm' and 'Da' (default).", + "default": "Da", + "fa_icon": "fas fa-list-ol", + "help_text": "This value can be overridden via command line. If not specified, the value from the SDRF file will be used.", + "enum": ["Da", "ppm"] + }, + "variable_mods": { + "type": "string", + "description": "A comma-separated list of variable modifications with their Unimod name to be searched during database search", + "default": "Oxidation (M)", + "fa_icon": "fas fa-tasks", + "help_text": "Specify which variable modifications should be applied to the database search (e.g. 'Oxidation (M)'). Use Unimod names in the style '({unimod name} ({optional term specificity} {optional origin})'. Multiple variable modifications can be specified comma separated (e.g. 'Carbamidomethyl (C),Oxidation (M)'). This value can be overridden via command line. If not specified, the value from the SDRF file will be used." + }, + "min_precursor_charge": { + "type": "integer", + "description": "Minimum precursor ion charge. Omit the '+'", + "default": 2, + "fa_icon": "fas fa-sliders-h" + }, + "max_precursor_charge": { + "type": "integer", + "description": "Maximum precursor ion charge. Omit the '+'", + "default": 4, + "fa_icon": "fas fa-sliders-h" + }, + "min_peptide_length": { + "type": "integer", + "description": "Minimum peptide length to consider", + "default": 6, + "fa_icon": "fas fa-sliders-h" + }, + "max_peptide_length": { + "type": "integer", + "description": "Maximum peptide length to consider", + "default": 40, + "fa_icon": "fas fa-sliders-h" + }, + "max_mods": { + "type": "integer", + "description": "Maximum number of modifications per peptide. If this value is large, the search may take very long.", + "default": 3, + "fa_icon": "fas fa-sliders-h" + }, + "min_pr_mz": { + "type": "number", + "description": "The minimum precursor m/z for the in silico library generation or library-free search", + "fa_icon": "fas fa-filter", + "default": 400 + }, + "max_pr_mz": { + "type": "number", + "description": "The maximum precursor m/z for the in silico library generation or library-free search", + "fa_icon": "fas fa-filter", + "default": 2400 + }, + "min_fr_mz": { + "type": "number", + "description": "The minimum fragment m/z for the in silico library generation or library-free search", + "fa_icon": "fas fa-filter", + "default": 100 + }, + "max_fr_mz": { + "type": "number", + "description": "The maximum fragment m/z for the in silico library generation or library-free search", + "fa_icon": "fas fa-filter", + "default": 1800 + } + }, + "fa_icon": "fas fa-search" + }, + "psm_re_scoring_general": { + "title": "PSM re-scoring (general)", + "type": "object", + "description": "Choose between different rescoring/posterior probability calculation methods and set them up.", + "default": "", + "properties": { + "pp_debug": { + "type": "integer", + "description": "Debug level when running the re-scoring. Logs become more verbose and at '>5' temporary files are kept.", + "fa_icon": "fas fa-bug", + "default": 0, + "hidden": true + } + }, + "fa_icon": "fas fa-star-half-alt" + }, + "protein_inference": { + "title": "Protein inference", + "type": "object", + "description": "To group proteins, calculate scores on the protein (group) level and to potentially modify associations from peptides to proteins.", + "default": "", + "properties": { + "protein_level_fdr_cutoff": { + "type": "number", + "description": "The experiment-wide protein (group)-level FDR cutoff. Default: 0.01", + "default": 0.01, + "fa_icon": "fas fa-filter", + "help_text": "This can be protein level if 'strictly_unique_peptides' are used for protein quantification. See [`--protein_quant`](#protein_quant)" + } + }, + "fa_icon": "fab fa-hubspot" + }, + "DIA-NN": { + "title": "DIA-NN", + "type": "object", + "description": "Settings for DIA-NN - a universal software for data-independent acquisition (DIA) proteomics data processing.", + "default": "", + "properties": { + "diann_version": { + "type": "string", + "description": "Specify the DIA-NN version to be used in the workflow.", + "fa_icon": "fas fa-tag" + }, + "enable_mod_localization": { + "type": "boolean", + "description": "Enable or disable modification localization scoring in DIA-NN.", + "fa_icon": "fas fa-map-marker-alt" + }, + "mod_localization": { + "type": "string", + "description": "Specify the modification localization parameters for DIA-NN.", + "fa_icon": "fas fa-cogs" + }, + "mass_acc_automatic": { + "type": "boolean", + "default": true, + "description": "Choosing the MS2 mass accuracy setting automatically", + "fa_icon": "fas fa-toggle-on" + }, + "scan_window_automatic": { + "type": "boolean", + "description": "Choosing scan_window setting automatically", + "default": true, + "fa_icon": "fas fa-toggle-on" + }, + "scan_window": { + "type": "integer", + "description": "Set the scan window radius to a specific value", + "fa_icon": "fas fa-filter", + "help_text": " Ideally, should be approximately equal to the average number of data points per peak", + "default": 8 + }, + "mass_acc_ms2": { + "type": "number", + "description": "Set the MS2 mass accuracy (tolerance) to a specific value in ppm.", + "fa_icon": "fas fa-bullseye", + "help_text": "If specified, this overrides the automatic calibration. Corresponds to the --mass-acc parameter in DIA-NN.", + "default": 15 + }, + "mass_acc_ms1": { + "type": "number", + "description": "Set the MS1 mass accuracy (tolerance) to a specific value in ppm.", + "fa_icon": "fas fa-bullseye", + "help_text": "If specified, this overrides the automatic calibration. Corresponds to the --mass-acc-ms1 parameter in DIA-NN.", + "default": 15 + }, + "performance_mode": { + "type": "boolean", + "description": "Set Low RAM & High Speed Mode for DIANN, including min-corr, corr-diff, and time-corr-only three parameters", + "fa_icon": "far fa-check-square", + "default": true + }, + "quick_mass_acc": { + "type": "boolean", + "description": "when choosing the MS2 mass accuracy setting automatically, DIA-NN will use a fast heuristical algorithm instead of IDs number optimisation", + "fa_icon": "far fa-check-square", + "default": true + }, + "pg_level": { + "type": "integer", + "description": "Controls the protein inference mode", + "fa_icon": "fas fa-list-ol", + "enum": [0, 1, 2], + "default": 2 + }, + "species_genes": { + "type": "boolean", + "description": "Instructs DIA-NN to add the organism identifier to the gene names", + "fa_icon": "far fa-check-square" + }, + "diann_speclib": { + "type": "string", + "description": "The spectral library to use for DIA-NN", + "fa_icon": "fas fa-file", + "help_text": "If passed, will use that spectral library to carry out the DIA-NN search, instead of predicting one from the fasta file.", + "hidden": false + }, + "diann_report_decoys": { + "type": "boolean", + "description": "Save decoy PSMs to the main .parquet report for DIA-NN 2.0.*", + "fa_icon": "far fa-check-square", + "hidden": false + }, + "diann_export_xic": { + "type": "boolean", + "description": "instructs DIA-NN to extract MS1/fragment chromatograms for identified precursors within X seconds from the elution apex, with X set to 10s if not provided;equivalent to the 'XICs' option in the GUI", + "fa_icon": "far fa-check-square", + "hidden": false + }, + "quantums": { + "type": "boolean", + "description": "Use legacy DIA-NN quantification algorithms (DIA-NN parameter --direct-quant); if true, QuantUMS is enabled.", + "fa_icon": "far fa-check-square", + "hidden": false + }, + "quantums_train_runs": { + "type": "string", + "description": "Run index range for QuantUMS training (e.g., '0:5').", + "fa_icon": "fas fa-sliders-h", + "help_text": "Sets the DIA-NN parameter '--quant-train-runs'. Format: '[N1]:[N2]'. Instructs QuantUMS to train its parameters on runs with indices in the range N1 to N2 (inclusive).", + "hidden": false + }, + "quantums_sel_runs": { + "type": "integer", + "description": "Number of automatically selected runs for QuantUMS training.", + "fa_icon": "fas fa-sliders-h", + "help_text": "Sets the DIA-NN parameter --quant-sel-runs. Format: [N]. QuantUMS will train its parameters on N automatically selected runs to speed up training in large experiments. N must be 6 or greater.", + "hidden": false + }, + "quantums_params": { + "type": "string", + "description": "Pre-calculated QuantUMS parameters.", + "fa_icon": "fas fa-sliders-h", + "help_text": "Sets the DIA-NN parameter --quant-params. Provide previously obtained QuantUMS parameters to use instead of training new ones.", + "hidden": false + }, + "diann_no_peptidoforms": { + "type": "boolean", + "description": "Set no-peptidoforms mode: disables automatic peptidoform scoring when variable modifications are declared (DIA-NN: not recommended).", + "fa_icon": "far fa-check-square", + "default": false + }, + "diann_tims_sum": { + "type": "boolean", + "description": "Enable '--quant-tims-sum' for slice/scanning timsTOF methods (highly recommended for Synchro-PASEF).", + "fa_icon": "far fa-check-square", + "default": false + }, + "diann_im_window": { + "type": "number", + "description": "Set '--im-window' to ensure the IM extraction window is not smaller than the specified value.", + "fa_icon": "fas fa-filter" + }, + "diann_use_quant": { + "type": "boolean", + "description": "Set '--use-quant' to reuse existing .quant files if available.", + "fa_icon": "far fa-check-square", + "default": true + }, + "skip_preliminary_analysis": { + "type": "boolean", + "description": "Skip the preliminary analysis step, thus use the passed spectral library as-is instead of generating a local consensus library.", + "fa_icon": "fas fa-forward", + "hidden": false + }, + "diann_debug": { + "type": "integer", + "description": "Debug level", + "default": 3, + "fa_icon": "fas fa-bug", + "enum": [0, 1, 2, 3, 4], + "hidden": true + }, + "diann_normalize": { + "type": "boolean", + "description": "Enable cross-run normalization between runs by diann.", + "default": true, + "fa_icon": "far fa-check-square" + }, + "random_preanalysis": { + "type": "boolean", + "description": "Enable random selection of spectrum files to generate empirical library.", + "fa_icon": "far fa-check-square" + }, + "random_preanalysis_seed": { + "type": "integer", + "description": "Set the random seed for the random selection of spectrum files to generate the empirical library.", + "default": 42, + "fa_icon": "fas fa-filter" + }, + "empirical_assembly_ms_n": { + "type": "integer", + "description": "The number of randomly selected spectrum files.", + "default": 200, + "fa_icon": "fas fa-filter", + "hidden": true + }, + "diann_extra_args": { + "type": "string", + "description": "Extra arguments appended to all DIA-NN steps. Flags incompatible with specific steps are automatically stripped with a warning.", + "fa_icon": "fas fa-terminal", + "hidden": false, + "help_text": "Pass additional DIA-NN command-line arguments that will be appended to all DIA-NN steps (INSILICO_LIBRARY_GENERATION, PRELIMINARY_ANALYSIS, ASSEMBLE_EMPIRICAL_LIBRARY, INDIVIDUAL_ANALYSIS, FINAL_QUANTIFICATION). Flags that conflict with a specific step are automatically stripped with a warning. For step-specific overrides, use custom Nextflow config files with ext.args." + }, + "save_speclib_tsv": { + "type": "boolean", + "default": false, + "description": "Save the TSV spectral library from the in-silico library generation step.", + "fa_icon": "fas fa-save", + "help_text": "When enabled, the human-readable TSV version of the spectral library produced by DIA-NN during the INSILICO_LIBRARY_GENERATION step is published to the output directory under `library_generation/`. By default this file is discarded as an intermediate." + } + }, + "fa_icon": "fas fa-braille" + }, + "quality_control": { + "title": "Quality control", + "type": "object", + "description": "", + "default": "", + "properties": { + "enable_pmultiqc": { + "type": "boolean", + "description": "Enable generation of pmultiqc report? default: true", + "fa_icon": "fas fa-toggle-on", + "default": true + }, + "pmultiqc_idxml_skip": { + "type": "boolean", + "description": "Skip idXML files (do not generate search engine scores) in pmultiqc report? default: 'true'", + "fa_icon": "fas fa-toggle-on", + "default": true + }, + "contaminant_string": { + "type": "string", + "description": "Contaminant affix string for pmultiqc report. This parameter maps to --contaminant_affix in pmultiqc. default: 'CONT'", + "default": "CONT" + } + }, + "fa_icon": "fas fa-file-medical-alt" + }, + "institutional_config_options": { + "title": "Institutional config options", + "type": "object", + "fa_icon": "fas fa-university", + "description": "Parameters used to describe centralised config profiles. These should not be edited.", + "help_text": "The centralised nf-core configuration profiles use a handful of pipeline parameters to describe themselves. This information is then printed to the Nextflow log when you run a pipeline. You should not need to change these values when you run a pipeline.", + "properties": { + "custom_config_version": { + "type": "string", + "description": "Git commit id for Institutional configs.", + "default": "master", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "custom_config_base": { + "type": "string", + "description": "Base directory for Institutional configs.", + "default": "https://raw.githubusercontent.com/nf-core/configs/master", + "hidden": true, + "help_text": "If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.", + "fa_icon": "fas fa-users-cog" + }, + "config_profile_name": { + "type": "string", + "description": "Institutional config name.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_description": { + "type": "string", + "description": "Institutional config description.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_contact": { + "type": "string", + "description": "Institutional config contact information.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + }, + "config_profile_url": { + "type": "string", + "description": "Institutional config URL link.", + "hidden": true, + "fa_icon": "fas fa-users-cog" + } + } + }, + "generic_options": { + "title": "Generic options", + "type": "object", + "fa_icon": "fas fa-file-import", + "description": "Less common options for the pipeline, typically set in a config file.", + "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", + "properties": { + "version": { + "type": "boolean", + "description": "Display version and exit.", + "fa_icon": "fas fa-question-circle", + "hidden": true + }, + "publish_dir_mode": { + "type": "string", + "default": "copy", + "description": "Method used to save pipeline results to output directory.", + "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", + "fa_icon": "fas fa-copy", + "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], + "hidden": true + }, + "email_on_fail": { + "type": "string", + "description": "Email address for completion summary, only when pipeline fails.", + "fa_icon": "fas fa-exclamation-triangle", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$", + "help_text": "An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.", + "hidden": true + }, + "plaintext_email": { + "type": "boolean", + "description": "Send plain-text email instead of HTML.", + "fa_icon": "fas fa-remove-format", + "hidden": true + }, + "max_multiqc_email_size": { + "type": "string", + "description": "File size limit when attaching MultiQC reports to summary emails.", + "pattern": "^\\d+(\\.\\d+)?\\.?\\s*(K|M|G|T)?B$", + "default": "25.MB", + "fa_icon": "fas fa-file-upload", + "hidden": true + }, + "monochrome_logs": { + "type": "boolean", + "description": "Do not use coloured log outputs.", + "fa_icon": "fas fa-palette", + "hidden": true + }, + "hook_url": { + "type": "string", + "description": "Incoming hook URL for messaging service", + "fa_icon": "fas fa-people-group", + "help_text": "Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.", + "hidden": true + }, + "multiqc_config": { + "type": "string", + "format": "file-path", + "description": "Custom config file to supply to MultiQC.", + "fa_icon": "fas fa-cog", + "hidden": true + }, + "skip_table_plots": { + "type": "boolean", + "description": "Skip protein/peptide table plots with pmultiqc for large dataset.", + "fa_icon": "fas fa-toggle-on" + }, + "multiqc_logo": { + "type": "string", + "description": "Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file", + "fa_icon": "fas fa-image", + "hidden": true + }, + "multiqc_methods_description": { + "type": "string", + "description": "Custom MultiQC yaml file containing HTML including a methods description.", + "fa_icon": "fas fa-cog" + }, + "validate_params": { + "type": "boolean", + "description": "Boolean whether to validate parameters against the schema at runtime", + "default": true, + "fa_icon": "fas fa-check-square", + "hidden": true + }, + "pipelines_testdata_base_path": { + "type": "string", + "fa_icon": "far fa-check-circle", + "description": "Base URL or local path to location of pipeline test dataset files", + "default": "https://raw.githubusercontent.com/nf-core/test-datasets/", + "hidden": true + }, + "trace_report_suffix": { + "type": "string", + "fa_icon": "far fa-calendar", + "description": "Suffix to add to the trace report filename. Default is the date and time in the format yyyy-MM-dd_HH-mm-ss.", + "hidden": true + }, + "help": { + "type": ["boolean", "string"], + "description": "Display the help message." + }, + "help_full": { + "type": "boolean", + "description": "Display the full detailed help message." + }, + "show_hidden": { + "type": "boolean", + "description": "Display hidden parameters in the help message (only works when --help or --help_full are provided)." + } + } + } + }, + "allOf": [ + { + "$ref": "#/$defs/input_output_options" + }, + { + "$ref": "#/$defs/protein_database" + }, + { + "$ref": "#/$defs/sdrf_validation" + }, + { + "$ref": "#/$defs/spectrum_preprocessing" + }, + { + "$ref": "#/$defs/database_search" + }, + { + "$ref": "#/$defs/psm_re_scoring_general" + }, + { + "$ref": "#/$defs/protein_inference" + }, + { + "$ref": "#/$defs/DIA-NN" + }, + { + "$ref": "#/$defs/quality_control" + }, + { + "$ref": "#/$defs/institutional_config_options" + }, + { + "$ref": "#/$defs/generic_options" + } + ] +} diff --git a/nf-test.config b/nf-test.config new file mode 100644 index 0000000..3a1fff5 --- /dev/null +++ b/nf-test.config @@ -0,0 +1,24 @@ +config { + // location for all nf-test tests + testsDir "." + + // nf-test directory including temporary files for each test + workDir System.getenv("NFT_WORKDIR") ?: ".nf-test" + + // location of an optional nextflow.config file specific for executing tests + configFile "tests/nextflow.config" + + // ignore tests coming from the nf-core/modules repo + ignore 'modules/nf-core/**/tests/*', 'subworkflows/nf-core/**/tests/*' + + // run all test with defined profile(s) from the main nextflow.config + profile "test" + + // list of filenames or patterns that should be trigger a full test run + triggers 'nextflow.config', 'nf-test.config', 'conf/test.config', 'tests/nextflow.config', 'tests/.nftignore' + + // load the necessary plugins + plugins { + load "nft-utils@0.0.3" + } +} diff --git a/ro-crate-metadata.json b/ro-crate-metadata.json new file mode 100644 index 0000000..0aa48af --- /dev/null +++ b/ro-crate-metadata.json @@ -0,0 +1,341 @@ +{ + "@context": [ + "https://w3id.org/ro/crate/1.1/context", + { + "GithubService": "https://w3id.org/ro/terms/test#GithubService", + "JenkinsService": "https://w3id.org/ro/terms/test#JenkinsService", + "PlanemoEngine": "https://w3id.org/ro/terms/test#PlanemoEngine", + "TestDefinition": "https://w3id.org/ro/terms/test#TestDefinition", + "TestInstance": "https://w3id.org/ro/terms/test#TestInstance", + "TestService": "https://w3id.org/ro/terms/test#TestService", + "TestSuite": "https://w3id.org/ro/terms/test#TestSuite", + "TravisService": "https://w3id.org/ro/terms/test#TravisService", + "definition": "https://w3id.org/ro/terms/test#definition", + "engineVersion": "https://w3id.org/ro/terms/test#engineVersion", + "instance": "https://w3id.org/ro/terms/test#instance", + "resource": "https://w3id.org/ro/terms/test#resource", + "runsOn": "https://w3id.org/ro/terms/test#runsOn" + } + ], + "@graph": [ + { + "@id": "./", + "@type": "Dataset", + "creativeWorkStatus": "InProgress", + "datePublished": "2026-02-20T15:36:51+00:00", + "description": "# quantmsdiann\n\n[![GitHub Actions CI Status](https://github.com/bigbio/quantmsdiann/actions/workflows/ci.yml/badge.svg)](https://github.com/bigbio/quantmsdiann/actions/workflows/ci.yml)\n[![GitHub Actions Linting Status](https://github.com/bigbio/quantmsdiann/actions/workflows/linting.yml/badge.svg)](https://github.com/bigbio/quantmsdiann/actions/workflows/linting.yml)\n[![Cite with Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.15573386.svg)](https://doi.org/10.5281/zenodo.15573386)\n[![nf-test](https://img.shields.io/badge/unit_tests-nf--test-337ab7.svg)](https://www.nf-test.com)\n\n[![Nextflow](https://img.shields.io/badge/version-%E2%89%A525.04.0-green?style=flat&logo=nextflow&logoColor=white&color=%230DC09D&link=https%3A%2F%2Fnextflow.io)](https://www.nextflow.io/)\n[![nf-core template version](https://img.shields.io/badge/nf--core_template-3.5.2-green?style=flat&logo=nfcore&logoColor=white&color=%2324B064&link=https%3A%2F%2Fnf-co.re)](https://github.com/nf-core/tools/releases/tag/3.5.2)\n[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)\n[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)\n\n## Introduction\n\n**quantmsdiann** is a [bigbio](https://github.com/bigbio) bioinformatics pipeline, built following [nf-core](https://nf-co.re/) guidelines, for **Data-Independent Acquisition (DIA)** quantitative mass spectrometry analysis using [DIA-NN](https://github.com/vdemichev/DiaNN).\n\nThe pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a portable manner. It uses Docker/Singularity containers making results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process, making it easy to maintain and update software dependencies.\n\n## Pipeline summary\n\n

\n \"quantmsdiann\n

\n\nThe pipeline takes [SDRF](https://github.com/bigbio/proteomics-metadata-standard) metadata and mass spectrometry data files (`.raw`, `.mzML`, `.d`, `.dia`) as input and performs:\n\n1. **Input validation** \u2014 SDRF parsing and validation\n2. **File preparation** \u2014 RAW to mzML conversion (ThermoRawFileParser), indexing, Bruker `.d` handling\n3. **In-silico spectral library generation** \u2014 or use a user-provided library (`--diann_speclib`)\n4. **Preliminary analysis** \u2014 per-file calibration and mass accuracy estimation\n5. **Empirical library assembly** \u2014 consensus library from preliminary results\n6. **Individual analysis** \u2014 per-file search with the empirical library\n7. **Final quantification** \u2014 protein/peptide/gene group matrices\n8. **MSstats conversion** \u2014 DIA-NN report to MSstats-compatible format\n9. **Quality control** \u2014 interactive QC report via [pmultiqc](https://github.com/bigbio/pmultiqc)\n\n## Supported DIA-NN Versions\n\n| Version | Profile | Container | Output format |\n| --------------- | -------------- | ------------------------------------------ | ------------- |\n| 1.8.1 (default) | `diann_v1_8_1` | `docker.io/biocontainers/diann:v1.8.1_cv1` | TSV |\n| 2.1.0 | `diann_v2_1_0` | `ghcr.io/bigbio/diann:2.1.0` | Parquet |\n| 2.2.0 | `diann_v2_2_0` | `ghcr.io/bigbio/diann:2.2.0` | Parquet |\n\nSwitch versions with `-profile diann_v2_1_0,docker`. See the [DIA-NN Version Selection](docs/usage.md#dia-nn-version-selection) section for details.\n\n## Quick start\n\n> [!NOTE]\n> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set up Nextflow.\n\n```bash\nnextflow run bigbio/quantmsdiann \\\n --input 'experiment.sdrf.tsv' \\\n --database 'proteins.fasta' \\\n --outdir './results' \\\n -profile docker\n```\n\n> [!WARNING]\n> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), not for defining parameters.\n\n## Documentation\n\n- [Usage](docs/usage.md) \u2014 How to run the pipeline, input formats, optional outputs, and custom configuration\n- [Parameters](docs/parameters.md) \u2014 Complete reference of all pipeline parameters organised by category\n- [Output](docs/output.md) \u2014 Description of all output files produced by the pipeline\n\n## Credits\n\nquantmsdiann is developed and maintained by:\n\n- [Yasset Perez-Riverol](https://github.com/ypriverol) (EMBL-EBI)\n- [Dai Chengxin](https://github.com/daichengxin) (Beijing Proteome Research Center)\n- [Julianus Pfeuffer](https://github.com/jpfeuffer) (Freie Universitat Berlin)\n- [Vadim Demichev](https://github.com/vdemichev) (Charite Universitaetsmedizin Berlin)\n- [Qi-Xuan Yue](https://github.com/yueqixuan) (Chongqing University of Posts and Telecommunications)\n\n## Contributions and Support\n\nIf you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).\n\n## Citation\n\nIf you use quantmsdiann in your research, please cite:\n\n> Dai et al. \"quantms: a cloud-based pipeline for quantitative proteomics\" (2024). DOI: [10.5281/zenodo.15573386](https://doi.org/10.5281/zenodo.15573386)\n\nAn extensive list of references for the tools used by the pipeline can be found in the [CITATIONS.md](CITATIONS.md) file.\n\n## License\n\n[MIT](LICENSE)\n", + "hasPart": [ + { + "@id": "main.nf" + }, + { + "@id": "assets/" + }, + { + "@id": "bin/" + }, + { + "@id": "conf/" + }, + { + "@id": "docs/" + }, + { + "@id": "docs/images/" + }, + { + "@id": "modules/" + }, + { + "@id": "modules/nf-core/" + }, + { + "@id": "workflows/" + }, + { + "@id": "subworkflows/" + }, + { + "@id": "nextflow.config" + }, + { + "@id": "README.md" + }, + { + "@id": "nextflow_schema.json" + }, + { + "@id": "CHANGELOG.md" + }, + { + "@id": "LICENSE" + }, + { + "@id": "CODE_OF_CONDUCT.md" + }, + { + "@id": "CITATIONS.md" + }, + { + "@id": "modules.json" + }, + { + "@id": "docs/usage.md" + }, + { + "@id": "docs/output.md" + }, + { + "@id": ".nf-core.yml" + }, + { + "@id": ".pre-commit-config.yaml" + }, + { + "@id": ".prettierignore" + } + ], + "isBasedOn": "https://github.com/bigbio/quantmsdiann", + "license": "MIT", + "mainEntity": { + "@id": "main.nf" + }, + "mentions": [ + { + "@id": "#e08e8618-13c7-4107-9b03-f44529aa5fad" + } + ], + "name": "bigbio/quantmsdiann" + }, + { + "@id": "ro-crate-metadata.json", + "@type": "CreativeWork", + "about": { + "@id": "./" + }, + "conformsTo": [ + { + "@id": "https://w3id.org/ro/crate/1.1" + }, + { + "@id": "https://w3id.org/workflowhub/workflow-ro-crate/1.0" + } + ] + }, + { + "@id": "main.nf", + "@type": [ + "File", + "SoftwareSourceCode", + "ComputationalWorkflow" + ], + "creator": [ + { + "@id": "https://orcid.org/0000-0001-6579-6941" + } + ], + "dateCreated": "", + "dateModified": "2026-02-20T15:36:51Z", + "dct:conformsTo": "https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE/", + "keywords": [ + "nf-core", + "nextflow", + "dia", + "dia-nn", + "mass-spec", + "mass-spectrometry", + "proteomics", + "quantitative-proteomics" + ], + "license": [ + "MIT" + ], + "maintainer": [ + { + "@id": "https://orcid.org/0000-0001-6579-6941" + } + ], + "name": [ + "bigbio/quantmsdiann" + ], + "programmingLanguage": { + "@id": "https://w3id.org/workflowhub/workflow-ro-crate#nextflow" + }, + "sdPublisher": { + "@id": "https://nf-co.re/" + }, + "url": [ + "https://github.com/bigbio/quantmsdiann", + "https://nf-co.re/bigbio/quantmsdiann/dev/" + ], + "version": [ + "1.0.0" + ] + }, + { + "@id": "https://w3id.org/workflowhub/workflow-ro-crate#nextflow", + "@type": "ComputerLanguage", + "identifier": { + "@id": "https://www.nextflow.io/" + }, + "name": "Nextflow", + "url": { + "@id": "https://www.nextflow.io/" + }, + "version": "!>=25.04.0" + }, + { + "@id": "#e08e8618-13c7-4107-9b03-f44529aa5fad", + "@type": "TestSuite", + "instance": [ + { + "@id": "#3117cd02-e7cf-46fa-b9ff-69f935d897c9" + } + ], + "mainEntity": { + "@id": "main.nf" + }, + "name": "Test suite for bigbio/quantmsdiann" + }, + { + "@id": "#3117cd02-e7cf-46fa-b9ff-69f935d897c9", + "@type": "TestInstance", + "name": "GitHub Actions workflow for testing bigbio/quantmsdiann", + "resource": "repos/bigbio/quantmsdiann/actions/workflows/nf-test.yml", + "runsOn": { + "@id": "https://w3id.org/ro/terms/test#GithubService" + }, + "url": "https://api.github.com" + }, + { + "@id": "https://w3id.org/ro/terms/test#GithubService", + "@type": "TestService", + "name": "Github Actions", + "url": { + "@id": "https://github.com" + } + }, + { + "@id": "assets/", + "@type": "Dataset", + "description": "Additional files" + }, + { + "@id": "bin/", + "@type": "Dataset", + "description": "Pipeline utility scripts" + }, + { + "@id": "conf/", + "@type": "Dataset", + "description": "Configuration files" + }, + { + "@id": "docs/", + "@type": "Dataset", + "description": "Markdown files for documenting the pipeline" + }, + { + "@id": "docs/images/", + "@type": "Dataset", + "description": "Images for the documentation files" + }, + { + "@id": "modules/", + "@type": "Dataset", + "description": "Modules used by the pipeline" + }, + { + "@id": "modules/nf-core/", + "@type": "Dataset", + "description": "nf-core modules" + }, + { + "@id": "workflows/", + "@type": "Dataset", + "description": "Main pipeline workflows to be executed in main.nf" + }, + { + "@id": "subworkflows/", + "@type": "Dataset", + "description": "Smaller subworkflows" + }, + { + "@id": "nextflow.config", + "@type": "File", + "description": "Main Nextflow configuration file" + }, + { + "@id": "README.md", + "@type": "File", + "description": "Basic pipeline usage information" + }, + { + "@id": "nextflow_schema.json", + "@type": "File", + "description": "JSON schema for pipeline parameter specification" + }, + { + "@id": "CHANGELOG.md", + "@type": "File", + "description": "Information on changes made to the pipeline" + }, + { + "@id": "LICENSE", + "@type": "File", + "description": "The license - should be MIT" + }, + { + "@id": "CODE_OF_CONDUCT.md", + "@type": "File", + "description": "The nf-core code of conduct" + }, + { + "@id": "CITATIONS.md", + "@type": "File", + "description": "Citations needed when using the pipeline" + }, + { + "@id": "modules.json", + "@type": "File", + "description": "Version information for modules from nf-core/modules" + }, + { + "@id": "docs/usage.md", + "@type": "File", + "description": "Usage documentation" + }, + { + "@id": "docs/output.md", + "@type": "File", + "description": "Output documentation" + }, + { + "@id": ".nf-core.yml", + "@type": "File", + "description": "nf-core configuration file, configuring template features and linting rules" + }, + { + "@id": ".pre-commit-config.yaml", + "@type": "File", + "description": "Configuration file for pre-commit hooks" + }, + { + "@id": ".prettierignore", + "@type": "File", + "description": "Ignore file for prettier" + }, + { + "@id": "https://nf-co.re/", + "@type": "Organization", + "name": "nf-core", + "url": "https://nf-co.re/" + }, + { + "@id": "https://orcid.org/0000-0001-6579-6941", + "@type": "Person", + "email": "ypriverol@gmail.com", + "name": "Yasset Perez-Riverol" + } + ] +} \ No newline at end of file diff --git a/subworkflows/local/create_input_channel/main.nf b/subworkflows/local/create_input_channel/main.nf new file mode 100644 index 0000000..588e144 --- /dev/null +++ b/subworkflows/local/create_input_channel/main.nf @@ -0,0 +1,199 @@ +// +// Create channel for input file (DIA-NN only pipeline) +// +include { SDRF_PARSING } from '../../../modules/local/sdrf_parsing/main' + + + +workflow CREATE_INPUT_CHANNEL { + take: + ch_sdrf + + main: + ch_versions = channel.empty() + + // Always parse as SDRF using DIA-NN converter + SDRF_PARSING(ch_sdrf) + ch_versions = ch_versions.mix(SDRF_PARSING.out.versions) + ch_expdesign = SDRF_PARSING.out.ch_expdesign + ch_diann_cfg = SDRF_PARSING.out.ch_diann_cfg + + def enzymes = new HashSet() + def files = new HashSet() + + // Extract experiment_id from the SDRF filename via the value channel + ch_sdrf_val = ch_sdrf.first() + ch_experiment_id = ch_sdrf_val.map { sdrf_file -> file(sdrf_file).baseName } + + ch_experiment_id + .combine(ch_expdesign) + .splitCsv(header: true, sep: '\t') + .map { experiment_id, row -> + def wrapper = [acquisition_method: "", experiment_id: experiment_id] + create_meta_channel(row, enzymes, files, wrapper) + } + .set { ch_meta_config_dia } + + emit: + ch_meta_config_dia // [meta, spectra_file] + ch_expdesign + ch_diann_cfg + versions = ch_versions +} + +// Function to get list of [meta, [ spectra_files ]] +def create_meta_channel(LinkedHashMap row, enzymes, files, wrapper) { + def meta = [:] + def filestr + + // Always use SDRF format + if (!params.root_folder) { + filestr = row.URI?.toString()?.trim() ? row.URI.toString() : row.Filename.toString() + } + else { + filestr = row.Filename.toString() + } + + def fileName = file(filestr).name + def dotIndex = fileName.lastIndexOf('.') + meta.id = dotIndex > 0 ? fileName.take(dotIndex) : fileName + meta.experiment_id = wrapper.experiment_id + + // apply transformations given by specified root_folder and type + if (params.root_folder) { + filestr = params.root_folder + File.separator + filestr + filestr = (params.local_input_type + ? filestr.take(filestr.lastIndexOf('.')) + '.' + params.local_input_type + : filestr) + } + + // existence check + if (!file(filestr).exists()) { + exit(1, "ERROR: Please check input file -> File Uri does not exist!\n${filestr}") + } + + // Validate acquisition method is DIA + // AcquisitionMethod is already extracted by convert-diann (e.g. "Data-Independent Acquisition") + def acqMethod = row.AcquisitionMethod?.toString()?.trim() ?: "" + if (acqMethod.toLowerCase().contains("data-independent acquisition") || acqMethod.toLowerCase().contains("dia")) { + meta.acquisition_method = "dia" + } + else if (acqMethod.isEmpty()) { + // If no acquisition method column in SDRF, assume DIA (this is a DIA-only pipeline) + meta.acquisition_method = "dia" + } + else { + log.error("This pipeline only supports Data-Independent Acquisition (DIA). Found: '${acqMethod}'. Use the quantms pipeline for DDA workflows.") + exit(1) + } + + // DissociationMethod is already normalized by convert-diann (HCD, CID, ETD, ECD) + meta.dissociationmethod = row.DissociationMethod?.toString()?.trim() ?: "" + + wrapper.acquisition_method = meta.acquisition_method + + // Validate required SDRF columns - these parameters are exclusively read from SDRF (no command-line override) + def requiredColumns = [ + 'Label': row.Label, + 'Enzyme': row.Enzyme, + 'FixedModifications': row.FixedModifications + ] + + def missingColumns = [] + requiredColumns.each { colName, colValue -> + if (colValue == null || colValue.toString().trim().isEmpty()) { + missingColumns.add(colName) + } + } + + if (missingColumns.size() > 0) { + log.error("ERROR: Missing or empty required SDRF columns for file '${filestr}': ${missingColumns.join(', ')}") + log.error("These parameters must be specified in the SDRF file. Please check your SDRF annotation.") + exit(1) + } + + // Set values from SDRF (required columns) + meta.labelling_type = row.Label + meta.fixedmodifications = row.FixedModifications + meta.enzyme = row.Enzyme + + // Set tolerance values: use SDRF if available, otherwise fall back to params + def validUnits = ['ppm', 'da', 'Da', 'PPM'] + + // Precursor mass tolerance + if (row.PrecursorMassTolerance != null && !row.PrecursorMassTolerance.toString().trim().isEmpty()) { + try { + meta.precursormasstolerance = Double.parseDouble(row.PrecursorMassTolerance) + } catch (NumberFormatException e) { + log.error("ERROR: Invalid PrecursorMassTolerance value '${row.PrecursorMassTolerance}' for file '${filestr}'. Must be a valid number.") + exit(1) + } + } else { + log.warn("No precursor mass tolerance in SDRF for '${filestr}'. Using default: ${params.precursor_mass_tolerance} ${params.precursor_mass_tolerance_unit}") + meta.precursormasstolerance = params.precursor_mass_tolerance + } + + // Precursor mass tolerance unit + if (row.PrecursorMassToleranceUnit != null && !row.PrecursorMassToleranceUnit.toString().trim().isEmpty()) { + if (!validUnits.any { row.PrecursorMassToleranceUnit.toString().equalsIgnoreCase(it) }) { + log.error("ERROR: Invalid PrecursorMassToleranceUnit '${row.PrecursorMassToleranceUnit}' for file '${filestr}'. Must be 'ppm' or 'Da'.") + exit(1) + } + meta.precursormasstoleranceunit = row.PrecursorMassToleranceUnit + } else { + meta.precursormasstoleranceunit = params.precursor_mass_tolerance_unit + } + + // Fragment mass tolerance + if (row.FragmentMassTolerance != null && !row.FragmentMassTolerance.toString().trim().isEmpty()) { + try { + meta.fragmentmasstolerance = Double.parseDouble(row.FragmentMassTolerance) + } catch (NumberFormatException e) { + log.error("ERROR: Invalid FragmentMassTolerance value '${row.FragmentMassTolerance}' for file '${filestr}'. Must be a valid number.") + exit(1) + } + } else { + log.warn("No fragment mass tolerance in SDRF for '${filestr}'. Using default: ${params.fragment_mass_tolerance} ${params.fragment_mass_tolerance_unit}") + meta.fragmentmasstolerance = params.fragment_mass_tolerance + } + + // Fragment mass tolerance unit + if (row.FragmentMassToleranceUnit != null && !row.FragmentMassToleranceUnit.toString().trim().isEmpty()) { + if (!validUnits.any { row.FragmentMassToleranceUnit.toString().equalsIgnoreCase(it) }) { + log.error("ERROR: Invalid FragmentMassToleranceUnit '${row.FragmentMassToleranceUnit}' for file '${filestr}'. Must be 'ppm' or 'Da'.") + exit(1) + } + meta.fragmentmasstoleranceunit = row.FragmentMassToleranceUnit + } else { + meta.fragmentmasstoleranceunit = params.fragment_mass_tolerance_unit + } + + // Variable modifications: use SDRF if available, otherwise fall back to params + if (row.VariableModifications != null && !row.VariableModifications.toString().trim().isEmpty()) { + meta.variablemodifications = row.VariableModifications + } else { + meta.variablemodifications = params.variable_mods + } + + // Per-file scan ranges (empty string = no flags passed, DIA-NN auto-detects) + meta.ms1minmz = row.MS1MinMz?.toString()?.trim() ?: "" + meta.ms1maxmz = row.MS1MaxMz?.toString()?.trim() ?: "" + meta.ms2minmz = row.MS2MinMz?.toString()?.trim() ?: "" + meta.ms2maxmz = row.MS2MaxMz?.toString()?.trim() ?: "" + + enzymes += row.Enzyme + if (enzymes.size() > 1) { + log.error("Currently only one enzyme is supported for the whole experiment. Specified was '${enzymes}'. Check or split your SDRF.") + log.error(filestr) + exit(1) + } + + // Check for duplicate files + if (filestr in files) { + log.error("Currently only one DIA-NN setting per file is supported for the whole experiment. ${filestr} has multiple entries in your SDRF. Consider splitting your design into multiple experiments.") + exit(1) + } + files += filestr + + return [meta, filestr] +} diff --git a/subworkflows/local/create_input_channel/meta.yml b/subworkflows/local/create_input_channel/meta.yml new file mode 100644 index 0000000..de6acae --- /dev/null +++ b/subworkflows/local/create_input_channel/meta.yml @@ -0,0 +1,32 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "create_input_channel" +description: Subworkflow to create input channels from validated SDRF files (DIA only) +keywords: + - input + - channel + - validation + - dia +components: + - sdrf/parsing +input: + - ch_sdrf: + type: file + description: | + Channel containing the SDRF file path +output: + - ch_meta_config_dia: + type: file + description: | + Channel containing DIA data configuration [meta, spectra_file] + - ch_expdesign: + type: file + description: | + Channel containing experimental design information + - versions: + type: file + description: | + Software versions used in this subworkflow +authors: + - "@bigbio" +maintainers: + - "@bigbio" diff --git a/subworkflows/local/file_preparation/main.nf b/subworkflows/local/file_preparation/main.nf new file mode 100644 index 0000000..6327cbb --- /dev/null +++ b/subworkflows/local/file_preparation/main.nf @@ -0,0 +1,117 @@ +// +// Raw file conversion and mzml indexing +// + +include { THERMORAWFILEPARSER } from '../../../modules/bigbio/thermorawfileparser/main' +include { DECOMPRESS } from '../../../modules/local/utils/decompress_dotd/main' +include { MZML_INDEXING } from '../../../modules/local/openms/mzml_indexing/main' +include { MZML_STATISTICS } from '../../../modules/local/utils/mzml_statistics/main' + +workflow FILE_PREPARATION { + take: + ch_rawfiles // channel: [ val(meta), raw/mzml/d.tar ] + + main: + ch_versions = channel.empty() + ch_results = channel.empty() + ch_statistics = channel.empty() + ch_ms2_statistics = channel.empty() + ch_feature_statistics = channel.empty() + + + // Divide the compressed files + ch_rawfiles + .branch { item -> + dottar: hasExtension(item[1], '.tar') + dotzip: hasExtension(item[1], '.zip') + gz: hasExtension(item[1], '.gz') + uncompressed: true + }.set { ch_branched_input } + + compressed_files = ch_branched_input.dottar.mix(ch_branched_input.dotzip, ch_branched_input.gz) + DECOMPRESS(compressed_files) + ch_versions = ch_versions.mix(DECOMPRESS.out.versions) + ch_rawfiles = ch_branched_input.uncompressed.mix(DECOMPRESS.out.decompressed_files) + + // + // Divide mzml files + ch_rawfiles + .branch { item -> + raw: hasExtension(item[1], '.raw') + mzML: hasExtension(item[1], '.mzML') + dotd: hasExtension(item[1], '.d') + dia: hasExtension(item[1], '.dia') + unsupported: true + }.set { ch_branched_input } + + // Warn about unsupported file formats + ch_branched_input.unsupported + .collect() + .subscribe { files -> + if (files.size() > 0) { + log.warn "=" * 80 + log.warn "WARNING: ${files.size()} file(s) with unsupported format(s) detected and will be SKIPPED from processing:" + files.each { _meta, file -> + log.warn " - ${file}" + } + log.warn "\nSupported formats: .raw, .mzML, .d (Bruker), .dia" + log.warn "Compressed variants (.gz, .tar, .tar.gz, .zip) are also supported." + log.warn "=" * 80 + } + } + + // Note: we used to always index mzMLs if not already indexed but due to + // either a bug or limitation in nextflow + // peeking into a remote file consumes a lot of RAM + // See https://github.com/bigbio/quantms/issues/61 + // This is now done in the search engines themselves if they need it. + // This means users should pre-index to save time and space, especially + // when re-running. + + if (params.reindex_mzml) { + MZML_INDEXING( ch_branched_input.mzML ) + ch_versions = ch_versions.mix(MZML_INDEXING.out.versions) + ch_results = ch_results.mix(MZML_INDEXING.out.mzmls_indexed) + } else { + ch_results = ch_results.mix(ch_branched_input.mzML) + } + + THERMORAWFILEPARSER( ch_branched_input.raw ) + // Output: spectra (tuple val(meta), path(mzML/mgf/parquet)), log, versions via topic channel + ch_results = ch_results.mix(THERMORAWFILEPARSER.out.spectra) + + ch_results.map{ it -> [it[0], it[1]] }.set{ indexed_mzml_bundle } + + // Pass through .d files without conversion + // DIA-NN handles .d files natively; they bypass mzML statistics + ch_results = indexed_mzml_bundle.mix(ch_branched_input.dotd) + + // Pass through .dia files without conversion (DIA-NN handles them natively) + ch_results = ch_results.mix(ch_branched_input.dia) + + if (params.mzml_statistics) { + // Only run on mzML files — exclude .d, .dia, .mgf, .parquet, etc. + ch_mzml_for_stats = ch_results.filter { _meta, file -> + hasExtension(file, '.mzML') + } + MZML_STATISTICS(ch_mzml_for_stats) + ch_statistics = ch_statistics.mix(MZML_STATISTICS.out.ms_statistics.collect()) + ch_ms2_statistics = ch_ms2_statistics.mix(MZML_STATISTICS.out.ms2_statistics) + ch_feature_statistics = ch_feature_statistics.mix(MZML_STATISTICS.out.feature_statistics.collect()) + ch_versions = ch_versions.mix(MZML_STATISTICS.out.versions) + } + + emit: + results = ch_results // channel: [val(mzml_id), indexedmzml|.d.tar] + statistics = ch_statistics // channel: [ *_ms_info.parquet ] + ms2_statistics = ch_ms2_statistics // channel: [ *_ms2_info.parquet ] + feature_statistics = ch_feature_statistics // channel: [ *_feature_info.parquet ] + versions = ch_versions // channel: [ *.versions.yml ] +} + +// +// check file extension +// +def hasExtension(file, extension) { + return file.toString().toLowerCase().endsWith(extension.toLowerCase()) +} diff --git a/subworkflows/local/file_preparation/meta.yml b/subworkflows/local/file_preparation/meta.yml new file mode 100644 index 0000000..54211c7 --- /dev/null +++ b/subworkflows/local/file_preparation/meta.yml @@ -0,0 +1,45 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "file_preparation" +description: Subworkflow for preparing and validating input files for proteomics analysis +keywords: + - file + - preparation + - validation + - proteomics +components: + - thermorawfileparser + - tdf2mzml + - decompress + - mzml/indexing + - mzml/statistics + - openms/peak/picker +input: + - ch_input: + type: file + description: | + Channel containing input files to be prepared +output: + - results: + type: file + description: | + Channel containing prepared results + - statistics: + type: file + description: | + Channel containing file statistics + - ms2_statistics: + type: file + description: | + Channel containing MS2 spectrum statistics + - feature_statistics: + type: file + description: | + Channel containing feature statistics + - versions: + type: file + description: | + Software versions used in this subworkflow +authors: + - "@bigbio" +maintainers: + - "@bigbio" diff --git a/subworkflows/local/input_check/main.nf b/subworkflows/local/input_check/main.nf new file mode 100644 index 0000000..4a821e2 --- /dev/null +++ b/subworkflows/local/input_check/main.nf @@ -0,0 +1,21 @@ +// +// Check input SDRF and get read channels +// + +include { SAMPLESHEET_CHECK } from '../../../modules/local/samplesheet_check' + +workflow INPUT_CHECK { + take: + input_file + + main: + + ch_software_versions = channel.empty() + + SAMPLESHEET_CHECK ( input_file ) + ch_software_versions = ch_software_versions.mix(SAMPLESHEET_CHECK.out.versions) + + emit: + ch_input_file = SAMPLESHEET_CHECK.out.checked_file + versions = ch_software_versions +} diff --git a/subworkflows/local/input_check/meta.yml b/subworkflows/local/input_check/meta.yml new file mode 100644 index 0000000..abe2c7f --- /dev/null +++ b/subworkflows/local/input_check/meta.yml @@ -0,0 +1,28 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "input_check" +description: Subworkflow for validating and checking input files and parameters +keywords: + - input + - validation + - check + - parameters +components: + - samplesheet/check +input: + - input_file: + type: file + description: | + Input file to be validated +output: + - ch_input_file: + type: file + description: | + Channel containing validated input files + - versions: + type: file + description: | + Software versions used in this subworkflow +authors: + - "@bigbio" +maintainers: + - "@bigbio" diff --git a/subworkflows/local/utils_nfcore_quantms_pipeline/main.nf b/subworkflows/local/utils_nfcore_quantms_pipeline/main.nf new file mode 100644 index 0000000..bce1eee --- /dev/null +++ b/subworkflows/local/utils_nfcore_quantms_pipeline/main.nf @@ -0,0 +1,184 @@ +// +// Subworkflow with functionality specific to the bigbio/quantmsdiann pipeline +// + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + IMPORT FUNCTIONS / MODULES / SUBWORKFLOWS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// Plugin function from nf-schema@2.5.1 (version specified in nextflow.config) +include { paramsSummaryMap } from 'plugin/nf-schema' +include { completionEmail } from '../../nf-core/utils_nfcore_pipeline' +include { completionSummary } from '../../nf-core/utils_nfcore_pipeline' +include { imNotification } from '../../nf-core/utils_nfcore_pipeline' + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + SUBWORKFLOW TO INITIALISE PIPELINE +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + SUBWORKFLOW FOR PIPELINE COMPLETION +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow PIPELINE_COMPLETION { + + take: + email // string: email address + email_on_fail // string: email address sent on pipeline failure + plaintext_email // boolean: Send plain-text email instead of HTML + outdir // path: Path to output directory where results will be published + monochrome_logs // boolean: Disable ANSI colour codes in log output + hook_url // string: hook URL for notifications + multiqc_report // string: Path to MultiQC report + + main: + summary_params = paramsSummaryMap(workflow, parameters_schema: "nextflow_schema.json") + def multiqc_reports = multiqc_report.toList() + // + // Completion email and summary + // + workflow.onComplete { + if (email || email_on_fail) { + completionEmail( + summary_params, + email, + email_on_fail, + plaintext_email, + outdir, + monochrome_logs, + multiqc_reports.getVal(), + ) + } + + completionSummary(monochrome_logs) + if (hook_url) { + imNotification(summary_params, hook_url) + } + } + + workflow.onError { + log.error "Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting" + } +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + FUNCTIONS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ +// +// Check and validate pipeline parameters +// +def validateInputParameters() { + genomeExistsError() +} + +// +// Validate channels from input samplesheet +// +def validateInputSamplesheet(input) { + def (metas, fastqs) = input[1..2] + + // Check that multiple runs of the same sample are of the same datatype i.e. single-end / paired-end + def endedness_ok = metas.collect{ meta -> meta.single_end }.unique().size == 1 + if (!endedness_ok) { + error("Please check input samplesheet -> Multiple runs of a sample must be of the same datatype i.e. single-end or paired-end: ${metas[0].id}") + } + + return [ metas[0], fastqs ] +} +// +// Get attribute from genome config file e.g. fasta +// +def getGenomeAttribute(attribute) { + if (params.genomes && params.genome && params.genomes.containsKey(params.genome)) { + if (params.genomes[ params.genome ].containsKey(attribute)) { + return params.genomes[ params.genome ][ attribute ] + } + } + return null +} + +// +// Exit pipeline if incorrect --genome key provided +// +def genomeExistsError() { + if (params.genomes && params.genome && !params.genomes.containsKey(params.genome)) { + def error_string = "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" + + " Genome '${params.genome}' not found in any config files provided to the pipeline.\n" + + " Currently, the available genome keys are:\n" + + " ${params.genomes.keySet().join(", ")}\n" + + "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + error(error_string) + } +} +// +// Generate methods description for MultiQC +// +def toolCitationText() { + def citation_text = [ + "Tools used in the workflow included:", + "DIA-NN (Demichev et al. 2020),", + "OpenMS (Röst et al. 2016),", + "ThermoRawFileParser (Hulstaert et al. 2020),", + "pmultiqc (Perez-Riverol et al. 2024)", + "." + ].join(' ').trim() + + return citation_text +} + +def toolBibliographyText() { + def reference_text = [ + "
  • Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M. (2020). DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods, 17(1), 41-44. doi: 10.1038/s41592-019-0638-x
  • ", + "
  • Röst HL, Sachsenberg T, Aiche S, Bielow C, Weisser H, Aicheler F, Andreotti S, Ehrlich HC, Gutenbrunner P, Kenar E, Liang X, Nahnsen S, Nilse L, Pfeuffer J, Rosenberger G, Rurik M, Schmitt U, Veit J, Walzer M, Wojnar D, Wolski WE, Schilling O, Choudhary JS, Malmström L, Aebersold R, Reinert K, Kohlbacher O. (2016). OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nature Methods, 13(9), 741–748. doi: 10.1038/nmeth.3959
  • ", + "
  • Hulstaert N, Shofstahl J, Sachsenberg T, Walzer M, Barsnes H, Martens L, Perez-Riverol Y. (2020). ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. Journal of Proteome Research, 19(1), 537-542. doi: 10.1021/acs.jproteome.9b00328
  • ", + "
  • Perez-Riverol Y, Moreno P, da Veiga Leprevost F, Csordas A, Bai J, Carver J, Hewapathirana S, Kundu DJ, Inuganti A, Griss J, Mayer G, Eisenacher M, Pérez E, Uszkoreit J, Pfeuffer J, Sachsenberg T, Yilmaz S, Tiwary S, Cox J, Audain E, Walzer M, Jarnuczak AF, Ternent T, Brazma A, Vizcaíno JA. (2024). pmultiqc: a comprehensive tool for quality control of proteomics data. Nature Methods, 21(1), 1-2. doi: 10.1038/s41592-023-02125-1
  • " + ].join(' ').trim() + + return reference_text +} + +def methodsDescriptionText(mqc_methods_yaml) { + // Convert to a named map so can be used as with familiar NXF ${workflow} variable syntax in the MultiQC YML file + def meta = [:] + meta.workflow = workflow.toMap() + meta["manifest_map"] = workflow.manifest.toMap() + + // Pipeline DOI + if (meta.manifest_map.doi) { + // Using a loop to handle multiple DOIs + // Removing `https://doi.org/` to handle pipelines using DOIs vs DOI resolvers + // Removing ` ` since the manifest.doi is a string and not a proper list + def temp_doi_ref = "" + def manifest_doi = meta.manifest_map.doi.tokenize(",") + manifest_doi.each { doi_ref -> + temp_doi_ref += "(doi: ${doi_ref.replace("https://doi.org/", "").replace(" ", "")}), " + } + meta["doi_text"] = temp_doi_ref.substring(0, temp_doi_ref.length() - 2) + } else meta["doi_text"] = "" + meta["nodoi_text"] = meta.manifest_map.doi ? "" : "
  • If available, make sure to update the text to include the Zenodo DOI of version of the pipeline used.
  • " + + // Tool references + meta["tool_citations"] = "" + meta["tool_bibliography"] = "" + + // Tool citations and bibliography + meta["tool_citations"] = toolCitationText().replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".") + meta["tool_bibliography"] = toolBibliographyText() + + + def methods_text = mqc_methods_yaml.text + + def engine = new groovy.text.SimpleTemplateEngine() + def description_html = engine.createTemplate(methods_text).make(meta) + + return description_html.toString() +} diff --git a/subworkflows/local/utils_nfcore_quantms_pipeline/meta.yml b/subworkflows/local/utils_nfcore_quantms_pipeline/meta.yml new file mode 100644 index 0000000..06365ae --- /dev/null +++ b/subworkflows/local/utils_nfcore_quantms_pipeline/meta.yml @@ -0,0 +1,32 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "utils_nfcore_quantms_pipeline" +description: Pipeline completion utilities for the nf-core quantmsdiann pipeline +keywords: + - utils + - nf-core + - quantms +components: + - completionemail + - completionsummary + - imnotification + - utils_nextflow_pipeline + - utils_nfcore_pipeline + - utils_nfschema_plugin +input: + - ch_input: + type: file + description: | + Channel containing pipeline execution summary or status information +output: + - ch_completion_status: + type: file + description: | + Channel containing pipeline completion status and summary information + - versions: + type: file + description: | + Software versions used in this subworkflow +authors: + - "@bigbio" +maintainers: + - "@bigbio" diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/main.nf b/subworkflows/nf-core/utils_nextflow_pipeline/main.nf new file mode 100644 index 0000000..d6e593e --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/main.nf @@ -0,0 +1,126 @@ +// +// Subworkflow with functionality that may be useful for any Nextflow pipeline +// + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + SUBWORKFLOW DEFINITION +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow UTILS_NEXTFLOW_PIPELINE { + take: + print_version // boolean: print version + dump_parameters // boolean: dump parameters + outdir // path: base directory used to publish pipeline results + check_conda_channels // boolean: check conda channels + + main: + + // + // Print workflow version and exit on --version + // + if (print_version) { + log.info("${workflow.manifest.name} ${getWorkflowVersion()}") + System.exit(0) + } + + // + // Dump pipeline parameters to a JSON file + // + if (dump_parameters && outdir) { + dumpParametersToJSON(outdir) + } + + // + // When running with Conda, warn if channels have not been set-up appropriately + // + if (check_conda_channels) { + checkCondaChannels() + } + + emit: + dummy_emit = true +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + FUNCTIONS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// Generate version string +// +def getWorkflowVersion() { + def version_string = "" as String + if (workflow.manifest.version) { + def prefix_v = workflow.manifest.version[0] != 'v' ? 'v' : '' + version_string += "${prefix_v}${workflow.manifest.version}" + } + + if (workflow.commitId) { + def git_shortsha = workflow.commitId.substring(0, 7) + version_string += "-g${git_shortsha}" + } + + return version_string +} + +// +// Dump pipeline parameters to a JSON file +// +def dumpParametersToJSON(outdir) { + def timestamp = new java.util.Date().format('yyyy-MM-dd_HH-mm-ss') + def filename = "params_${timestamp}.json" + def temp_pf = new File(workflow.launchDir.toString(), ".${filename}") + def jsonStr = groovy.json.JsonOutput.toJson(params) + temp_pf.text = groovy.json.JsonOutput.prettyPrint(jsonStr) + + nextflow.extension.FilesEx.copyTo(temp_pf.toPath(), "${outdir}/pipeline_info/params_${timestamp}.json") + temp_pf.delete() +} + +// +// When running with -profile conda, warn if channels have not been set-up appropriately +// +def checkCondaChannels() { + def parser = new org.yaml.snakeyaml.Yaml() + def channels = [] + try { + def config = parser.load("conda config --show channels".execute().text) + channels = config.channels + } + catch (NullPointerException e) { + log.debug(e) + log.warn("Could not verify conda channel configuration.") + return null + } + catch (IOException e) { + log.debug(e) + log.warn("Could not verify conda channel configuration.") + return null + } + + // Check that all channels are present + // This channel list is ordered by required channel priority. + def required_channels_in_order = ['conda-forge', 'bioconda'] + def channels_missing = ((required_channels_in_order as Set) - (channels as Set)) as Boolean + + // Check that they are in the right order + def channel_priority_violation = required_channels_in_order != channels.findAll { ch -> ch in required_channels_in_order } + + if (channels_missing | channel_priority_violation) { + log.warn """\ + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + There is a problem with your Conda configuration! + You will need to set-up the conda-forge and bioconda channels correctly. + Please refer to https://bioconda.github.io/ + The observed channel order is + ${channels} + but the following channel order is required: + ${required_channels_in_order} + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + """.stripIndent(true) + } +} diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/meta.yml b/subworkflows/nf-core/utils_nextflow_pipeline/meta.yml new file mode 100644 index 0000000..e5c3a0a --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/meta.yml @@ -0,0 +1,38 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "UTILS_NEXTFLOW_PIPELINE" +description: Subworkflow with functionality that may be useful for any Nextflow pipeline +keywords: + - utility + - pipeline + - initialise + - version +components: [] +input: + - print_version: + type: boolean + description: | + Print the version of the pipeline and exit + - dump_parameters: + type: boolean + description: | + Dump the parameters of the pipeline to a JSON file + - output_directory: + type: directory + description: Path to output dir to write JSON file to. + pattern: "results/" + - check_conda_channel: + type: boolean + description: | + Check if the conda channel priority is correct. +output: + - dummy_emit: + type: boolean + description: | + Dummy emit to make nf-core subworkflows lint happy +authors: + - "@adamrtalbot" + - "@drpatelh" +maintainers: + - "@adamrtalbot" + - "@drpatelh" + - "@maxulysse" diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test new file mode 100644 index 0000000..68718e4 --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test @@ -0,0 +1,54 @@ + +nextflow_function { + + name "Test Functions" + script "subworkflows/nf-core/utils_nextflow_pipeline/main.nf" + config "subworkflows/nf-core/utils_nextflow_pipeline/tests/nextflow.config" + tag 'subworkflows' + tag 'utils_nextflow_pipeline' + tag 'subworkflows/utils_nextflow_pipeline' + + test("Test Function getWorkflowVersion") { + + function "getWorkflowVersion" + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } + + test("Test Function dumpParametersToJSON") { + + function "dumpParametersToJSON" + + when { + function { + """ + // define inputs of the function here. Example: + input[0] = "$outputDir" + """.stripIndent() + } + } + + then { + assertAll( + { assert function.success } + ) + } + } + + test("Test Function checkCondaChannels") { + + function "checkCondaChannels" + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } +} diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test.snap b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test.snap new file mode 100644 index 0000000..e3f0baf --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.function.nf.test.snap @@ -0,0 +1,20 @@ +{ + "Test Function getWorkflowVersion": { + "content": [ + "v9.9.9" + ], + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:02:05.308243" + }, + "Test Function checkCondaChannels": { + "content": null, + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:02:12.425833" + } +} \ No newline at end of file diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.workflow.nf.test b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.workflow.nf.test new file mode 100644 index 0000000..02dbf09 --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/tests/main.workflow.nf.test @@ -0,0 +1,113 @@ +nextflow_workflow { + + name "Test Workflow UTILS_NEXTFLOW_PIPELINE" + script "../main.nf" + config "subworkflows/nf-core/utils_nextflow_pipeline/tests/nextflow.config" + workflow "UTILS_NEXTFLOW_PIPELINE" + tag 'subworkflows' + tag 'utils_nextflow_pipeline' + tag 'subworkflows/utils_nextflow_pipeline' + + test("Should run no inputs") { + + when { + workflow { + """ + print_version = false + dump_parameters = false + outdir = null + check_conda_channels = false + + input[0] = print_version + input[1] = dump_parameters + input[2] = outdir + input[3] = check_conda_channels + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } + + test("Should print version") { + + when { + workflow { + """ + print_version = true + dump_parameters = false + outdir = null + check_conda_channels = false + + input[0] = print_version + input[1] = dump_parameters + input[2] = outdir + input[3] = check_conda_channels + """ + } + } + + then { + expect { + with(workflow) { + assert success + assert "nextflow_workflow v9.9.9" in stdout + } + } + } + } + + test("Should dump params") { + + when { + workflow { + """ + print_version = false + dump_parameters = true + outdir = 'results' + check_conda_channels = false + + input[0] = false + input[1] = true + input[2] = outdir + input[3] = false + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } + + test("Should not create params JSON if no output directory") { + + when { + workflow { + """ + print_version = false + dump_parameters = true + outdir = null + check_conda_channels = false + + input[0] = false + input[1] = true + input[2] = outdir + input[3] = false + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } +} diff --git a/subworkflows/nf-core/utils_nextflow_pipeline/tests/nextflow.config b/subworkflows/nf-core/utils_nextflow_pipeline/tests/nextflow.config new file mode 100644 index 0000000..a09572e --- /dev/null +++ b/subworkflows/nf-core/utils_nextflow_pipeline/tests/nextflow.config @@ -0,0 +1,9 @@ +manifest { + name = 'nextflow_workflow' + author = """nf-core""" + homePage = 'https://127.0.0.1' + description = """Dummy pipeline""" + nextflowVersion = '!>=23.04.0' + version = '9.9.9' + doi = 'https://doi.org/10.5281/zenodo.5070524' +} diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/main.nf b/subworkflows/nf-core/utils_nfcore_pipeline/main.nf new file mode 100644 index 0000000..2f30e9a --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/main.nf @@ -0,0 +1,419 @@ +// +// Subworkflow with utility functions specific to the nf-core pipeline template +// + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + SUBWORKFLOW DEFINITION +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +workflow UTILS_NFCORE_PIPELINE { + take: + nextflow_cli_args + + main: + valid_config = checkConfigProvided() + checkProfileProvided(nextflow_cli_args) + + emit: + valid_config +} + +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + FUNCTIONS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +// +// Warn if a -profile or Nextflow config has not been provided to run the pipeline +// +def checkConfigProvided() { + def valid_config = true as Boolean + if (workflow.profile == 'standard' && workflow.configFiles.size() <= 1) { + log.warn( + "[${workflow.manifest.name}] You are attempting to run the pipeline without any custom configuration!\n\n" + "This will be dependent on your local compute environment but can be achieved via one or more of the following:\n" + " (1) Using an existing pipeline profile e.g. `-profile docker` or `-profile singularity`\n" + " (2) Using an existing nf-core/configs for your Institution e.g. `-profile crick` or `-profile uppmax`\n" + " (3) Using your own local custom config e.g. `-c /path/to/your/custom.config`\n\n" + "Please refer to the quick start section and usage docs for the pipeline.\n " + ) + valid_config = false + } + return valid_config +} + +// +// Exit pipeline if --profile contains spaces +// +def checkProfileProvided(nextflow_cli_args) { + if (workflow.profile.endsWith(',')) { + error( + "The `-profile` option cannot end with a trailing comma, please remove it and re-run the pipeline!\n" + "HINT: A common mistake is to provide multiple values separated by spaces e.g. `-profile test, docker`.\n" + ) + } + if (nextflow_cli_args[0]) { + log.warn( + "nf-core pipelines do not accept positional arguments. The positional argument `${nextflow_cli_args[0]}` has been detected.\n" + "HINT: A common mistake is to provide multiple values separated by spaces e.g. `-profile test, docker`.\n" + ) + } +} + +// +// Generate workflow version string +// +def getWorkflowVersion() { + def version_string = "" as String + if (workflow.manifest.version) { + def prefix_v = workflow.manifest.version[0] != 'v' ? 'v' : '' + version_string += "${prefix_v}${workflow.manifest.version}" + } + + if (workflow.commitId) { + def git_shortsha = workflow.commitId.substring(0, 7) + version_string += "-g${git_shortsha}" + } + + return version_string +} + +// +// Get software versions for pipeline +// +def processVersionsFromYAML(yaml_file) { + def yaml = new org.yaml.snakeyaml.Yaml() + def versions = yaml.load(yaml_file).collectEntries { k, v -> [k.tokenize(':')[-1], v] } + return yaml.dumpAsMap(versions).trim() +} + +// +// Get workflow version for pipeline +// +def workflowVersionToYAML() { + return """ + Workflow: + ${workflow.manifest.name}: ${getWorkflowVersion()} + Nextflow: ${workflow.nextflow.version} + """.stripIndent().trim() +} + +// +// Get channel of software versions used in pipeline in YAML format +// +def softwareVersionsToYAML(ch_versions) { + return ch_versions.unique().map { version -> processVersionsFromYAML(version) }.unique().mix(channel.of(workflowVersionToYAML())) +} + +// +// Get workflow summary for MultiQC +// +def paramsSummaryMultiqc(summary_params) { + def summary_section = '' + summary_params + .keySet() + .each { group -> + def group_params = summary_params.get(group) + // This gets the parameters of that particular group + if (group_params) { + summary_section += "

    ${group}

    \n" + summary_section += "
    \n" + group_params + .keySet() + .sort() + .each { param -> + summary_section += "
    ${param}
    ${group_params.get(param) ?: 'N/A'}
    \n" + } + summary_section += "
    \n" + } + } + + def yaml_file_text = "id: '${workflow.manifest.name.replace('/', '-')}-summary'\n" as String + yaml_file_text += "description: ' - this information is collected when the pipeline is started.'\n" + yaml_file_text += "section_name: '${workflow.manifest.name} Workflow Summary'\n" + yaml_file_text += "section_href: 'https://github.com/${workflow.manifest.name}'\n" + yaml_file_text += "plot_type: 'html'\n" + yaml_file_text += "data: |\n" + yaml_file_text += "${summary_section}" + + return yaml_file_text +} + +// +// ANSII colours used for terminal logging +// +def logColours(monochrome_logs=true) { + def colorcodes = [:] as Map + + // Reset / Meta + colorcodes['reset'] = monochrome_logs ? '' : "\033[0m" + colorcodes['bold'] = monochrome_logs ? '' : "\033[1m" + colorcodes['dim'] = monochrome_logs ? '' : "\033[2m" + colorcodes['underlined'] = monochrome_logs ? '' : "\033[4m" + colorcodes['blink'] = monochrome_logs ? '' : "\033[5m" + colorcodes['reverse'] = monochrome_logs ? '' : "\033[7m" + colorcodes['hidden'] = monochrome_logs ? '' : "\033[8m" + + // Regular Colors + colorcodes['black'] = monochrome_logs ? '' : "\033[0;30m" + colorcodes['red'] = monochrome_logs ? '' : "\033[0;31m" + colorcodes['green'] = monochrome_logs ? '' : "\033[0;32m" + colorcodes['yellow'] = monochrome_logs ? '' : "\033[0;33m" + colorcodes['blue'] = monochrome_logs ? '' : "\033[0;34m" + colorcodes['purple'] = monochrome_logs ? '' : "\033[0;35m" + colorcodes['cyan'] = monochrome_logs ? '' : "\033[0;36m" + colorcodes['white'] = monochrome_logs ? '' : "\033[0;37m" + + // Bold + colorcodes['bblack'] = monochrome_logs ? '' : "\033[1;30m" + colorcodes['bred'] = monochrome_logs ? '' : "\033[1;31m" + colorcodes['bgreen'] = monochrome_logs ? '' : "\033[1;32m" + colorcodes['byellow'] = monochrome_logs ? '' : "\033[1;33m" + colorcodes['bblue'] = monochrome_logs ? '' : "\033[1;34m" + colorcodes['bpurple'] = monochrome_logs ? '' : "\033[1;35m" + colorcodes['bcyan'] = monochrome_logs ? '' : "\033[1;36m" + colorcodes['bwhite'] = monochrome_logs ? '' : "\033[1;37m" + + // Underline + colorcodes['ublack'] = monochrome_logs ? '' : "\033[4;30m" + colorcodes['ured'] = monochrome_logs ? '' : "\033[4;31m" + colorcodes['ugreen'] = monochrome_logs ? '' : "\033[4;32m" + colorcodes['uyellow'] = monochrome_logs ? '' : "\033[4;33m" + colorcodes['ublue'] = monochrome_logs ? '' : "\033[4;34m" + colorcodes['upurple'] = monochrome_logs ? '' : "\033[4;35m" + colorcodes['ucyan'] = monochrome_logs ? '' : "\033[4;36m" + colorcodes['uwhite'] = monochrome_logs ? '' : "\033[4;37m" + + // High Intensity + colorcodes['iblack'] = monochrome_logs ? '' : "\033[0;90m" + colorcodes['ired'] = monochrome_logs ? '' : "\033[0;91m" + colorcodes['igreen'] = monochrome_logs ? '' : "\033[0;92m" + colorcodes['iyellow'] = monochrome_logs ? '' : "\033[0;93m" + colorcodes['iblue'] = monochrome_logs ? '' : "\033[0;94m" + colorcodes['ipurple'] = monochrome_logs ? '' : "\033[0;95m" + colorcodes['icyan'] = monochrome_logs ? '' : "\033[0;96m" + colorcodes['iwhite'] = monochrome_logs ? '' : "\033[0;97m" + + // Bold High Intensity + colorcodes['biblack'] = monochrome_logs ? '' : "\033[1;90m" + colorcodes['bired'] = monochrome_logs ? '' : "\033[1;91m" + colorcodes['bigreen'] = monochrome_logs ? '' : "\033[1;92m" + colorcodes['biyellow'] = monochrome_logs ? '' : "\033[1;93m" + colorcodes['biblue'] = monochrome_logs ? '' : "\033[1;94m" + colorcodes['bipurple'] = monochrome_logs ? '' : "\033[1;95m" + colorcodes['bicyan'] = monochrome_logs ? '' : "\033[1;96m" + colorcodes['biwhite'] = monochrome_logs ? '' : "\033[1;97m" + + return colorcodes +} + +// Return a single report from an object that may be a Path or List +// +def getSingleReport(multiqc_reports) { + if (multiqc_reports instanceof Path) { + return multiqc_reports + } else if (multiqc_reports instanceof List) { + if (multiqc_reports.size() == 0) { + log.warn("[${workflow.manifest.name}] No reports found from process 'MULTIQC'") + return null + } else if (multiqc_reports.size() == 1) { + return multiqc_reports.first() + } else { + log.warn("[${workflow.manifest.name}] Found multiple reports from process 'MULTIQC', will use only one") + return multiqc_reports.first() + } + } else { + return null + } +} + +// +// Construct and send completion email +// +def completionEmail(summary_params, email, email_on_fail, plaintext_email, outdir, monochrome_logs=true, multiqc_report=null) { + + // Set up the e-mail variables + def subject = "[${workflow.manifest.name}] Successful: ${workflow.runName}" + if (!workflow.success) { + subject = "[${workflow.manifest.name}] FAILED: ${workflow.runName}" + } + + def summary = [:] + summary_params + .keySet() + .sort() + .each { group -> + summary << summary_params[group] + } + + def misc_fields = [:] + misc_fields['Date Started'] = workflow.start + misc_fields['Date Completed'] = workflow.complete + misc_fields['Pipeline script file path'] = workflow.scriptFile + misc_fields['Pipeline script hash ID'] = workflow.scriptId + if (workflow.repository) { + misc_fields['Pipeline repository Git URL'] = workflow.repository + } + if (workflow.commitId) { + misc_fields['Pipeline repository Git Commit'] = workflow.commitId + } + if (workflow.revision) { + misc_fields['Pipeline Git branch/tag'] = workflow.revision + } + misc_fields['Nextflow Version'] = workflow.nextflow.version + misc_fields['Nextflow Build'] = workflow.nextflow.build + misc_fields['Nextflow Compile Timestamp'] = workflow.nextflow.timestamp + + def email_fields = [:] + email_fields['version'] = getWorkflowVersion() + email_fields['runName'] = workflow.runName + email_fields['success'] = workflow.success + email_fields['dateComplete'] = workflow.complete + email_fields['duration'] = workflow.duration + email_fields['exitStatus'] = workflow.exitStatus + email_fields['errorMessage'] = (workflow.errorMessage ?: 'None') + email_fields['errorReport'] = (workflow.errorReport ?: 'None') + email_fields['commandLine'] = workflow.commandLine + email_fields['projectDir'] = workflow.projectDir + email_fields['summary'] = summary << misc_fields + + // On success try attach the multiqc report + def mqc_report = getSingleReport(multiqc_report) + + // Check if we are only sending emails on failure + def email_address = email + if (!email && email_on_fail && !workflow.success) { + email_address = email_on_fail + } + + // Render the TXT template + def engine = new groovy.text.GStringTemplateEngine() + def tf = new File("${workflow.projectDir}/assets/email_template.txt") + def txt_template = engine.createTemplate(tf).make(email_fields) + def email_txt = txt_template.toString() + + // Render the HTML template + def hf = new File("${workflow.projectDir}/assets/email_template.html") + def html_template = engine.createTemplate(hf).make(email_fields) + def email_html = html_template.toString() + + // Render the sendmail template + def max_multiqc_email_size = (params.containsKey('max_multiqc_email_size') ? params.max_multiqc_email_size : 0) as MemoryUnit + def smail_fields = [email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "${workflow.projectDir}", mqcFile: mqc_report, mqcMaxSize: max_multiqc_email_size.toBytes()] + def sf = new File("${workflow.projectDir}/assets/sendmail_template.txt") + def sendmail_template = engine.createTemplate(sf).make(smail_fields) + def sendmail_html = sendmail_template.toString() + + // Send the HTML e-mail + def colors = logColours(monochrome_logs) as Map + if (email_address) { + try { + if (plaintext_email) { + new org.codehaus.groovy.GroovyException('Send plaintext e-mail, not HTML') + } + // Try to send HTML e-mail using sendmail + def sendmail_tf = new File(workflow.launchDir.toString(), ".sendmail_tmp.html") + sendmail_tf.withWriter { w -> w << sendmail_html } + ['sendmail', '-t'].execute() << sendmail_html + log.info("-${colors.purple}[${workflow.manifest.name}]${colors.green} Sent summary e-mail to ${email_address} (sendmail)-") + } + catch (Exception msg) { + log.debug(msg.toString()) + log.debug("Trying with mail instead of sendmail") + // Catch failures and try with plaintext + def mail_cmd = ['mail', '-s', subject, '--content-type=text/html', email_address] + mail_cmd.execute() << email_html + log.info("-${colors.purple}[${workflow.manifest.name}]${colors.green} Sent summary e-mail to ${email_address} (mail)-") + } + } + + // Write summary e-mail HTML to a file + def output_hf = new File(workflow.launchDir.toString(), ".pipeline_report.html") + output_hf.withWriter { w -> w << email_html } + nextflow.extension.FilesEx.copyTo(output_hf.toPath(), "${outdir}/pipeline_info/pipeline_report.html") + output_hf.delete() + + // Write summary e-mail TXT to a file + def output_tf = new File(workflow.launchDir.toString(), ".pipeline_report.txt") + output_tf.withWriter { w -> w << email_txt } + nextflow.extension.FilesEx.copyTo(output_tf.toPath(), "${outdir}/pipeline_info/pipeline_report.txt") + output_tf.delete() +} + +// +// Print pipeline summary on completion +// +def completionSummary(monochrome_logs=true) { + def colors = logColours(monochrome_logs) as Map + if (workflow.success) { + if (workflow.stats.ignoredCount == 0) { + log.info("-${colors.purple}[${workflow.manifest.name}]${colors.green} Pipeline completed successfully${colors.reset}-") + } + else { + log.info("-${colors.purple}[${workflow.manifest.name}]${colors.yellow} Pipeline completed successfully, but with errored process(es) ${colors.reset}-") + } + } + else { + log.info("-${colors.purple}[${workflow.manifest.name}]${colors.red} Pipeline completed with errors${colors.reset}-") + } +} + +// +// Construct and send a notification to a web server as JSON e.g. Microsoft Teams and Slack +// +def imNotification(summary_params, hook_url) { + def summary = [:] + summary_params + .keySet() + .sort() + .each { group -> + summary << summary_params[group] + } + + def misc_fields = [:] + misc_fields['start'] = workflow.start + misc_fields['complete'] = workflow.complete + misc_fields['scriptfile'] = workflow.scriptFile + misc_fields['scriptid'] = workflow.scriptId + if (workflow.repository) { + misc_fields['repository'] = workflow.repository + } + if (workflow.commitId) { + misc_fields['commitid'] = workflow.commitId + } + if (workflow.revision) { + misc_fields['revision'] = workflow.revision + } + misc_fields['nxf_version'] = workflow.nextflow.version + misc_fields['nxf_build'] = workflow.nextflow.build + misc_fields['nxf_timestamp'] = workflow.nextflow.timestamp + + def msg_fields = [:] + msg_fields['version'] = getWorkflowVersion() + msg_fields['runName'] = workflow.runName + msg_fields['success'] = workflow.success + msg_fields['dateComplete'] = workflow.complete + msg_fields['duration'] = workflow.duration + msg_fields['exitStatus'] = workflow.exitStatus + msg_fields['errorMessage'] = (workflow.errorMessage ?: 'None') + msg_fields['errorReport'] = (workflow.errorReport ?: 'None') + msg_fields['commandLine'] = workflow.commandLine.replaceFirst(/ +--hook_url +[^ ]+/, "") + msg_fields['projectDir'] = workflow.projectDir + msg_fields['summary'] = summary << misc_fields + + // Render the JSON template + def engine = new groovy.text.GStringTemplateEngine() + // Different JSON depending on the service provider + // Defaults to "Adaptive Cards" (https://adaptivecards.io), except Slack which has its own format + def json_path = hook_url.contains("hooks.slack.com") ? "slackreport.json" : "adaptivecard.json" + def hf = new File("${workflow.projectDir}/assets/${json_path}") + def json_template = engine.createTemplate(hf).make(msg_fields) + def json_message = json_template.toString() + + // POST + def post = new URL(hook_url).openConnection() + post.setRequestMethod("POST") + post.setDoOutput(true) + post.setRequestProperty("Content-Type", "application/json") + post.getOutputStream().write(json_message.getBytes("UTF-8")) + def postRC = post.getResponseCode() + if (!postRC.equals(200)) { + log.warn(post.getErrorStream().getText()) + } +} diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/meta.yml b/subworkflows/nf-core/utils_nfcore_pipeline/meta.yml new file mode 100644 index 0000000..d08d243 --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/meta.yml @@ -0,0 +1,24 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "UTILS_NFCORE_PIPELINE" +description: Subworkflow with utility functions specific to the nf-core pipeline template +keywords: + - utility + - pipeline + - initialise + - version +components: [] +input: + - nextflow_cli_args: + type: list + description: | + Nextflow CLI positional arguments +output: + - success: + type: boolean + description: | + Dummy output to indicate success +authors: + - "@adamrtalbot" +maintainers: + - "@adamrtalbot" + - "@maxulysse" diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test new file mode 100644 index 0000000..f117040 --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test @@ -0,0 +1,126 @@ + +nextflow_function { + + name "Test Functions" + script "../main.nf" + config "subworkflows/nf-core/utils_nfcore_pipeline/tests/nextflow.config" + tag "subworkflows" + tag "subworkflows_nfcore" + tag "utils_nfcore_pipeline" + tag "subworkflows/utils_nfcore_pipeline" + + test("Test Function checkConfigProvided") { + + function "checkConfigProvided" + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } + + test("Test Function checkProfileProvided") { + + function "checkProfileProvided" + + when { + function { + """ + input[0] = [] + """ + } + } + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } + + test("Test Function without logColours") { + + function "logColours" + + when { + function { + """ + input[0] = true + """ + } + } + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } + + test("Test Function with logColours") { + function "logColours" + + when { + function { + """ + input[0] = false + """ + } + } + + then { + assertAll( + { assert function.success }, + { assert snapshot(function.result).match() } + ) + } + } + + test("Test Function getSingleReport with a single file") { + function "getSingleReport" + + when { + function { + """ + input[0] = file(params.modules_testdata_base_path + '/generic/tsv/test.tsv', checkIfExists: true) + """ + } + } + + then { + assertAll( + { assert function.success }, + { assert function.result.contains("test.tsv") } + ) + } + } + + test("Test Function getSingleReport with multiple files") { + function "getSingleReport" + + when { + function { + """ + input[0] = [ + file(params.modules_testdata_base_path + '/generic/tsv/test.tsv', checkIfExists: true), + file(params.modules_testdata_base_path + '/generic/tsv/network.tsv', checkIfExists: true), + file(params.modules_testdata_base_path + '/generic/tsv/expression.tsv', checkIfExists: true) + ] + """ + } + } + + then { + assertAll( + { assert function.success }, + { assert function.result.contains("test.tsv") }, + { assert !function.result.contains("network.tsv") }, + { assert !function.result.contains("expression.tsv") } + ) + } + } +} diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test.snap b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test.snap new file mode 100644 index 0000000..02c6701 --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.function.nf.test.snap @@ -0,0 +1,136 @@ +{ + "Test Function checkProfileProvided": { + "content": null, + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:03:03.360873" + }, + "Test Function checkConfigProvided": { + "content": [ + true + ], + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:02:59.729647" + }, + "Test Function without logColours": { + "content": [ + { + "reset": "", + "bold": "", + "dim": "", + "underlined": "", + "blink": "", + "reverse": "", + "hidden": "", + "black": "", + "red": "", + "green": "", + "yellow": "", + "blue": "", + "purple": "", + "cyan": "", + "white": "", + "bblack": "", + "bred": "", + "bgreen": "", + "byellow": "", + "bblue": "", + "bpurple": "", + "bcyan": "", + "bwhite": "", + "ublack": "", + "ured": "", + "ugreen": "", + "uyellow": "", + "ublue": "", + "upurple": "", + "ucyan": "", + "uwhite": "", + "iblack": "", + "ired": "", + "igreen": "", + "iyellow": "", + "iblue": "", + "ipurple": "", + "icyan": "", + "iwhite": "", + "biblack": "", + "bired": "", + "bigreen": "", + "biyellow": "", + "biblue": "", + "bipurple": "", + "bicyan": "", + "biwhite": "" + } + ], + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:03:17.969323" + }, + "Test Function with logColours": { + "content": [ + { + "reset": "\u001b[0m", + "bold": "\u001b[1m", + "dim": "\u001b[2m", + "underlined": "\u001b[4m", + "blink": "\u001b[5m", + "reverse": "\u001b[7m", + "hidden": "\u001b[8m", + "black": "\u001b[0;30m", + "red": "\u001b[0;31m", + "green": "\u001b[0;32m", + "yellow": "\u001b[0;33m", + "blue": "\u001b[0;34m", + "purple": "\u001b[0;35m", + "cyan": "\u001b[0;36m", + "white": "\u001b[0;37m", + "bblack": "\u001b[1;30m", + "bred": "\u001b[1;31m", + "bgreen": "\u001b[1;32m", + "byellow": "\u001b[1;33m", + "bblue": "\u001b[1;34m", + "bpurple": "\u001b[1;35m", + "bcyan": "\u001b[1;36m", + "bwhite": "\u001b[1;37m", + "ublack": "\u001b[4;30m", + "ured": "\u001b[4;31m", + "ugreen": "\u001b[4;32m", + "uyellow": "\u001b[4;33m", + "ublue": "\u001b[4;34m", + "upurple": "\u001b[4;35m", + "ucyan": "\u001b[4;36m", + "uwhite": "\u001b[4;37m", + "iblack": "\u001b[0;90m", + "ired": "\u001b[0;91m", + "igreen": "\u001b[0;92m", + "iyellow": "\u001b[0;93m", + "iblue": "\u001b[0;94m", + "ipurple": "\u001b[0;95m", + "icyan": "\u001b[0;96m", + "iwhite": "\u001b[0;97m", + "biblack": "\u001b[1;90m", + "bired": "\u001b[1;91m", + "bigreen": "\u001b[1;92m", + "biyellow": "\u001b[1;93m", + "biblue": "\u001b[1;94m", + "bipurple": "\u001b[1;95m", + "bicyan": "\u001b[1;96m", + "biwhite": "\u001b[1;97m" + } + ], + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:03:21.714424" + } +} \ No newline at end of file diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test new file mode 100644 index 0000000..8940d32 --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test @@ -0,0 +1,29 @@ +nextflow_workflow { + + name "Test Workflow UTILS_NFCORE_PIPELINE" + script "../main.nf" + config "subworkflows/nf-core/utils_nfcore_pipeline/tests/nextflow.config" + workflow "UTILS_NFCORE_PIPELINE" + tag "subworkflows" + tag "subworkflows_nfcore" + tag "utils_nfcore_pipeline" + tag "subworkflows/utils_nfcore_pipeline" + + test("Should run without failures") { + + when { + workflow { + """ + input[0] = [] + """ + } + } + + then { + assertAll( + { assert workflow.success }, + { assert snapshot(workflow.out).match() } + ) + } + } +} diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test.snap b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test.snap new file mode 100644 index 0000000..859d103 --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/tests/main.workflow.nf.test.snap @@ -0,0 +1,19 @@ +{ + "Should run without failures": { + "content": [ + { + "0": [ + true + ], + "valid_config": [ + true + ] + } + ], + "meta": { + "nf-test": "0.8.4", + "nextflow": "23.10.1" + }, + "timestamp": "2024-02-28T12:03:25.726491" + } +} \ No newline at end of file diff --git a/subworkflows/nf-core/utils_nfcore_pipeline/tests/nextflow.config b/subworkflows/nf-core/utils_nfcore_pipeline/tests/nextflow.config new file mode 100644 index 0000000..d0a926b --- /dev/null +++ b/subworkflows/nf-core/utils_nfcore_pipeline/tests/nextflow.config @@ -0,0 +1,9 @@ +manifest { + name = 'nextflow_workflow' + author = """nf-core""" + homePage = 'https://127.0.0.1' + description = """Dummy pipeline""" + nextflowVersion = '!>=23.04.0' + version = '9.9.9' + doi = 'https://doi.org/10.5281/zenodo.5070524' +} diff --git a/subworkflows/nf-core/utils_nfschema_plugin/main.nf b/subworkflows/nf-core/utils_nfschema_plugin/main.nf new file mode 100644 index 0000000..acb3972 --- /dev/null +++ b/subworkflows/nf-core/utils_nfschema_plugin/main.nf @@ -0,0 +1,73 @@ +// +// Subworkflow that uses the nf-schema plugin to validate parameters and render the parameter summary +// + +include { paramsSummaryLog } from 'plugin/nf-schema' +include { validateParameters } from 'plugin/nf-schema' +include { paramsHelp } from 'plugin/nf-schema' + +workflow UTILS_NFSCHEMA_PLUGIN { + + take: + input_workflow // workflow: the workflow object used by nf-schema to get metadata from the workflow + validate_params // boolean: validate the parameters + parameters_schema // string: path to the parameters JSON schema. + // this has to be the same as the schema given to `validation.parametersSchema` + // when this input is empty it will automatically use the configured schema or + // "${projectDir}/nextflow_schema.json" as default. This input should not be empty + // for meta pipelines + help // boolean: show help message + help_full // boolean: show full help message + show_hidden // boolean: show hidden parameters in help message + before_text // string: text to show before the help message and parameters summary + after_text // string: text to show after the help message and parameters summary + command // string: an example command of the pipeline + + main: + + if(help || help_full) { + help_options = [ + beforeText: before_text, + afterText: after_text, + command: command, + showHidden: show_hidden, + fullHelp: help_full, + ] + if(parameters_schema) { + help_options << [parametersSchema: parameters_schema] + } + log.info paramsHelp( + help_options, + params.help instanceof String ? params.help : "", + ) + exit 0 + } + + // + // Print parameter summary to stdout. This will display the parameters + // that differ from the default given in the JSON schema + // + + summary_options = [:] + if(parameters_schema) { + summary_options << [parametersSchema: parameters_schema] + } + log.info before_text + log.info paramsSummaryLog(summary_options, input_workflow) + log.info after_text + + // + // Validate the parameters using nextflow_schema.json or the schema + // given via the validation.parametersSchema configuration option + // + if(validate_params) { + validateOptions = [:] + if(parameters_schema) { + validateOptions << [parametersSchema: parameters_schema] + } + validateParameters(validateOptions) + } + + emit: + dummy_emit = true +} diff --git a/subworkflows/nf-core/utils_nfschema_plugin/meta.yml b/subworkflows/nf-core/utils_nfschema_plugin/meta.yml new file mode 100644 index 0000000..f7d9f02 --- /dev/null +++ b/subworkflows/nf-core/utils_nfschema_plugin/meta.yml @@ -0,0 +1,35 @@ +# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/yaml-schema.json +name: "utils_nfschema_plugin" +description: Run nf-schema to validate parameters and create a summary of changed parameters +keywords: + - validation + - JSON schema + - plugin + - parameters + - summary +components: [] +input: + - input_workflow: + type: object + description: | + The workflow object of the used pipeline. + This object contains meta data used to create the params summary log + - validate_params: + type: boolean + description: Validate the parameters and error if invalid. + - parameters_schema: + type: string + description: | + Path to the parameters JSON schema. + This has to be the same as the schema given to the `validation.parametersSchema` config + option. When this input is empty it will automatically use the configured schema or + "${projectDir}/nextflow_schema.json" as default. The schema should not be given in this way + for meta pipelines. +output: + - dummy_emit: + type: boolean + description: Dummy emit to make nf-core subworkflows lint happy +authors: + - "@nvnieuwk" +maintainers: + - "@nvnieuwk" diff --git a/subworkflows/nf-core/utils_nfschema_plugin/tests/main.nf.test b/subworkflows/nf-core/utils_nfschema_plugin/tests/main.nf.test new file mode 100644 index 0000000..c977917 --- /dev/null +++ b/subworkflows/nf-core/utils_nfschema_plugin/tests/main.nf.test @@ -0,0 +1,173 @@ +nextflow_workflow { + + name "Test Subworkflow UTILS_NFSCHEMA_PLUGIN" + script "../main.nf" + workflow "UTILS_NFSCHEMA_PLUGIN" + + tag "subworkflows" + tag "subworkflows_nfcore" + tag "subworkflows/utils_nfschema_plugin" + tag "plugin/nf-schema" + + config "./nextflow.config" + + test("Should run nothing") { + + when { + + params { + test_data = '' + } + + workflow { + """ + validate_params = false + input[0] = workflow + input[1] = validate_params + input[2] = "" + input[3] = false + input[4] = false + input[5] = false + input[6] = "" + input[7] = "" + input[8] = "" + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } + + test("Should validate params") { + + when { + + params { + test_data = '' + outdir = null + } + + workflow { + """ + validate_params = true + input[0] = workflow + input[1] = validate_params + input[2] = "" + input[3] = false + input[4] = false + input[5] = false + input[6] = "" + input[7] = "" + input[8] = "" + """ + } + } + + then { + assertAll( + { assert workflow.failed }, + { assert workflow.stdout.any { it.contains('ERROR ~ Validation of pipeline parameters failed!') } } + ) + } + } + + test("Should run nothing - custom schema") { + + when { + + params { + test_data = '' + } + + workflow { + """ + validate_params = false + input[0] = workflow + input[1] = validate_params + input[2] = "${projectDir}/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json" + input[3] = false + input[4] = false + input[5] = false + input[6] = "" + input[7] = "" + input[8] = "" + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } + + test("Should validate params - custom schema") { + + when { + + params { + test_data = '' + outdir = null + } + + workflow { + """ + validate_params = true + input[0] = workflow + input[1] = validate_params + input[2] = "${projectDir}/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json" + input[3] = false + input[4] = false + input[5] = false + input[6] = "" + input[7] = "" + input[8] = "" + """ + } + } + + then { + assertAll( + { assert workflow.failed }, + { assert workflow.stdout.any { it.contains('ERROR ~ Validation of pipeline parameters failed!') } } + ) + } + } + + test("Should create a help message") { + + when { + + params { + test_data = '' + outdir = null + } + + workflow { + """ + validate_params = true + input[0] = workflow + input[1] = validate_params + input[2] = "${projectDir}/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json" + input[3] = true + input[4] = false + input[5] = false + input[6] = "Before" + input[7] = "After" + input[8] = "nextflow run test/test" + """ + } + } + + then { + assertAll( + { assert workflow.success } + ) + } + } +} diff --git a/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow.config b/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow.config new file mode 100644 index 0000000..8d8c737 --- /dev/null +++ b/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow.config @@ -0,0 +1,8 @@ +plugins { + id "nf-schema@2.5.1" +} + +validation { + parametersSchema = "${projectDir}/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json" + monochromeLogs = true +} diff --git a/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json b/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json new file mode 100644 index 0000000..331e0d2 --- /dev/null +++ b/subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow_schema.json @@ -0,0 +1,96 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://raw.githubusercontent.com/./master/nextflow_schema.json", + "title": ". pipeline parameters", + "description": "", + "type": "object", + "$defs": { + "input_output_options": { + "title": "Input/output options", + "type": "object", + "fa_icon": "fas fa-terminal", + "description": "Define where the pipeline should find input data and save output data.", + "required": ["outdir"], + "properties": { + "validate_params": { + "type": "boolean", + "description": "Validate parameters?", + "default": true, + "hidden": true + }, + "outdir": { + "type": "string", + "format": "directory-path", + "description": "The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.", + "fa_icon": "fas fa-folder-open" + }, + "test_data_base": { + "type": "string", + "default": "https://raw.githubusercontent.com/nf-core/test-datasets/modules", + "description": "Base for test data directory", + "hidden": true + }, + "test_data": { + "type": "string", + "description": "Fake test data param", + "hidden": true + } + } + }, + "generic_options": { + "title": "Generic options", + "type": "object", + "fa_icon": "fas fa-file-import", + "description": "Less common options for the pipeline, typically set in a config file.", + "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs.\n\nTypically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", + "properties": { + "help": { + "type": "boolean", + "description": "Display help text.", + "fa_icon": "fas fa-question-circle", + "hidden": true + }, + "version": { + "type": "boolean", + "description": "Display version and exit.", + "fa_icon": "fas fa-question-circle", + "hidden": true + }, + "logo": { + "type": "boolean", + "default": true, + "description": "Display nf-core logo in console output.", + "fa_icon": "fas fa-image", + "hidden": true + }, + "singularity_pull_docker_container": { + "type": "boolean", + "description": "Pull Singularity container from Docker?", + "hidden": true + }, + "publish_dir_mode": { + "type": "string", + "default": "copy", + "description": "Method used to save pipeline results to output directory.", + "help_text": "The Nextflow `publishDir` option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See [Nextflow docs](https://www.nextflow.io/docs/latest/process.html#publishdir) for details.", + "fa_icon": "fas fa-copy", + "enum": ["symlink", "rellink", "link", "copy", "copyNoFollow", "move"], + "hidden": true + }, + "monochrome_logs": { + "type": "boolean", + "description": "Use monochrome_logs", + "hidden": true + } + } + } + }, + "allOf": [ + { + "$ref": "#/$defs/input_output_options" + }, + { + "$ref": "#/$defs/generic_options" + } + ] +} diff --git a/tests/.nftignore b/tests/.nftignore new file mode 100644 index 0000000..e128a12 --- /dev/null +++ b/tests/.nftignore @@ -0,0 +1,12 @@ +.DS_Store +multiqc/multiqc_data/fastqc_top_overrepresented_sequences_table.txt +multiqc/multiqc_data/multiqc.parquet +multiqc/multiqc_data/multiqc.log +multiqc/multiqc_data/multiqc_data.json +multiqc/multiqc_data/multiqc_sources.txt +multiqc/multiqc_data/multiqc_software_versions.txt +multiqc/multiqc_data/llms-full.txt +multiqc/multiqc_plots/{svg,pdf,png}/*.{svg,pdf,png} +multiqc/multiqc_report.html +fastqc/*_fastqc.{html,zip} +pipeline_info/*.{html,json,txt,yml} diff --git a/tests/default.nf.test b/tests/default.nf.test new file mode 100644 index 0000000..ee3179e --- /dev/null +++ b/tests/default.nf.test @@ -0,0 +1,33 @@ +nextflow_pipeline { + + name "Test pipeline" + script "../main.nf" + tag "pipeline" + + test("-profile test") { + + when { + params { + outdir = "$outputDir" + } + } + + then { + // stable_name: All files + folders in ${params.outdir}/ with a stable name + def stable_name = getAllFilesFromDir(params.outdir, relative: true, includeDir: true, ignore: ['pipeline_info/*.{html,json,txt}']) + // stable_path: All files in ${params.outdir}/ with stable content + def stable_path = getAllFilesFromDir(params.outdir, ignoreFile: 'tests/.nftignore') + assertAll( + { assert workflow.success}, + { assert snapshot( + // pipeline versions.yml file for multiqc from which Nextflow version is removed because we test pipelines on multiple Nextflow versions + removeNextflowVersion("$outputDir/pipeline_info/nf_core_quantms_software_mqc_versions.yml"), + // All stable path name, with a relative path + stable_name, + // All files with stable contents + stable_path + ).match() } + ) + } + } +} diff --git a/tests/nextflow.config b/tests/nextflow.config new file mode 100644 index 0000000..9fb43e8 --- /dev/null +++ b/tests/nextflow.config @@ -0,0 +1,13 @@ +/* +======================================================================================== + Nextflow config file for running nf-test tests +======================================================================================== +*/ + +// Pipeline-specific test parameters +params { + modules_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/' + pipelines_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/refs/heads/quantms' +} + +aws.client.anonymous = true // fixes S3 access issues on self-hosted runners diff --git a/workflows/dia.nf b/workflows/dia.nf new file mode 100644 index 0000000..8a1b13b --- /dev/null +++ b/workflows/dia.nf @@ -0,0 +1,219 @@ +/* +======================================================================================== + IMPORT LOCAL MODULES/SUBWORKFLOWS +======================================================================================== +*/ + +// +// MODULES: Local to the pipeline +// +include { DIANN_MSSTATS } from '../modules/local/diann/diann_msstats/main' +include { PRELIMINARY_ANALYSIS } from '../modules/local/diann/preliminary_analysis/main' +include { ASSEMBLE_EMPIRICAL_LIBRARY } from '../modules/local/diann/assemble_empirical_library/main' +include { INSILICO_LIBRARY_GENERATION } from '../modules/local/diann/insilico_library_generation/main' +include { INDIVIDUAL_ANALYSIS } from '../modules/local/diann/individual_analysis/main' +include { FINAL_QUANTIFICATION } from '../modules/local/diann/final_quantification/main' + +// +// SUBWORKFLOWS: Consisting of a mix of local and nf-core/modules +// + +/* +======================================================================================== + RUN MAIN WORKFLOW +======================================================================================== +*/ + + +workflow DIA { + take: + ch_file_preparation_results + ch_expdesign + ch_diann_cfg + + main: + + ch_software_versions = channel.empty() + ch_searchdb = channel.fromPath(params.database, checkIfExists: true).first() + + ch_file_preparation_results.multiMap { + result -> + meta: preprocessed_meta(result[0]) + ms_file:result[1] + }.set { ch_result } + + ch_experiment_meta = ch_result.meta.unique { m -> m.experiment_id }.first() + + // diann_config.cfg comes directly from SDRF_PARSING (convert-diann) + // Use as value channel so it can be consumed by all per-file processes + ch_diann_cfg_val = ch_diann_cfg + + // + // MODULE: SILICOLIBRARYGENERATION + // + if (params.diann_speclib != null && params.diann_speclib.toString() != "") { + speclib = channel.from(file(params.diann_speclib, checkIfExists: true)) + } else { + INSILICO_LIBRARY_GENERATION(ch_searchdb, ch_diann_cfg_val) + speclib = INSILICO_LIBRARY_GENERATION.out.predict_speclib + } + + if (params.skip_preliminary_analysis) { + // Users who skip preliminary analysis provide mass accuracy and scan window directly + ch_parsed_vals = channel.value("${params.mass_acc_ms2},${params.mass_acc_ms1},${params.scan_window}") + indiv_fin_analysis_in = ch_file_preparation_results + .combine(ch_searchdb) + .combine(speclib) + .combine(ch_parsed_vals) + .map { meta_map, ms_file, fasta, library, param_string -> + def values = param_string.trim().split(',') + def new_meta = meta_map + [ + mass_acc_ms2 : values[0], + mass_acc_ms1 : values[1], + scan_window : values[2] + ] + return [ new_meta, ms_file, fasta, library ] + } + empirical_lib = speclib + } else { + // + // MODULE: PRELIMINARY_ANALYSIS + // + if (params.random_preanalysis) { + preanalysis_subset = ch_file_preparation_results + .toSortedList{ a, b -> file(a[1]).getName() <=> file(b[1]).getName() } + .flatMap() + .randomSample(params.empirical_assembly_ms_n, params.random_preanalysis_seed) + empirical_lib_files = preanalysis_subset + .map { result -> result[1] } + .collect( sort: { a, b -> file(a).getName() <=> file(b).getName() } ) + PRELIMINARY_ANALYSIS(preanalysis_subset.combine(speclib), ch_diann_cfg_val) + } else { + empirical_lib_files = ch_file_preparation_results + .map { result -> result[1] } + .collect( sort: { a, b -> file(a).getName() <=> file(b).getName() } ) + PRELIMINARY_ANALYSIS(ch_file_preparation_results.combine(speclib), ch_diann_cfg_val) + } + ch_software_versions = ch_software_versions + .mix(PRELIMINARY_ANALYSIS.out.versions) + + // + // MODULE: ASSEMBLE_EMPIRICAL_LIBRARY + // + // Order matters in DIANN, This should be sorted for reproducible results. + ASSEMBLE_EMPIRICAL_LIBRARY( + empirical_lib_files, + ch_experiment_meta, + PRELIMINARY_ANALYSIS.out.diann_quant.collect(), + speclib, + ch_diann_cfg_val + ) + ch_software_versions = ch_software_versions + .mix(ASSEMBLE_EMPIRICAL_LIBRARY.out.versions) + // Parse calibrated params from the assembly log on the head node + ch_parsed_vals = ASSEMBLE_EMPIRICAL_LIBRARY.out.log + .map { log_file -> + def match = log_file.text.readLines().find { it.contains("Averaged recommended settings") } + if (match) { + def parts = match.trim().split(/\s+/) + def ms2 = parts.size() > 10 ? parts[10].replaceAll(/[^0-9.]/, '') : "${params.mass_acc_ms2}" + def ms1 = parts.size() > 14 ? parts[14].replaceAll(/[^0-9.]/, '') : "${params.mass_acc_ms1}" + def sw = parts.size() > 18 ? parts[18].replaceAll(/[^0-9.]/, '') : "${params.scan_window}" + return "${ms2},${ms1},${sw}" + } + return "${params.mass_acc_ms2},${params.mass_acc_ms1},${params.scan_window}" + } + indiv_fin_analysis_in = ch_file_preparation_results + .combine(ch_searchdb) + .combine(ASSEMBLE_EMPIRICAL_LIBRARY.out.empirical_library) + .combine(ch_parsed_vals) + .map { meta_map, ms_file, fasta, library, param_string -> + def values = param_string.trim().split(',') + def new_meta = meta_map + [ + mass_acc_ms2 : values[0], + mass_acc_ms1 : values[1], + scan_window : values[2] + ] + return [ new_meta, ms_file, fasta, library ] + } + empirical_lib = ASSEMBLE_EMPIRICAL_LIBRARY.out.empirical_library + } + + // + // MODULE: INDIVIDUAL_ANALYSIS + // + INDIVIDUAL_ANALYSIS(indiv_fin_analysis_in, ch_diann_cfg_val) + ch_software_versions = ch_software_versions + .mix(INDIVIDUAL_ANALYSIS.out.versions) + + // + // MODULE: DIANNSUMMARY + // + // Order matters in DIANN, This should be sorted for reproducible results. + // NOTE: ch_results.ms_file contains the name of the ms file, not the path. + // The next step only needs the name (since it uses the cached .quant) + // Converting to a file object and using its name is necessary because ch_result.ms_file contains + // locally, every element in ch_result is a string, whilst on cloud it is a path. + ch_result + .ms_file.map { msfile -> file(msfile).getName() } + .collect(sort: true) + .set { ms_file_names } + + FINAL_QUANTIFICATION( + ms_file_names, + ch_experiment_meta, + empirical_lib, + INDIVIDUAL_ANALYSIS.out.diann_quant.collect(), + ch_searchdb, + ch_diann_cfg_val) + + ch_software_versions = ch_software_versions.mix( + FINAL_QUANTIFICATION.out.versions + ) + + diann_main_report = FINAL_QUANTIFICATION.out.main_report + + // + // MODULE: DIANN_MSSTATS — Convert DIA-NN report to MSstats-compatible format + // + DIANN_MSSTATS( + diann_main_report, + ch_expdesign + ) + ch_software_versions = ch_software_versions + .mix(DIANN_MSSTATS.out.versions) + + emit: + versions = ch_software_versions + diann_report = diann_main_report + diann_log = FINAL_QUANTIFICATION.out.log + msstats_in = DIANN_MSSTATS.out.out_msstats +} + +// remove meta.id to make sure cache identical HashCode +def preprocessed_meta(LinkedHashMap meta) { + def parameters = [:] + parameters['experiment_id'] = meta.experiment_id + parameters['acquisition_method'] = meta.acquisition_method + parameters['dissociationmethod'] = meta.dissociationmethod + parameters['labelling_type'] = meta.labelling_type + parameters['fixedmodifications'] = meta.fixedmodifications + parameters['variablemodifications'] = meta.variablemodifications + parameters['precursormasstolerance'] = meta.precursormasstolerance + parameters['precursormasstoleranceunit'] = meta.precursormasstoleranceunit + parameters['fragmentmasstolerance'] = meta.fragmentmasstolerance + parameters['fragmentmasstoleranceunit'] = meta.fragmentmasstoleranceunit + parameters['enzyme'] = meta.enzyme + parameters['ms1minmz'] = meta.ms1minmz + parameters['ms1maxmz'] = meta.ms1maxmz + parameters['ms2minmz'] = meta.ms2minmz + parameters['ms2maxmz'] = meta.ms2maxmz + + return parameters +} + +/* +======================================================================================== + THE END +======================================================================================== +*/ diff --git a/workflows/quantmsdiann.nf b/workflows/quantmsdiann.nf new file mode 100644 index 0000000..079858a --- /dev/null +++ b/workflows/quantmsdiann.nf @@ -0,0 +1,125 @@ +/* +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + PRINT PARAMS SUMMARY +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +*/ + +include { paramsSummaryMap } from 'plugin/nf-schema' +include { paramsSummaryMultiqc } from '../subworkflows/nf-core/utils_nfcore_pipeline' +include { softwareVersionsToYAML } from '../subworkflows/nf-core/utils_nfcore_pipeline' +include { methodsDescriptionText } from '../subworkflows/local/utils_nfcore_quantms_pipeline' + +// Main subworkflows imported from the pipeline DIA +include { DIA } from './dia' + +// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules +include { INPUT_CHECK } from '../subworkflows/local/input_check/main' +include { FILE_PREPARATION } from '../subworkflows/local/file_preparation/main' +include { CREATE_INPUT_CHANNEL } from '../subworkflows/local/create_input_channel/main' + +// Modules import from the pipeline +include { PMULTIQC as SUMMARY_PIPELINE } from '../modules/local/pmultiqc/main' + +/* +======================================================================================== + RUN MAIN WORKFLOW +======================================================================================== +*/ + + +workflow QUANTMSDIANN { + + main: + + ch_versions = channel.empty() + + // + // SUBWORKFLOW: Read in samplesheet, validate and stage input files + // + INPUT_CHECK( + file(params.input) + ) + ch_versions = ch_versions.mix(INPUT_CHECK.out.versions) + + // + // SUBWORKFLOW: Create input channel + // + CREATE_INPUT_CHANNEL( + INPUT_CHECK.out.ch_input_file + ) + ch_versions = ch_versions.mix(CREATE_INPUT_CHANNEL.out.versions) + + // + // SUBWORKFLOW: File preparation + // + FILE_PREPARATION( + CREATE_INPUT_CHANNEL.out.ch_meta_config_dia + ) + + ch_versions = ch_versions.mix(FILE_PREPARATION.out.versions) + + FILE_PREPARATION.out.results + .branch { item -> + dia: item[0].acquisition_method.toLowerCase().contains("dia") + } + .set { ch_fileprep_result } + // + // WORKFLOW: Run main bigbio/quantmsdiann analysis pipeline based on the quantification type + // + ch_pipeline_results = channel.empty() + ch_ids_pmultiqc = channel.empty() + ch_msstats_in = channel.empty() + ch_consensus_pmultiqc = channel.empty() + + DIA( + ch_fileprep_result.dia, + CREATE_INPUT_CHANNEL.out.ch_expdesign, + CREATE_INPUT_CHANNEL.out.ch_diann_cfg, + ) + ch_pipeline_results = ch_pipeline_results.mix(DIA.out.diann_report) + ch_msstats_in = ch_msstats_in.mix(DIA.out.msstats_in) + ch_versions = ch_versions.mix(DIA.out.versions) + + // Other subworkflow will return null when performing another subworkflow due to unknown reason. + ch_versions = ch_versions.filter { v -> v != null } + + softwareVersionsToYAML(ch_versions) + .collectFile( + storeDir: "${params.outdir}/pipeline_info", + name: 'nf_core_' + 'quantmsdiann_software_' + 'mqc_' + 'versions.yml', + sort: true, + newLine: true, + ) + .set { ch_collated_versions } + + ch_multiqc_config = channel.fromPath("${projectDir}/assets/multiqc_config.yml", checkIfExists: true) + summary_params = paramsSummaryMap(workflow, parameters_schema: "nextflow_schema.json") + ch_workflow_summary = channel.value(paramsSummaryMultiqc(summary_params)) + ch_multiqc_custom_methods_description = params.multiqc_methods_description + ? file(params.multiqc_methods_description, checkIfExists: true) + : file("${projectDir}/assets/methods_description_template.yml", checkIfExists: true) + ch_methods_description = channel.value(methodsDescriptionText(ch_multiqc_custom_methods_description)) + // concatenate multiqc input files + ch_multiqc_files = channel.empty() + ch_multiqc_files = ch_multiqc_files.mix(ch_multiqc_config) + ch_multiqc_files = ch_multiqc_files.mix(ch_workflow_summary.collectFile(name: 'workflow_summary_mqc.yaml')) + ch_multiqc_files = ch_multiqc_files.mix(FILE_PREPARATION.out.statistics) + ch_multiqc_files = ch_multiqc_files.mix(DIA.out.diann_log) + ch_multiqc_files = ch_multiqc_files.mix(ch_collated_versions) + ch_multiqc_files = ch_multiqc_files.mix(ch_methods_description.collectFile(name: 'methods_description_mqc.yaml', sort: false)) + + // create cross product of all inputs + multiqc_inputs = CREATE_INPUT_CHANNEL.out.ch_expdesign + .mix(ch_pipeline_results.ifEmpty([])) + .mix(ch_multiqc_files.collect()) + .mix(ch_ids_pmultiqc.collect().ifEmpty([])) + .mix(ch_consensus_pmultiqc.collect().ifEmpty([])) + .mix(ch_msstats_in.ifEmpty([])) + .collect() + + SUMMARY_PIPELINE(multiqc_inputs) + + emit: + multiqc_report = SUMMARY_PIPELINE.out.ch_pmultiqc_report.toList() + versions = ch_versions +}