-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various MultiQC issues: FastQC sections for raw and trimmed reads // umi-tools dedup and extraction plots, custom content styling. #1308
Various MultiQC issues: FastQC sections for raw and trimmed reads // umi-tools dedup and extraction plots, custom content styling. #1308
Conversation
|
5a8912a
to
e75b875
Compare
Some progress:
|
Thanks @MatthiasZepper !!
I read through your write-up but was a little unclear as to what is still missing here? |
To copy in @MatthiasZepper's note on this from Slack: I am somewhat stuck with #1308, both because of a lack of time recently and also a lack of ideas. I believed that I fixed 3 of the 4 issues with the 4th, the inconsistent naming of the TrimGalore! output, being somewhat neglectable. However, it turns out that I did not fix the main issue yet. The reports generated by MultiQC when run inside the pipeline and manually on the outdir of the pipeline differ. The manual runs look exactly how I want them, so I thought it should be good, but the pipeline version does not work alike. In the pipeline version, the path_filters in the MultiQC config (workflows/rnaseq/assets/multiqc/multiqc_config.yml) are not applied:
I think that is because the file paths in the ch_multiqc_files are still those to the work dir and to not correspond yet to the final folder structure specified by the publishDir directives when I mix the output into the channel…
… but since I can’t do a proper introspection into the channel (a .view() or .collectFile() completely crashes the pipeline), I don’t know for sure. |
OK, I know the fix @MatthiasZepper, I sorted this in riboseq. The issue is that the file structure is flat by the time it gets to MultiQC. We need to do like:
... and then:
So we're using the |
Thank you so much! That would be fantastic! You should be able to push to the branch since you are a maintainer, but just in case, I have also invited to as a collaborator to my fork! |
@MatthiasZepper OK, committed! Had a quick check and I think this works, though I note that the trimgalore subworkflow doesn't do a post-trim FASTQ, which we might want to address at some point.... Anyway, I'll let you take it home from here :-) |
Thank you so much! I will try my best to finish this quickly now!
Oh, it does. It is just confusing, because TrimGalore! in itself is a wrapper script around cutadapt and FastQC. So FastQC is not run as a Nextflow process but by the TrimGalore Perl script. |
Ahh right, thought I was forgetting something ;-). So there is probably a missing bit to get those outputs prefixed correctly, but you know what to do. |
@MatthiasZepper in case it's impacting on your work, we've noticed that the lastest MultiQC has generated some issues in the workflow. We're looking into it. |
c005701
to
3ad2adf
Compare
I think/hope/wish I am done with this PR. It now fixes 3 out of the 4 issues that were spotted with the TrimGalore! renaming being left. However, I perceive this as a minor issue and think that it could be tackled some when later if needed. |
Great, thanks @MatthiasZepper ! Just to be clear, you don't need an updated MultiQC? |
It did need changes to MultiQC, since the previous version was not working. However, the critical bug was fixed with 1.22.2 and my updates to the umi-tools module were already contained within 1.22.3. Therefore, with this PR, we should now see (re)introduced:
|
…in the General Stats table of MultiQC.
Hope you don't mind @MatthiasZepper - just illustrating in those last couple of commits what I meant. So use the module in its updated form, but also have a patch to help with updates. I also removed something I added to the patch earlier and which shouldn't have been there, and bumped the module (think it was just Maxime mucking about with stubs) |
No, I don't mind at all. In contrast, I highly appreciate your help here! Please push your changes also to the draft PR of the modules' repo right away so they don't get lost in translation!. Fixing the dupradar module was not even in the original scope of this PR. I think, the first changes got introduced by rebasing my draft PR to the dev branch, and then I packed some more changes in there because a colleague suggested them and felt that it was too minor for an PR on its own right ?!? In either way, I would like to see the MultiQC fixes merged and am happy to take everything else out, if it complicates the review and decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is good to go. I've given the MultiQC report a check, and it's looking good to me when I check for the recent issues.
Module state is per @MatthiasZepper's module PR, temporarily achieved via a patch pending the merge of that PR. We can merge as-is, or just merge that module PR (since everything seems to be working) and update here, removing the patch.
@MatthiasZepper I think the failure here is because the nf-test didn't run on the module PR (maybe touching the template file isn't enough), and we need to update the tests to reflect the changes. I'll take a look. |
Thanks. But don't overthink it, since I probably just screwed up undoing the local changes.
|
Thanks @MatthiasZepper! Module update done, lights are green. Merge away when you're ready. |
I did not have time to test the latest iteration of this PR until just now, but to me it seems the MultiQC issues are not fixed (or new ones emerged - can't tell if I overlooked something before, because I have deleted the results from the previous test runs already). 1.) The FastQC section is missing samples. Only 2 samples in the FastQC, but 5 samples in the umi-tools module, if I use the test profile: Also, the sample names are oddly mixed up. I never paid much attention to the contents of the test data, but it seems that some tools only process parts of the data or perform some weird renaming of the samples? 2.) The Dupradar plot is there, but no lines are shown and the sample values are all 0,0. That might be due to the small testdata being poorly suited to test the tool, but of course it could also be due to an invalid config or incorrect data processing. Can you as a first step, please let me know if it is the same with you or not? |
This is actually because there are only two e.g. _raw_fastqc.zip files getting to the multiqc process, so probably a workflow issue. I'll figure it out |
@MatthiasZepper think I fixed the FASTQC thing at least (see last commit) - could you check again? |
Also, could you give me the UMI params you're testing with, and which are not set in the test profile by default? |
Pipeline run is queued and about to start as we speak (as I type).
Of course. Mind, however, that this is a completely nonesense pattern. I am defining three fixed bases at the start, but allowing for two random substitutions. I just wanted to make the UMI-tools extract plot a little more informative (and take that module to a true test), because with a fixed pattern there are no failures when extracting.
Then this is probably a side effect of my random UMI specification that I did not think through properly or an issue with my browser. |
I've also looked into the UMI thing. I don't think it can ever have worked (unless there is some regression I was unaware of). Multi QC is parsing log lines like:
Those input files are taken directly from the sample sheet. Other processes where FASTQ files have been merged end up using a prefix on their output, so the log for
To fix this, someone will probably have to alter |
OK, I see it now with your parameters, so it's not your browser! Not sure of the fix though as it relates to your params (I haven't dug much into the UMI stuff), and don't think it's MultiQC related. So probably one for a separate issue as well. |
I fear, I can't follow you without being a tad more specific than thing. 🙃 I think, you are aware of that there are two umi-tools steps in the pipeline:
The examples you show evidently refer to |
Yes, I meant your earlier flagged inconsistency in the names on the extract plots. Being more specific, the sample IDs are derived like this, which means they're derived from data lines like:
... i.e. from the actual, bare, FASTQ file names, exactly as supplied to the pipeline. The FASTQs that . I might suggest that the simplest thing would be to work off the
... but obviously we'd then we waiting on a release. MultiQC renaming looks good, and I just had a quick stab, but I can't see how to do it quickly (the MultiQC module really needs a file input for the renaming TSV). To my mind we should probably get this merged (assuming you confirm the FASTQC fix works), and deal with this down the line. |
This draft PR comprises my current progress towards fixing issue #1303.
It does modify the
publishDir
directives in the FastQC module config such that the reports are consistently published in${params.outdir}/fastqc/raw
and${params.outdir}/fastqc/trim
regardless of the chosen trimmer (TrimGalore!, Fastp), and adapts the custom MultiQC config of the pipeline accordingly.This is, however, not sufficient to fix the issue, because recent versions of MultiQC have a bug that prevents running the same module twice. There are still separate entries and columns in the General Statistics table, but the modules are not shown in the report and navigation bar:
For both screenshots, I ran MultiQC on the output directory of a
test
profile run of this pipeline using the custom profile inworkflows/rnaseq/assets/multiqc/multiqc_config.yml
.It should be stressed that the FastQC module itself works in modern versions, because if the custom config is omitted, it is also shown. But forcing the module to run twice via a custom config seemingly breaks it. Only in the
General Statistics
table, it still works like a charm. Thus, the reports are parsed, but the module output is not displayed in the report.Further issues
In the course of troubleshooting this issue, I discovered more issues that need to be tackled. Help would be greatly appreciated with those:
Inconsistent naming of FastQC output:
For FastP, the file names are retained before and after trimming:
For TrimGalore!, the RAP1_UNINDUCED samples are renamed with a trimmed suffix and the others receive
_val1_
and_val2_
suffixes.Unfortunately, I have no idea why. I have quadruplechecked the
publishDir
directives and can't explain. Help and inspiration needed!Duplicate column is actually shown in the General Statistics table (FIXED!)
According to the config, the duplicate column from FastQC should be hidden in the General Statistics table. However, it is shown. Might be another MultiQC bug or that I just stared myself blind.
umi-tools dedup stats not shown (Fixed)
According to our current
master
/dev
branch config, the umi_tools module is not run. Seeing this, I believed that would be an easy fix for #1277 and added the module in the config. However, no reports are shown. Either the module is broken or the deduplication stats are not channelled to MultiQC. In either way, also no quick solution in sight here.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).