Preprocessing step for RSEM #7752

takutosato · 2022-04-04T19:55:50Z

This tool will be used in the neovax project as a pre-processing step before running RSEM, the gene quantification tool, which has stringent requirements for the format of the input bam.

lbergelson

@takutosato I made probably extremely overly nitpicky comments on this. It's unclear to me if this is supposed to be a general user tool or a specific wierd one off for an internal pipeline. If that's the case you could probably just tag it as @Experimental or something and ignore most of my comments.

This makes me want to introduce a ReadNameGroupWalker that does the grouping in the background. SortedPairWalker? I'm not sure what I'd call it... Ted wrote a much more exciting PairWalker class but it doesn't deal with secondary reads in the way you need.

lbergelson · 2022-04-05T16:02:17Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+ *
+ * ### Task 2. Removing Reads ###
+ *
+ * If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts.


I don't understand this set of comments. This says it can remove them and then caveat says it doesn't.

contradiction removed.

lbergelson · 2022-04-05T16:02:26Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+ *
+ *
+ * Caveat: This tool does not remove duplicate reads; it assumes it's been removed upstream e.g.
+ * MarkDuplicates with R


Mark duplicates with R?

What happened was I left IntelliJ to look up what the exact argument name was (it's REMOVE_DUPLICATES) and then got distracted and went on to other places in the code.

lbergelson · 2022-04-05T16:03:30Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+ *
+ * ### Task 2. Removing Reads ###
+ *
+ * If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts.


This can be done by MarkDuplicates itself, (or a read filter) unless I misunderstand what you're doing. If the reads are aligned can't you use -XL to screen out the midochondrial reads? It's fine to bake these things into the tool to make it foolproof but I'm not clear why it's necessary to do specially.

Based on the hardcoded list of mitochondrial transcripts below it seems like it would be better to pass in an exclusion file with -XL to remove mitocondria instead of hardcoding the list.

Removed hardcoded MT contigs and created an interval list.

lbergelson · 2022-04-05T16:08:58Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+@CommandLineProgramProperties(
+        summary = "",
+        oneLineSummary = "",
+        programGroup = ReadDataManipulationProgramGroup.class // Sato: Change to QC when the other PR is merged.


I commented in the other pr but this seems better to me than QC? This isn't really checking anything.

Keeping ReadDataManipulation

lbergelson · 2022-04-05T18:57:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+    @Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME)
+    public File outSam;
+
+    @Argument(fullName = "keep-MT-reads")


I think lowercase all our full names.

good to know, this option is now removed

lbergelson · 2022-04-06T17:45:52Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+                            r.getStart() == read1.getMateStart() && r.getMateStart() == read1.getStart()).findFirst();
+            if (read2.isPresent()){
+                result.add(new ImmutablePair<>(read1, read2.get()));
+            } else {


Do we want to also warn on the case where the are multiple matchess?

lbergelson · 2022-04-06T17:48:17Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+                }
+            }
+
+            // Supplementary reads are not handled i.e. removed


I would add a default read filter to remove supplementary reads up front. You can do that by overriding getDefaultReadFilters()

lbergelson · 2022-04-06T17:55:00Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/ReadPair.java

+            throw new UserException("Read names do not match: " + this.queryName + " vs " + read.getName());
+        }
+
+        if (isPrimaryAlignment(read) && read.isFirstOfPair()) {


Should these throw if first of pair is already set?

lbergelson · 2022-04-06T18:07:43Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+        // If either of the pair is unmapped, throw out.
+        // With the STAR argument --quantTranscriptomeBan IndelSoftclipSingleend this should not occur,
+        // but we check just to be thorough.
+        if (read1.getContig() == null || read2.getContig() == null){


If either read1 or read2 is missing this will crash with a NPE. That's probably something that should be handled since it tends to happen. A crash is a fine answer but detect it and write a sane error message.

Since it seems like you do require both a first and second in pair, one way to simplify some things would be to push these checks into ReadFilters and then at this point you'll just check if there is a full set of reads. It would change the counting of the elimination reason though from being per fragment group to being by read which might not be what you want.

This doesn't seem to happen. But I'm going to invoke the @droazen rule and do the pragmatic thing, which is that if either read1 or read2 is null, we return false (so these reads will not be output).

lbergelson · 2022-04-06T18:08:19Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+        final List<CigarElement> cigarElements1 = read1.getCigar().getCigarElements();
+        final List<CigarElement> cigarElements2 = read2.getCigar().getCigarElements();
+
+        if (cigarElements1.size() != 1 || cigarElements2.size() != 1){


I might pull out a little function and call it twice instead of duplicating the checks.

takutosato · 2022-04-07T16:19:40Z

@lbergelson thanks again for reviewing, back to you.

gatk-bot · 2022-04-07T16:39:06Z

Travis reported job failures from build 38623
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	38623.12	logs
integration	openjdk8	38623.2	logs

lbergelson · 2022-04-08T16:45:25Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/PostProcessReadsForRSEM.java

+
+    @Override
+    public List<ReadFilter> getDefaultReadFilters() {
+        return Collections.singletonList(ReadFilterLibrary.NOT_SUPPLEMENTARY_ALIGNMENT);


You may or may not want to include WELLFORMED

Yea should be ok to add it but I'm a bit concerned about it having unintended consequences (e.g. read1 is tossed but it's mate isn't), so I want to leave it out for now.

lbergelson · 2022-04-08T16:48:02Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/qc/ReadPair.java

+                    "The primary firstOfPair is already set. Added read = " + read.getName());
+            this.firstOfPair = read;
+        } else if (isPrimaryAlignment(read) && read.isSecondOfPair()) {
+            this.secondOfPair = read;


Heh, do you also want to throw if second of pair is alreayd set?

lbergelson

@takutosato A few minor comments. Looks good though. I think it looks a lot cleaner than before. Feel free to merge when ready.

takutosato added 4 commits April 4, 2022 11:03

initial

1e8b28b

cleaned up

a9a5b62

add duplicate and mt filter:

c5ba782

add test files

ea73824

takutosato mentioned this pull request Apr 4, 2022

RNA pipeline with adapter clipping broadinstitute/warp#662

Merged

8 tasks

lbergelson requested changes Apr 6, 2022

View reviewed changes

takutosato added 2 commits April 7, 2022 11:50

louis

95bd053

fix tests

a2bb2a6

lbergelson reviewed Apr 8, 2022

View reviewed changes

lbergelson approved these changes Apr 8, 2022

View reviewed changes

louis2

2f7d12d

takutosato merged commit 3b0bc03 into master Apr 8, 2022

takutosato deleted the ts_rsem branch April 8, 2022 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing step for RSEM #7752

Preprocessing step for RSEM #7752

takutosato commented Apr 4, 2022

lbergelson left a comment

lbergelson Apr 5, 2022

takutosato Apr 6, 2022

lbergelson Apr 5, 2022

takutosato Apr 6, 2022

lbergelson Apr 5, 2022

lbergelson Apr 6, 2022

takutosato Apr 6, 2022

lbergelson Apr 5, 2022

takutosato Apr 6, 2022

lbergelson Apr 5, 2022

takutosato Apr 6, 2022

lbergelson Apr 6, 2022

takutosato Apr 7, 2022

lbergelson Apr 6, 2022

takutosato Apr 7, 2022

lbergelson Apr 6, 2022

takutosato Apr 7, 2022

lbergelson Apr 6, 2022

takutosato Apr 7, 2022

lbergelson Apr 6, 2022

takutosato commented Apr 7, 2022

gatk-bot commented Apr 7, 2022 •

edited

Loading

lbergelson Apr 8, 2022

takutosato Apr 8, 2022

lbergelson Apr 8, 2022

takutosato Apr 8, 2022

lbergelson left a comment

Preprocessing step for RSEM #7752

Preprocessing step for RSEM #7752

Conversation

takutosato commented Apr 4, 2022

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takutosato commented Apr 7, 2022

gatk-bot commented Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

gatk-bot commented Apr 7, 2022 •

edited

Loading