Add D2 benchmark specification #171

daneckaw · 2021-09-01T12:55:08Z

Fixes #170

Differential usage benchmark specification and example input/output files.

Regarding input files: at least some tools can detect whether the 3'-UTR is lengthened or shortened, so I added a column for that and we can use this for another metric. If not many tools produce this data or we don't have ground-truth data, we can remove it.

Checklist

I have performed a self-review of my own code
My code follows the templates/style guidelines of the repository
I have commented my code, particularly in hard-to-understand areas
I have tested my code

SamBryce-Smith

Made some small comments in the README. As a more general comment, I'm not keen on having a shortening/lengthening metric here as part of this specification (it should be in a separate differential usage specification if we decide to implement).

On a practical note RE shortening/lengthening, we'd need to define a new differential usage output file specification and update all of the execution workflows accordingly. Coding wise this wouldn't be too challenging, but we'd have to decide this pretty quickly to give us time to implement, review code etc. I absolutely wouldn't want this to be the deciding factor, but it's still a point to consider.

Will share my thoughts on a shortening/lengthening metric here anyway in case we decide to implement. I was chewing over this for a little while so apologies in advance for the long message (and delayed review). I do think a metric like this could be useful, especially if the question is considered as ‘does the tool get the relative direction of change right’? Any higher resolution and I think it would be very difficult/impossible to treat all of the tools consistently.

What is our definition of a shortening/lengthening event? The way I think of it is the direction (and magnitude) of change in relative usage of the polyA sites on 3’UTR. I think most tools would define it this way as part of a relative usage metric (e.g. % proximal polyA site used), but these metrics aren’t equivalent between all the tools. For example, LABRAT’s psi value is a gene-level relative usage metric where the expression of each PAS is weighted by 5’-3’ order in the gene, whereas DaPars’s PDUI is the % distal polyA site used. I think however we decided to calculate our shortening/lengthening metric for the ground truth data, we’d either have to include tool-specific re-processing to match our convention or exclude tools from the analysis (if there is not enough information about the individual PAS provided to re-calculate our relative usage metric).

Which value to report for a gene? A number of tools in our challenge operate per transcript model (e.g. both DaPars tools, APAtrap), so in genes where these transcripts have different last exons/3’UTRs we will likely have different shortening/lengthening values. As it stands, when reporting p-values for these tx-by-tx tools we’ve settled on choosing the smallest p-value for each gene, so in line with that we could also pick the shortening/lengthening metric value associated with that row.

at what level of resolution do we want to assess shortening/lengthening? i.e. does the gene show any 3’UTRs with shortening/lengthening (and its direction), or do we want to ensure the same 3’UTRs are being assessed between the tool and ground truth? I think the latter would be too complicated to implement as different tools have different conventions for reporting shortening/lengthening and assess/don’t assess polyA sites on different exons. If we’re interested in just 3’UTR shortening/lengthening, we’d also have to be careful about which genes to include in our evaluation for tools that assess alternative last exon APA (e.g. LABRAT).

We may also want to consider a range of effect sizes in our ground truth labels (e.g. +/- 1/5/10 % change). The larger magnitude changes are usually the genes we focus on for further investigation, so it would be useful to know how reliable/reproducible these events are with the ground truth.

Anyway apologies again for the super-long message, most of this isn't directly related to the PR but I wanted to get my thoughts down anyway :)

SamBryce-Smith · 2021-10-07T11:26:40Z

summary_workflows/differential_usage/D2_benchmark/D2_benchmark_specification.md

+#### Format 1
+
+This CSV file contains a list of genes with differentially used poly(A) sites identified from RNA-Seq data by the benchmarked method.  
+Columns:
+
+- gene name
+- is_de: information whether **any** PAS in the gene was differentially expressed; possible values: "0" if no differential expression detected, "1" if differentially expressed
+- is_lengthened: information whether shortening/lengthening events were detected for the gene; possible values: "0" if no shortening/lengthening events detected, "1" if shortening/lengthening events detected


Judging from other benchmark specifications the input should match output file formats specified for execution workflows where possible. For differential usage output files we just have a 2 column TSV of gene_id & p-value. I guess it's not so much of a problem if we can write an adaptor script as part of the summary workflow to generate this CSV file/table though, but I'd prefer to stick with the TSV file as input in the interests of consistency.

Also, how would we define a gene as differentially used? My naive assumption is that we'd set a p-value threshold below which genes are considered differentially used, in which case we'd probably want to try several different thresholds.

Also, would prefer 'differentially used' / 'differential usage' in place of 'differentially expressed' / 'differential expression' to keep consistent with definitions throughout APAeval (e.g. line 42)

I agree with the terminology. I would however, include the quantitative measure of usage and not just the p-value. This would mean then that we will have to define and calculate the measure of shortening/lengthening for a gene. But given that a gene can have multiple terminal exons and different tools could call different sites in different terminal exons, it makes more sense to do the analysis at the exon level.

In terms of shortening/lengthening definition, I would probably take the one from LABRAT.

ninsch3000 · 2022-05-19T08:03:31Z

Hi at @daneckaw @SamBryce-Smith @mzavolan , can this PR be closed as we already merged the new specs in #222 ? Or is there some additional information here?

Add D2 benchmark specification

07e1dbe

daneckaw added the draft Work in progress, not ready for review label Sep 1, 2021

Added example input files

0dec6e6

daneckaw added ready for review and removed draft Work in progress, not ready for review labels Sep 29, 2021

daneckaw requested review from mrgazzara, ninsch3000 and YosephBarash September 29, 2021 10:50

daneckaw mentioned this pull request Sep 29, 2021

feat: D2 benchmark scripts #181

Closed

4 tasks

daneckaw requested a review from SamBryce-Smith September 29, 2021 13:41

SamBryce-Smith requested changes Oct 11, 2021

View reviewed changes

faricazjj added reviews to address PR reviews complete but changes needed and removed ready for review labels Jan 18, 2022

ninsch3000 closed this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add D2 benchmark specification #171

Add D2 benchmark specification #171

daneckaw commented Sep 1, 2021 •

edited

Loading

SamBryce-Smith left a comment

SamBryce-Smith Oct 7, 2021

mzavolan Feb 4, 2022

mzavolan Feb 4, 2022

ninsch3000 commented May 19, 2022

Add D2 benchmark specification #171

Add D2 benchmark specification #171

Conversation

daneckaw commented Sep 1, 2021 • edited Loading

Checklist

SamBryce-Smith left a comment

Choose a reason for hiding this comment

SamBryce-Smith Oct 7, 2021

Choose a reason for hiding this comment

mzavolan Feb 4, 2022

Choose a reason for hiding this comment

mzavolan Feb 4, 2022

Choose a reason for hiding this comment

ninsch3000 commented May 19, 2022

daneckaw commented Sep 1, 2021 •

edited

Loading