-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add D2 benchmark specification #171
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some small comments in the README. As a more general comment, I'm not keen on having a shortening/lengthening metric here as part of this specification (it should be in a separate differential usage specification if we decide to implement).
On a practical note RE shortening/lengthening, we'd need to define a new differential usage output file specification and update all of the execution workflows accordingly. Coding wise this wouldn't be too challenging, but we'd have to decide this pretty quickly to give us time to implement, review code etc. I absolutely wouldn't want this to be the deciding factor, but it's still a point to consider.
Will share my thoughts on a shortening/lengthening metric here anyway in case we decide to implement. I was chewing over this for a little while so apologies in advance for the long message (and delayed review). I do think a metric like this could be useful, especially if the question is considered as ‘does the tool get the relative direction of change right’? Any higher resolution and I think it would be very difficult/impossible to treat all of the tools consistently.
What is our definition of a shortening/lengthening event? The way I think of it is the direction (and magnitude) of change in relative usage of the polyA sites on 3’UTR. I think most tools would define it this way as part of a relative usage metric (e.g. % proximal polyA site used), but these metrics aren’t equivalent between all the tools. For example, LABRAT’s psi value is a gene-level relative usage metric where the expression of each PAS is weighted by 5’-3’ order in the gene, whereas DaPars’s PDUI is the % distal polyA site used. I think however we decided to calculate our shortening/lengthening metric for the ground truth data, we’d either have to include tool-specific re-processing to match our convention or exclude tools from the analysis (if there is not enough information about the individual PAS provided to re-calculate our relative usage metric).
Which value to report for a gene? A number of tools in our challenge operate per transcript model (e.g. both DaPars tools, APAtrap), so in genes where these transcripts have different last exons/3’UTRs we will likely have different shortening/lengthening values. As it stands, when reporting p-values for these tx-by-tx tools we’ve settled on choosing the smallest p-value for each gene, so in line with that we could also pick the shortening/lengthening metric value associated with that row.
at what level of resolution do we want to assess shortening/lengthening? i.e. does the gene show any 3’UTRs with shortening/lengthening (and its direction), or do we want to ensure the same 3’UTRs are being assessed between the tool and ground truth? I think the latter would be too complicated to implement as different tools have different conventions for reporting shortening/lengthening and assess/don’t assess polyA sites on different exons. If we’re interested in just 3’UTR shortening/lengthening, we’d also have to be careful about which genes to include in our evaluation for tools that assess alternative last exon APA (e.g. LABRAT).
We may also want to consider a range of effect sizes in our ground truth labels (e.g. +/- 1/5/10 % change). The larger magnitude changes are usually the genes we focus on for further investigation, so it would be useful to know how reliable/reproducible these events are with the ground truth.
Anyway apologies again for the super-long message, most of this isn't directly related to the PR but I wanted to get my thoughts down anyway :)
#### Format 1 | ||
|
||
This CSV file contains a list of genes with differentially used poly(A) sites identified from RNA-Seq data by the benchmarked method. | ||
Columns: | ||
|
||
- gene name | ||
- is_de: information whether **any** PAS in the gene was differentially expressed; possible values: "0" if no differential expression detected, "1" if differentially expressed | ||
- is_lengthened: information whether shortening/lengthening events were detected for the gene; possible values: "0" if no shortening/lengthening events detected, "1" if shortening/lengthening events detected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Judging from other benchmark specifications the input should match output file formats specified for execution workflows where possible. For differential usage output files we just have a 2 column TSV of gene_id & p-value. I guess it's not so much of a problem if we can write an adaptor script as part of the summary workflow to generate this CSV file/table though, but I'd prefer to stick with the TSV file as input in the interests of consistency.
Also, how would we define a gene as differentially used? My naive assumption is that we'd set a p-value threshold below which genes are considered differentially used, in which case we'd probably want to try several different thresholds.
Also, would prefer 'differentially used' / 'differential usage' in place of 'differentially expressed' / 'differential expression' to keep consistent with definitions throughout APAeval (e.g. line 42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the terminology. I would however, include the quantitative measure of usage and not just the p-value. This would mean then that we will have to define and calculate the measure of shortening/lengthening for a gene. But given that a gene can have multiple terminal exons and different tools could call different sites in different terminal exons, it makes more sense to do the analysis at the exon level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of shortening/lengthening definition, I would probably take the one from LABRAT.
Hi at @daneckaw @SamBryce-Smith @mzavolan , can this PR be closed as we already merged the new specs in #222 ? Or is there some additional information here? |
Fixes #170
Differential usage benchmark specification and example input/output files.
Regarding input files: at least some tools can detect whether the 3'-UTR is lengthened or shortened, so I added a column for that and we can use this for another metric. If not many tools produce this data or we don't have ground-truth data, we can remove it.
Checklist