Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"compared to previous week" percentages are high even when absolute change is low #1178

Open
duanecmu opened this issue Jun 3, 2022 · 5 comments
Labels
discussion enhancement New feature or request

Comments

@duanecmu
Copy link
Contributor

duanecmu commented Jun 3, 2022

On the COVIDcast dashboard for Allegheny County the current deaths (relative change to 7 days ago) are displayed as a very large percentage change (at this time we took the screenshot it was +424 .0% change in number of deaths.) @RoniRos suggested seeing this large number may be confusing as at first glance it appears deaths are dramatically increasing when the number went only from 0 to 1-2 deaths. It may be less confusing for viewers to see N/A for such small changes.

Go to https://delphi.cmu.edu/covidcast/?region=42003 for Allegheny County or use any other county dashboard.

Included screenshot from June 1 for Allegheny County as an example. When the deaths moved from 0 deaths to 1-2 the viewer sees it jumps up by a huge number like +424% for this example.

screenshot

Rating scale 1-2 minor issue

@duanecmu duanecmu added the bug Something isn't working label Jun 3, 2022
@krivard krivard added enhancement New feature or request discussion and removed bug Something isn't working labels Jun 3, 2022
@krivard
Copy link
Contributor

krivard commented Jun 3, 2022

This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20. Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

  • what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?
  • if we do go ahead with censoring this information: whether to censor the figure (top row) or change since last week (bottom row) or both
  • whether population or raw count or both should be the determining factor, and what the thresholds should be
  • what to display instead (I recommend against "NA" because we're already using NA here to mean unavailable, as opposed to un-meaningful or confusing)

@duanecmu
Copy link
Contributor Author

duanecmu commented Jun 3, 2022

Added Roni so he can follow the discussion. @RoniRos

@krivard krivard changed the title COVIDcast dashboard 7-day average clarification changes "compared to previous week" percentages are high even when absolute change is low Jun 3, 2022
@RoniRos
Copy link
Member

RoniRos commented Jun 6, 2022

This may actually be less a small-counts issue and more a batch-reporting issue, which we already know is common in this dataset. Here's a view of the raw death counts for the same region -- it seems the actual increase in incident deaths between May 23 and May 30 is not 1-2, but 15-20.

Thanks! Based on the raw counts you shared:

  • total count for 7 days ending May 23 = 1
  • total count for 7 days ending May 30 = 21
    I am not sure why this didn't result in a percentile increase of (20/7)/(1/7)=2000%. Where did the 424% come from? Was there further smoothing before the percentile calculation?

In any case, my point is that percentile change starting from a total 7day count of 1 is uninformative, and arguably misleading or at least distracting. We can decide not to calculate percentile change if the previous 7day total is less than, say, 10. Note that the condition is only on the previous 7day total (the denominator in the percentile calculation), not the current 7day total.

Depending on the reason for the spike on May 25, this may be a good candidate for the anomalies spreadsheet that feeds annotations in the web visualizations.

True. But note that this is fairly orthogonal to my point. My point would have been the same if in the most recent 7days, instead of (0,21,0,0,0,0,0), we had, say, (2,4,3,4,3,2,3).

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

  • what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?

Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.

  • if we do go ahead with censoring this information: whether to censor the figure (top row) or change since last week (bottom row) or both

Definitely not censor the figure (top row). That figure is based on the current 7day total, which may actually be quite large. But even if it's small, I wouldn't censor it

  • whether population or raw count or both should be the determining factor, and what the thresholds should be

Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.

  • what to display instead (I recommend against "NA" because we're already using NA here to mean unavailable, as opposed to un-meaningful or confusing)

I agree, and suggest something like "Small Counts", maybe in a two-line, tiny font like the one we use for "per 100k". This will hopefully become recognizable as an icon that means "not calculated because small counts make this value uninformative".

@krivard
Copy link
Contributor

krivard commented Jun 14, 2022

It is still worth discussing whether to censor certain information for small-population regions or for small-count signals. We should decide:

  • what our choice to focus on data power users means in this case. +424% is the real actual relative change -- should we not expect data power users to be familiar with the distinction between relative and absolute change? censoring this information hides it from everyone, not just those who may be confused by it. is that fair to people who know what they're looking for?

Actually, I prefer not to censor counts, merely to avoid displaying percentages when they are based on a small-count denominator.

I was talking about censoring any information, not just counts. I don't understand how avoiding displaying percentages is different from censoring those percentages. If the distinction is important to you, could you explain?

  • whether population or raw count or both should be the determining factor, and what the thresholds should be

Raw count, and I suggest <10. I don't think population size is very relevant to this issue, except that low-pop counties are more likely to have low raw counts.

I've looked into what it would take to do this, and we have a few options. The change since last week display is based on the covidcast/trend endpoint of the Epidata API, which gives output that looks like this:

# Query: https://api.covidcast.cmu.edu/epidata/covidcast/trend?
#  signal=jhu-csse:deaths_7dav_incidence_prop
#  &geo=nation:*
#  &date=20220612
#  &basis_shift=7
#  &window=20220213-20220613
{
    "geo_type": "nation",
    "geo_value": "us",
    "date": 20220612,
    "value": 0.1152074,
    "basis_date": 20220605,
    "basis_value": 0.080945,
    "basis_trend": "increasing",
    "min_date": 20220603,
    "min_value": 0.0756771,
    "min_trend": "increasing",
    "max_date": 20220213,
    "max_value": 0.7274768,
    "max_trend": "decreasing"
}

The above is taken from the actual query performed by the frontend in determining the "change since last week" for deaths and results in "+42.3%" (=value/basis_value-1). Since it queries 7dav prop and not raw counts, we could:

  • Modify the frontend code to query raw counts in addition to 7dav prop, and condition the resulting display on raw count threshold.
    • Pros: Changes are limited to one system.
    • Cons: Marginally slower load time (+200ms for the query, plus extra for logic); Complicated special-casing logic.
      • The "change since last week" code is used for all signals, but we only want this extra raw counts query for cases and deaths. Loads of options here along a scale from "hard-coded and painful to maintain but quicker to implement right now" to "configurable and easy to maintain but will require more and more-thoughtful implementation time"
  • Modify the API server to make covidcast/trend suppress or flag responses where the basis value for the requested signal is related to a corresponding raw counts signal below threshold.
    • Pros: May be easier for the API to know which signals are related than for the frontend; No additional load time needed for querying data;
    • Cons: Changes multiple systems; Requires careful rollout procedure; Can't use existing "base name" signal relationship without disrupting other projects.
      • Signal relationships are surfaced in the covidcast/meta endpoint, and correct functioning of Query-Time Computations (JIT) relies on the base name of cases/deaths signals being cumulative. We could store another signal relationship, with loads of options for how to do that -- some would impact additional parts of the frontend through the resulting changes to covidcast/meta output, others would require separate code, special-casing logic, etc.
    • Bonus: Don't need to worry about impact to other API users since covidcast/trend is not a publicly documented endpoint

pinging @sgratzl to weigh in

@RoniRos
Copy link
Member

RoniRos commented Jul 27, 2022

Revisiting this issue.

I understand and appreciate the overhead incurred by these solutions. I am not happy about it, but am also not happy about letting "+424.0%" stand; it doesn't reflect well on our system.

Since the covidcast/trend endpoint is meant to calculate and communicate about trends, and going from 1 to 3 is not quite a trend in the way that going from 300 to 900 is, I think your second option is generally the preferred one: trend should know when ratios are based on small counts and are therefore unreliable, and should signal this condition appropriately, maybe using a special non-numeric value.

I understand this is not trivial to do right. Let's let this issue sleep until we have to revamp related code for other needs, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants