Skip to content

Latest commit

 

History

History
executable file
·
197 lines (154 loc) · 9.13 KB

quantile-dotplots.md

File metadata and controls

executable file
·
197 lines (154 loc) · 9.13 KB

Quantile dotplots

Introduction

This document describes quantile dotplots, the basic motivation behind them, and how to generate them.

Please cite:

Matthew Kay, Tara Kola, Jessica Hullman, Sean Munson. When (ish) is My Bus? User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems. CHI 2016. DOI: 10.1145/2858036.2858558.

Setup

Required libraries

If you are missing any of the packages below, use install.packages("packagename") to install them. The import:: syntax requires the import package to be installed, and provides a simple way to import specific functions from a package without polluting your entire namespace (unlike library())

library(ggplot2)
import::from(magrittr, `%>%`, `%<>%`, `%$%`)
import::from(dplyr, 
    transmute, group_by, mutate, filter, select, data_frame,
    left_join, summarise, one_of, arrange, do, ungroup)

Ggplot theme

theme_set(theme_light() + theme(
    panel.grid.major=element_blank(), 
    panel.grid.minor=element_blank(),
    axis.line=element_line(color="black"),
    text=element_text(size=14),
    axis.text=element_text(size=rel(15/16)),
    axis.ticks.length=unit(8, "points"),
    line=element_line(size=.75)
))

Representing a continuous probability distribution as discrete outcomes

We might like to represent a continuous probability distribution (say, a prediction for when a bus is going to arrive) as discrete outcomes, since frequency-based (or "discrete outcome") presentations can be easier for people to interpret. How should we do that?

A problematic solution: random draws

A first pass at representing a continous probability distribution as discrete outcomes might be to generate random draws from that distribution. However, especially for a small number of samples, this tends not to work well. The problem is that any given random sample of (say) 20 isn't always representative.

For example, a log-normal distribution might look like this:

mu = log(11.4)
sigma = 0.2
x = seq(from=.01, to=30, length=10001)
d = dlnorm(x, mu, sigma)
density_df = data_frame(x, d)
density_df %>%
    ggplot(aes(x = x, y = d)) +
    geom_line() +
    geom_vline(xintercept = 11, color="red") + 
    xlim(0, 30)

And we could generate random samples of a set size, say 20, to visualize this. But every time we do that it looks different:

runs = 5
samples = 20
binwidth = 1.25
data.frame(
    run = factor(1:runs),
    x = rlnorm(runs * samples, mu, sigma)
) %>%
    ggplot(aes(x=x)) +
        geom_dotplot(binwidth=binwidth) +
        facet_grid(run ~ .) +
        xlim(0, 30)

Sometimes the distribution shape is obscured, the location shifted, or tails over- or under-represented.

A consistent solution: a quantile dotplot

One way to get a more consistent representation is to use the quantile function of the distribution instead of random draws. In essence, this encodes the cumulative distribution function one-dimensionally, and then turns it into a Wilkinsonian dotplot. (Wilkinson originally described dotplots for displaying sample data in Wilkinson, L, Dot Plots, The American Statistician, 1999; we re-use them here for displaying predictive quantiles, hence quantile dotplot).

First, we generate evenly-space quantiles in probability space (i.e. from 0 to 1) depending on the number of samples we want. If you are familiar with Q-Q plots, this is essentially the same approach used to generate representative quantiles for a Q-Q plot (a variant of this method is implemented in R in the ppoints function, but spelled out here for completeness).

quantiles = data_frame(
    p_less_than_x = seq(from = 1/samples / 2, to = 1 - (1/samples / 2), length=samples),
    x = qlnorm(p_less_than_x, mu, sigma)
)
quantiles
## # A tibble: 20 x 2
##    p_less_than_x         x
##            <dbl>     <dbl>
## 1          0.025  7.703082
## 2          0.075  8.548083
## 3          0.125  9.057050
## 4          0.175  9.456435
## 5          0.225  9.801450
## 6          0.275 10.115423
## 7          0.325 10.410979
## 8          0.375 10.696167
## 9          0.425 10.976863
## 10         0.475 11.257921
## 11         0.525 11.543872
## 12         0.575 11.839448
## 13         0.625 12.150147
## 14         0.675 12.482976
## 15         0.725 12.847707
## 16         0.775 13.259262
## 17         0.825 13.743022
## 18         0.875 14.349043
## 19         0.925 15.203409
## 20         0.975 16.871168

The table above shows the probability of drawing a value less than x (i.e. P(X < x)) and the corresponding value of x to achieve that probability on the underlying distribution. We generate 20 (or whatever value of samples you like) values of p_less_than_x evenly-spaced in (0,1) (i.e. not including 0 or 1), and then find x using the inverse CDF (aka the quantile function) of the predictive distribution at each value of p_less_than_x. If we then take those values of x and plot them as a dotplot, we get a quantile dotplot:

Visualized with the CDF, that looks like this:

x = seq(0.01, 30, length=1001)
p_less_than_x = plnorm(x, mu, sigma)
qf_df = data_frame(x, p_less_than_x)
cdf_plot = ggplot(quantiles, aes(x=x, y=p_less_than_x)) +
    geom_segment(aes(xend=x, yend=p_less_than_x), x = 0, color="gray75", linetype="dashed") +
    geom_segment(aes(xend=x, yend=p_less_than_x), y = -0.05, color="gray75", linetype="dashed",
        arrow = arrow(ends="first", length=unit(5, "pt"), type="closed")) +    
    geom_line(data = qf_df, color="red", size=1) +
    annotate("text", x = 30, y = 1, label = "cumulative distribution function", color="red", hjust=1, vjust=1.5) +
    xlim(c(-3.5, 30))
cdf_plot

quantile_dotplot = quantiles %>%
    ggplot(aes(x = x)) +
    geom_vline(aes(xintercept = x), color="gray75", linetype="dashed") +  
    geom_dotplot(binwidth = 1.25, color=NA) +
    xlim(-3.5,30)
quantile_dotplot

The evenly-spaced probabilities on the y axis are turned into representative quantiles from the distribution on the x axis.

Properties of the quantile dotplot

Shape approximates density

The shape a quantile dotplot approximates the shape of the density:

quantile_dotplot +
    geom_line(aes(y = d * 5.1), data = density_df, color="red")

Finding probability intervals reduces to counting

Inferences about one-sided probability intervals on the distribution can be made through counting. If our example distribution represents a probability distribution over predicted time to arrive for my bus, and I am willing to miss the bus 2/20 times, I can count up 2 dots ("busses") from the left to get the time I should arrive at the bus stop. This works because the x-value for P(X < x) = 2/20 will be between the second and third dot. More generally, the x-value for P(X < x) = k/20 will be between the kth and k + 1th dot). Here's how this works on the CDF and the dotplot:

cdf_plot +
    geom_segment(x = 0, xend = qlnorm(.1, mu, sigma), y = .1, yend = .1, color = "purple", size = 1) +  
    geom_segment(x = qlnorm(.1, mu, sigma), xend = qlnorm(.1, mu, sigma), y = -0.05, yend = .1, 
        color = "purple", size = 1, arrow = arrow(ends="first", length=unit(5, "pt"), type="closed")) +
    annotate("text", x = 0, y = 0.1, label = "2/20 = 0.1", color = "purple", hjust = 1.1)
quantile_dotplot +
    geom_dotplot(binwidth = 1.25, color=NA, fill="purple", data=filter(quantiles, p_less_than_x < .1)) +
    geom_vline(xintercept = qlnorm(.1, mu, sigma), color="purple", size=1) 

Consequently, this purple line represents the point of time I should arrive at my bus if I want to catch it about 18 times out of 20 (or miss it 2 times out of 20).