Skip to content

jsermeno/amby

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amby: statistical data visualization

Build Status Haskell Programming Language [BSD3 License](tl;dr Legal: BSD3)

boxplot with empty bins

normal distribution plot

clean theme equation plot

multiple beta distributions

A statistics visualization library built on top of Chart inspired by Seaborn. Amby provides a high level interface to quickly display attractive visualizations. Amby also provides tools to display Charts from both Amby and the Chart package within GHCi.

Plotting basics

The simplest plotting function is plot'. Here's how you might plot the standard normal distribution.

λ> import Amby
λ> import qualified Statistics.Distribution.Normal as Stats

λ> let x = contDistrDomain Stats.standard 10000
λ> let y = contDistrRange Stats.standard x
λ> plot' x y

normal distribution plot

Notice the tick mark ' after plot'. This indicates a function that accepts no optional arguments.

Plotting univariate distributions

This tutorial mirrors the first section of Seaborn's python tutorial.

Use distplot to view univariate distributions. By default this will create a histogram and fit a kernel density estimate.

λ> z <- random Stats.standard 100
λ> distPlot' z

distplot

Histograms

The distPlot histogram automatically chooses a reasonable number of bins and counts the data points in each bin. To view the position of each data point you can add a rugplot.

λ> distPlot z $ kde .= False >> rug .= True

histogram with rugplot

Choosing a different number of bins for the histogram can reveal different patterns in the data.

λ> distPlot z $ bins .= 20 >> kde .= False >> rug .= True

histogram with more bins

Kernel density estimation

Kernel density estimation can be a useful too for plotting the shape of the distribution.

λ> distplot z $ hist .= False >> rug .= True

Kernel density estimation

A kernel density estimation is a summation of several normal distributions, each centered on each of the data points.

λ> import qualified Statistics.Sample as Stats
λ> import qualified Data.Vector.Unboxed as U
λ> let bandwidth = 1.059 * Stats.stdDev z * fromIntegral (U.length z) ** ((-1) / 5)
λ> let xs = linspace (-6) 6 200
λ> let a = U.take 30 z

λ> let foldFn _ b = plot xs (contDistrRange (Stats.normalDistr b bandwidth) xs)
λ> U.foldM foldFn () a >> rugPlot a (color .= K >> linewidth .= 3) >> xlim (-4, 4)

Kernel density estimation explanation

The resulting curve is normalized so the area under it is equal to 1. This is what is provided with the kdePlot function.

λ> kdePlot z $ shade .= True

Kernel density estimation

The bandwith (bw) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. The default behaviour tries to guess a good value, but it may be helpful to try larger or smaller values.

λ> kdePlot' z >> kdePlot z (bw .= BwScalar 0.2) >> kdePlot z (bw .= BwScalar 2)

Kernel density estimation bandwidth

You can also control how far past the range of your dataset the curve is drawn. However this only influences how the curve is drawn, not how it is fit.

λ> kdePlot z (cut .= 0 >> shade .= True) >> rugPlot' z

Kernel density estimation cut

Plotting categorical data

In this section we'll see how to visualize the relationship between a numeric variable and one or more categorical variables.

Plotting distributions of observations within categories

Boxplots can facilitate easy comparisons across category levels. This kind of plot shows the three quartile values of the distribution along with extreme values. The "whiskers" extend to points that lie within 1.5 IQRs (interquartile range) of the lower and upper quartile, and then observations that fall outside this range are displayed independently. Importantly, this means that each value in the boxplot corresponds to an actual observation in the data:

For convenience we'll use the loadDataset method from Amby.Utils to load datasets.

The simplest way to draw a boxplot is to use the boxPlot function.

λ> ds <- loadDataset tips
λ> head ds
Tip
  { totalBill = 16.99
  , tip = 1.01
  , sex = "Female"
  , smoker = "No"
  , day = "Sun"
  , time = "Dinner"
  , tipSize = 2
  }
λ> (b, p, s, k, d, t, _) <- getTipColumns ds

Draw a single horizontal boxplot.

λ> boxPlot' b

single horizontal boxplot

Draw a vertical boxplot grouped by a categorical variable.

λ> boxPlot b $ fac .= d >> axis .= YAxis

boxplot with one factor

Draw a vertical boxplot with nested grouping by two categorical variables.

λ> boxPlot b $ fac .= s >> hue .= d >> axis .= YAxis >> color .= G

boxplot with two factors

Draw a boxplot when some bins are empty.

λ> theme springTheme >> boxPlot b (fac .= d >> hue .= t)

boxplot with empty bins

Control box order.

λ> boxPlot p $ fac .= changeOrder t ["Dinner", "Lunch"]

boxplot with manual order

If you want to compare more than two categorical variables you can use factorPlot.

λ> gridTheme cleanTheme >> factorPlot b (fac .= s >> hue .= d >> col .= k)

boxplot with three factors

We can add labels.

λ> factorPlot b $ fac .= s >> hue .= d >> col .= k >> colLabel .= "smoker"

labeled boxplot with three factors

You can compare up to four categorical variables using factorPlot.

λ> factorPlot b $ fac .= s >> hue .= d >> col .= k >> row .= t

boxplot with four factors

Rendering

There are several ways to render plots.

First, Amby provides the helper functions save and saveSvg that will save a graph to the file .__amby.png and .__amby.svg respectively. save uses the Cairo backend, while saveSvg uses the Diagrams backend. The Diagrams backend produces better looking charts, but is slower.

λ> save $ distPlot' z
λ> saveSvg $ distPlot' z

Second, you can use any rendering methods that the underlying Chart library provides by converting an AmbyChart () or AmbyGrid () to a Renderable (LayoutPick Double Double Double) with the getRenderable function.

λ> import Graphics.Rendering.Chart.Easy (def)
λ> import Graphics.Rendering.Chart.Backend.Cairo as Cairo
λ> import Graphics.Rendering.Chart.Backend.Diagrams as Diagrams
λ> Cairo.renderableToFile def "myFile.png" $ getRenderable $ distPlot' z
λ> Diagrams.renderableToFile def "myFile.svg" $ getRenderable $ distPlot' z

Third—if you have a terminal that supports images such as iTerm2—you can display charts directly inside the GHCi repl. Just install the imgcat executable, and the pretty-display library. See here for further installation instructions.

λ> distPlot' z

terminal example

Plotting equations

You can also specify graphs using a domain and an equation.

λ> plotEq' [0,0.001..4] sqrt

clean theme equation plot

Multiple container types

Plotting functions work on both lists and generic vectors of doubles.

λ> plotEq' [0,0.001..4] sqrt
λ> plotEq' (linspace 0 4 4000) sqrt

Combine graphs using do notation

λ> import Statistics.Distribution.Beta as Stats
λ> :set +m
λ> let plotBeta a b =
λ|       let d = Stats.betaDistr a b
λ|           x = contDistrDomain d 10000
λ|           y = contDistrRange d x
λ|       in plot' x y
λ> do
λ|   theme cleanTheme
λ|   plotBeta 0.5 0.5
λ|   plotBeta 5 1
λ|   plotBeta 1 3
λ|   plotBeta 2 2
λ|   plotBeta 2 5
λ|   ylim (0.0, 2.5)

multiple beta distributions

Dependencies

To use amby you'll first need to install Chart and gtk2hs if you don't already have them.

Chart and Gtk2Hs

Mac OS X

Here are the instructions I used to install Chart and gtk2hs on OS X El Capitan with stack.

stack install Chart-diagrams
brew cask install xquartz
brew install glib cairo gtk gettext fontconfig freetype

Add the following environment variable export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig to .bashrc or similar file.

stack install alex happy
stack install gtk2hs-buildtools
stack install glib
stack install -- gtk --flag gtk:have-quartz-gtk
stack install Chart-cairo

Linux and Windows

Instructions for installing gtk2hs on Linux and Windows can be found here.

Likewise, run

stack install Chart-diagrams
stack install Chart-cairo

Imgcat

To be able to display charts in ghci with a terminal such as iTerm2 you'll need imgcat and pretty-display.

Mac OS X

brew tap eddieantonio/eddieantonio
brew install imgcat

Linux and Windows

For more information visit imgcat's repository

pretty-display

  1. Add pretty-display to your cabal file.
  2. stack build
  3. Place the following in your .ghci file. If you're using stack you can put this file at the root of your project.
import Text.Display

:set -interactive-print=Text.Display.dPrint
:def pp (\_ -> return ":set -interactive-print=Text.Display.dPrint")
:def npp (\_ -> return ":set -interactive-print=print")
  1. Restart ghci.

Other tips

Auto-reload files

If using the 'save' or 'saveSvg' functions because your terminal is unable to display images within GHCi you can use a tool such as entr to run a command like open whenever the file is saved.

ls -d __amby.png | entr -r open /_

About

Statistical data visualization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published