Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram merges last two values when last value coincides with the upper threshold #54

Closed
aendra-rininsland opened this issue Mar 27, 2017 · 4 comments

Comments

@aendra-rininsland
Copy link

aendra-rininsland commented Mar 27, 2017

I wrote a Stack Overflow question about this, which documents the behaviour well. At the request of an answerer, I've created this issue to try and understand this behaviour better (or alternately register it as a bug if it is indeed such).

tl;dr — d3.histogram merges the last two bins when the last value coincides with the upper threshold. However, if you set histogram.domain() to [extentMin -1, extentMax + 1], the problem seems to dissipate.

Example code (courtesy of Gerardo Furtado):

var data = d3.range(100);

const histogram = d3.histogram()
  .value(d => d)
  .thresholds(data);

var bins = histogram(data);

console.log(bins);

The last bin contains both 98 and 99, whereas the other bins only contain one value.

This isn't the case with the other thresholds:

var data = d3.range(100);

const histogram = d3.histogram()
  .value(d => d)
  .thresholds(d3.thresholdFreedmanDiaconis(data, d3.min(data), d3.max(data)));

var bins = histogram(data);

console.log(bins)
var data = d3.range(100);

const histogram = d3.histogram()
  .value(d => d)
  .thresholds(d3.thresholdScott(data, d3.min(data), d3.max(data)));

var bins = histogram(data);

console.log(bins)
var data = d3.range(100);

const histogram = d3.histogram()
  .value(d => d)
  .thresholds(d3.thresholdSturges(data, d3.min(data), d3.max(data)));

var bins = histogram(data);

console.log(bins)

Any idea what's going on with this?

Thanks!

@curran
Copy link

curran commented Apr 12, 2017

I believe including the upper threshold value in the max bin is standard histogram behavior.

I wonder, would histogram libraries in the R or Python worlds give the same result?

@mbostock
Copy link
Member

mbostock commented Apr 12, 2017

Per the API reference:

Thresholds are defined as an array of values [x0, x1, …]. Any value less than x0 will be placed in the first bin; any value greater than or equal to x0 but less than x1 will be placed in the second bin; and so on. Thus, the generated histogram will have thresholds.length + 1 bins.

But, what does not appear to be documented in the API reference is how histogram.domain affects the thresholds. From histogram.js:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] >= x1) tz.pop(), --m;

So, any threshold value that is less than or equal to domain[0] is dropped, as is any threshold value that is greater than or equal to domain[1].

Thus, if you have 100 thresholds [0, 1, 2, … 97, 98, 99] (d3.range(100)), and the default domain [0, 99] (d3.extent(d3.range(100))), you in fact only have 98 thresholds [1, 2, … 97, 98] because the 0 and 99 thresholds are removed.

This means the histogram will generate 99 (98 + 1) bins, with any value less than 1 ([0]) in the first bin, (bins[0]), any value greater than or equal to 1 and less than 2 ([1]) in the second bin (bins[1]), and so on, with the 99th bin (bins[98]) containing any value greater than or equal to 98 ([98, 99]).

I think it’s possible that the logic should be tweaked slightly since the upper bound of the bin is exclusive:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;

That would produce 99 thresholds [1, 2, … 98, 99] and 100 bins, [[0], [1], … [98], [99]]. But that’s a little weird because it means that the last bin can only contain values that are exactly equal to the maximum value. This change feels to tailored to the contrived example in question.

So, perhaps it makes more sense to only remove thresholds when those thresholds aren’t specified explicitly; that is, to only remove thresholds when they are specified as a count. This would also apply to the builtin threshold count estimators d3.thresholdFreedmanDiaconis, d3.thresholdScott and d3.thresholdSturges. Then if 100 threshold values are specified explicitly, you’d always get 101 thresholds regardless of the value of histogram.domain. However, depending on the domain and the input values, some of those bins would be empty, as expected. And the default would be [[], [0], [1], … [98], [99]]. I think this change would make a lot of sense, since when you don’t specify the threshold values explicitly, the specified count is only a hint anyway.

mbostock added a commit that referenced this issue Apr 12, 2017
Only filter implicit thresholds by the domain. Fixes #54.
@mbostock
Copy link
Member

I’m still not 100% sure this is a good idea since the first and last bin’s bin.x0 and bin.x1 will be equal. Actually it’s worse than that because the first bin’s bin.x0 is set to domain[0] and the last bin.x1 is set to domain[1], which is wrong if the thresholds are outside or coincident with the domain endpoints.

@mbostock
Copy link
Member

If we don’t filter the thresholds, then probably the first bin’s bin.x0 and the last bin’s bin.x1 should be undefined. This would be consistent with threshold.invertExtent from d3-scale. But this would need a major version bump, and while perhaps more flexible, it also makes it more cumbersome in the common case because you’d need to handle the unbounded extreme bins specially when rendering the histogram; see the example.

So, I think it probably makes more sense to keep the current behavior where the histogram always has a finite domain (either set explicitly or determined from the data), and then the thresholds (even if set explicitly) must be constrained to that domain.

Since the lower bound of the bin is inclusive, we should tweak the threshold filtering of the upper bound (as described above) so that the threshold 99 isn’t dropped if the domain is [0, 99]:

// Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m;

We should also document this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants