-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram merges last two values when last value coincides with the upper threshold #54
Comments
I believe including the upper threshold value in the max bin is standard histogram behavior. I wonder, would histogram libraries in the R or Python worlds give the same result? |
Per the API reference:
But, what does not appear to be documented in the API reference is how histogram.domain affects the thresholds. From histogram.js: // Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] >= x1) tz.pop(), --m; So, any threshold value that is less than or equal to Thus, if you have 100 thresholds [0, 1, 2, … 97, 98, 99] ( This means the histogram will generate 99 (98 + 1) bins, with any value less than 1 ( I think it’s possible that the logic should be tweaked slightly since the upper bound of the bin is exclusive: // Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m; That would produce 99 thresholds [1, 2, … 98, 99] and 100 bins, [[0], [1], … [98], [99]]. But that’s a little weird because it means that the last bin can only contain values that are exactly equal to the maximum value. This change feels to tailored to the contrived example in question. So, perhaps it makes more sense to only remove thresholds when those thresholds aren’t specified explicitly; that is, to only remove thresholds when they are specified as a count. This would also apply to the builtin threshold count estimators d3.thresholdFreedmanDiaconis, d3.thresholdScott and d3.thresholdSturges. Then if 100 threshold values are specified explicitly, you’d always get 101 thresholds regardless of the value of histogram.domain. However, depending on the domain and the input values, some of those bins would be empty, as expected. And the default would be [[], [0], [1], … [98], [99]]. I think this change would make a lot of sense, since when you don’t specify the threshold values explicitly, the specified count is only a hint anyway. |
Only filter implicit thresholds by the domain. Fixes #54.
I’m still not 100% sure this is a good idea since the first and last bin’s bin.x0 and bin.x1 will be equal. Actually it’s worse than that because the first bin’s bin.x0 is set to domain[0] and the last bin.x1 is set to domain[1], which is wrong if the thresholds are outside or coincident with the domain endpoints. |
If we don’t filter the thresholds, then probably the first bin’s bin.x0 and the last bin’s bin.x1 should be undefined. This would be consistent with threshold.invertExtent from d3-scale. But this would need a major version bump, and while perhaps more flexible, it also makes it more cumbersome in the common case because you’d need to handle the unbounded extreme bins specially when rendering the histogram; see the example. So, I think it probably makes more sense to keep the current behavior where the histogram always has a finite domain (either set explicitly or determined from the data), and then the thresholds (even if set explicitly) must be constrained to that domain. Since the lower bound of the bin is inclusive, we should tweak the threshold filtering of the upper bound (as described above) so that the threshold 99 isn’t dropped if the domain is [0, 99]: // Remove any thresholds outside the domain.
var m = tz.length;
while (tz[0] <= x0) tz.shift(), --m;
while (tz[m - 1] > x1) tz.pop(), --m; We should also document this behavior. |
I wrote a Stack Overflow question about this, which documents the behaviour well. At the request of an answerer, I've created this issue to try and understand this behaviour better (or alternately register it as a bug if it is indeed such).
tl;dr — d3.histogram merges the last two bins when the last value coincides with the upper threshold. However, if you set
histogram.domain()
to[extentMin -1, extentMax + 1]
, the problem seems to dissipate.Example code (courtesy of Gerardo Furtado):
The last bin contains both 98 and 99, whereas the other bins only contain one value.
This isn't the case with the other thresholds:
Any idea what's going on with this?
Thanks!
The text was updated successfully, but these errors were encountered: