Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: histogram(nbins=N) off by one error #9687

Closed
1 task done
lostmygithubaccount opened this issue Jul 24, 2024 · 4 comments · Fixed by #9711
Closed
1 task done

bug: histogram(nbins=N) off by one error #9687

lostmygithubaccount opened this issue Jul 24, 2024 · 4 comments · Fixed by #9711
Assignees
Labels
bug Incorrect behavior inside of ibis
Milestone

Comments

@lostmygithubaccount
Copy link
Member

What happened?

import ibis

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 20

t = ibis.range(1000).unnest().name("index").as_table()
t.select(hist=t["index"].histogram(nbins=10)).value_counts()
┏━━━━━━━┳━━━━━━━━━━━━┓
┃ hist  ┃ hist_count ┃
┡━━━━━━━╇━━━━━━━━━━━━┩
│ int64 │ int64      │
├───────┼────────────┤
│     5 │        100 │
│     9 │         99 │
│     2 │        100 │
│    10 │          1 │
│     0 │        100 │
│     1 │        100 │
│     3 │        100 │
│     6 │        100 │
│     4 │        100 │
│     7 │        100 │
│     8 │        100 │
└───────┴────────────┘

there is one value in bin 10 and 99 in bin 9

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@cpcloud
Copy link
Member

cpcloud commented Jul 24, 2024

Yeah, looks like a bug! Great example, really makes the problem obvious :)

@christophertitchen
Copy link
Contributor

/take

@jitingxu1
Copy link
Contributor

Took a quick look at the code, it'll always put the maximum value of the column in an extra bin based on the current design (self - base) / binwidth.

I'm not sure about the eps, seems like not having real effect on the bin assignment. It slightly alters the binwidth and the value, effectively canceling out any impact.

Also, the base is the min value of the bins, I feel like make more sense to have range(min, max) if we want to let user specify the min and max value of bins.

@christophertitchen
Copy link
Contributor

Took a quick look at the code, it'll always put the maximum value of the column in an extra bin based on the current design (self - base) / binwidth.

I'm not sure about the eps, seems like not having real effect on the bin assignment. It slightly alters the binwidth and the value, effectively canceling out any impact.

Also, the base is the min value of the bins, I feel like make more sense to have range(min, max) if we want to let user specify the min and max value of bins.

Yes, I agree it would be lovely to have that functionality to specify the lower and upper range of the bins, along with a few other useful features like non-uniform bin-widths and weights like in np.histogram. This will be a quick and straightforward fix within the current design though—to make the bounds of the last bin inclusive—so I thought it would make a nice first contribution for me. We can save the more comprehensive redesign for another PR once the bug has been fixed.

gforsyth pushed a commit that referenced this issue Jul 30, 2024
## Description of changes

```
import ibis

ibis.options.interactive = True
ibis.options.repr.interactive.max_rows = 20

t = ibis.range(1000).unnest().name("index").as_table()
t.select(hist=t["index"].histogram(nbins=10)).value_counts()
```

```
┏━━━━━━━┳━━━━━━━━━━━━┓
┃ hist  ┃ hist_count ┃
┡━━━━━━━╇━━━━━━━━━━━━┩
│ int64 │ int64      │
├───────┼────────────┤
│     5 │        100 │
│     9 │        100 │
│     0 │        100 │
│     3 │        100 │
│     6 │        100 │
│     2 │        100 │
│     7 │        100 │
│     8 │        100 │
│     1 │        100 │
│     4 │        100 │
└───────┴────────────┘
```

## Issues closed

* Resolves #9687.

I had to make a slight change to ``histogram`` to account for an edge case that was tested for Impala. It would fail if ``nbins`` was not passed, which is a rather niche use case because ``np.histogram`` for example requires the number of bins to be passed either explicitly or implicitly. I also found a slight quirk with the current design when fixing this because if a ``base`` is passed that is not the minimum value, it would assign those out-of-bound values smaller than the base a negative bin index. It now clips those out-of-bound values to the bin index of -1 to group them together, rather than potentially having bin indices of -1 and -2 for example, so this now aligns with how ``np.histogram`` assigns a bin index of 0 for out-of-bound values.
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jul 30, 2024
@cpcloud cpcloud added this to the 9.3 milestone Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants