-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix median miscalculation for even-sized item list #2224
Conversation
Thank you for your contribution and Welcome to our Open Source Community! To make sure your pull request is accepted successfully, we ask all our open source contributors to sign a Contributor License Agreement; having reviewed our contributor list, we require a CLA for the following people : (@adetsi). In order to sign a CLA with FINOS, just submit a Pull Request with a simple change to this file (adding an empty line, or some random text at the bottom); this will trigger the EasyCLA bot, which will post a comment to the Pull Request stating whether all PR contributors are covered by FINOS CLA; if not covered, the bot will post instructions on how to sign the CLA. Thanks once again for your contribution. Let us work with you to make the CLA process quick, easy and efficient so we can move forward with reviewing and accepting your pull request. Feel free to email [email protected] for any questions. |
@cla-bot[bot] check |
The cla-bot has been summoned, and re-checked this pull request! |
@adetsi There is already an open PR for this issue #2197. That said, neither this PR nor #2197 actually address the issue I described in my response to #1928, which is that the |
Thanks for the clarification, I have added implementations to accommodate for non-numeric types as well (preserves the old behavior). |
@texodus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @adetsi. This implementation is better but the failing tests indicate another issue, that median
can't be calculated for integer
columns without changing the aggregate column's output type to float
; trying to calculate the median without this will generate all 0's. In order to do this, you'd need to set t_tscalar median_average
explicitly, as well as setting the associated metadata for the schema transition in a few other places around the codebase. (this is done already for e.g. count
which goes from any type to integer
, if you need an example to grep for).
In lieu of this, it would be easier to just implement this behavior for float
columns instead of numeric ones, by changing the feature-check from is_numeric
to is_floating_point
. integer
columns will then stay integer
when applied median()
and exhibit the old behavior, while columns which are float will use the new behavior. This would be an acceptable alternative if the former implementation for integer
-column-type-upgrading is too complex (it may involve updating python and the UI as well).
See inline review notes below, there are also a few small style and efficiency improvements, and we should preserve the current test inputs and their integer columns.
EDIT
I've implement these changes here for you to cherry-pick if you choose.
|
||
median_average.set((*first_middle + *second_middle) / static_cast<t_tscalar>(2)); | ||
return median_average; | ||
}else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting - please run clang-format
.
int size = values.size(); | ||
bool is_even_size = size % 2 == 0; | ||
|
||
if (is_even_size && values[0].is_numeric()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See review summary - is_numeric()
case here should be exclusively float
column types via is_floating_point()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
std::vector<t_tscalar>::iterator first_middle = values.begin() + ((size - 1) / 2); | ||
std::vector<t_tscalar>::iterator second_middle = values.begin() + (size / 2); | ||
|
||
nth_element(values.begin(), first_middle, values.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nth_element()
does not need to be called twice here, the column is guaranteed to be even and min 2, so this is equivalent to *(second_middle - 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
values.begin(), middle, values.end()); | ||
|
||
return *middle; | ||
return get_aggregate_median(values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion - if we're going to factor out this logic into a function called get_aggregate_median(values)
, maybe we should also move the 0
and 1
cases into this definition to 1) make it complete and 2) move the entire closure body to a single method call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@texodus thank you for the review and comments, I hope everything looks good now. |
Thanks for the PR @adetsi! Looks good! |
Fix for issue (#1928) where median value for even-sized item list is miscalculated, see below.
Before
After