Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix median miscalculation for even-sized item list #2224

Merged
merged 4 commits into from
May 26, 2023

Conversation

adetsi
Copy link
Contributor

@adetsi adetsi commented May 17, 2023

Fix for issue (#1928) where median value for even-sized item list is miscalculated, see below.

Before

image

After

image

@finos-cla-bot
Copy link

finos-cla-bot bot commented May 17, 2023

Thank you for your contribution and Welcome to our Open Source Community!

To make sure your pull request is accepted successfully, we ask all our open source contributors to sign a Contributor License Agreement; having reviewed our contributor list, we require a CLA for the following people : (@adetsi).

In order to sign a CLA with FINOS, just submit a Pull Request with a simple change to this file (adding an empty line, or some random text at the bottom); this will trigger the EasyCLA bot, which will post a comment to the Pull Request stating whether all PR contributors are covered by FINOS CLA; if not covered, the bot will post instructions on how to sign the CLA.

Thanks once again for your contribution. Let us work with you to make the CLA process quick, easy and efficient so we can move forward with reviewing and accepting your pull request. Feel free to email [email protected] for any questions.

cc @TheJuanAndOnly99 @mcleo-d

@TheJuanAndOnly99
Copy link
Member

@cla-bot[bot] check

@finos-cla-bot
Copy link

finos-cla-bot bot commented May 17, 2023

The cla-bot has been summoned, and re-checked this pull request!

@texodus
Copy link
Member

texodus commented May 18, 2023

@adetsi There is already an open PR for this issue #2197.

That said, neither this PR nor #2197 actually address the issue I described in my response to #1928, which is that the median() aggregate is also used for non-numeric values and therefore we can't naively calculate the average when the column has an even number of elements.

@adetsi
Copy link
Contributor Author

adetsi commented May 19, 2023

@texodus

Thanks for the clarification, I have added implementations to accommodate for non-numeric types as well (preserves the old behavior).

@adetsi
Copy link
Contributor Author

adetsi commented May 24, 2023

@texodus
Broken test cases have been fixed

Copy link
Member

@texodus texodus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @adetsi. This implementation is better but the failing tests indicate another issue, that median can't be calculated for integer columns without changing the aggregate column's output type to float; trying to calculate the median without this will generate all 0's. In order to do this, you'd need to set t_tscalar median_average explicitly, as well as setting the associated metadata for the schema transition in a few other places around the codebase. (this is done already for e.g. count which goes from any type to integer, if you need an example to grep for).

In lieu of this, it would be easier to just implement this behavior for float columns instead of numeric ones, by changing the feature-check from is_numeric to is_floating_point. integer columns will then stay integer when applied median() and exhibit the old behavior, while columns which are float will use the new behavior. This would be an acceptable alternative if the former implementation for integer-column-type-upgrading is too complex (it may involve updating python and the UI as well).

See inline review notes below, there are also a few small style and efficiency improvements, and we should preserve the current test inputs and their integer columns.

EDIT

I've implement these changes here for you to cherry-pick if you choose.


median_average.set((*first_middle + *second_middle) / static_cast<t_tscalar>(2));
return median_average;
}else{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting - please run clang-format.

int size = values.size();
bool is_even_size = size % 2 == 0;

if (is_even_size && values[0].is_numeric()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See review summary - is_numeric() case here should be exclusively float column types via is_floating_point().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

std::vector<t_tscalar>::iterator first_middle = values.begin() + ((size - 1) / 2);
std::vector<t_tscalar>::iterator second_middle = values.begin() + (size / 2);

nth_element(values.begin(), first_middle, values.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nth_element() does not need to be called twice here, the column is guaranteed to be even and min 2, so this is equivalent to *(second_middle - 1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

values.begin(), middle, values.end());

return *middle;
return get_aggregate_median(values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion - if we're going to factor out this logic into a function called get_aggregate_median(values), maybe we should also move the 0 and 1 cases into this definition to 1) make it complete and 2) move the entire closure body to a single method call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@adetsi
Copy link
Contributor Author

adetsi commented May 26, 2023

@texodus thank you for the review and comments, I hope everything looks good now.

@texodus
Copy link
Member

texodus commented May 26, 2023

Thanks for the PR @adetsi! Looks good!

@texodus texodus added the bug Concrete, reproducible bugs label May 26, 2023
@texodus texodus merged commit 3116936 into finos:master May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Concrete, reproducible bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shouldn't the median value for numbers 1 and 2 be equal to 1.5 instead of 2?
3 participants