Fix median miscalculation for even-sized item list #2224

adetsi · 2023-05-17T16:56:27Z

Fix for issue (#1928) where median value for even-sized item list is miscalculated, see below.

Before

After

finos-cla-bot · 2023-05-17T16:56:31Z

Thank you for your contribution and Welcome to our Open Source Community!

To make sure your pull request is accepted successfully, we ask all our open source contributors to sign a Contributor License Agreement; having reviewed our contributor list, we require a CLA for the following people : (@adetsi).

In order to sign a CLA with FINOS, just submit a Pull Request with a simple change to this file (adding an empty line, or some random text at the bottom); this will trigger the EasyCLA bot, which will post a comment to the Pull Request stating whether all PR contributors are covered by FINOS CLA; if not covered, the bot will post instructions on how to sign the CLA.

Thanks once again for your contribution. Let us work with you to make the CLA process quick, easy and efficient so we can move forward with reviewing and accepting your pull request. Feel free to email [email protected] for any questions.

cc @TheJuanAndOnly99 @mcleo-d

TheJuanAndOnly99 · 2023-05-17T17:01:53Z

@cla-bot[bot] check

finos-cla-bot · 2023-05-17T17:01:56Z

The cla-bot has been summoned, and re-checked this pull request!

texodus · 2023-05-18T01:20:04Z

@adetsi There is already an open PR for this issue #2197.

That said, neither this PR nor #2197 actually address the issue I described in my response to #1928, which is that the median() aggregate is also used for non-numeric values and therefore we can't naively calculate the average when the column has an even number of elements.

adetsi · 2023-05-19T19:56:28Z

@texodus

Thanks for the clarification, I have added implementations to accommodate for non-numeric types as well (preserves the old behavior).

adetsi · 2023-05-24T10:09:23Z

@texodus
Broken test cases have been fixed

texodus

Thanks for the update @adetsi. This implementation is better but the failing tests indicate another issue, that median can't be calculated for integer columns without changing the aggregate column's output type to float; trying to calculate the median without this will generate all 0's. In order to do this, you'd need to set t_tscalar median_average explicitly, as well as setting the associated metadata for the schema transition in a few other places around the codebase. (this is done already for e.g. count which goes from any type to integer, if you need an example to grep for).

In lieu of this, it would be easier to just implement this behavior for float columns instead of numeric ones, by changing the feature-check from is_numeric to is_floating_point. integer columns will then stay integer when applied median() and exhibit the old behavior, while columns which are float will use the new behavior. This would be an acceptable alternative if the former implementation for integer-column-type-upgrading is too complex (it may involve updating python and the UI as well).

See inline review notes below, there are also a few small style and efficiency improvements, and we should preserve the current test inputs and their integer columns.

EDIT

I've implement these changes here for you to cherry-pick if you choose.

texodus · 2023-05-26T02:09:22Z

cpp/perspective/src/cpp/sparse_tree.cpp

+
+        median_average.set((*first_middle + *second_middle) / static_cast<t_tscalar>(2));
+        return median_average;
+    }else{


Formatting - please run clang-format.

texodus · 2023-05-26T02:11:09Z

cpp/perspective/src/cpp/sparse_tree.cpp

+    int size = values.size();
+    bool is_even_size = size % 2 == 0;
+
+    if (is_even_size && values[0].is_numeric()){


See review summary - is_numeric() case here should be exclusively float column types via is_floating_point().

texodus · 2023-05-26T02:13:34Z

cpp/perspective/src/cpp/sparse_tree.cpp

+        std::vector<t_tscalar>::iterator first_middle = values.begin() + ((size - 1) / 2);
+        std::vector<t_tscalar>::iterator second_middle = values.begin() + (size / 2);
+
+        nth_element(values.begin(),  first_middle, values.end());


nth_element() does not need to be called twice here, the column is guaranteed to be even and min 2, so this is equivalent to *(second_middle - 1).

texodus · 2023-05-26T02:21:53Z

cpp/perspective/src/cpp/sparse_tree.cpp

-                                values.begin(), middle, values.end());
-
-                            return *middle;
+                            return get_aggregate_median(values);


Just a suggestion - if we're going to factor out this logic into a function called get_aggregate_median(values), maybe we should also move the 0 and 1 cases into this definition to 1) make it complete and 2) move the entire closure body to a single method call.

…roperly

adetsi · 2023-05-26T14:41:54Z

@texodus thank you for the review and comments, I hope everything looks good now.

texodus · 2023-05-26T17:42:26Z

Thanks for the PR @adetsi! Looks good!

Fix median miscalculation

57c2540

finos-cla-bot bot added the cla-present label May 17, 2023

Implement median for both numeric and non-numeric data

1fcace9

Fix broken test cases

65884cc

texodus requested changes May 26, 2023

View reviewed changes

Optimize implementation, resolve failing test cases and format code p…

fe338cb

…roperly

texodus approved these changes May 26, 2023

View reviewed changes

texodus added the bug Concrete, reproducible bugs label May 26, 2023

texodus linked an issue May 26, 2023 that may be closed by this pull request

Shouldn't the median value for numbers 1 and 2 be equal to 1.5 instead of 2? #1928

Closed

texodus merged commit 3116936 into finos:master May 26, 2023

texodus mentioned this pull request Jun 4, 2023

Add type-specific aggregates for median and middle #2197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix median miscalculation for even-sized item list #2224

Fix median miscalculation for even-sized item list #2224

adetsi commented May 17, 2023

finos-cla-bot bot commented May 17, 2023

TheJuanAndOnly99 commented May 17, 2023

finos-cla-bot bot commented May 17, 2023

texodus commented May 18, 2023

adetsi commented May 19, 2023

adetsi commented May 24, 2023

texodus left a comment •

edited

Loading

texodus May 26, 2023

texodus May 26, 2023

adetsi May 26, 2023

texodus May 26, 2023

adetsi May 26, 2023

texodus May 26, 2023

adetsi May 26, 2023

adetsi commented May 26, 2023

texodus commented May 26, 2023

Fix median miscalculation for even-sized item list #2224

Fix median miscalculation for even-sized item list #2224

Conversation

adetsi commented May 17, 2023

Before

After

finos-cla-bot bot commented May 17, 2023

TheJuanAndOnly99 commented May 17, 2023

finos-cla-bot bot commented May 17, 2023

texodus commented May 18, 2023

adetsi commented May 19, 2023

adetsi commented May 24, 2023

texodus left a comment • edited Loading

Choose a reason for hiding this comment

texodus May 26, 2023

Choose a reason for hiding this comment

texodus May 26, 2023

Choose a reason for hiding this comment

adetsi May 26, 2023

Choose a reason for hiding this comment

texodus May 26, 2023

Choose a reason for hiding this comment

adetsi May 26, 2023

Choose a reason for hiding this comment

texodus May 26, 2023

Choose a reason for hiding this comment

adetsi May 26, 2023

Choose a reason for hiding this comment

adetsi commented May 26, 2023

texodus commented May 26, 2023

texodus left a comment •

edited

Loading