ESQL: mv_median_absolute_deviation function by ivancea · Pull Request #112055 · elastic/elasticsearch

ivancea · 2024-08-21T09:57:01Z

Added mv_median_absolute_deviation function
Added possibility of having a fixed param in Multivalue "ascending" functions
Add surrogate to MedianAbsoluteDeviation

Calculations used to avoid overflows

First, a quick recap of how the MAD is calculated:

Sort values, and get the median
Calculate the difference between each value with the median (abs(median - value))
Sort the differences, and get their median

Calculating a MAD may overflow when calculating the differences (Step 2), given the type is a signed number, as the difference is a positive value, with potentially the same value as POSITIVE_MAX - NEGATIVE_MIN.
To solve this, some types are up-casted as follow:

Int: Stored as longs, simple approach
Long: Stored as longs, but switched to unsigned long representation when calculating the differences
Unsigned long: No effect; the resulting range is the same
Doubles: Nothing. If the values overflow to +/-infinity, they're left that way, as we'll just use those outliers to sort

Closes #111590

elasticsearchmachine · 2024-08-21T09:57:57Z

Hi @ivancea, I've created a changelog YAML for you.

ivancea · 2024-08-21T10:40:35Z

...c/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedian.java

    }

-    @MvEvaluator(extraName = "Double", finish = "finish")
+    @MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")


This evaluator was missing the "ascending" value, so it wasn't calling the optimized ascending function. Just a minor unrelated """fix"""

It'd be nice if we counted the number of times we went this way in the profile output so we could assert on it. But that's a problem for another dya.

ivancea · 2024-08-21T10:52:35Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        double median = count % 2 == 1 ? values.getDouble(middle) : (values.getDouble(middle - 1) + values.getDouble(middle)) / 2;
+        for (int i = 0; i < count; i++) {
+            double value = values.getDouble(firstValue + i);
+            doubles.values[i] = value > median ? value - median : median - value;


This currently overflows, as the abs difference between the median and a big/small number may be higher than MAX_TYPE. (E.g. Median = 10, and some element = MIN_DOUBLE)
A "simple" fix would be to change:

For longs, use BigInteger

For doubles, use BigDecimal

For ints, use longs
But using those for every calculation would introduce a lot of overhead (Specially for doubles and longs).

Other options I considered:

Throw and return null. This is not a median, and this may overflow. The cases are very... Edgy, so could be fine. No overflow, no overhead.

Detect overflows in a per-element basis, and only throw if one of those overflows has to be used. Quite the logic, and I don't think it's possible to have a MAD higher than MAX_VALUE. Which leads me to the next option:

Replace over/underflows with MAX/MIN values, just to sort them, and later ignore them? If the MAD can't be higher than MAX_VALUE, then this should render a correct sorting and a correct MAD

Dumping my ideas here, so others can check and validate them. The last one feels nice to me, but I still have to do some checks to avoid committing a mathematical crime here

Personally, I'm in favor of collecting the overflowing values and return null + a warning if and only if we determine that the median absolute deviation has to be properly out of range.

From the top of my head, I cannot determine an edge case where the MAD itself would be a legitimate overflow - maybe with longs we can construct such cases (because the negative and positive max values are asymmetric), but with doubles it may be properly impossible (in theory, not accounting for rounding errors). If that's correct, trying to simply keep track of the cases when the distance to the median is out of range should be sufficient.

After some thought I'm following a mixed approach:

For ints, just using longs instead

For longs, calculating the differences as unsigned longs. Looks like the more strict and realistic approach to me. After all, differences are, indeed, unsigned

For doubles, I'll just let it infinite. As Infinities are sortable, they should work well

Both ints and longs are checked on down-conversion to make sure they don't overflow.

About null+warnings, no test resulted in an overflow, not even max/min longs/ints, so I'd call it a day. I could add a warnExceptions to catch and nullify them, but as I couldn't reproduce them, I'm not sure it's worth it

Sounds very convincing!

Maybe it's worth asserting that the result is never overflowing, resp. never an infinity; I agree that we shouldn't proactively null+warn if we never expect this to be relevant.

…g, long -> unsigned long)

elasticsearchmachine · 2024-08-29T16:09:42Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-08-29T16:09:42Z

Pinging @elastic/kibana-esql (ES|QL-ui)

astefan · 2024-09-02T17:38:49Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/mv_median_absolute_deviation.csv-spec

+    int = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change.int),
+    long = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change.long),
+    double = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change)
+| KEEP emp_no, int, long, double


I, as a reviewer, would like to see what values led to the said mv_mad calculus.

Suggested change

| KEEP emp_no, int, long, double

| KEEP emp_no, *int, *long, *double

Added the input field to the keep as a reference

astefan · 2024-09-02T17:39:39Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/mv_median_absolute_deviation.csv-spec

+required_capability: fn_mv_median_absolute_deviation
+
+FROM employees
+| WHERE emp_no <= 10002


This interval has a better coverage, it's also covering null.

Suggested change

| WHERE emp_no <= 10002

| WHERE emp_no <= 10010

Added the emp_no 10009 (null) and 10007 (odd amount of values). The others are identical, so just those to keep the test simple

astefan · 2024-09-02T17:41:14Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/mv_median_absolute_deviation.csv-spec

+nullsAndFolds
+required_capability: fn_mv_median_absolute_deviation
+
+ROW x = [0, 2, 5, 6], single = 300


astefan · 2024-09-02T17:44:07Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats_percentile.csv-spec

 // end::median-absolute-deviation-result[]
 ;

+medianAbsoluteDeviationFold


Please, either change the name of the file or create a new file with all the m_a_d tests and leave stats_percentile.csv-spec with only percentile related tests.

Or move every mv_m_a_d and m_a_d in the same file.

Created a new file for MAD (Better a file per function I think, but still a lot to migrate)

astefan · 2024-09-02T17:48:30Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+
+    @FunctionInfo(
+        returnType = { "double", "integer", "long", "unsigned_long" },
+        description = "Converts a multivalued field into a single valued field containing the median absolute deviation.",


Please, include a reference link to some public documentation defining "median" or the formula you used to define this function (wikipedia or similar).

Added the same explanation as MedianAbsoluteDeviation

astefan · 2024-09-02T17:49:24Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        description = "Converts a multivalued field into a single valued field containing the median absolute deviation.",
+        note = "If the field has an even number of values, "
+            + "the medians will be calculated as the average of the middle two values. "
+            + "If the column is not floating point, the averages round towards 0.",


"round towards 0" or "rounds to 0"?

Suggested change

+ "If the column is not floating point, the averages round towards 0.",

+ "If the value is not a floating point number, the averages round towards 0.",

the averages are rounded towards 0 maybe?

astefan · 2024-09-02T17:54:04Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+
+    @Override
+    protected TypeResolution resolveFieldType() {
+        return isType(field(), t -> t.isNumeric() && isRepresentable(t), sourceText(), null, "numeric");


I noticed that median_absolute_deviation doesn't accept unsigned_long while its mv_ sister does. Is this a miss or has a reason?

mv_median allows unsigned_longs, but MedianAbsDev uses QuantileStates. As long as QuantileStates can work with unsigned longs, it should be possible, but I don't know

astefan

With the small things I mentioned yesterday in my review, this PR LGTM!

nik9000

I left a few small things, but LGTM.

nik9000 · 2024-09-05T13:51:42Z

...c/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedian.java

    }

-    @MvEvaluator(extraName = "Double", finish = "finish")
+    @MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")


nik9000 · 2024-09-05T13:52:19Z

...c/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedian.java

    }

-    @MvEvaluator(extraName = "Double", finish = "finish")
+    @MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")


It'd be nice if we counted the number of times we went this way in the profile output so we could assert on it. But that's a problem for another dya.

nik9000 · 2024-09-05T13:53:15Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        return switch (PlannerUtils.toElementType(field().dataType())) {
+            case DOUBLE -> new MvMedianAbsoluteDeviationDoubleEvaluator.Factory(fieldEval);
+            case INT -> new MvMedianAbsoluteDeviationIntEvaluator.Factory(fieldEval);
+            case LONG -> field().dataType() == DataType.UNSIGNED_LONG


We could use switch (field.dataType()) now that it's an enum.

An IntelliJ inspection thinks the opposite! (As it's just an if/else)
As most of use use IntelliJ (I think?), and the other functions also do this, I'd keep it this way.

nik9000 · 2024-09-05T13:54:19Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        longs.values[longs.count++] = v;
+    }
+
+    static int finishInts(Longs longs) {


Worth javadoc just because it says Ints in the method name and takes Longs.

nik9000 · 2024-09-05T13:57:53Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        Arrays.sort(values, 0, count);
+        int middle = count / 2;
+        return count % 2 == 1 ? values[middle] : avgWithoutOverflow(values[middle - 1], values[middle]);
+    }


Maybe make it a function on Longs. I dunno if that's better.

In ascending cases, the Longs/Doubles object doesn't have data, and we artificially fill it without increasing the count (As we would have to reset it at the end again).

Anyway, after writing this, I started to think that it's indeed simpler to just use the full object instead of being "juggling" with values... So just changed it for simplicity

nik9000 · 2024-09-05T14:00:13Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        for (int i = 0; i < count; i++) {
+            double value = values.getDouble(firstValue + i);
+            // Double differences between median and the values may potentially result in +/-Infinity.
+            // As we use that value just to sort, the MAD should remain finite.


We sort, but if the median is Infinity we could fail, right?

The first median shouldn't be infinite, as all values used to calculate it are finite.

For the second median, the median of differences, I believe if it can happen that the median is non-finite, we still do a throw in the doubleMedianOf() function used later. For all types, we do that check there

PS: While doing a change for another comment, I saw that I forgot the ascending="ascending" for doubles. And it had an outdated avg calculation (a + b) / 2 instead of a / 2 + b / 2. Just fixed it

nik9000 · 2024-09-05T14:03:01Z

...lasticsearch/xpack/esql/expression/function/scalar/multivalue/MvMedianAbsoluteDeviation.java

+        for (int i = 0; i < longs.count; i++) {
+            long value = longs.values[i];
+            // We know they were ints, so we can calculate differences within a long
+            longs.values[i] = value > median ? Math.subtractExact(value, median) : Math.subtractExact(median, value);


I'd probably use - in that case.

This is the same code as the last half of ascending - could we pull it into a single function? I think it's safe enough from a performance standpoint.

Changed to -.

About the repeated code, one uses longs.values[i], and the other values.getInt(firstValue + i). So I couldn't abstract this much more without having "valueSuppliers" or something like that, which is a bit too much for this, and could affect performance

👍 on the repeated bit. Could you add a comment so I don't make the same mistake again? I have copy-and-paste blindness. If some code looks like it's copy and pasted I often will never spot the subtle difference.

…, and exceptions not resetting count

Just a race condition while merging two PRs (#112055 and #112350). Fixes #112659 Fixes #112660 Fixes #112661

ivancea added 7 commits August 19, 2024 15:56

Use ascending on MvMedian doubles

f36e610

Initial function and tests

a1b958d

Fixed unsigned long ascending

ad31847

Merge branch 'main' into mv-median-absolute-deviation-function

0c1d49a

Added overflow test cases, and double overflow fixes

adfab05

Added overflow checks and extra tests for them

1c4c381

Minor refactor

990ef6c

ivancea added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI labels Aug 21, 2024

elasticsearchmachine added the v8.16.0 label Aug 21, 2024

Update docs/changelog/112055.yaml

c5cbd0c

Format

25f05cc

ivancea commented Aug 21, 2024

View reviewed changes

ivancea added 10 commits August 28, 2024 12:07

Merge branch 'main' into mv-median-absolute-deviation-function

a4ce26f

Merge branch 'main' into mv-median-absolute-deviation-function

8c57086

Fix overflows by using the next bigger number type for MAD (in -> lon…

557c62f

…g, long -> unsigned long)

Ensure exact conversions in longs

60b698b

Avoid using BigIntegers for unsigned longs

880f310

Added function to registry, and updated failing tests

0da4f14

Added CSV tests and docs

5254fee

Added csv tests for all types and using an index

f4c312c

Assert doubles median is finite

8096654

Surrogate in aggregation

8009502

ivancea requested review from alex-spies, astefan and nik9000 August 29, 2024 16:08

ivancea marked this pull request as ready for review August 29, 2024 16:08

astefan mentioned this pull request Aug 30, 2024

ES|QL: Improve aggregation over constants handling #112392

Open

astefan reviewed Sep 2, 2024

View reviewed changes

astefan approved these changes Sep 3, 2024

View reviewed changes

ivancea added 4 commits September 3, 2024 11:16

Merge branch 'main' into mv-median-absolute-deviation-function

348eb9b

Extra cases for FROM test

0ca55ec

Improved function documentation

47965c1

Moved MAD agg csv tests to its own file and updated meta tests

a07c58f

ivancea requested a review from a team September 3, 2024 13:20

elastic deleted a comment from jlfernandezfernandez Sep 3, 2024

nik9000 approved these changes Sep 5, 2024

View reviewed changes

ivancea added 4 commits September 6, 2024 10:58

Merge branch 'main' into mv-median-absolute-deviation-function

f123a2a

Added missing ascending case for doubles, fixed doubles infinites bug…

09b1f31

…, and exceptions not resetting count

Simplified ints calculation by removing long overflow safeguards

f11a707

Add required capability to agg test

a0a61ca

nik9000 approved these changes Sep 6, 2024

View reviewed changes

Improved docs on ascending functions

42377ce

ivancea merged commit fc2760c into elastic:main Sep 9, 2024

ivancea deleted the mv-median-absolute-deviation-function branch September 9, 2024 08:04

luigidellaquila mentioned this pull request Sep 9, 2024

ES|QL: Fix function metadata tests #112662

Merged

elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2024

ES|QL: Fix function metadata tests (#112662)

903b6dd

Just a race condition while merging two PRs (#112055 and #112350). Fixes #112659 Fixes #112660 Fixes #112661

	\| KEEP emp_no, int, long, double
	\| KEEP emp_no, int, long, *double

	+ "If the column is not floating point, the averages round towards 0.",
	+ "If the value is not a floating point number, the averages round towards 0.",

Conversation

ivancea commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Calculations used to avoid overflows

Uh oh!

elasticsearchmachine commented Aug 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivancea Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivancea Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Aug 29, 2024

Uh oh!

elasticsearchmachine commented Aug 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivancea Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefan left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ivancea commented Aug 21, 2024 •

edited

Loading

ivancea Aug 21, 2024 •

edited

Loading

ivancea Aug 29, 2024 •

edited

Loading

ivancea Sep 3, 2024 •

edited

Loading