Skip to content

ESQL: mv_median_absolute_deviation function#112055

Merged
ivancea merged 28 commits intoelastic:mainfrom
ivancea:mv-median-absolute-deviation-function
Sep 9, 2024
Merged

ESQL: mv_median_absolute_deviation function#112055
ivancea merged 28 commits intoelastic:mainfrom
ivancea:mv-median-absolute-deviation-function

Conversation

@ivancea
Copy link
Contributor

@ivancea ivancea commented Aug 21, 2024

  • Added mv_median_absolute_deviation function
  • Added possibility of having a fixed param in Multivalue "ascending" functions
  • Add surrogate to MedianAbsoluteDeviation

Calculations used to avoid overflows

First, a quick recap of how the MAD is calculated:

  1. Sort values, and get the median
  2. Calculate the difference between each value with the median (abs(median - value))
  3. Sort the differences, and get their median

Calculating a MAD may overflow when calculating the differences (Step 2), given the type is a signed number, as the difference is a positive value, with potentially the same value as POSITIVE_MAX - NEGATIVE_MIN.
To solve this, some types are up-casted as follow:

  • Int: Stored as longs, simple approach
  • Long: Stored as longs, but switched to unsigned long representation when calculating the differences
  • Unsigned long: No effect; the resulting range is the same
  • Doubles: Nothing. If the values overflow to +/-infinity, they're left that way, as we'll just use those outliers to sort

Closes #111590

@ivancea ivancea added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI labels Aug 21, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @ivancea, I've created a changelog YAML for you.

}

@MvEvaluator(extraName = "Double", finish = "finish")
@MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This evaluator was missing the "ascending" value, so it wasn't calling the optimized ascending function. Just a minor unrelated """fix"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if we counted the number of times we went this way in the profile output so we could assert on it. But that's a problem for another dya.

double median = count % 2 == 1 ? values.getDouble(middle) : (values.getDouble(middle - 1) + values.getDouble(middle)) / 2;
for (int i = 0; i < count; i++) {
double value = values.getDouble(firstValue + i);
doubles.values[i] = value > median ? value - median : median - value;
Copy link
Contributor Author

@ivancea ivancea Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently overflows, as the abs difference between the median and a big/small number may be higher than MAX_TYPE. (E.g. Median = 10, and some element = MIN_DOUBLE)
A "simple" fix would be to change:

  • For longs, use BigInteger
  • For doubles, use BigDecimal
  • For ints, use longs
    But using those for every calculation would introduce a lot of overhead (Specially for doubles and longs).

Other options I considered:

  • Throw and return null. This is not a median, and this may overflow. The cases are very... Edgy, so could be fine. No overflow, no overhead.
  • Detect overflows in a per-element basis, and only throw if one of those overflows has to be used. Quite the logic, and I don't think it's possible to have a MAD higher than MAX_VALUE. Which leads me to the next option:
  • Replace over/underflows with MAX/MIN values, just to sort them, and later ignore them? If the MAD can't be higher than MAX_VALUE, then this should render a correct sorting and a correct MAD

Dumping my ideas here, so others can check and validate them. The last one feels nice to me, but I still have to do some checks to avoid committing a mathematical crime here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm in favor of collecting the overflowing values and return null + a warning if and only if we determine that the median absolute deviation has to be properly out of range.

From the top of my head, I cannot determine an edge case where the MAD itself would be a legitimate overflow - maybe with longs we can construct such cases (because the negative and positive max values are asymmetric), but with doubles it may be properly impossible (in theory, not accounting for rounding errors). If that's correct, trying to simply keep track of the cases when the distance to the median is out of range should be sufficient.

Copy link
Contributor Author

@ivancea ivancea Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought I'm following a mixed approach:

  • For ints, just using longs instead
  • For longs, calculating the differences as unsigned longs. Looks like the more strict and realistic approach to me. After all, differences are, indeed, unsigned
  • For doubles, I'll just let it infinite. As Infinities are sortable, they should work well

Both ints and longs are checked on down-conversion to make sure they don't overflow.

About null+warnings, no test resulted in an overflow, not even max/min longs/ints, so I'd call it a day. I could add a warnExceptions to catch and nullify them, but as I couldn't reproduce them, I'm not sure it's worth it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very convincing!

Maybe it's worth asserting that the result is never overflowing, resp. never an infinity; I agree that we shouldn't proactively null+warn if we never expect this to be relevant.

@ivancea ivancea marked this pull request as ready for review August 29, 2024 16:08
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

int = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change.int),
long = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change.long),
double = MV_MEDIAN_ABSOLUTE_DEVIATION(salary_change)
| KEEP emp_no, int, long, double
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, as a reviewer, would like to see what values led to the said mv_mad calculus.

Suggested change
| KEEP emp_no, int, long, double
| KEEP emp_no, *int, *long, *double

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the input field to the keep as a reference

required_capability: fn_mv_median_absolute_deviation

FROM employees
| WHERE emp_no <= 10002
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interval has a better coverage, it's also covering null.

Suggested change
| WHERE emp_no <= 10002
| WHERE emp_no <= 10010

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the emp_no 10009 (null) and 10007 (odd amount of values). The others are identical, so just those to keep the test simple

nullsAndFolds
required_capability: fn_mv_median_absolute_deviation

ROW x = [0, 2, 5, 6], single = 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// end::median-absolute-deviation-result[]
;

medianAbsoluteDeviationFold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, either change the name of the file or create a new file with all the m_a_d tests and leave stats_percentile.csv-spec with only percentile related tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or move every mv_m_a_d and m_a_d in the same file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new file for MAD (Better a file per function I think, but still a lot to migrate)


@FunctionInfo(
returnType = { "double", "integer", "long", "unsigned_long" },
description = "Converts a multivalued field into a single valued field containing the median absolute deviation.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, include a reference link to some public documentation defining "median" or the formula you used to define this function (wikipedia or similar).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the same explanation as MedianAbsoluteDeviation

description = "Converts a multivalued field into a single valued field containing the median absolute deviation.",
note = "If the field has an even number of values, "
+ "the medians will be calculated as the average of the middle two values. "
+ "If the column is not floating point, the averages round towards 0.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"round towards 0" or "rounds to 0"?

Suggested change
+ "If the column is not floating point, the averages round towards 0.",
+ "If the value is not a floating point number, the averages round towards 0.",

Copy link
Contributor Author

@ivancea ivancea Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the averages are rounded towards 0 maybe?


@Override
protected TypeResolution resolveFieldType() {
return isType(field(), t -> t.isNumeric() && isRepresentable(t), sourceText(), null, "numeric");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that median_absolute_deviation doesn't accept unsigned_long while its mv_ sister does. Is this a miss or has a reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv_median allows unsigned_longs, but MedianAbsDev uses QuantileStates. As long as QuantileStates can work with unsigned longs, it should be possible, but I don't know

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the small things I mentioned yesterday in my review, this PR LGTM!

@ivancea ivancea requested a review from a team September 3, 2024 13:20
@elastic elastic deleted a comment from jlfernandezfernandez Sep 3, 2024
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few small things, but LGTM.

}

@MvEvaluator(extraName = "Double", finish = "finish")
@MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

@MvEvaluator(extraName = "Double", finish = "finish")
@MvEvaluator(extraName = "Double", finish = "finish", ascending = "ascending")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if we counted the number of times we went this way in the profile output so we could assert on it. But that's a problem for another dya.

return switch (PlannerUtils.toElementType(field().dataType())) {
case DOUBLE -> new MvMedianAbsoluteDeviationDoubleEvaluator.Factory(fieldEval);
case INT -> new MvMedianAbsoluteDeviationIntEvaluator.Factory(fieldEval);
case LONG -> field().dataType() == DataType.UNSIGNED_LONG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use switch (field.dataType()) now that it's an enum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IntelliJ inspection thinks the opposite! (As it's just an if/else)
As most of use use IntelliJ (I think?), and the other functions also do this, I'd keep it this way.

longs.values[longs.count++] = v;
}

static int finishInts(Longs longs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth javadoc just because it says Ints in the method name and takes Longs.

Arrays.sort(values, 0, count);
int middle = count / 2;
return count % 2 == 1 ? values[middle] : avgWithoutOverflow(values[middle - 1], values[middle]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it a function on Longs. I dunno if that's better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ascending cases, the Longs/Doubles object doesn't have data, and we artificially fill it without increasing the count (As we would have to reset it at the end again).

Anyway, after writing this, I started to think that it's indeed simpler to just use the full object instead of being "juggling" with values... So just changed it for simplicity

for (int i = 0; i < count; i++) {
double value = values.getDouble(firstValue + i);
// Double differences between median and the values may potentially result in +/-Infinity.
// As we use that value just to sort, the MAD should remain finite.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort, but if the median is Infinity we could fail, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first median shouldn't be infinite, as all values used to calculate it are finite.

For the second median, the median of differences, I believe if it can happen that the median is non-finite, we still do a throw in the doubleMedianOf() function used later. For all types, we do that check there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: While doing a change for another comment, I saw that I forgot the ascending="ascending" for doubles. And it had an outdated avg calculation (a + b) / 2 instead of a / 2 + b / 2. Just fixed it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for (int i = 0; i < longs.count; i++) {
long value = longs.values[i];
// We know they were ints, so we can calculate differences within a long
longs.values[i] = value > median ? Math.subtractExact(value, median) : Math.subtractExact(median, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably use - in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same code as the last half of ascending - could we pull it into a single function? I think it's safe enough from a performance standpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to -.

About the repeated code, one uses longs.values[i], and the other values.getInt(firstValue + i). So I couldn't abstract this much more without having "valueSuppliers" or something like that, which is a bit too much for this, and could affect performance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the repeated bit. Could you add a comment so I don't make the same mistake again? I have copy-and-paste blindness. If some code looks like it's copy and pasted I often will never spot the subtle difference.

@ivancea ivancea merged commit fc2760c into elastic:main Sep 9, 2024
@ivancea ivancea deleted the mv-median-absolute-deviation-function branch September 9, 2024 08:04
elasticsearchmachine pushed a commit that referenced this pull request Sep 9, 2024
Just a race condition while merging two PRs (#112055 and #112350). 

Fixes #112659 Fixes #112660 Fixes #112661
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: implement mv_median_absolute_deviation function

5 participants