[verifier] Add option to validate array(float) via error margin.#22143
[verifier] Add option to validate array(float) via error margin.#22143spershin merged 1 commit intoprestodb:masterfrom
Conversation
|
Submitted PR a bit too early - the validation part has not been implemented yet. But the PR is good to see where it is all going. |
9e3445e to
352c585
Compare
352c585 to
efb5bdb
Compare
| public ArrayColumnChecksum(@Nullable Object checksum, @Nullable Object cardinalityChecksum, long cardinalitySum) | ||
| public ArrayColumnChecksum( | ||
| @Nullable Object checksum, | ||
| @Nullable Object sum, |
There was a problem hiding this comment.
Why is this an Object instead of a more specific type? For that matter why is checksum an Object? I assume there's a reason, but this is surprising me.
There was a problem hiding this comment.
@elharo
I'm not sue, I was following the existing code from FloatingPointColumnValidator.
As I can see classes have long for counts and Object for everything else, why - not sure. :)
There was a problem hiding this comment.
Let's watch out for Chesterton's Fence. Before proceeding we should figure out why checksum is an Object. Presumably there's a reason but maybe that reason applies to the new cardinality checksum and maybe it doesn't. It's also possible checksum shouldn't have been an Object in the first place, in which case we don't have to fix this now, but let's not repeat the mistake. Any idea who could shed some light on this?
There was a problem hiding this comment.
I can't see a good reason why sum is an Object in FloatingPointColumnChecksum. the only places it's used, it's blindly cast to a double https://github.com/prestodb/presto/blob/master/presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/FloatingPointColumnValidator.java#L145-L146. For arrays, the sum is not necessarily a double, and I think it would be better not to assume it's a double here. (i mean you've only implemented it for double type, but that's more incidental. You should be able to do sum validation on arrays of other numeric types)
Regarding why Objects are used for checksums throughout, I'm not sure. I assume it was most convenient, but I also suspect the code could be improved to use more specific types. In case it's helpful for gleaning more, the object usage seems to have been introduced here 04d45ea and here: 7c48103 if that helps at all.
There was a problem hiding this comment.
Some additional thoughts - since we could imagine wanting to add special validation for other types. It might be better to make this argument a ColumnChecksum object and pass in a FloatingPointColumnChecksum instead of adding all the fields from FloatingPointColumnChecksum directly here. That way later we can add validation for other kinds of arrays, and just pass in the appropriate columnchecksum object and do the corresponding validation.
It might be interesting even to flip the logic a bit so ColumnChecksum would have a validate method that implementations override to call the validation appropriate for the type (something like columnChecksum.validate(Checksum other, Column column). That way you don't even need to worry about what type it might be here. you would just call validate on whatever ColumnChecksum you end up with
It's actually weird that currently validation is done by object equality, especially considering that for float checksums that's definitely not what we want. (From a quick read through, I'm pretty sure that FloatingPointColumnChecksum at least is not used in object equality checks, but it's not at all clear and definitely not enforced) I would be very in favor of moving away from this essentially untyped everything is just an Object land.
Finally, I think we have the same problem you're solving here for map types, where it does checksums on the keys and values.
There was a problem hiding this comment.
Let's watch out for Chesterton's Fence.
If we follow that rule you mentioned - we should not deviate of what already is in there. I suggest approach of 'as is' and then follow up if we decide to change this. I would like this PR to stay focused on its task, if possible.
Some additional thoughts - since we could imagine wanting to add special validation for other types.
Yes, we would likely go that way. Maps of doubles is the next candidate, especially for values part. I was thinking about it, however wanted to stay focused on just arrays now to unblock NGA use case quickly. We can adjust few things to fit Maps more as we go and introduce more unconventional support.
Finally, I think we have the same problem you're solving here for map types, where it does checksums on the keys and values.
Not only that, but the structural types. like rows, but not sure if we want to really go there. Let's fo one step at a time? Harder to break existing stuff that way.
There was a problem hiding this comment.
@elharo @rschlussel
I believe objects are being used, because the aggregation can return NULL when there are no rows.
I'm currently debugging exactly such case.
To handle these nulls we use Object rather than a direct type.
There was a problem hiding this comment.
Interesting. but then why not use a Double type (which can be null since it's an object) rather than a double primitive type?
There was a problem hiding this comment.
Good question.
We didn't think about in the beginning?
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnValidator.java
Outdated
Show resolved
Hide resolved
| public ArrayColumnChecksum(@Nullable Object checksum, @Nullable Object cardinalityChecksum, long cardinalitySum) | ||
| public ArrayColumnChecksum( | ||
| @Nullable Object checksum, | ||
| @Nullable Object sum, |
There was a problem hiding this comment.
I can't see a good reason why sum is an Object in FloatingPointColumnChecksum. the only places it's used, it's blindly cast to a double https://github.com/prestodb/presto/blob/master/presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/FloatingPointColumnValidator.java#L145-L146. For arrays, the sum is not necessarily a double, and I think it would be better not to assume it's a double here. (i mean you've only implemented it for double type, but that's more incidental. You should be able to do sum validation on arrays of other numeric types)
Regarding why Objects are used for checksums throughout, I'm not sure. I assume it was most convenient, but I also suspect the code could be improved to use more specific types. In case it's helpful for gleaning more, the object usage seems to have been introduced here 04d45ea and here: 7c48103 if that helps at all.
efb5bdb to
1484bca
Compare
|
The latest update includes:
|
1484bca to
acb8c44
Compare
|
Fixed Checkstyle errors. |
acb8c44 to
5892fa1
Compare
|
More Checkstyle errors... |
5892fa1 to
b6b090a
Compare
|
Aaaand Checkstyle again... |
b6b090a to
b91473d
Compare
|
Fixed the checksum query. |
5b90982 to
bf4a7b1
Compare
kewang1024
left a comment
There was a problem hiding this comment.
Thanks @spershin for helping improve the verification of array(double)
presto-verifier/src/main/java/com/facebook/presto/verifier/framework/VerifierConfig.java
Show resolved
Hide resolved
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnValidator.java
Outdated
Show resolved
Hide resolved
| boolean useFloatingPointPath = useFloatingPointPath(column); | ||
|
|
||
| ArrayColumnChecksum controlChecksum = toColumnChecksum(column, controlResult, useFloatingPointPath); | ||
| ArrayColumnChecksum testChecksum = toColumnChecksum(column, testResult, useFloatingPointPath); | ||
|
|
||
| // Not floating point case (we have '$checksum' column). | ||
| if (!useFloatingPointPath) { |
There was a problem hiding this comment.
We don't need useFloatingPointPath here, we can infer this information from ArrayColumnChecksum structure
There was a problem hiding this comment.
if ArrayColumnChecksum 's checksum is not null, then it's regular ArrayColumnChecksum
If it is null, then it's the float point check
There was a problem hiding this comment.
if
ArrayColumnChecksum's checksum is not null, then it's regular ArrayColumnChecksum If it is null, then it's the float point check
That was my 1st choice, but it all breaks on the corner case when we get checksums on empty tables (no rows) and then a bunch of checksum columns have nulls.
I realized after that if we are triggering floating point path checking the array element type - we should be consitent and always check using that approach. :)
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ChecksumResult.java
Outdated
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public static ValidateResult validate( |
There was a problem hiding this comment.
Although changed to static function, I think it might make more sense to still put this function in FloatingPointColumnValidator.java
FloatingPointColumnChecksum is only the data structure, FloatingPointColumnValidator will handle the validation logic
There was a problem hiding this comment.
Initially I kept it in the FloatingPointColumnValidator, but then I thought that this logic is better to be incapsulated in the structure that keeps the data. It already implements equals(), toString().
Make sense to put validate() here too, imho.
Let me know if you really prefer it back to the validator class.
There was a problem hiding this comment.
I totally get why we move it to the data structure, but might still prefer it in the validator class
There was a problem hiding this comment.
I think either all validation in should be in the respective checksum classes or none of them(also, might fit better in the validate class if it were a non-static method like equals e.g. floatingPointChecksum.validate(FloatingPointChecksum other)). As things stand it seems a bit out of place, and I agree with @kewang1024 that it would be better in the validate class.
bf4a7b1 to
620920d
Compare
|
The last update:
|
|
@kewang1024 , @rschlussel |
kewang1024
left a comment
There was a problem hiding this comment.
Thanks again! Overall looks good, some NIT
| } | ||
| } | ||
|
|
||
| public static ValidateResult validate( |
There was a problem hiding this comment.
I totally get why we move it to the data structure, but might still prefer it in the validator class
| } | ||
|
|
||
| // Class being used to return a result from the validate() method. | ||
| public static class ValidateResult |
There was a problem hiding this comment.
Can we move this class to the validator as well?
| return new ArrayColumnChecksum(); | ||
| } | ||
|
|
||
| long rowCount = checksumResult.getRowCount(); |
There was a problem hiding this comment.
we can remove this line
| private final FloatingPointColumnChecksum floatingPointChecksum; | ||
|
|
||
| public ArrayColumnChecksum(@Nullable Object checksum, @Nullable Object cardinalityChecksum, long cardinalitySum) | ||
| // Constructor for array(non- floating point) types. |
| private final Object cardinalityChecksum; | ||
| private final long cardinalitySum; | ||
| // For array(floating point) we have extra aggregations collected. | ||
| private final FloatingPointColumnChecksum floatingPointChecksum; |
There was a problem hiding this comment.
Let's make it Optional
| this.checksum = checksum; | ||
| this.cardinalityChecksum = cardinalityChecksum; | ||
| this.cardinalitySum = cardinalitySum; | ||
| this.floatingPointChecksum = new FloatingPointColumnChecksum(); |
There was a problem hiding this comment.
this.floatingPointChecksum = Optional.empty()
| this.checksum = null; | ||
| this.cardinalityChecksum = null; | ||
| this.cardinalitySum = 0; | ||
| this.floatingPointChecksum = new FloatingPointColumnChecksum(); |
There was a problem hiding this comment.
this.floatingPointChecksum = Optional.empty()
| this.rowCount = rowCount; | ||
| } | ||
|
|
||
| // Constructor to handle cases when the result table is empty (sum aggregations returns null). |
There was a problem hiding this comment.
We can remove this one
There was a problem hiding this comment.
Actually, I would prefer to leave it.
We use it, check the updated changes.
| ArrayColumnChecksum testChecksum = toColumnChecksum(column, testResult, useFloatingPointPath); | ||
|
|
||
| return ImmutableList.of(new ColumnMatchResult<>(Objects.equals(controlChecksum, testChecksum), column, controlChecksum, testChecksum)); | ||
| // Not floating point case (we have '$checksum' column). |
There was a problem hiding this comment.
Non-floating point to be consistent?
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnChecksum.java
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| public static FloatingPointColumnChecksum.ValidateResult validateFloatingPoint( |
There was a problem hiding this comment.
it's weird to me that this logic is over here, but the rest of the validation logic is in ArrayColumnValidator. I would do one or the other.
There was a problem hiding this comment.
Moved validate() logic to Validate classes.
| // Check the non floating point members first. | ||
| if (!Objects.equals(controlChecksum.getCardinalityChecksum(), testChecksum.getCardinalityChecksum()) || | ||
| !Objects.equals(controlChecksum.getCardinalitySum(), testChecksum.getCardinalitySum())) { | ||
| return new FloatingPointColumnChecksum.ValidateResult(false, Optional.of("cardinality mismatch")); |
There was a problem hiding this comment.
Do we need this new class? Can we use ColumnMatchResult instead?
There was a problem hiding this comment.
ColumnMatchResult is a templated class and we must return different instantiations of it from ArrayColumnChecksum and FloatingPointColumnChecksum.
I decide to have that small class to carry actual validation result, except all the columns info, which will be provided by concrete validators.
There was a problem hiding this comment.
The ValidateResult still feel not needed, I don't see particular bad effect of using ColumnMatchResult
There was a problem hiding this comment.
Ok, I guess, we can get rid of it.
| { | ||
| private final double relativeErrorMargin; | ||
| private final double absoluteErrorMargin; | ||
| private final boolean arrayFloatingPointErrorMargin; |
There was a problem hiding this comment.
I find this name unclear. Maybe something like "useErrorMarginForFloatArrays"?
There was a problem hiding this comment.
Replaced by useErrorMarginForFloatingPointArrays and "use-error-margin-for-floating-point-arrays".
| } | ||
| } | ||
|
|
||
| public static ValidateResult validate( |
There was a problem hiding this comment.
I think either all validation in should be in the respective checksum classes or none of them(also, might fit better in the validate class if it were a non-static method like equals e.g. floatingPointChecksum.validate(FloatingPointChecksum other)). As things stand it seems a bit out of place, and I agree with @kewang1024 that it would be better in the validate class.
| @Inject | ||
| public ArrayColumnValidator(VerifierConfig config) | ||
| { | ||
| this.relativeErrorMargin = config.getRelativeErrorMargin(); |
There was a problem hiding this comment.
I wonder if we should inject FloatingPointColumnValidator instead, and then can just call floatingPointColumnValidator.validate() and leave these configs as implementation details of the floatingpointcolumnvalidator.
There was a problem hiding this comment.
@rschlussel
Do you mean make FloatingPointColumnValidator a member of ArrayColumnValidator and pass the former to the constructor of the latter?
We can do this.
didn't mean to approve. clicked the wrong button
620920d to
351c344
Compare
|
Thanks for the review. |
d732c5f to
039b69e
Compare
| return column.getName() + "$checksum"; | ||
| } | ||
|
|
||
| private static String getSumColumnAlias(Column column) |
|
|
||
| return new ArrayColumnChecksum( | ||
| checksumResult.getChecksum(getChecksumColumnAlias(column)), | ||
| checksumResult.getChecksum(getSumColumnAlias(column)), |
There was a problem hiding this comment.
we can directly do FloatingPointColumnValidator.getSumColumnAlias(column)
| // Check the non floating point members first. | ||
| if (!Objects.equals(controlChecksum.getCardinalityChecksum(), testChecksum.getCardinalityChecksum()) || | ||
| !Objects.equals(controlChecksum.getCardinalitySum(), testChecksum.getCardinalitySum())) { | ||
| return new FloatingPointColumnChecksum.ValidateResult(false, Optional.of("cardinality mismatch")); |
There was a problem hiding this comment.
The ValidateResult still feel not needed, I don't see particular bad effect of using ColumnMatchResult
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnChecksum.java
Show resolved
Hide resolved
44d1c91 to
b8c6c4b
Compare
|
Addressed ArrayColumnChecksum ctor comments. |
rschlussel
left a comment
There was a problem hiding this comment.
Thanks for the update. A few comments, but generally looks good!
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnValidator.java
Outdated
Show resolved
Hide resolved
| { | ||
| ArrayColumnChecksum controlChecksum = toColumnChecksum(column, controlResult); | ||
| ArrayColumnChecksum testChecksum = toColumnChecksum(column, testResult); | ||
| checkArgument( |
There was a problem hiding this comment.
Just want to confirm there's some other part of verifier that checks that row counts match before getting here, right? and so we have this argument check to sanity check the code, but we shouldn't hit it.
There was a problem hiding this comment.
I'm not sure about it, honestly - repeating what FloatingPointColumnValidator is doing.
There was a problem hiding this comment.
Yes, it looks like a sanity check to me.
I think we can remove it from ArrayColumnValidator and move it in FloatingPointColumnValidator from overridden validate() to the public validate(), which is called from the ArrayColumnValidator.
Actually, cannot do that. It needs ChecksumResult and in the public validate() we only have ColumnChecksum.
Got to leave it as is.
| @@ -42,10 +44,10 @@ public TestStructuredColumnMismatchResolver() | |||
| @Test | |||
| public void testResolveArray() | |||
There was a problem hiding this comment.
Can you update the StructuredColumnsMismatchResolver and then add a test that we don't resolve column mismatches for float arrays if we're using error margin validation.
There was a problem hiding this comment.
Can I get a bit more context on that?
What does StructuredColumnMismatchResolver do and why it should not resolve column mismatches for float arrays if we're using error margin validation?
Thanks!
| } | ||
|
|
||
| @Test | ||
| public void testFloatingPointArray() |
There was a problem hiding this comment.
can you add a test for the empty table case too?
There was a problem hiding this comment.
Yes, I was going to, but hit that snag with nulls in the Map.
I can see if I can use that HashMap to generate the nulls.
| public String toString() | ||
| { | ||
| return format("checksum: %s, cardinality_checksum: %s, cardinality_sum: %s", checksum, cardinalityChecksum, cardinalitySum); | ||
| if (!floatingPointChecksum.isPresent()) { |
There was a problem hiding this comment.
nit: instead of repeating everything for constructing the string, you can use MoreObject.toStringHelper to construct a nicely printed string with all the fields. Something like:
MoreObjects.ToStringHelper toStringHelper = MoreObjects.toStringHelper(this)
.add("cardinality_checksum", cardinalityChecksum)
.add("cardinality_sum", cardinalitySum)
if(floatingPointChecksum.isPresent() {
toStringHelper.add("floating_point_checksum", floatingPointchecksum.get());
}
else {
toStringHelper.add("checksum", checksum);
}
return toStringHelper.toString();
There was a problem hiding this comment.
not sure if it's important that this toString matches the toStrings of the other checksums, so I'll add the caveat that the generated string will look slightly different than what you have, see the docs for example output https://guava.dev/releases/19.0/api/docs/com/google/common/base/MoreObjects.html#toStringHelper(java.lang.Object)
There was a problem hiding this comment.
@rschlussel
I don't know. I don't feel that this change worth the trouble at the moment.
You are saying "to construct a nicely printed string with all the fields" - are they going to be more nicely printed than with the current code? How?
Maybe we can consider a small refactor followup to move all Checksums to such format?
There was a problem hiding this comment.
not necessarily nicer than the current code. just helps you with not duplicating for the floating point vs. non-floating point case. It would look something like ArrayColumnChecksum{cardinality_checksum=123, cardinality_sum=345, ...}. anyway, fine to leave it as is.
presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnChecksum.java
Outdated
Show resolved
Hide resolved
b8c6c4b to
f7e5e3d
Compare
Basically allow verifier to validate array(float) and array(double) columns in the same way as we validate float and double columns. Later we can extend this to the maps as well.
f7e5e3d to
0c896e3
Compare
Description
Basically allow verifier to validate array(float) and array(double) columns in the same way as we validate float and double columns. Later we can extend this to the maps as well.
Motivation and Context
We are planning to use this when comparing Presto Native with Presto Java to reduce the noise from array(double) columns. This will also strengthen the Presto verification as non-deterministic array(double) column could hide a valid correctness issue in another column by marking query SKIPPED in the verifier.
Test Plan
Added new unit test and updated existing one.
Ran verifier on a query producing non-deterministic array(double) (artificial query).
Result is successful: the query passes correctness in the new verifier, while getting skipped in the old one.
Successfully run real use case verifier suite on the new verifier.