Allow Verifier to validate varchar as floating point#23312
Allow Verifier to validate varchar as floating point#23312gggrace14 merged 1 commit intoprestodb:masterfrom
Conversation
d253542 to
dee840f
Compare
|
@gggrace14 : Thanks for this code. Did you intend to Advance Velox in this PR ? Please can you separate it to its own PR. We are suggesting that as a best practice. |
Hi @aditi-pandit , I'm not meant to advance velox by this change. I'll change to leave that part out. I forgot this matter. Thanks for the reminder! |
585af59 to
4709f38
Compare
|
Suggest change to release note entry to follow the Release Notes Guidelines: Revised. |
amitkdutta
left a comment
There was a problem hiding this comment.
Genearlly looks OK to me. Thanks so much @gggrace14 for working on it. I have one question: why don't we make validate-string-as-double default to true? Any issue you envision by not making it default?
mbasmanova
left a comment
There was a problem hiding this comment.
@gggrace14 Ge, let's document this enhancement in https://prestodb.io/docs/current/admin/verifier.html
Do we have other recent enhancements documented already? If not, let's prioritize documenting these as well.
CC: @steveburnett
| if (useStringAsDoublePath) { | ||
| Column asDoubleArrayColumn = getAsDoubleArrayColumn(column); | ||
| return new ArrayColumnChecksum( | ||
| checksum, |
There was a problem hiding this comment.
why is this null for floating point but not null for "useStringAsDouble"?
There was a problem hiding this comment.
Revised and I can set it to null. This field will not be used for the branch of useStringAsDouble. I included it for debugging.
| } | ||
| } | ||
|
|
||
| // Check if the column is derived from a floating point column by comparing the count of the original null values and the count of null values when casting the column back |
There was a problem hiding this comment.
I had a hard time understanding this comment because it assumes that you already know what's going on. Could you clarfiy it?
I think what you are saying is that we do as follows:
count_if(x is null)
count_if(Cast(x AS DOUBLE) is NULL), and then if the null count is the same, that means we were able to cast all the varchar values as doubles.
There was a problem hiding this comment.
Revised with your suggestion. Yes, that's exactly what I meant.
| // Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue. | ||
| // In order to resolve column mismatch in such a situation, generate an extra checksum for the values. | ||
| if (isFloatingPointType(keyType) && !isFloatingPointType(valueType)) { | ||
| if ((isFloatingPointType(keyType) || shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) { |
There was a problem hiding this comment.
should we also skip this if the value type is a string that's really a double?
There was a problem hiding this comment.
Hi Rebecca, as in another comment, after this MapColumnValidator component, there's another related component, called FailureResolver. MapColumnValidator does checksum exact match today instead of floating point validation. When MapColumnValidator returns mismatch, if the FailureResolver sees the map comprised of floating point, the FailureResolver auto resolves the mismatch and considers it correct.
The reason that we skip here today for floating point is because of this FailureResolver approach. And I think we should get rid of FailureResolver and make MapColumnValidator do floating point validation, as ArrayColumnValidator does wtih #22143, for floating point.
In this PR for map of varchar, I directly do floating point validation for varchar. So we need to check on both key column and value column, during which we do floating point validation if either of them is a floating point column or a varchar column that's really a double.
| valuesFloatingPointChecksum = Optional.of(FloatingPointColumnValidator.toColumnChecksum(valuesColumn, checksumResult, checksumResult.getRowCount())); | ||
| } | ||
| Object valuesChecksum = null; | ||
| if ((isFloatingPointType(keyType) || isDoubleKeyAsString) && !isFloatingPointType(valueType)) { |
There was a problem hiding this comment.
same question about why don't we skip if the value type isDoubleValueAsString
| builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column))))); | ||
| builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column))))); | ||
|
|
||
| // We need values checksum in one case only: when key is a floating point type and value is not. |
There was a problem hiding this comment.
If we're doing floating point validation on the map now does this still apply? wouldn't we always want to do floating point validation on the values if they are floating point types?
There was a problem hiding this comment.
Nope, for map of floating points, we're not doing floating points validation today. And I think we need to add that. I see you discussed about this in #22793 (comment). And I agree. Either me or Sergey could add that later. But it needs a separate PR, given this PR is already very large.
There was a problem hiding this comment.
I think I'm a bit confused because we are generating all the columns for floating point values in maps as we would for floating point validation. Is the difference that we are then comparing that they are exactly equal rather than using an error margin?
There was a problem hiding this comment.
Hi Rebecca, for the branch of map of floating point, we're not generating columns for floating point validation today. And I keep it as it is with this PR. We do the checksum exact match. So this MapColumnValidator component probably often fails I guess. (Attaching the logic block below) There's another component following this ColumnValidator component in the verifier workflow, called FailureResolver. For map mismatch returned by ColumnValidator, if it sees the map has a floating point key, the FailureResolver auto resolves the mismatches and considers it correct.
I think this MapColumnValidator + FailureResolver for map is too inaccurate. And we should do floating point validation in MapColumnValidator, and get rid of FailureResolve. The recent #22143 improved ArrayColumnValidator to do floating point validation for array. And we should make the same change for map. That needs a separate PR.
Then in this PR which is for map of varchar, I make the path for map of varchar directly do floating point validation. We're generating floating point columns only for this path.
There was a problem hiding this comment.
I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation? Can we have floating point types go through the same path?
There was a problem hiding this comment.
I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation?
@rschlussel Rebecca, yes, exactly, that is the case.
Regarding having the true floating point type go through the same path, yes, we could and should do that next. The floating point validation is abstracted into a method in FloatingPointColumnValidator. And we could call that, after refactoring a bit of this MapColumnValidator.validate(). We prefer to make it a separate PR. These changes need quite a bit testing to make sure free of false positive and negative. I would like to separate the change.
@mbasmanova Masha, sure, I will add to verifier.rst and include in this PR. |
@amitkdutta, thanks a lot for the review! |
a5eab49 to
7dae0d6
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc! A few nits of formatting and suggestions of rephrasing, and some errors in the local doc build about indenting lists and nested lists.
7dae0d6 to
9626f5d
Compare
steveburnett
left a comment
There was a problem hiding this comment.
LGTM! (docs)
Pull updated branch, new local doc build, everything looks good. Thanks!
Treat varchar column as double and apply floating point column validation, if it is detected that the varchar column is derived from casting floating points. This applies to simple varchar column, array(varchar) and map(varchar, varchar). More nested varchars is not supported. Add verifier config --validate-string-as-double.
9626f5d to
da904ab
Compare
rschlussel
left a comment
There was a problem hiding this comment.
Thanks for the explanations. The current state after this PR is pretty weird where maps with numeric strings will use floating point validation but maps with actual double types don't do that. Like you mentioned, we need a follow up to bring floating point keys/values in line with string as double and consolidate the code. As long as you have plans to do that, it's fine to merge this first.
I see. Agree that the state is weird, and needs a lot of details if to communicate with others. We would add the floating point validation for map as the next step. Thank you, @rschlussel , for the review and suggestion! |
Description
Treat varchar column as double and apply floating point column validation, if it is detected that the varchar column is derived from casting floating points. This applies to the simple varchar column, array(varchar) column and map(varchar, varchar). Structured column types with deeper nested varchars are not supported.
Add verifier config --validate-string-as-double.
Resolve #23348
Motivation and Context
Today Presto Verifier reports MISMATCH in ColumnValidator for simple columns or structured output columns composed by varchar, for real world test cases. This raises false negative and creates overhead for Verifier users utilizing the Verifier results. Verifier needs to apply floating point error margin for this case.
Impact
Remove false negative when validating varchar columns derived from casting floating points. When --validate-string-as-double is set to true, Verifier control and test checksum queries will include more columns and become more expensive.
Test Plan
Unit tests and production tests.
Contributor checklist
Release Notes