Skip to content

Allow Verifier to validate varchar as floating point#23312

Merged
gggrace14 merged 1 commit intoprestodb:masterfrom
gggrace14:resolvesimple
Aug 7, 2024
Merged

Allow Verifier to validate varchar as floating point#23312
gggrace14 merged 1 commit intoprestodb:masterfrom
gggrace14:resolvesimple

Conversation

@gggrace14
Copy link
Contributor

@gggrace14 gggrace14 commented Jul 26, 2024

Description

Treat varchar column as double and apply floating point column validation, if it is detected that the varchar column is derived from casting floating points. This applies to the simple varchar column, array(varchar) column and map(varchar, varchar). Structured column types with deeper nested varchars are not supported.

Add verifier config --validate-string-as-double.

Resolve #23348

Motivation and Context

Today Presto Verifier reports MISMATCH in ColumnValidator for simple columns or structured output columns composed by varchar, for real world test cases. This raises false negative and creates overhead for Verifier users utilizing the Verifier results. Verifier needs to apply floating point error margin for this case.

Impact

Remove false negative when validating varchar columns derived from casting floating points. When --validate-string-as-double is set to true, Verifier control and test checksum queries will include more columns and become more expensive.

Test Plan

Unit tests and production tests.

Contributor checklist

Release Notes

== RELEASE NOTES ==

Verifier Changes
* Add verifier config --validate-string-as-double to control applying floating point validation to the column composed of varchar, if it is detected that the varchar column is derived from casting floating points. :pr:`23312`

@gggrace14 gggrace14 requested review from a team as code owners July 26, 2024 19:39
@gggrace14 gggrace14 requested a review from presto-oss July 26, 2024 19:39
@gggrace14 gggrace14 force-pushed the resolvesimple branch 2 times, most recently from d253542 to dee840f Compare July 26, 2024 20:05
@aditi-pandit
Copy link
Contributor

@gggrace14 : Thanks for this code.

Did you intend to Advance Velox in this PR ? Please can you separate it to its own PR. We are suggesting that as a best practice.

@gggrace14
Copy link
Contributor Author

gggrace14 commented Jul 26, 2024

@gggrace14 : Thanks for this code.

Did you intend to Advance Velox in this PR ? Please can you separate it to its own PR. We are suggesting that as a best practice.

Hi @aditi-pandit , I'm not meant to advance velox by this change. I'll change to leave that part out. I forgot this matter. Thanks for the reminder!

@gggrace14 gggrace14 force-pushed the resolvesimple branch 3 times, most recently from 585af59 to 4709f38 Compare July 30, 2024 00:20
@steveburnett
Copy link
Contributor

steveburnett commented Jul 30, 2024

Suggest change to release note entry to follow the Release Notes Guidelines:

== RELEASE NOTES ==

Verifier Changes
* Add configuration property --validate-string-as-double to control applying floating point validation to columns composed of varchar, if it is detected that the varchar column is derived from casting floating points. :pr:`23312`

Revised.

@amitkdutta amitkdutta requested a review from rschlussel August 1, 2024 05:38
Copy link
Contributor

@amitkdutta amitkdutta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genearlly looks OK to me. Thanks so much @gggrace14 for working on it. I have one question: why don't we make validate-string-as-double default to true? Any issue you envision by not making it default?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gggrace14 Ge, let's document this enhancement in https://prestodb.io/docs/current/admin/verifier.html

Do we have other recent enhancements documented already? If not, let's prioritize documenting these as well.

CC: @steveburnett

if (useStringAsDoublePath) {
Column asDoubleArrayColumn = getAsDoubleArrayColumn(column);
return new ArrayColumnChecksum(
checksum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this null for floating point but not null for "useStringAsDouble"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised and I can set it to null. This field will not be used for the branch of useStringAsDouble. I included it for debugging.

}
}

// Check if the column is derived from a floating point column by comparing the count of the original null values and the count of null values when casting the column back
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time understanding this comment because it assumes that you already know what's going on. Could you clarfiy it?

I think what you are saying is that we do as follows:
count_if(x is null)
count_if(Cast(x AS DOUBLE) is NULL), and then if the null count is the same, that means we were able to cast all the varchar values as doubles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised with your suggestion. Yes, that's exactly what I meant.

// Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue.
// In order to resolve column mismatch in such a situation, generate an extra checksum for the values.
if (isFloatingPointType(keyType) && !isFloatingPointType(valueType)) {
if ((isFloatingPointType(keyType) || shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also skip this if the value type is a string that's really a double?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rebecca, as in another comment, after this MapColumnValidator component, there's another related component, called FailureResolver. MapColumnValidator does checksum exact match today instead of floating point validation. When MapColumnValidator returns mismatch, if the FailureResolver sees the map comprised of floating point, the FailureResolver auto resolves the mismatch and considers it correct.

The reason that we skip here today for floating point is because of this FailureResolver approach. And I think we should get rid of FailureResolver and make MapColumnValidator do floating point validation, as ArrayColumnValidator does wtih #22143, for floating point.

In this PR for map of varchar, I directly do floating point validation for varchar. So we need to check on both key column and value column, during which we do floating point validation if either of them is a floating point column or a varchar column that's really a double.

valuesFloatingPointChecksum = Optional.of(FloatingPointColumnValidator.toColumnChecksum(valuesColumn, checksumResult, checksumResult.getRowCount()));
}
Object valuesChecksum = null;
if ((isFloatingPointType(keyType) || isDoubleKeyAsString) && !isFloatingPointType(valueType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about why don't we skip if the value type isDoubleValueAsString

builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column)))));
builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column)))));

// We need values checksum in one case only: when key is a floating point type and value is not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing floating point validation on the map now does this still apply? wouldn't we always want to do floating point validation on the values if they are floating point types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, for map of floating points, we're not doing floating points validation today. And I think we need to add that. I see you discussed about this in #22793 (comment). And I agree. Either me or Sergey could add that later. But it needs a separate PR, given this PR is already very large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm a bit confused because we are generating all the columns for floating point values in maps as we would for floating point validation. Is the difference that we are then comparing that they are exactly equal rather than using an error margin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rebecca, for the branch of map of floating point, we're not generating columns for floating point validation today. And I keep it as it is with this PR. We do the checksum exact match. So this MapColumnValidator component probably often fails I guess. (Attaching the logic block below) There's another component following this ColumnValidator component in the verifier workflow, called FailureResolver. For map mismatch returned by ColumnValidator, if it sees the map has a floating point key, the FailureResolver auto resolves the mismatches and considers it correct.

I think this MapColumnValidator + FailureResolver for map is too inaccurate. And we should do floating point validation in MapColumnValidator, and get rid of FailureResolve. The recent #22143 improved ArrayColumnValidator to do floating point validation for array. And we should make the same change for map. That needs a separate PR.

Then in this PR which is for map of varchar, I make the path for map of varchar directly do floating point validation. We're generating floating point columns only for this path.

builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column)))));
builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column)))));
// We need values checksum in one case only: when key is a floating point type and value is not.
// In such case, when both column checksum and the key checksum do not match, we cannot tell if the values match or not.
// Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue.
// In order to resolve column mismatch in such a situation, generate an extra checksum for the values.
if ((isFloatingPointType(keyType) || shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) {
Expression valuesChecksum = generateArrayChecksum(functionCall("map_values", column.getExpression()), new ArrayType(valueType));
builder.add(new SingleColumn(valuesChecksum, Optional.of(delimitedIdentifier(getValuesChecksumColumnAlias(column)))));

boolean keyContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getKeyType());
boolean valueContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getValueType());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation? Can we have floating point types go through the same path?

Copy link
Contributor Author

@gggrace14 gggrace14 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation?

@rschlussel Rebecca, yes, exactly, that is the case.

Regarding having the true floating point type go through the same path, yes, we could and should do that next. The floating point validation is abstracted into a method in FloatingPointColumnValidator. And we could call that, after refactoring a bit of this MapColumnValidator.validate(). We prefer to make it a separate PR. These changes need quite a bit testing to make sure free of false positive and negative. I would like to separate the change.

@gggrace14
Copy link
Contributor Author

@gggrace14 Ge, let's document this enhancement in https://prestodb.io/docs/current/admin/verifier.html

Do we have other recent enhancements documented already? If not, let's prioritize documenting these as well.

CC: @steveburnett

@mbasmanova Masha, sure, I will add to verifier.rst and include in this PR.
For recent enhancement with --control/test.reuse-table and --function-substitutes, I've added the documentation to verifier.rst. I will check again if I need to add more to other parts of verifier.rst.

@gggrace14
Copy link
Contributor Author

Genearlly looks OK to me. Thanks so much @gggrace14 for working on it. I have one question: why don't we make validate-string-as-double default to true? Any issue you envision by not making it default?

@amitkdutta, thanks a lot for the review!
Right now with this change, turning on --validate-string-as-double will add a couple aggregate column to the CHECKSUM queries. They're potentially expensive and make the CHECKSUM queries run longer in the control cluster. Already observed slowness of the CHECKSUM queries and we have 3 pairs of CONTROL & TEST CHECKSUM queries. So our approach is to only turn on this config when retesting the cases with COLUMN_MISMATCH previously.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! A few nits of formatting and suggestions of rephrasing, and some errors in the local doc build about indenting lists and nested lists.

steveburnett
steveburnett previously approved these changes Aug 6, 2024
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, everything looks good. Thanks!

Treat varchar column as double and apply floating point column
validation, if it is detected that the varchar column is derived
from casting floating points. This applies to simple varchar column,
array(varchar) and map(varchar, varchar). More nested varchars is
not supported.

Add verifier config --validate-string-as-double.
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. The current state after this PR is pretty weird where maps with numeric strings will use floating point validation but maps with actual double types don't do that. Like you mentioned, we need a follow up to bring floating point keys/values in line with string as double and consolidate the code. As long as you have plans to do that, it's fine to merge this first.

@gggrace14
Copy link
Contributor Author

gggrace14 commented Aug 7, 2024

Thanks for the explanations. The current state after this PR is pretty weird where maps with numeric strings will use floating point validation but maps with actual double types don't do that. Like you mentioned, we need a follow up to bring floating point keys/values in line with string as double and consolidate the code. As long as you have plans to do that, it's fine to merge this first.

I see. Agree that the state is weird, and needs a lot of details if to communicate with others. We would add the floating point validation for map as the next step.

Thank you, @rschlussel , for the review and suggestion!

@gggrace14 gggrace14 merged commit 937e863 into prestodb:master Aug 7, 2024
@gggrace14 gggrace14 deleted the resolvesimple branch August 7, 2024 03:19
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve ColumnValidator to validate a varchar as a floating point column if the column values are derived from floating points

7 participants