Allow Verifier to validate varchar as floating point by gggrace14 · Pull Request #23312 · prestodb/presto

gggrace14 · 2024-07-26T19:39:46Z

Description

Treat varchar column as double and apply floating point column validation, if it is detected that the varchar column is derived from casting floating points. This applies to the simple varchar column, array(varchar) column and map(varchar, varchar). Structured column types with deeper nested varchars are not supported.

Add verifier config --validate-string-as-double.

Resolve #23348

Motivation and Context

Today Presto Verifier reports MISMATCH in ColumnValidator for simple columns or structured output columns composed by varchar, for real world test cases. This raises false negative and creates overhead for Verifier users utilizing the Verifier results. Verifier needs to apply floating point error margin for this case.

Impact

Remove false negative when validating varchar columns derived from casting floating points. When --validate-string-as-double is set to true, Verifier control and test checksum queries will include more columns and become more expensive.

Test Plan

Unit tests and production tests.

Contributor checklist

Release Notes

== RELEASE NOTES ==

Verifier Changes
* Add verifier config --validate-string-as-double to control applying floating point validation to the column composed of varchar, if it is detected that the varchar column is derived from casting floating points. :pr:`23312`

aditi-pandit · 2024-07-26T20:23:28Z

@gggrace14 : Thanks for this code.

Did you intend to Advance Velox in this PR ? Please can you separate it to its own PR. We are suggesting that as a best practice.

gggrace14 · 2024-07-26T20:48:58Z

@gggrace14 : Thanks for this code.

Did you intend to Advance Velox in this PR ? Please can you separate it to its own PR. We are suggesting that as a best practice.

Hi @aditi-pandit , I'm not meant to advance velox by this change. I'll change to leave that part out. I forgot this matter. Thanks for the reminder!

steveburnett · 2024-07-30T15:03:50Z

Suggest change to release note entry to follow the Release Notes Guidelines:

== RELEASE NOTES ==

Verifier Changes
* Add configuration property --validate-string-as-double to control applying floating point validation to columns composed of varchar, if it is detected that the varchar column is derived from casting floating points. :pr:`23312`

Revised.

amitkdutta

Genearlly looks OK to me. Thanks so much @gggrace14 for working on it. I have one question: why don't we make validate-string-as-double default to true? Any issue you envision by not making it default?

mbasmanova

@gggrace14 Ge, let's document this enhancement in https://prestodb.io/docs/current/admin/verifier.html

Do we have other recent enhancements documented already? If not, let's prioritize documenting these as well.

CC: @steveburnett

rschlussel · 2024-08-01T14:53:50Z

presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ArrayColumnValidator.java

+        if (useStringAsDoublePath) {
+            Column asDoubleArrayColumn = getAsDoubleArrayColumn(column);
+            return new ArrayColumnChecksum(
+                    checksum,


why is this null for floating point but not null for "useStringAsDouble"?

Revised and I can set it to null. This field will not be used for the branch of useStringAsDouble. I included it for debugging.

rschlussel · 2024-08-01T14:58:59Z

presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/ColumnValidatorUtil.java

+        }
+    }
+
+    // Check if the column is derived from a floating point column by comparing the count of the original null values and the count of null values when casting the column back


I had a hard time understanding this comment because it assumes that you already know what's going on. Could you clarfiy it?

I think what you are saying is that we do as follows:
count_if(x is null)
count_if(Cast(x AS DOUBLE) is NULL), and then if the null count is the same, that means we were able to cast all the varchar values as doubles.

Revised with your suggestion. Yes, that's exactly what I meant.

rschlussel · 2024-08-01T15:06:54Z

presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/MapColumnValidator.java

        // Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue.
        // In order to resolve column mismatch in such a situation, generate an extra checksum for the values.
-        if (isFloatingPointType(keyType) && !isFloatingPointType(valueType)) {
+        if ((isFloatingPointType(keyType) || shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) {


should we also skip this if the value type is a string that's really a double?

Hi Rebecca, as in another comment, after this MapColumnValidator component, there's another related component, called FailureResolver. MapColumnValidator does checksum exact match today instead of floating point validation. When MapColumnValidator returns mismatch, if the FailureResolver sees the map comprised of floating point, the FailureResolver auto resolves the mismatch and considers it correct.

The reason that we skip here today for floating point is because of this FailureResolver approach. And I think we should get rid of FailureResolver and make MapColumnValidator do floating point validation, as ArrayColumnValidator does wtih #22143, for floating point.

In this PR for map of varchar, I directly do floating point validation for varchar. So we need to check on both key column and value column, during which we do floating point validation if either of them is a floating point column or a varchar column that's really a double.

rschlussel · 2024-08-01T15:07:52Z

presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/MapColumnValidator.java

+            valuesFloatingPointChecksum = Optional.of(FloatingPointColumnValidator.toColumnChecksum(valuesColumn, checksumResult, checksumResult.getRowCount()));
+        }
+        Object valuesChecksum = null;
+        if ((isFloatingPointType(keyType) || isDoubleKeyAsString) && !isFloatingPointType(valueType)) {


same question about why don't we skip if the value type isDoubleValueAsString

rschlussel · 2024-08-01T15:12:12Z

presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/MapColumnValidator.java

+        builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column)))));
+        builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column)))));
+
        // We need values checksum in one case only: when key is a floating point type and value is not.


If we're doing floating point validation on the map now does this still apply? wouldn't we always want to do floating point validation on the values if they are floating point types?

Nope, for map of floating points, we're not doing floating points validation today. And I think we need to add that. I see you discussed about this in #22793 (comment). And I agree. Either me or Sergey could add that later. But it needs a separate PR, given this PR is already very large.

I think I'm a bit confused because we are generating all the columns for floating point values in maps as we would for floating point validation. Is the difference that we are then comparing that they are exactly equal rather than using an error margin?

Hi Rebecca, for the branch of map of floating point, we're not generating columns for floating point validation today. And I keep it as it is with this PR. We do the checksum exact match. So this MapColumnValidator component probably often fails I guess. (Attaching the logic block below) There's another component following this ColumnValidator component in the verifier workflow, called FailureResolver. For map mismatch returned by ColumnValidator, if it sees the map has a floating point key, the FailureResolver auto resolves the mismatches and considers it correct.

I think this MapColumnValidator + FailureResolver for map is too inaccurate. And we should do floating point validation in MapColumnValidator, and get rid of FailureResolve. The recent #22143 improved ArrayColumnValidator to do floating point validation for array. And we should make the same change for map. That needs a separate PR.

Then in this PR which is for map of varchar, I make the path for map of varchar directly do floating point validation. We're generating floating point columns only for this path.

presto/presto-verifier/src/main/java/com/facebook/presto/verifier/checksum/MapColumnValidator.java

Lines 73 to 82 in 7dae0d6

builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column)))));

builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column)))));

// We need values checksum in one case only: when key is a floating point type and value is not.

// In such case, when both column checksum and the key checksum do not match, we cannot tell if the values match or not.

// Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue.

// In order to resolve column mismatch in such a situation, generate an extra checksum for the values.

if ((isFloatingPointType(keyType) || shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) {

Expression valuesChecksum = generateArrayChecksum(functionCall("map_values", column.getExpression()), new ArrayType(valueType));

builder.add(new SingleColumn(valuesChecksum, Optional.of(delimitedIdentifier(getValuesChecksumColumnAlias(column)))));

presto/presto-verifier/src/main/java/com/facebook/presto/verifier/resolver/StructuredColumnMismatchResolver.java

Lines 76 to 77 in f16032e

boolean keyContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getKeyType());

boolean valueContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getValueType());

I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation? Can we have floating point types go through the same path?

I see so am I correct that with your change if it's a double string we'll use floating point validation if the config property is set, but if it's just a double we don't do floating point validation?

@rschlussel Rebecca, yes, exactly, that is the case.

Regarding having the true floating point type go through the same path, yes, we could and should do that next. The floating point validation is abstracted into a method in FloatingPointColumnValidator. And we could call that, after refactoring a bit of this MapColumnValidator.validate(). We prefer to make it a separate PR. These changes need quite a bit testing to make sure free of false positive and negative. I would like to separate the change.

gggrace14 · 2024-08-01T20:04:59Z

@gggrace14 Ge, let's document this enhancement in https://prestodb.io/docs/current/admin/verifier.html

Do we have other recent enhancements documented already? If not, let's prioritize documenting these as well.

CC: @steveburnett

@mbasmanova Masha, sure, I will add to verifier.rst and include in this PR.
For recent enhancement with --control/test.reuse-table and --function-substitutes, I've added the documentation to verifier.rst. I will check again if I need to add more to other parts of verifier.rst.

gggrace14 · 2024-08-01T20:59:22Z

Genearlly looks OK to me. Thanks so much @gggrace14 for working on it. I have one question: why don't we make validate-string-as-double default to true? Any issue you envision by not making it default?

@amitkdutta, thanks a lot for the review!
Right now with this change, turning on --validate-string-as-double will add a couple aggregate column to the CHECKSUM queries. They're potentially expensive and make the CHECKSUM queries run longer in the control cluster. Already observed slowness of the CHECKSUM queries and we have 3 pairs of CONTROL & TEST CHECKSUM queries. So our approach is to only turn on this config when retesting the cases with COLUMN_MISMATCH previously.

steveburnett

Thanks for the doc! A few nits of formatting and suggestions of rephrasing, and some errors in the local doc build about indenting lists and nested lists.

presto-docs/src/main/sphinx/admin/verifier.rst

steveburnett

LGTM! (docs)

Pull updated branch, new local doc build, everything looks good. Thanks!

Treat varchar column as double and apply floating point column validation, if it is detected that the varchar column is derived from casting floating points. This applies to simple varchar column, array(varchar) and map(varchar, varchar). More nested varchars is not supported. Add verifier config --validate-string-as-double.

rschlussel

Thanks for the explanations. The current state after this PR is pretty weird where maps with numeric strings will use floating point validation but maps with actual double types don't do that. Like you mentioned, we need a follow up to bring floating point keys/values in line with string as double and consolidate the code. As long as you have plans to do that, it's fine to merge this first.

gggrace14 · 2024-08-07T03:18:11Z

Thanks for the explanations. The current state after this PR is pretty weird where maps with numeric strings will use floating point validation but maps with actual double types don't do that. Like you mentioned, we need a follow up to bring floating point keys/values in line with string as double and consolidate the code. As long as you have plans to do that, it's fine to merge this first.

I see. Agree that the state is weird, and needs a lot of details if to communicate with others. We would add the floating point validation for map as the next step.

Thank you, @rschlussel , for the review and suggestion!

gggrace14 requested review from a team as code owners July 26, 2024 19:39

gggrace14 requested a review from presto-oss July 26, 2024 19:39

gggrace14 force-pushed the resolvesimple branch 2 times, most recently from d253542 to dee840f Compare July 26, 2024 20:05

gggrace14 force-pushed the resolvesimple branch 3 times, most recently from 585af59 to 4709f38 Compare July 30, 2024 00:20

gggrace14 force-pushed the resolvesimple branch from 4709f38 to a5eab49 Compare July 30, 2024 18:44

woshizhouyihao approved these changes Jul 31, 2024

View reviewed changes

amitkdutta requested a review from rschlussel August 1, 2024 05:38

amitkdutta reviewed Aug 1, 2024

View reviewed changes

mbasmanova reviewed Aug 1, 2024

View reviewed changes

rschlussel reviewed Aug 1, 2024

View reviewed changes

gggrace14 force-pushed the resolvesimple branch from a5eab49 to 7dae0d6 Compare August 5, 2024 05:49

gggrace14 requested review from elharo and steveburnett as code owners August 5, 2024 05:49

steveburnett requested changes Aug 5, 2024

View reviewed changes

gggrace14 force-pushed the resolvesimple branch from 7dae0d6 to 9626f5d Compare August 6, 2024 04:15

steveburnett previously approved these changes Aug 6, 2024

View reviewed changes

gggrace14 dismissed steveburnett’s stale review via da904ab August 6, 2024 17:44

gggrace14 force-pushed the resolvesimple branch from 9626f5d to da904ab Compare August 6, 2024 17:44

rschlussel approved these changes Aug 6, 2024

View reviewed changes

gggrace14 merged commit 937e863 into prestodb:master Aug 7, 2024

gggrace14 deleted the resolvesimple branch August 7, 2024 03:19

tdcmeehan mentioned this pull request Aug 23, 2024

Add release notes for 0.289 #23513

Merged

34 tasks

	builder.add(new SingleColumn(checksum, Optional.of(delimitedIdentifier(getChecksumColumnAlias(column)))));
	builder.add(new SingleColumn(keysChecksum, Optional.of(delimitedIdentifier(getKeysChecksumColumnAlias(column)))));

	// We need values checksum in one case only: when key is a floating point type and value is not.
	// In such case, when both column checksum and the key checksum do not match, we cannot tell if the values match or not.
	// Meaning we cannot resolve column mismatch, because value type is not a floating type and a mismatch in the values could indicate a real correctness issue.
	// In order to resolve column mismatch in such a situation, generate an extra checksum for the values.
	if ((isFloatingPointType(keyType) \|\| shouldValidateStringAsDouble(keyType)) && !isFloatingPointType(valueType)) {
	Expression valuesChecksum = generateArrayChecksum(functionCall("map_values", column.getExpression()), new ArrayType(valueType));
	builder.add(new SingleColumn(valuesChecksum, Optional.of(delimitedIdentifier(getValuesChecksumColumnAlias(column)))));

	boolean keyContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getKeyType());
	boolean valueContainsFloatingPoint = containsFloatingPointType(((MapType) columnType).getValueType());

Conversation

gggrace14 commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

aditi-pandit commented Jul 26, 2024

Uh oh!

gggrace14 commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveburnett commented Jul 30, 2024 • edited by gggrace14 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amitkdutta left a comment

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gggrace14 Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gggrace14 commented Aug 1, 2024

Uh oh!

gggrace14 commented Aug 1, 2024

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

rschlussel left a comment

Choose a reason for hiding this comment

Uh oh!

gggrace14 commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

gggrace14 commented Jul 26, 2024 •

edited

Loading

gggrace14 commented Jul 26, 2024 •

edited

Loading

steveburnett commented Jul 30, 2024 •

edited by gggrace14

Loading

gggrace14 Aug 6, 2024 •

edited

Loading

gggrace14 commented Aug 7, 2024 •

edited

Loading