-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: liberal parsing of zero scale decimals #8700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
arrow-cast/src/parse.rs
Outdated
| if scale == 0 && fractionals > 0 { | ||
| // The input string contained some fractional digits after the decimal point despite | ||
| // the scale being zero. Eject all the fractional digits from the number. | ||
| result = result.div_wrapping(base.pow_wrapping(fractionals as _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be correct if there are many fractional digits, e.g. 20+ ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would, added another test case on that note (("123.000000000000000000004", 123)).
95a0e5a to
124441b
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if we could update the docstring of this function to document this expected behaviour
| } | ||
| } | ||
|
|
||
| if !is_e_notation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test case with e notation, for a sanity check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, added a couple of e-notation test cases now
("0.123e2", 12, 3),
("0.123e4", 1230, 10),
("1.23e4", 12300, 10),
("12.3e4", 123000, 10),
("123e4", 1230000, 10),
|
The current implementation has an overflow issue. You can test with the following code: The first assert will pass. The second assert will fail: There is apparently an interger overflow. |
fc36673 to
3dbfe0d
Compare
Great catch, thanks! I revised the implementation a bit, and tried a drive-by |
|
That said, the following (valid) example will lead to an overflow again let result_128 = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0e-1", 38, 0).unwrap();
assert_eq!(result_128, 20000000000000000000000000000000000000);
assertion `left == right` failed
left: -1402823669209384634633746074317682114
right: 20000000000000000000000000000000000000In fact this also happens with scale 1, and it happens on The main problem there is the way e-notation is parsed—by the time we hit e-notation processing we've already applied all the significant digits to the result without checking whether some of them are not needed (e.g. because of the negative exponent moving the decimal point to the left). Hence we may have overflowed by the point we want to do To fix this i think we need to |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies it took so long to re-review, it seems the whole fix has changed in terms of code and I'm not as familiar with this part of the codebase so it was hard to separate what was a refactor vs what was the actual fix code (would greatly appreciate it if there were some pointer comments to this effect).
It would be good if we could update the docstring of this function to document this expected behaviour
Just bumping this previous suggestion.
This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).
It would be great to raise an issue for this 👍
arrow-cast/src/parse.rs
Outdated
| let mut bs = s.as_bytes().iter().skip(index + fractionals as usize); | ||
|
|
||
| while let Some(b) = bs.next() { | ||
| // complete parsing of any unprocessed fractionals up to the exponent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I'm confused by this line; above we state we skip to the exponent or after any processed fractionals, but here we say we parse unprocessed fractionals somehow?
After checking the code it seems it corresponds to this check:
Where we previously skipped fractionals for exceeding the scale, but now we include them since we parse the exponent. I feel we should capture this detail here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct—this is indeed (more than) slightly confusing, and it stems from unclear separation of concerns. We skip parsing any fractionals after scale in parse_decimal, unless we hit into the e-notation and realise we may actually need those fractionals after all, in which case we parse them in parse_e_notation.
This is also one more argument for going with the above suggestion:
To fix this i think we need to s.split_once(['e', 'E']) first, then optionally process the exponent, and only then move on to the mantissa, since then we know exactly what numbers can be thrown away. This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).
Besides fixing another category of edge cases, this would also simplify the logic. I'm only worried about performance though, since that would change the current one-pass into a two-pass algorithm.
For now I'll try to elaborate in the comment a bit more. I'll also open an issue for a cleaner parse_decimal/parse_e_notation separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the issue capturing the broader problems of parse_decimal/parse_e_notation: #9170
| bs = s.as_bytes().iter().skip(index); | ||
| } | ||
| // skip to the exponent index directly or just after any processed fractionals | ||
| let mut bs = s.as_bytes().iter().skip(index + fractionals as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're refactoring perhaps we should change the type of fractions to be unsigned? It was a little confusing to have it be signed to me as I had to verify that your refactor to collapse the if/else made sense since fractionals was never negative (and the + 1 in the if block is accounted for by adjusting index at the callsite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, though this leads to some cascading casting and type changes, as we use fractionals to compute/compare with other variables, notably with scale which is i8.
So we can then make scale an unsigned type too, since it looks like negative scale is not supported right now.
However I'd rather retain the possibility of supporting negative scales at some point, so I went with changing the scale type to i16 instead. That way fractionals: u8 can be safely cast to i16 for comparison/arithmetic (plus parse_e_notation already expects scale to be i16).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, scale being i8 is more widespread than I thought, I reverted the commit that changes it.
I don't think we can seamlessly change fractionals to an unsigned type then.
|
I tested this change with the current datafusion tests (I used apache/datafusion#19728 (review)) and all the tests passed for what it is worth |
3dbfe0d to
f4f3ff5
Compare
c30841a to
f4f3ff5
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gruuya -- this is looking good
arrow-cast/src/parse.rs
Outdated
| /// The result value can't be out of bounds. | ||
| /// Expected behavior: | ||
| /// - the result value can't be out of bounds. | ||
| /// - when parsing a decimal with scale 0, any digits pass the decimal point will be discarded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for e-notation the digits after the decimal do affect the value (e.g., "1.23e4" → 12300) as shown in the tests. Can we clarify this slightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me re-word this.
| for (s, i, precision) in zero_scale_tests { | ||
| let result_128 = parse_decimal::<Decimal128Type>(s, precision, 0).unwrap(); | ||
| assert_eq!(i, result_128); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add some error tests, for example
.blag
Should both fail to parse
Also how about these values? Should they be valid?
"1e", "1e+", "1e-", ".", and ".123"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these cases turned out to be very valuable.
For one thing "." case wasn't being correctly handled for zero scale decimals, I fixed that now by factoring out various validation points throughout the code into a common unsupported check at the start
if matches!(s, "" | "-" | "+" | ".") {
return Err(ArrowError::ParseError(format!(
"can't parse the string value {s} to decimal"
)));
}I found this is a lot easier to grok, since previously these validations were scattered throughout parse_decimal and deduced implicitly based on side-effects (e.g. we finished parsing a decimal with a decimal point but the digit count is zero). Let me know if you have performance concerns about this, and I can resort to the old (implicit) manner on deducing unsupported strings.
For another thing, the test cases "1e", "1e+" and "1e-" are curious. Strictly speaking they aren't valid, but I wouldn't be surprised if some system can parse/generate these. The question is what should parse_decimal output for them. I tested these on both main and this branch:
| input | output(main) | output(this branch) |
|------------------------------------------------|-------------------------------------------------------------------|---------------------|
| parse_decimal::<Decimal128Type>("1e", 5, 0) | Err(ParseError("can't parse the string value 1e to decimal")) | Ok(1) |
| parse_decimal::<Decimal128Type>("1e+", 5, 0) | Ok(1) | Ok(1) |
| parse_decimal::<Decimal128Type>("1e-", 5, 0) | Ok(1) | Ok(1) |
| parse_decimal::<Decimal128Type>("1e", 5, 2) | Err(ParseError("can't parse the string value 1e to decimal")) | Ok(100) |
| parse_decimal::<Decimal128Type>("1e+", 5, 2) | Ok(100) | Ok(100) |
| parse_decimal::<Decimal128Type>("1e-", 5, 2) | Ok(100) | Ok(100) |
To make these consistent I think we should error out on all of them (that's what Postgres does), at least until there's a worthwhile precedent to parse some of these.
So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.
I think that sounds good to me. I tested some other systems and they also can't parse these examples
D select '1e'::decimal;
Conversion Error:
Could not convert string "1e" to DECIMAL(18,3)
LINE 1: select '1e'::decimal;spark also doesn't parse them
spark-sql (default)> SELECT cast ('1' as DECIMAL(5,3));
1.000
Time taken: 0.048 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,3));
NULL
Time taken: 0.052 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('e' as DECIMAL(5,3));
NULL
Time taken: 0.034 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e+' as DECIMAL(5,3));
NULL
Time taken: 0.033 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0));
NULL
Time taken: 0.038 seconds, Fetched 1 row(s)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I've appended these cases to unsupported check and test cases.
|
Spark 4 throws an error spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0));
[CAST_INVALID_INPUT] The value '1e' of the type "STRING" cannot be cast to "DECIMAL(5,0)" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
== SQL (line 1, position 8) ==
SELECT cast ('1e' as DECIMAL(5,0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
By chance would this PR also address Or is it a different cause? |
24a56b9 to
3216e33
Compare
No, actually this is the same problem I laid out here #8700 (comment). And as mentioned one potential solution for it is #9170, where we could avoid redundant multiplication by base factor, followed immediately by division when we parse the e-notation (since then we'd know the exponent in advance, and could stop processing digits on time to avoid the overflow). |
Thanks for checking, I'll update the issue to link these two. |
| let mut digits: u8 = 0; | ||
| let base = T::Native::usize_as(10); | ||
|
|
||
| if matches!(s, "" | "-" | "+" | "." | "1e" | "1e+" | "1e-") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like an oddly specific list of error conditions to check -- It seems like it should be checking that s has only digits and is not empty?
Which issue does this PR close?
parse_decimalwith zero scale #8699.Alternative to #8702 (see #8699 for a discussion).
Rationale for this change
Make parsing of zero-scale decimals more correct/consistent.
What changes are included in this PR?
When a decimal with scale 0 is being parsed, but it has some digits pass the decimal point those will effectively be discarded.
Are these changes tested?
Yes.
Are there any user-facing changes?
Aligned parsing of zero-scale decimals.