fix: liberal parsing of zero scale decimals #8700

gruuya · 2025-10-23T20:59:06Z

Which issue does this PR close?

Closes Incorrect behavior of parse_decimal with zero scale #8699.

Alternative to #8702 (see #8699 for a discussion).

Rationale for this change

Make parsing of zero-scale decimals more correct/consistent.

What changes are included in this PR?

When a decimal with scale 0 is being parsed, but it has some digits pass the decimal point those will effectively be discarded.

Are these changes tested?

Yes.

Are there any user-facing changes?

Aligned parsing of zero-scale decimals.

arrow-cast/src/parse.rs

martin-g · 2025-10-24T11:26:57Z

arrow-cast/src/parse.rs

+        if scale == 0 && fractionals > 0 {
+            // The input string contained some fractional digits after the decimal point despite
+            // the scale being zero. Eject all the fractional digits from the number.
+            result = result.div_wrapping(base.pow_wrapping(fractionals as _));


Would this be correct if there are many fractional digits, e.g. 20+ ?!

I think it would, added another test case on that note (("123.000000000000000000004", 123)).

Jefffrey

It would be good if we could update the docstring of this function to document this expected behaviour

Jefffrey · 2025-12-26T23:46:22Z

arrow-cast/src/parse.rs

        }
    }

    if !is_e_notation {


Could we add a test case with e notation, for a sanity check?

Yup, added a couple of e-notation test cases now

("0.123e2", 12, 3), ("0.123e4", 1230, 10), ("1.23e4", 12300, 10), ("12.3e4", 123000, 10), ("123e4", 1230000, 10),

chenhao-db · 2025-12-28T00:40:33Z

The current implementation has an overflow issue. You can test with the following code:

let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);
let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);

The first assert will pass. The second assert will fail:

assertion `left == right` failed
  left: -14028236692093846346337460743176821143
 right: 20000000000000000000000000000000000002

There is apparently an interger overflow.

gruuya · 2025-12-29T15:48:14Z

The current implementation has an overflow issue. You can test with the following code:

let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);
let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);

The first assert will pass. The second assert will fail:

assertion `left == right` failed
  left: -14028236692093846346337460743176821143
 right: 20000000000000000000000000000000000002

There is apparently an interger overflow.

Great catch, thanks!

I revised the implementation a bit, and tried a drive-by parse_e_notation simplification. We won't hit the integer overflow in this case anymore (added a test for it too), since we now avoid the redundant result = result.mul_wrapping(base) unless we hit an e-notation (in which case we need to do it since we don;t know ahead of time what are the actual non-fractional digits then).

gruuya · 2025-12-29T16:05:14Z

That said, the following (valid) example will lead to an overflow again

let result_128 = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0e-1", 38, 0).unwrap();
assert_eq!(result_128, 20000000000000000000000000000000000000);

assertion `left == right` failed
  left: -1402823669209384634633746074317682114
 right: 20000000000000000000000000000000000000

In fact this also happens with scale 1, and it happens on main as well (so this PR isn't introducing it).

The main problem there is the way e-notation is parsed—by the time we hit e-notation processing we've already applied all the significant digits to the result without checking whether some of them are not needed (e.g. because of the negative exponent moving the decimal point to the left). Hence we may have overflowed by the point we want to do div_wrapping.

To fix this i think we need to s.split_once(['e', 'E']) first, then optionally process the exponent, and only then move on to the mantissa, since then we know exactly what numbers can be thrown away. This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

Jefffrey

Apologies it took so long to re-review, it seems the whole fix has changed in terms of code and I'm not as familiar with this part of the codebase so it was hard to separate what was a refactor vs what was the actual fix code (would greatly appreciate it if there were some pointer comments to this effect).

It would be good if we could update the docstring of this function to document this expected behaviour

Just bumping this previous suggestion.

This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

It would be great to raise an issue for this 👍

Jefffrey · 2026-01-09T14:13:42Z

arrow-cast/src/parse.rs

+    let mut bs = s.as_bytes().iter().skip(index + fractionals as usize);

-    while let Some(b) = bs.next() {
+    // complete parsing of any unprocessed fractionals up to the exponent


I must admit I'm confused by this line; above we state we skip to the exponent or after any processed fractionals, but here we say we parse unprocessed fractionals somehow?

After checking the code it seems it corresponds to this check:

https://github.com/splitgraph/arrow-rs/blob/3dbfe0d16447d4344ef1f6caeb2befddc0e91aed/arrow-cast/src/parse.rs#L900-L905

Where we previously skipped fractionals for exceeding the scale, but now we include them since we parse the exponent. I feel we should capture this detail here.

That is correct—this is indeed (more than) slightly confusing, and it stems from unclear separation of concerns. We skip parsing any fractionals after scale in parse_decimal, unless we hit into the e-notation and realise we may actually need those fractionals after all, in which case we parse them in parse_e_notation.

This is also one more argument for going with the above suggestion:

To fix this i think we need to s.split_once(['e', 'E']) first, then optionally process the exponent, and only then move on to the mantissa, since then we know exactly what numbers can be thrown away. This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

Besides fixing another category of edge cases, this would also simplify the logic. I'm only worried about performance though, since that would change the current one-pass into a two-pass algorithm.

For now I'll try to elaborate in the comment a bit more. I'll also open an issue for a cleaner parse_decimal/parse_e_notation separation.

Here's the issue capturing the broader problems of parse_decimal/parse_e_notation: #9170

Jefffrey · 2026-01-09T14:15:18Z

arrow-cast/src/parse.rs

-        bs = s.as_bytes().iter().skip(index);
-    }
+    // skip to the exponent index directly or just after any processed fractionals
+    let mut bs = s.as_bytes().iter().skip(index + fractionals as usize);


If we're refactoring perhaps we should change the type of fractions to be unsigned? It was a little confusing to have it be signed to me as I had to verify that your refactor to collapse the if/else made sense since fractionals was never negative (and the + 1 in the if block is accounted for by adjusting index at the callsite)

That makes sense, though this leads to some cascading casting and type changes, as we use fractionals to compute/compare with other variables, notably with scale which is i8.

So we can then make scale an unsigned type too, since it looks like negative scale is not supported right now.

However I'd rather retain the possibility of supporting negative scales at some point, so I went with changing the scale type to i16 instead. That way fractionals: u8 can be safely cast to i16 for comparison/arithmetic (plus parse_e_notation already expects scale to be i16).

Hmm, scale being i8 is more widespread than I thought, I reverted the commit that changes it.

I don't think we can seamlessly change fractionals to an unsigned type then.

alamb · 2026-01-10T14:08:44Z

I tested this change with the current datafusion tests (I used apache/datafusion#19728 (review)) and all the tests passed for what it is worth

Co-authored-by: Martin Grigorov <[email protected]>

alamb

Thanks @gruuya -- this is looking good

alamb · 2026-01-14T22:31:10Z

arrow-cast/src/parse.rs

-/// The result value can't be out of bounds.
+/// Expected behavior:
+/// - the result value can't be out of bounds.
+/// - when parsing a decimal with scale 0, any digits pass the decimal point will be discarded


I think for e-notation the digits after the decimal do affect the value (e.g., "1.23e4" → 12300) as shown in the tests. Can we clarify this slightly?

Good point, let me re-word this.

alamb · 2026-01-14T22:31:54Z

arrow-cast/src/parse.rs

+        for (s, i, precision) in zero_scale_tests {
+            let result_128 = parse_decimal::<Decimal128Type>(s, precision, 0).unwrap();
+            assert_eq!(i, result_128);
+        }


Could you also add some error tests, for example

.

blag

Should both fail to parse

Also how about these values? Should they be valid?

"1e", "1e+", "1e-", ".", and ".123"

Thanks, these cases turned out to be very valuable.

For one thing "." case wasn't being correctly handled for zero scale decimals, I fixed that now by factoring out various validation points throughout the code into a common unsupported check at the start

if matches!(s, "" | "-" | "+" | ".") { return Err(ArrowError::ParseError(format!( "can't parse the string value {s} to decimal" ))); }

I found this is a lot easier to grok, since previously these validations were scattered throughout parse_decimal and deduced implicitly based on side-effects (e.g. we finished parsing a decimal with a decimal point but the digit count is zero). Let me know if you have performance concerns about this, and I can resort to the old (implicit) manner on deducing unsupported strings.

For another thing, the test cases "1e", "1e+" and "1e-" are curious. Strictly speaking they aren't valid, but I wouldn't be surprised if some system can parse/generate these. The question is what should parse_decimal output for them. I tested these on both main and this branch:

| input | output(main) | output(this branch) | |------------------------------------------------|-------------------------------------------------------------------|---------------------| | parse_decimal::<Decimal128Type>("1e", 5, 0) | Err(ParseError("can't parse the string value 1e to decimal")) | Ok(1) | | parse_decimal::<Decimal128Type>("1e+", 5, 0) | Ok(1) | Ok(1) | | parse_decimal::<Decimal128Type>("1e-", 5, 0) | Ok(1) | Ok(1) | | parse_decimal::<Decimal128Type>("1e", 5, 2) | Err(ParseError("can't parse the string value 1e to decimal")) | Ok(100) | | parse_decimal::<Decimal128Type>("1e+", 5, 2) | Ok(100) | Ok(100) | | parse_decimal::<Decimal128Type>("1e-", 5, 2) | Ok(100) | Ok(100) |

To make these consistent I think we should error out on all of them (that's what Postgres does), at least until there's a worthwhile precedent to parse some of these.

So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.

So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.

I think that sounds good to me. I tested some other systems and they also can't parse these examples

D select '1e'::decimal; Conversion Error: Could not convert string "1e" to DECIMAL(18,3) LINE 1: select '1e'::decimal;

spark also doesn't parse them

spark-sql (default)> SELECT cast ('1' as DECIMAL(5,3)); 1.000 Time taken: 0.048 seconds, Fetched 1 row(s) spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,3)); NULL Time taken: 0.052 seconds, Fetched 1 row(s) spark-sql (default)> SELECT cast ('e' as DECIMAL(5,3)); NULL Time taken: 0.034 seconds, Fetched 1 row(s) spark-sql (default)> SELECT cast ('1e+' as DECIMAL(5,3)); NULL Time taken: 0.033 seconds, Fetched 1 row(s) spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0)); NULL Time taken: 0.038 seconds, Fetched 1 row(s)

Nice, I've appended these cases to unsupported check and test cases.

alamb · 2026-01-15T16:44:27Z

Spark 4 throws an error

spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0));
[CAST_INVALID_INPUT] The value '1e' of the type "STRING" cannot be cast to "DECIMAL(5,0)" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
== SQL (line 1, position 8) ==
SELECT cast ('1e' as DECIMAL(5,0))
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Jefffrey · 2026-01-16T01:36:21Z

By chance would this PR also address

Parsing Decimal With Negative Exponent And Zero Scale Can Overflow #5762

Or is it a different cause?

gruuya · 2026-01-16T06:28:08Z

By chance would this PR also address

* [Parsing Decimal With Negative Exponent And Zero Scale Can Overflow #5762](https://github.com/apache/arrow-rs/issues/5762)

Or is it a different cause?

No, actually this is the same problem I laid out here #8700 (comment).

And as mentioned one potential solution for it is #9170, where we could avoid redundant multiplication by base factor, followed immediately by division when we parse the e-notation (since then we'd know the exponent in advance, and could stop processing digits on time to avoid the overflow).

Jefffrey · 2026-01-16T07:23:00Z

By chance would this PR also address
* [Parsing Decimal With Negative Exponent And Zero Scale Can Overflow #5762](https://github.com/apache/arrow-rs/issues/5762)
Or is it a different cause?
No, actually this is the same problem I laid out here #8700 (comment).

And as mentioned one potential solution for it is #9170, where we could avoid redundant multiplication by base factor, followed immediately by division when we parse the e-notation (since then we'd know the exponent in advance, and could stop processing digits on time to avoid the overflow).

Thanks for checking, I'll update the issue to link these two.

alamb · 2026-01-17T16:18:51Z

arrow-cast/src/parse.rs

    let mut digits: u8 = 0;
    let base = T::Native::usize_as(10);

+    if matches!(s, "" | "-" | "+" | "." | "1e" | "1e+" | "1e-") {


this feels like an oddly specific list of error conditions to check -- It seems like it should be checking that s has only digits and is not empty?

github-actions bot added the arrow Changes to the arrow crate label Oct 23, 2025

This was referenced Oct 23, 2025

Incorrect behavior of parse_decimal with zero scale #8699

Open

fix: strict parsing of decimals with zero scale #8702

Closed

coderabbitai bot mentioned this pull request Oct 24, 2025

8700: fix: liberal parsing of zero scale decimals martin-augment/arrow-rs#4

Open

martin-g reviewed Oct 24, 2025

View reviewed changes

gruuya force-pushed the fix-zero-scale-decimal-parse branch from 95a0e5a to 124441b Compare October 24, 2025 13:34

Jefffrey reviewed Dec 26, 2025

View reviewed changes

gruuya force-pushed the fix-zero-scale-decimal-parse branch 3 times, most recently from fc36673 to 3dbfe0d Compare December 29, 2025 15:35

Jefffrey reviewed Jan 9, 2026

View reviewed changes

gruuya and others added 4 commits January 14, 2026 09:07

fix: liberal parsing of zero scale decimals

780ae35

Update arrow-cast/src/parse.rs

fef1ddd

Co-authored-by: Martin Grigorov <[email protected]>

fix: extend test cases

3a5d0dc

docs: improve docstring and comment

f4f3ff5

gruuya force-pushed the fix-zero-scale-decimal-parse branch from 3dbfe0d to f4f3ff5 Compare January 14, 2026 09:07

gruuya mentioned this pull request Jan 14, 2026

[arrow-cast] consider simplifying parse_decimal and parse_e_notation #9170

Open

gruuya force-pushed the fix-zero-scale-decimal-parse branch 2 times, most recently from c30841a to f4f3ff5 Compare January 14, 2026 10:13

Jefffrey approved these changes Jan 14, 2026

View reviewed changes

alamb reviewed Jan 14, 2026

View reviewed changes

test: add error asserts for zero scale decimal parsing

3216e33

gruuya force-pushed the fix-zero-scale-decimal-parse branch from 24a56b9 to 3216e33 Compare January 16, 2026 06:24

alamb reviewed Jan 17, 2026

View reviewed changes

fix: liberal parsing of zero scale decimals #8700

Are you sure you want to change the base?

fix: liberal parsing of zero scale decimals #8700

Conversation

gruuya commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenhao-db commented Dec 28, 2025

Uh oh!

gruuya commented Dec 29, 2025

Uh oh!

gruuya commented Dec 29, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gruuya Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 15, 2026

Uh oh!

Jefffrey commented Jan 16, 2026

Uh oh!

gruuya commented Jan 16, 2026

Uh oh!

Jefffrey commented Jan 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

gruuya commented Oct 23, 2025 •

edited

Loading

alamb Jan 14, 2026 •

edited

Loading

gruuya Jan 15, 2026 •

edited

Loading

alamb Jan 15, 2026 •

edited

Loading