Skip to content

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Oct 23, 2025

Which issue does this PR close?

Alternative to #8702 (see #8699 for a discussion).

Rationale for this change

Make parsing of zero-scale decimals more correct/consistent.

What changes are included in this PR?

When a decimal with scale 0 is being parsed, but it has some digits pass the decimal point those will effectively be discarded.

Are these changes tested?

Yes.

Are there any user-facing changes?

Aligned parsing of zero-scale decimals.

if scale == 0 && fractionals > 0 {
// The input string contained some fractional digits after the decimal point despite
// the scale being zero. Eject all the fractional digits from the number.
result = result.div_wrapping(base.pow_wrapping(fractionals as _));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be correct if there are many fractional digits, e.g. 20+ ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would, added another test case on that note (("123.000000000000000000004", 123)).

@gruuya gruuya force-pushed the fix-zero-scale-decimal-parse branch from 95a0e5a to 124441b Compare October 24, 2025 13:34
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good if we could update the docstring of this function to document this expected behaviour

}
}

if !is_e_notation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test case with e notation, for a sanity check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, added a couple of e-notation test cases now

            ("0.123e2", 12, 3),
            ("0.123e4", 1230, 10),
            ("1.23e4", 12300, 10),
            ("12.3e4", 123000, 10),
            ("123e4", 1230000, 10),

@chenhao-db
Copy link

The current implementation has an overflow issue. You can test with the following code:

let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);
let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);

The first assert will pass. The second assert will fail:

assertion `left == right` failed
  left: -14028236692093846346337460743176821143
 right: 20000000000000000000000000000000000002

There is apparently an interger overflow.

@gruuya gruuya force-pushed the fix-zero-scale-decimal-parse branch 3 times, most recently from fc36673 to 3dbfe0d Compare December 29, 2025 15:35
@gruuya
Copy link
Contributor Author

gruuya commented Dec 29, 2025

The current implementation has an overflow issue. You can test with the following code:

let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);
let result = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0", 38, 0).unwrap();
assert_eq!(result, 20000000000000000000000000000000000002i128);

The first assert will pass. The second assert will fail:

assertion `left == right` failed
  left: -14028236692093846346337460743176821143
 right: 20000000000000000000000000000000000002

There is apparently an interger overflow.

Great catch, thanks!

I revised the implementation a bit, and tried a drive-by parse_e_notation simplification. We won't hit the integer overflow in this case anymore (added a test for it too), since we now avoid the redundant result = result.mul_wrapping(base) unless we hit an e-notation (in which case we need to do it since we don;t know ahead of time what are the actual non-fractional digits then).

@gruuya
Copy link
Contributor Author

gruuya commented Dec 29, 2025

That said, the following (valid) example will lead to an overflow again

let result_128 = parse_decimal::<Decimal128Type>("20000000000000000000000000000000000002.0e-1", 38, 0).unwrap();
assert_eq!(result_128, 20000000000000000000000000000000000000);

assertion `left == right` failed
  left: -1402823669209384634633746074317682114
 right: 20000000000000000000000000000000000000

In fact this also happens with scale 1, and it happens on main as well (so this PR isn't introducing it).

The main problem there is the way e-notation is parsed—by the time we hit e-notation processing we've already applied all the significant digits to the result without checking whether some of them are not needed (e.g. because of the negative exponent moving the decimal point to the left). Hence we may have overflowed by the point we want to do div_wrapping.

To fix this i think we need to s.split_once(['e', 'E']) first, then optionally process the exponent, and only then move on to the mantissa, since then we know exactly what numbers can be thrown away. This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies it took so long to re-review, it seems the whole fix has changed in terms of code and I'm not as familiar with this part of the codebase so it was hard to separate what was a refactor vs what was the actual fix code (would greatly appreciate it if there were some pointer comments to this effect).


It would be good if we could update the docstring of this function to document this expected behaviour

Just bumping this previous suggestion.


This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

It would be great to raise an issue for this 👍

let mut bs = s.as_bytes().iter().skip(index + fractionals as usize);

while let Some(b) = bs.next() {
// complete parsing of any unprocessed fractionals up to the exponent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I'm confused by this line; above we state we skip to the exponent or after any processed fractionals, but here we say we parse unprocessed fractionals somehow?

After checking the code it seems it corresponds to this check:

https://github.com/splitgraph/arrow-rs/blob/3dbfe0d16447d4344ef1f6caeb2befddc0e91aed/arrow-cast/src/parse.rs#L900-L905

Where we previously skipped fractionals for exceeding the scale, but now we include them since we parse the exponent. I feel we should capture this detail here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct—this is indeed (more than) slightly confusing, and it stems from unclear separation of concerns. We skip parsing any fractionals after scale in parse_decimal, unless we hit into the e-notation and realise we may actually need those fractionals after all, in which case we parse them in parse_e_notation.

This is also one more argument for going with the above suggestion:

To fix this i think we need to s.split_once(['e', 'E']) first, then optionally process the exponent, and only then move on to the mantissa, since then we know exactly what numbers can be thrown away. This however goes beyond the scope of this PR/issue I think, and probably requires more discussion (especially around performance implications).

Besides fixing another category of edge cases, this would also simplify the logic. I'm only worried about performance though, since that would change the current one-pass into a two-pass algorithm.

For now I'll try to elaborate in the comment a bit more. I'll also open an issue for a cleaner parse_decimal/parse_e_notation separation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the issue capturing the broader problems of parse_decimal/parse_e_notation: #9170

bs = s.as_bytes().iter().skip(index);
}
// skip to the exponent index directly or just after any processed fractionals
let mut bs = s.as_bytes().iter().skip(index + fractionals as usize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're refactoring perhaps we should change the type of fractions to be unsigned? It was a little confusing to have it be signed to me as I had to verify that your refactor to collapse the if/else made sense since fractionals was never negative (and the + 1 in the if block is accounted for by adjusting index at the callsite)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, though this leads to some cascading casting and type changes, as we use fractionals to compute/compare with other variables, notably with scale which is i8.

So we can then make scale an unsigned type too, since it looks like negative scale is not supported right now.

However I'd rather retain the possibility of supporting negative scales at some point, so I went with changing the scale type to i16 instead. That way fractionals: u8 can be safely cast to i16 for comparison/arithmetic (plus parse_e_notation already expects scale to be i16).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, scale being i8 is more widespread than I thought, I reverted the commit that changes it.

I don't think we can seamlessly change fractionals to an unsigned type then.

@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

I tested this change with the current datafusion tests (I used apache/datafusion#19728 (review)) and all the tests passed for what it is worth

@gruuya gruuya force-pushed the fix-zero-scale-decimal-parse branch from 3dbfe0d to f4f3ff5 Compare January 14, 2026 09:07
@gruuya gruuya force-pushed the fix-zero-scale-decimal-parse branch 2 times, most recently from c30841a to f4f3ff5 Compare January 14, 2026 10:13
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gruuya -- this is looking good

/// The result value can't be out of bounds.
/// Expected behavior:
/// - the result value can't be out of bounds.
/// - when parsing a decimal with scale 0, any digits pass the decimal point will be discarded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for e-notation the digits after the decimal do affect the value (e.g., "1.23e4" → 12300) as shown in the tests. Can we clarify this slightly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me re-word this.

for (s, i, precision) in zero_scale_tests {
let result_128 = parse_decimal::<Decimal128Type>(s, precision, 0).unwrap();
assert_eq!(i, result_128);
}
Copy link
Contributor

@alamb alamb Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add some error tests, for example

  • .
  • blag

Should both fail to parse

Also how about these values? Should they be valid?

"1e", "1e+", "1e-", ".", and ".123"

Copy link
Contributor Author

@gruuya gruuya Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, these cases turned out to be very valuable.

For one thing "." case wasn't being correctly handled for zero scale decimals, I fixed that now by factoring out various validation points throughout the code into a common unsupported check at the start

    if matches!(s, "" | "-" | "+" | ".") {
        return Err(ArrowError::ParseError(format!(
            "can't parse the string value {s} to decimal"
        )));
    }

I found this is a lot easier to grok, since previously these validations were scattered throughout parse_decimal and deduced implicitly based on side-effects (e.g. we finished parsing a decimal with a decimal point but the digit count is zero). Let me know if you have performance concerns about this, and I can resort to the old (implicit) manner on deducing unsupported strings.

For another thing, the test cases "1e", "1e+" and "1e-" are curious. Strictly speaking they aren't valid, but I wouldn't be surprised if some system can parse/generate these. The question is what should parse_decimal output for them. I tested these on both main and this branch:

| input                                          | output(main)                                                      | output(this branch) |
|------------------------------------------------|-------------------------------------------------------------------|---------------------|
| parse_decimal::<Decimal128Type>("1e", 5, 0)    | Err(ParseError("can't parse the string value 1e to decimal"))     | Ok(1)               |
| parse_decimal::<Decimal128Type>("1e+", 5, 0)   | Ok(1)                                                             | Ok(1)               |
| parse_decimal::<Decimal128Type>("1e-", 5, 0)   | Ok(1)                                                             | Ok(1)               |
| parse_decimal::<Decimal128Type>("1e", 5, 2)    | Err(ParseError("can't parse the string value 1e to decimal"))     | Ok(100)             |
| parse_decimal::<Decimal128Type>("1e+", 5, 2)   | Ok(100)                                                           | Ok(100)             |
| parse_decimal::<Decimal128Type>("1e-", 5, 2)   | Ok(100)                                                           | Ok(100)             |

To make these consistent I think we should error out on all of them (that's what Postgres does), at least until there's a worthwhile precedent to parse some of these.

So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.

Copy link
Contributor

@alamb alamb Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the matches! approach from above is fine with you, and if you agree we should make these unsupported, then I can just append these patterns there. Otherwise, let me know which ones would you think it'd make sense to parse.

I think that sounds good to me. I tested some other systems and they also can't parse these examples

D select '1e'::decimal;
Conversion Error:
Could not convert string "1e" to DECIMAL(18,3)

LINE 1: select '1e'::decimal;

spark also doesn't parse them

spark-sql (default)> SELECT cast ('1' as DECIMAL(5,3));
1.000
Time taken: 0.048 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,3));
NULL
Time taken: 0.052 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('e' as DECIMAL(5,3));
NULL
Time taken: 0.034 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e+' as DECIMAL(5,3));
NULL
Time taken: 0.033 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0));
NULL
Time taken: 0.038 seconds, Fetched 1 row(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've appended these cases to unsupported check and test cases.

@alamb
Copy link
Contributor

alamb commented Jan 15, 2026

Spark 4 throws an error

spark-sql (default)> SELECT cast ('1e' as DECIMAL(5,0));
[CAST_INVALID_INPUT] The value '1e' of the type "STRING" cannot be cast to "DECIMAL(5,0)" because it is malformed. Correct the value as per the syntax, or change its target type. Use `try_cast` to tolerate malformed input and return NULL instead. SQLSTATE: 22018
== SQL (line 1, position 8) ==
SELECT cast ('1e' as DECIMAL(5,0))
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^

@Jefffrey
Copy link
Contributor

By chance would this PR also address

Or is it a different cause?

@gruuya gruuya force-pushed the fix-zero-scale-decimal-parse branch from 24a56b9 to 3216e33 Compare January 16, 2026 06:24
@gruuya
Copy link
Contributor Author

gruuya commented Jan 16, 2026

By chance would this PR also address

* [Parsing Decimal With Negative Exponent And Zero Scale Can Overflow #5762](https://github.com/apache/arrow-rs/issues/5762)

Or is it a different cause?

No, actually this is the same problem I laid out here #8700 (comment).

And as mentioned one potential solution for it is #9170, where we could avoid redundant multiplication by base factor, followed immediately by division when we parse the e-notation (since then we'd know the exponent in advance, and could stop processing digits on time to avoid the overflow).

@Jefffrey
Copy link
Contributor

By chance would this PR also address

* [Parsing Decimal With Negative Exponent And Zero Scale Can Overflow #5762](https://github.com/apache/arrow-rs/issues/5762)

Or is it a different cause?

No, actually this is the same problem I laid out here #8700 (comment).

And as mentioned one potential solution for it is #9170, where we could avoid redundant multiplication by base factor, followed immediately by division when we parse the e-notation (since then we'd know the exponent in advance, and could stop processing digits on time to avoid the overflow).

Thanks for checking, I'll update the issue to link these two.

let mut digits: u8 = 0;
let base = T::Native::usize_as(10);

if matches!(s, "" | "-" | "+" | "." | "1e" | "1e+" | "1e-") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like an oddly specific list of error conditions to check -- It seems like it should be checking that s has only digits and is not empty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect behavior of parse_decimal with zero scale

5 participants