Skip to content

Retain precision when casting JSON number to VARCHAR#28917

Draft
findepi wants to merge 6 commits intotrinodb:masterfrom
findepi:findepi/retain-precision-when-casting-json-number-to-varchar-806435
Draft

Retain precision when casting JSON number to VARCHAR#28917
findepi wants to merge 6 commits intotrinodb:masterfrom
findepi:findepi/retain-precision-when-casting-json-number-to-varchar-806435

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Mar 30, 2026

overview

Normalize JSON numeric values using Java BigDecmial in JSON to VARCHAR cast. Some examples:

1.34e2 -> 13.4
1.34567890123e1 -> 13.4567890123
1.34567890123e8 -> 134567890.123
1.34567890123e11 -> 134567890123
1.34567890123e12-> 1.34567890123E+12

0.000000000000000 -> 0.0
0e1000 -> 0.0
0e-1000 -> 0.0

1 -> 1
100000000000000000000000000000000000000000000000000000000000000000000e-68 -> 1.0
0.100000000000000 -> 0.1

related

.

release notes

General
* Improve precision when casting JSON numbers with decimal point to VARCHAR. #28881 

@cla-bot cla-bot bot added the cla-signed label Mar 30, 2026
@findepi findepi marked this pull request as draft March 30, 2026 07:51
@findepi findepi force-pushed the findepi/retain-precision-when-casting-json-number-to-varchar-806435 branch from d45a3c0 to ccc9ea0 Compare March 30, 2026 11:12
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

Added Benchmark. Results follow.

Before

Benchmark                                               (jsonType)  (varcharLength)  Mode  Cnt     Score    Error  Units
BenchmarkJsonOperators.benchmarkCastToVarchar         STRING_SHORT            10000  avgt   30    57.470 ±  0.752  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         STRING_SHORT       2147483647  avgt   30    56.193 ±  1.297  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        STRING_MEDIUM            10000  avgt   30   176.703 ±  3.310  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        STRING_MEDIUM       2147483647  avgt   30   174.852 ±  2.693  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar          STRING_LONG            10000  avgt   30  1098.212 ± 20.769  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar          STRING_LONG       2147483647  avgt   30  1101.451 ± 24.898  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar  STRING_WITH_UNICODE            10000  avgt   30    91.809 ±  0.299  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar  STRING_WITH_UNICODE       2147483647  avgt   30    91.903 ±  0.293  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_INTEGER            10000  avgt   30    62.553 ±  0.780  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_INTEGER       2147483647  avgt   30    61.504 ±  0.534  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_DECIMAL            10000  avgt   30   235.296 ±  1.065  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_DECIMAL       2147483647  avgt   30   236.338 ±  0.889  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar    NUMBER_SCIENTIFIC            10000  avgt   30   178.933 ±  0.826  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar    NUMBER_SCIENTIFIC       2147483647  avgt   30   179.991 ±  1.060  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         BOOLEAN_TRUE            10000  avgt   30    38.901 ±  0.272  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         BOOLEAN_TRUE       2147483647  avgt   30    39.677 ±  1.132  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        BOOLEAN_FALSE            10000  avgt   30    41.334 ±  1.634  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        BOOLEAN_FALSE       2147483647  avgt   30    39.803 ±  0.112  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar                 NULL            10000  avgt   30    37.529 ±  0.102  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar                 NULL       2147483647  avgt   30    37.812 ±  0.358  ns/op

After

Benchmark                                               (jsonType)  (varcharLength)  Mode  Cnt     Score    Error  Units
BenchmarkJsonOperators.benchmarkCastToVarchar         STRING_SHORT            10000  avgt   30    60.485 ±  4.568  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         STRING_SHORT       2147483647  avgt   30    55.325 ±  0.272  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        STRING_MEDIUM            10000  avgt   30   174.716 ±  2.581  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        STRING_MEDIUM       2147483647  avgt   30   173.241 ±  2.266  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar          STRING_LONG            10000  avgt   30  1115.777 ± 25.443  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar          STRING_LONG       2147483647  avgt   30  1106.917 ± 22.895  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar  STRING_WITH_UNICODE            10000  avgt   30    91.915 ±  0.193  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar  STRING_WITH_UNICODE       2147483647  avgt   30    91.778 ±  0.419  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_INTEGER            10000  avgt   30    68.107 ±  0.756  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_INTEGER       2147483647  avgt   30    68.553 ±  0.301  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_DECIMAL            10000  avgt   30    95.879 ±  0.436  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar       NUMBER_DECIMAL       2147483647  avgt   30    96.344 ±  0.789  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar    NUMBER_SCIENTIFIC            10000  avgt   30    82.827 ±  0.582  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar    NUMBER_SCIENTIFIC       2147483647  avgt   30    82.897 ±  0.924  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         BOOLEAN_TRUE            10000  avgt   30    38.935 ±  0.156  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar         BOOLEAN_TRUE       2147483647  avgt   30    38.784 ±  0.201  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        BOOLEAN_FALSE            10000  avgt   30    42.150 ±  0.896  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar        BOOLEAN_FALSE       2147483647  avgt   30    39.383 ±  0.304  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar                 NULL            10000  avgt   30    37.661 ±  0.102  ns/op
BenchmarkJsonOperators.benchmarkCastToVarchar                 NULL       2147483647  avgt   30    37.562 ±  0.187  ns/op

https://jmh.morethan.io/?sources=https://gist.githubusercontent.com/findepi/3eab5435b181fe7df19ffb0a952daa95/raw/7114825186c5a172ef472b9724c87358af7c4873/castToVarchar.01.before.json,https://gist.githubusercontent.com/findepi/3eab5435b181fe7df19ffb0a952daa95/raw/7114825186c5a172ef472b9724c87358af7c4873/castToVarchar.02.after.json

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

For some reason the benchmark shows performance improvement for the affected case.

I think there results address @dain's concern (#28882 (comment)).

@findepi findepi requested review from dain, losipiuk and wendigo March 30, 2026 11:17
@findepi findepi marked this pull request as ready for review March 30, 2026 11:17
@findepi findepi force-pushed the findepi/retain-precision-when-casting-json-number-to-varchar-806435 branch 3 times, most recently from 6df9050 to 6c10312 Compare March 30, 2026 11:56
@findepi findepi marked this pull request as draft March 30, 2026 16:07
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

The current implementation suffers from the problem that cast(JSON '...' as ARRAY(VARCHAR)) yields different results than cast(json_parse('...') as ARRAY(VARCHAR)).

findepi added 4 commits March 30, 2026 18:26
- add test cases with numbers with leading/trailing zeros.
- verify that casting string -> JSON -> array(varchar) and an optimized
  path behave the same.
Before the change, when casting a JSON number containing decimal point
to VARCHAR, the number would be converted first to DOUBLE. It resulted
in unnecessary loss of information.
@findepi findepi force-pushed the findepi/retain-precision-when-casting-json-number-to-varchar-806435 branch from 6c10312 to 14700b4 Compare March 30, 2026 16:49
@dain
Copy link
Copy Markdown
Member

dain commented Mar 30, 2026

IMO we should just retain the original text from the JSON. I don't see an upside to normalizing to the text, and the downside is:

  1. we loose information this might be meaningful to the user. For example, they might be in an environment where exponentiated numbers are important. Or possible the trailing zeros imply precision of a measurement.
  2. performance. Parsing and printing through BigDecimal is not cheap, and object heavy

findepi added 2 commits March 30, 2026 21:44
The general purpose `jsonParse` utility was lossy when it comes to
numbers containing a decimal point.

This affected `json_parse` SQL function, `JSON` SQL type constructor and
connectors which use `jsonParse` to canonicalize JSON representation on
remote data read (e.g. PostgreSQL).
together with `Fix numeric precision loss in JSON parsing`, this works
now
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

2. performance. Parsing and printing through BigDecimal is not cheap, and object heavy

Per benchmarks, this turned not to be an issue?
Can you maybe run them too and compare results?

I don't see an upside to normalizing to the text

The "upside" is that cast(JSON '...' as ARRAY(VARCHAR)) and cast(json_parse('...') as ARRAY(VARCHAR)) should yield same results. Can we agree this is expected behavior worth maintaining?

I found a bug in current implementation, which invalidated this assumption. However, combined with #28916 (cherry picked here to run CI), this works as expected, via decimals.

However, with case VALUE_NUMBER_FLOAT -> utf8Slice(parser.getText()) the cast(JSON '...' as ARRAY(VARCHAR)) = cast(json_parse('...') as ARRAY(VARCHAR)) no longer holds.

we loose information this might be meaningful to the user.

We at least agree that lossy cast to varchar is a problem, i.e. #28881 is a bug.
Going through doubles is definitely most lossy from all options considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Cast from JSON number with decimal point to VARCHAR loses precision

2 participants