Skip to content

Fix precision loss when dealing with JSON numbers containing decimal point#28882

Closed
findepi wants to merge 9 commits intotrinodb:masterfrom
findepi:findepi/json-precision-loss
Closed

Fix precision loss when dealing with JSON numbers containing decimal point#28882
findepi wants to merge 9 commits intotrinodb:masterfrom
findepi:findepi/json-precision-loss

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Mar 26, 2026

General
* Improve precision when casting JSON numbers with decimal point to VARCHAR. #28881
* Fix incorrect result when using `json_parse` or JSON type constructor
  and document contains numbers with decimal point with more than 16 significant digits. #28867

MySQL, PostgreSQL, Mongo, Pinot, SingleStore
* Fix incorrect result when reading JSON column and document contains numbers
  with decimal point with more than 16 significant digits. #28867

// An alternative is calling getLongValue and then BigintOperators.castToVarchar.
// It doesn't work as well because it can result in overflow and underflow exceptions for large integral numbers.
case VALUE_NUMBER_INT -> utf8Slice(parser.getText());
case VALUE_NUMBER_INT, VALUE_NUMBER_FLOAT -> utf8Slice(parser.getDecimalValue().toString());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned text might be longer than double.toString we used to use.
This can cause query failures when casting json to varchar(n) for certain n values.
I think it's fine. The format of returned string is something that is not guaranteed to be stable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually have to go through Git DoubleValue or can you just fetch the raw text out of the JsonParser and avoid the big decimal code entirelly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different JSON syntaxes encode same number: 100.0, 100.000, 1e2, 10e1, 00000000100.0.
I don't expect JSON number to VARCHAR cast to simply "leak" JSON internal syntax.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, json values are supposed to be normalized, so it shouldn't matter.
in practice, SpecializeCastWithJsonParse assumes the cast normalizes the values.
Query result could depend on whether SpecializeCastWithJsonParse kicks in or not. This feels wrong.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findepi added 5 commits March 26, 2026 12:52
The general purpose `jsonParse` utility was lossy when it comes to
numbers containing a decimal point.

This affected `json_parse` SQL function, `JSON` SQL type constructor and
connectors which use `jsonParse` to canonicalize JSON representation on
remote data read (e.g. PostgreSQL).
Before the change, when casting a JSON number containing decimal point
to VARCHAR, the number would be converted first to DOUBLE. It resulted
in unnecessary loss of information.
@findepi findepi force-pushed the findepi/json-precision-loss branch from 2a3a0d6 to 924b465 Compare March 26, 2026 11:52
@findepi findepi force-pushed the findepi/json-precision-loss branch from 924b465 to 65591ea Compare March 26, 2026 12:12
@dain
Copy link
Copy Markdown
Member

dain commented Mar 26, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

This PR addresses precision loss when casting JSON numbers to VARCHAR and other types, and when parsing JSON with decimal points. The changes include: enabling USE_BIG_DECIMAL_FOR_FLOATS in JsonTypeUtil to parse floating-point JSON numbers as BigDecimal instead of Double, refactoring numeric token handling in JsonUtil to consolidate VALUE_NUMBER_INT and VALUE_NUMBER_FLOAT paths using getDecimalValue(), removing the unused ConnectorSession parameter from JSON cast methods in JsonToArrayCast, JsonToMapCast, and JsonToRowCast, and introducing comprehensive test coverage for JSON↔ARRAY/MAP casting with updated expected outputs reflecting the precision fix.

Assessment against linked issues

Objective Addressed Explanation
Preserve decimal precision when parsing JSON numbers in json_parse and JSON constructors [#28867]
Prevent scientific notation conversion when casting JSON numbers to VARCHAR [#28881]

Out-of-scope changes

Code Change Explanation
Removal of ConnectorSession parameter from toArray, toMap, and toRow methods (JsonToArrayCast.java, JsonToMapCast.java, JsonToRowCast.java) The linked issues focus on precision loss in JSON number casting and do not mention refactoring method signatures to remove session parameters. While this refactoring may be necessary preparatory work, it extends beyond the stated scope of fixing precision issues.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.41.1)
core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java`:
- Around line 1141-1163: The test block in TestArrayOperators.java exercises
JSON->ARRAY(VARCHAR) for integers, booleans, empty strings and short/normalized
decimals but misses a regression case for long-decimal values (>16 significant
digits); add another assertion in the same group using
assertions.expression("cast(a as ARRAY(VARCHAR))").binding("a", "JSON '[...]'")
that includes a long decimal literal (e.g. a numeric value with >16 significant
digits) and assert .hasType(new ArrayType(VARCHAR)).matches(...) where the
expected CAST(ARRAY[... ] AS ARRAY(VARCHAR)) contains the long decimal preserved
exactly as a string (not truncated/normalized), so the test fails if precision
is lost. Ensure the new assertion follows the same style and placement as the
other cases in the block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 536b1ec1-d907-49e3-9ce9-044a10567c0d

📥 Commits

Reviewing files that changed from the base of the PR and between d92f188 and 65591ea.

📒 Files selected for processing (10)
  • core/trino-main/src/main/java/io/trino/operator/scalar/JsonToArrayCast.java
  • core/trino-main/src/main/java/io/trino/operator/scalar/JsonToMapCast.java
  • core/trino-main/src/main/java/io/trino/operator/scalar/JsonToRowCast.java
  • core/trino-main/src/main/java/io/trino/sql/ir/optimizer/rule/SpecializeCastWithJsonParse.java
  • core/trino-main/src/main/java/io/trino/util/JsonUtil.java
  • core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java
  • core/trino-main/src/test/java/io/trino/type/TestJsonOperators.java
  • core/trino-main/src/test/java/io/trino/type/TestMapOperators.java
  • lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/util/JsonTypeUtil.java
  • lib/trino-plugin-toolkit/src/test/java/io/trino/plugin/base/util/TestJsonTypeUtil.java

Comment on lines +1141 to +1163
// array with number elements (converted to string)
assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
.binding("a", "JSON '[128, 12345678901234567890]'"))
.hasType(new ArrayType(VARCHAR))
.matches("CAST(ARRAY['128', '12345678901234567890'] AS ARRAY(VARCHAR))");

// array with boolean elements
assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
.binding("a", "JSON '[true, false]'"))
.hasType(new ArrayType(VARCHAR))
.matches("CAST(ARRAY['true', 'false'] AS ARRAY(VARCHAR))");

// array with empty string
assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
.binding("a", "JSON '[\"test\", \"\", \"data\"]'"))
.hasType(new ArrayType(VARCHAR))
.matches("CAST(ARRAY['test', '', 'data'] AS ARRAY(VARCHAR))");

// array with various types including scientific notation and string "null"
assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
.binding("a", "JSON '[true, false, 12, 12.3, 1.23E1, \"puppies\", \"kittens\", \"null\", null]'"))
.hasType(new ArrayType(VARCHAR))
.matches("CAST(ARRAY['true', 'false', '12', '12.3', '12.3', 'puppies', 'kittens', 'null', null] AS ARRAY(VARCHAR))");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add the actual long-decimal JSON -> ARRAY(VARCHAR) regression case.

Lines 1143-1163 only cover integers, short decimals, and exponent normalization, so they still pass if the original >16-significant-digit decimal regression comes back.

➕ Suggested regression assertion
         assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
                 .binding("a", "JSON '[true, false, 12, 12.3, 1.23E1, \"puppies\", \"kittens\", \"null\", null]'"))
                 .hasType(new ArrayType(VARCHAR))
                 .matches("CAST(ARRAY['true', 'false', '12', '12.3', '12.3', 'puppies', 'kittens', 'null', null] AS ARRAY(VARCHAR))");
+
+        assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
+                .binding("a", "JSON '[123456789012345678901234567890.12345678]'"))
+                .hasType(new ArrayType(VARCHAR))
+                .matches("CAST(ARRAY['123456789012345678901234567890.12345678'] AS ARRAY(VARCHAR))");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java` around
lines 1141 - 1163, The test block in TestArrayOperators.java exercises
JSON->ARRAY(VARCHAR) for integers, booleans, empty strings and short/normalized
decimals but misses a regression case for long-decimal values (>16 significant
digits); add another assertion in the same group using
assertions.expression("cast(a as ARRAY(VARCHAR))").binding("a", "JSON '[...]'")
that includes a long decimal literal (e.g. a numeric value with >16 significant
digits) and assert .hasType(new ArrayType(VARCHAR)).matches(...) where the
expected CAST(ARRAY[... ] AS ARRAY(VARCHAR)) contains the long decimal preserved
exactly as a string (not truncated/normalized), so the test fails if precision
is lost. Ensure the new assertion follows the same style and placement as the
other cases in the block.

Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious/concerned about the performance impact of cycling through BigDecimal for this code. My experience with BigDecimal is that it's literally the performance cliff and should be avoided at all costs. So I am pretty concerned about us negatively affecting everyone using JSON. Can we add some benchmarks or something, or is there just something I'm missing here? Maybe the code that you've changed isn't used by a normal path or something like that.

private static final JsonMapper SORTED_MAPPER = new JsonMapperProvider().get()
.rebuild()
.configure(ORDER_MAP_ENTRIES_BY_KEYS, true)
.configure(USE_BIG_DECIMAL_FOR_FLOATS, true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the performance impact of this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. However, I value our correctness first approach. This is correct thing to do.

The whole SORTED_MAPPER.readValue(parser, Object.class) looks like optimization candidate. I think we could e.g. operate on TreeNode or something, to avoid unnecessary conversions, or work on top of json parser tokens. There seems to be a lot of optimization potential here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth figuring out. This could be a big regression for our users. JSON is one of the most common formats in practice, and it is a shame none of the standard benchmarks exercise it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my top level comment on the PR. I expect this should not have impact, and may not be necessary at all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// An alternative is calling getLongValue and then BigintOperators.castToVarchar.
// It doesn't work as well because it can result in overflow and underflow exceptions for large integral numbers.
case VALUE_NUMBER_INT -> utf8Slice(parser.getText());
case VALUE_NUMBER_INT, VALUE_NUMBER_FLOAT -> utf8Slice(parser.getDecimalValue().toString());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually have to go through Git DoubleValue or can you just fetch the raw text out of the JsonParser and avoid the big decimal code entirelly?

@findepi findepi force-pushed the findepi/json-precision-loss branch from 65591ea to a2dd63c Compare March 27, 2026 08:36
@findepi findepi requested a review from dain March 27, 2026 08:36
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 27, 2026

Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the code but I agree some benchmarking would be nice to have

@dain
Copy link
Copy Markdown
Member

dain commented Mar 27, 2026

I did more research on this with ChatGPT. USE_BIG_DECIMAL_FOR_FLOATS should only impact code that is using a generic path where you ask for the value as an Object or Number. For the code paths modified in this PR, I don't see any obvious usages of method like that. If so, is it possible to not set this?

The second concern is round triping through big decimal in the to VARCHAR cast. I think this comes down to a design choice. If we want to canonicalize the value, then we need to use the code as it is in the PR. If we instead are ok with (or want) the value preserved as it was written in the original JSON then we should use getText. Note in both cases, the text will be a valid numeric value because Jackson enforces this in the parser. The down side of canonicalizing is the performance penality for BigDecimal parsing, and object construction, compared to simple text copy in the getText version.

I feel strongly that we should not canonicalize here and should instead preserve the original value as is. I think that is what end users want in the case of converting a numeric value to text.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

I did more research on this with ChatGPT. USE_BIG_DECIMAL_FOR_FLOATS should only impact code that is using a generic path where you ask for the value as an Object or Number. For the code paths modified in this PR, I don't see any obvious usages of method like that. If so, is it possible to not set this?

It is suggesting that USE_BIG_DECIMAL_FOR_FLOATS is unnecessary.
Whereas it is the fix for lossy json_parse function (#28867).

The second concern is round triping through big decimal in the to VARCHAR cast. I think this comes down to a design choice. If we want to canonicalize the value, then we need to use the code as it is in the PR. If we instead are ok with (or want) the value preserved as it was written in the original JSON then we should use getText. Note in both cases, the text will be a valid numeric value because Jackson enforces this in the parser. The down side of canonicalizing is the performance penality for BigDecimal parsing, and object construction, compared to simple text copy in the getText version.

The downside is potentially unstable query results: two different code paths that are supposed to be equivalent will now produce different results. Details in #28882 (comment)

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

I extracted the commits that are not being disputed into separate PR:

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

The meaty changes are two things that are related, but can be reviewed and discussed independently. I will split the PR into two.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Mar 30, 2026

Created

let's start review there when the cleanup (#28915) is merged. I will tag you as reviewers there when ready.

@findepi findepi closed this Mar 30, 2026
@findepi findepi deleted the findepi/json-precision-loss branch March 30, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Cast from JSON number with decimal point to VARCHAR loses precision Loss of decimal number precision in json_parse, JSON constructor and connectors

3 participants