Fix precision loss when dealing with JSON numbers containing decimal point#28882
Fix precision loss when dealing with JSON numbers containing decimal point#28882findepi wants to merge 9 commits intotrinodb:masterfrom
Conversation
| // An alternative is calling getLongValue and then BigintOperators.castToVarchar. | ||
| // It doesn't work as well because it can result in overflow and underflow exceptions for large integral numbers. | ||
| case VALUE_NUMBER_INT -> utf8Slice(parser.getText()); | ||
| case VALUE_NUMBER_INT, VALUE_NUMBER_FLOAT -> utf8Slice(parser.getDecimalValue().toString()); |
There was a problem hiding this comment.
The returned text might be longer than double.toString we used to use.
This can cause query failures when casting json to varchar(n) for certain n values.
I think it's fine. The format of returned string is something that is not guaranteed to be stable.
There was a problem hiding this comment.
Does this actually have to go through Git DoubleValue or can you just fetch the raw text out of the JsonParser and avoid the big decimal code entirelly?
There was a problem hiding this comment.
Different JSON syntaxes encode same number: 100.0, 100.000, 1e2, 10e1, 00000000100.0.
I don't expect JSON number to VARCHAR cast to simply "leak" JSON internal syntax.
There was a problem hiding this comment.
In theory, json values are supposed to be normalized, so it shouldn't matter.
in practice, SpecializeCastWithJsonParse assumes the cast normalizes the values.
Query result could depend on whether SpecializeCastWithJsonParse kicks in or not. This feels wrong.
There was a problem hiding this comment.
Added a benchmark and posted results:
The general purpose `jsonParse` utility was lossy when it comes to numbers containing a decimal point. This affected `json_parse` SQL function, `JSON` SQL type constructor and connectors which use `jsonParse` to canonicalize JSON representation on remote data read (e.g. PostgreSQL).
Before the change, when casting a JSON number containing decimal point to VARCHAR, the number would be converted first to DOUBLE. It resulted in unnecessary loss of information.
2a3a0d6 to
924b465
Compare
924b465 to
65591ea
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughThis PR addresses precision loss when casting JSON numbers to VARCHAR and other types, and when parsing JSON with decimal points. The changes include: enabling Assessment against linked issues
Out-of-scope changes
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.41.1)core/trino-main/src/test/java/io/trino/type/TestArrayOperators.javaComment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java`:
- Around line 1141-1163: The test block in TestArrayOperators.java exercises
JSON->ARRAY(VARCHAR) for integers, booleans, empty strings and short/normalized
decimals but misses a regression case for long-decimal values (>16 significant
digits); add another assertion in the same group using
assertions.expression("cast(a as ARRAY(VARCHAR))").binding("a", "JSON '[...]'")
that includes a long decimal literal (e.g. a numeric value with >16 significant
digits) and assert .hasType(new ArrayType(VARCHAR)).matches(...) where the
expected CAST(ARRAY[... ] AS ARRAY(VARCHAR)) contains the long decimal preserved
exactly as a string (not truncated/normalized), so the test fails if precision
is lost. Ensure the new assertion follows the same style and placement as the
other cases in the block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 536b1ec1-d907-49e3-9ce9-044a10567c0d
📒 Files selected for processing (10)
core/trino-main/src/main/java/io/trino/operator/scalar/JsonToArrayCast.javacore/trino-main/src/main/java/io/trino/operator/scalar/JsonToMapCast.javacore/trino-main/src/main/java/io/trino/operator/scalar/JsonToRowCast.javacore/trino-main/src/main/java/io/trino/sql/ir/optimizer/rule/SpecializeCastWithJsonParse.javacore/trino-main/src/main/java/io/trino/util/JsonUtil.javacore/trino-main/src/test/java/io/trino/type/TestArrayOperators.javacore/trino-main/src/test/java/io/trino/type/TestJsonOperators.javacore/trino-main/src/test/java/io/trino/type/TestMapOperators.javalib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/util/JsonTypeUtil.javalib/trino-plugin-toolkit/src/test/java/io/trino/plugin/base/util/TestJsonTypeUtil.java
| // array with number elements (converted to string) | ||
| assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))") | ||
| .binding("a", "JSON '[128, 12345678901234567890]'")) | ||
| .hasType(new ArrayType(VARCHAR)) | ||
| .matches("CAST(ARRAY['128', '12345678901234567890'] AS ARRAY(VARCHAR))"); | ||
|
|
||
| // array with boolean elements | ||
| assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))") | ||
| .binding("a", "JSON '[true, false]'")) | ||
| .hasType(new ArrayType(VARCHAR)) | ||
| .matches("CAST(ARRAY['true', 'false'] AS ARRAY(VARCHAR))"); | ||
|
|
||
| // array with empty string | ||
| assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))") | ||
| .binding("a", "JSON '[\"test\", \"\", \"data\"]'")) | ||
| .hasType(new ArrayType(VARCHAR)) | ||
| .matches("CAST(ARRAY['test', '', 'data'] AS ARRAY(VARCHAR))"); | ||
|
|
||
| // array with various types including scientific notation and string "null" | ||
| assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))") | ||
| .binding("a", "JSON '[true, false, 12, 12.3, 1.23E1, \"puppies\", \"kittens\", \"null\", null]'")) | ||
| .hasType(new ArrayType(VARCHAR)) | ||
| .matches("CAST(ARRAY['true', 'false', '12', '12.3', '12.3', 'puppies', 'kittens', 'null', null] AS ARRAY(VARCHAR))"); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the actual long-decimal JSON -> ARRAY(VARCHAR) regression case.
Lines 1143-1163 only cover integers, short decimals, and exponent normalization, so they still pass if the original >16-significant-digit decimal regression comes back.
➕ Suggested regression assertion
assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
.binding("a", "JSON '[true, false, 12, 12.3, 1.23E1, \"puppies\", \"kittens\", \"null\", null]'"))
.hasType(new ArrayType(VARCHAR))
.matches("CAST(ARRAY['true', 'false', '12', '12.3', '12.3', 'puppies', 'kittens', 'null', null] AS ARRAY(VARCHAR))");
+
+ assertThat(assertions.expression("cast(a as ARRAY(VARCHAR))")
+ .binding("a", "JSON '[123456789012345678901234567890.12345678]'"))
+ .hasType(new ArrayType(VARCHAR))
+ .matches("CAST(ARRAY['123456789012345678901234567890.12345678'] AS ARRAY(VARCHAR))");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/trino-main/src/test/java/io/trino/type/TestArrayOperators.java` around
lines 1141 - 1163, The test block in TestArrayOperators.java exercises
JSON->ARRAY(VARCHAR) for integers, booleans, empty strings and short/normalized
decimals but misses a regression case for long-decimal values (>16 significant
digits); add another assertion in the same group using
assertions.expression("cast(a as ARRAY(VARCHAR))").binding("a", "JSON '[...]'")
that includes a long decimal literal (e.g. a numeric value with >16 significant
digits) and assert .hasType(new ArrayType(VARCHAR)).matches(...) where the
expected CAST(ARRAY[... ] AS ARRAY(VARCHAR)) contains the long decimal preserved
exactly as a string (not truncated/normalized), so the test fails if precision
is lost. Ensure the new assertion follows the same style and placement as the
other cases in the block.
dain
left a comment
There was a problem hiding this comment.
I'm curious/concerned about the performance impact of cycling through BigDecimal for this code. My experience with BigDecimal is that it's literally the performance cliff and should be avoided at all costs. So I am pretty concerned about us negatively affecting everyone using JSON. Can we add some benchmarks or something, or is there just something I'm missing here? Maybe the code that you've changed isn't used by a normal path or something like that.
| private static final JsonMapper SORTED_MAPPER = new JsonMapperProvider().get() | ||
| .rebuild() | ||
| .configure(ORDER_MAP_ENTRIES_BY_KEYS, true) | ||
| .configure(USE_BIG_DECIMAL_FOR_FLOATS, true) |
There was a problem hiding this comment.
What's the performance impact of this?
There was a problem hiding this comment.
I don't know. However, I value our correctness first approach. This is correct thing to do.
The whole SORTED_MAPPER.readValue(parser, Object.class) looks like optimization candidate. I think we could e.g. operate on TreeNode or something, to avoid unnecessary conversions, or work on top of json parser tokens. There seems to be a lot of optimization potential here.
There was a problem hiding this comment.
I think this is worth figuring out. This could be a big regression for our users. JSON is one of the most common formats in practice, and it is a shame none of the standard benchmarks exercise it.
There was a problem hiding this comment.
See my top level comment on the PR. I expect this should not have impact, and may not be necessary at all.
There was a problem hiding this comment.
benchmark added, see results:
(new PR per #28882 (comment) )
| // An alternative is calling getLongValue and then BigintOperators.castToVarchar. | ||
| // It doesn't work as well because it can result in overflow and underflow exceptions for large integral numbers. | ||
| case VALUE_NUMBER_INT -> utf8Slice(parser.getText()); | ||
| case VALUE_NUMBER_INT, VALUE_NUMBER_FLOAT -> utf8Slice(parser.getDecimalValue().toString()); |
There was a problem hiding this comment.
Does this actually have to go through Git DoubleValue or can you just fetch the raw text out of the JsonParser and avoid the big decimal code entirelly?
65591ea to
a2dd63c
Compare
losipiuk
left a comment
There was a problem hiding this comment.
I like the code but I agree some benchmarking would be nice to have
|
I did more research on this with ChatGPT. The second concern is round triping through big decimal in the to I feel strongly that we should not canonicalize here and should instead preserve the original value as is. I think that is what end users want in the case of converting a numeric value to text. |
It is suggesting that
The downside is potentially unstable query results: two different code paths that are supposed to be equivalent will now produce different results. Details in #28882 (comment) |
|
I extracted the commits that are not being disputed into separate PR: |
|
The meaty changes are two things that are related, but can be reviewed and discussed independently. I will split the PR into two. |
|
Created
let's start review there when the cleanup (#28915) is merged. I will tag you as reviewers there when ready. |
json_parse, JSON constructor and connectors #28867