Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

ueshin · 2020-11-11T22:39:31Z

The return type inference sometimes fails when the inferred type contains DecimalType to infer the precision or the scale smaller.

>>> from decimal import Decimal
>>> kdf = ks.DataFrame({'a': [Decimal('1.1'), Decimal('2.2'), Decimal('10.01')]})
>>> kdf
       a
0   1.10
1   2.20
2  10.01
>>> kdf.spark.print_schema()
root
 |-- a: decimal(4,2) (nullable = false)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x).spark.print_schema()
...
root
 |-- a: decimal(3,2) (nullable = true)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x)
...
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Decimal type with precision 4 does not fit into precision inferred from first array element: 3
...

We should use the Spark's system default precision and scale instead to avoid as many such cases as possible:

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x).spark.print_schema()
...
root
 |-- a: decimal(38,18) (nullable = true)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x)
...
                       a
0   1.100000000000000000
1   2.200000000000000000
2  10.010000000000000000

…rred data type contains DecimalType.

itholic

Everything looks clear to me except one nit.

databricks/koalas/spark/utils.py

codecov-io · 2020-11-12T01:36:34Z

Codecov Report

Merging #1904 (bcb403d) into master (44d45f2) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1904   +/-   ##
=======================================
  Coverage   94.16%   94.17%           
=======================================
  Files          41       41           
  Lines        9975     9989   +14     
=======================================
+ Hits         9393     9407   +14     
  Misses        582      582

Impacted Files	Coverage Δ
databricks/koalas/accessors.py	`93.00% <100.00%> (ø)`
databricks/koalas/frame.py	`96.73% <100.00%> (ø)`
databricks/koalas/groupby.py	`91.48% <100.00%> (ø)`
databricks/koalas/namespace.py	`83.33% <100.00%> (+0.03%)`	⬆️
databricks/koalas/spark/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44d45f2...bcb403d. Read the comment docs.

ueshin · 2020-11-12T01:37:06Z

Thanks! merging.

Force to use the Spark's system default precision and scale when infe…

ccaa41d

…rred data type contains DecimalType.

ueshin requested review from HyukjinKwon and itholic November 11, 2020 22:39

itholic approved these changes Nov 12, 2020

View reviewed changes

databricks/koalas/spark/utils.py Show resolved Hide resolved

Fix read_excel.

bcb403d

HyukjinKwon approved these changes Nov 12, 2020

View reviewed changes

ueshin merged commit ae0cf40 into databricks:master Nov 12, 2020

ueshin deleted the decimal branch November 12, 2020 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

ueshin commented Nov 11, 2020

itholic left a comment •

edited

Loading

codecov-io commented Nov 12, 2020

ueshin commented Nov 12, 2020

Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

Conversation

ueshin commented Nov 11, 2020

itholic left a comment • edited Loading

Choose a reason for hiding this comment

codecov-io commented Nov 12, 2020

Codecov Report

ueshin commented Nov 12, 2020

itholic left a comment •

edited

Loading