Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. #1904

Merged
merged 2 commits into from
Nov 12, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Nov 11, 2020

The return type inference sometimes fails when the inferred type contains DecimalType to infer the precision or the scale smaller.

>>> from decimal import Decimal
>>> kdf = ks.DataFrame({'a': [Decimal('1.1'), Decimal('2.2'), Decimal('10.01')]})
>>> kdf
       a
0   1.10
1   2.20
2  10.01
>>> kdf.spark.print_schema()
root
 |-- a: decimal(4,2) (nullable = false)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x).spark.print_schema()
...
root
 |-- a: decimal(3,2) (nullable = true)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x)
...
Traceback (most recent call last):
...
pyarrow.lib.ArrowInvalid: Decimal type with precision 4 does not fit into precision inferred from first array element: 3
...

We should use the Spark's system default precision and scale instead to avoid as many such cases as possible:

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x).spark.print_schema()
...
root
 |-- a: decimal(38,18) (nullable = true)

>>> with ks.option_context("compute.shortcut_limit", 1):
...   kdf.transform(lambda x: x)
...
                       a
0   1.100000000000000000
1   2.200000000000000000
2  10.010000000000000000

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks clear to me except one nit.

databricks/koalas/spark/utils.py Show resolved Hide resolved
@codecov-io
Copy link

Codecov Report

Merging #1904 (bcb403d) into master (44d45f2) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1904   +/-   ##
=======================================
  Coverage   94.16%   94.17%           
=======================================
  Files          41       41           
  Lines        9975     9989   +14     
=======================================
+ Hits         9393     9407   +14     
  Misses        582      582           
Impacted Files Coverage Δ
databricks/koalas/accessors.py 93.00% <100.00%> (ø)
databricks/koalas/frame.py 96.73% <100.00%> (ø)
databricks/koalas/groupby.py 91.48% <100.00%> (ø)
databricks/koalas/namespace.py 83.33% <100.00%> (+0.03%) ⬆️
databricks/koalas/spark/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44d45f2...bcb403d. Read the comment docs.

@ueshin
Copy link
Collaborator Author

ueshin commented Nov 12, 2020

Thanks! merging.

@ueshin ueshin merged commit ae0cf40 into databricks:master Nov 12, 2020
@ueshin ueshin deleted the decimal branch November 12, 2020 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants