Skip to content

Add basic extension dtypes support. #2039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Feb 17, 2021
Merged

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Feb 4, 2021

Adds basic extension dtypes support.

The following types are supported if the underlying pandas supports them:

  • pandas >= 0.24
    • Int8Dtype
    • Int16Dtype
    • Int32Dtype
    • Int64Dtype
  • pandas >= 1.0
    • BooleanDtype
    • StringDtype
  • pandas >= 1.2
    • Float32Dtype
    • Float64Dtype

Internally, index_dtypes and data_dtypes are introduced in InternalFrame.

>>> kdf = ks.DataFrame({'a': [1,2,None,3], 'b': [4,5,6,None]}).astype({'a': 'Int32', 'b': 'Int64'})
>>> kdf
      a     b
0     1     4
1     2     5
2  <NA>     6
3     3  <NA>
>>> kdf.dtypes
a    Int32
b    Int64
dtype: object

>>> kdf._internal.index_dtypes
[dtype('int64')]
>>> kdf._internal.data_dtypes
[Int32Dtype(), Int64Dtype()]

Currently binary operations and type casting are supported:

>>> kdf.a + kdf.b
0       5
1       7
2    <NA>
3    <NA>
dtype: Int64
>>> kdf + kdf
      a     b
0     2     8
1     4    10
2  <NA>    12
3     6  <NA>
>>> kdf.a.astype('Float64')
0     1.0
1     2.0
2    <NA>
3     3.0
Name: a, dtype: Float64

Resolves #2009.

@codecov-io
Copy link

codecov-io commented Feb 4, 2021

Codecov Report

Merging #2039 (bb766a4) into master (dcf5275) will decrease coverage by 5.27%.
The diff coverage is 69.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2039      +/-   ##
==========================================
- Coverage   94.71%   89.43%   -5.28%     
==========================================
  Files          54       54              
  Lines       11503    11566      +63     
==========================================
- Hits        10895    10344     -551     
- Misses        608     1222     +614     
Impacted Files Coverage Δ
databricks/koalas/namespace.py 78.49% <ø> (-5.92%) ⬇️
databricks/koalas/strings.py 82.35% <ø> (ø)
databricks/koalas/typedef/typehints.py 65.92% <25.71%> (-27.89%) ⬇️
databricks/koalas/base.py 93.46% <75.00%> (-3.83%) ⬇️
databricks/koalas/indexing.py 92.60% <91.66%> (-0.02%) ⬇️
databricks/koalas/internal.py 94.07% <94.33%> (-2.03%) ⬇️
databricks/koalas/frame.py 93.43% <100.00%> (-3.15%) ⬇️
databricks/koalas/indexes/base.py 97.23% <100.00%> (-0.21%) ⬇️
databricks/koalas/indexes/multi.py 91.45% <100.00%> (-4.35%) ⬇️
databricks/koalas/series.py 95.64% <100.00%> (-1.15%) ⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcf5275...bb766a4. Read the comment docs.

@ueshin ueshin marked this pull request as ready for review February 9, 2021 03:05
@ueshin
Copy link
Collaborator Author

ueshin commented Feb 9, 2021

I think it's ready to review.
I'll address the boolean operations in a separate PR because "boolean" type has the semantic to be SQL-ish whereas "bool" is Python-ish.

@ueshin ueshin requested review from itholic, HyukjinKwon and xinrong-meng and removed request for HyukjinKwon February 9, 2021 03:19
@ueshin
Copy link
Collaborator Author

ueshin commented Feb 9, 2021

The DataFrame operations with different anchors seem not working now. I'll address it soon, but you can still start reviewing this PR.

@HyukjinKwon
Copy link
Member

Haven't taken a close look but I guess it's good to go.

@ueshin
Copy link
Collaborator Author

ueshin commented Feb 17, 2021

Thanks! Let me merge this for now. Please feel free to leave comments if any.

@ueshin ueshin merged commit 2618c52 into databricks:master Feb 17, 2021
@ueshin ueshin deleted the type_mapping branch February 17, 2021 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting nullable data types for float data, Float32 and Float64
3 participants