Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable index operations. #1955

Merged
merged 11 commits into from
Dec 10, 2020
Merged

Enable index operations. #1955

merged 11 commits into from
Dec 10, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Dec 8, 2020

Currently Koalas can't handle index operations very well.

>>> kidx = ks.Index([1, 2, 3, 4, 5])
>>> kidx + kidx
Int64Index([2, 4, 6, 8, 10], dtype='int64')
>>> kidx + kidx + kidx
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series

or

>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Traceback (most recent call last):
...
AssertionError: args should be single DataFrame or single/multiple Series

This PR enables those operations:

>>> kidx + kidx + kidx
Int64Index([3, 6, 9, 12, 15], dtype='int64')

>>> ks.options.compute.ops_on_diff_frames = True
>>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
Int64Index([7, 9, 13, 11, 15], dtype='int64')

@codecov-io
Copy link

codecov-io commented Dec 8, 2020

Codecov Report

Merging #1955 (c4bc4bc) into master (01ada38) will decrease coverage by 0.16%.
The diff coverage is 90.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1955      +/-   ##
==========================================
- Coverage   94.63%   94.47%   -0.17%     
==========================================
  Files          49       49              
  Lines       10829    10855      +26     
==========================================
+ Hits        10248    10255       +7     
- Misses        581      600      +19     
Impacted Files Coverage Δ
databricks/koalas/utils.py 95.37% <83.33%> (-0.35%) ⬇️
databricks/koalas/base.py 96.38% <86.66%> (-0.51%) ⬇️
databricks/koalas/indexes.py 96.49% <91.66%> (-0.49%) ⬇️
databricks/koalas/series.py 96.77% <100.00%> (-0.09%) ⬇️
databricks/koalas/__init__.py 89.06% <0.00%> (-1.57%) ⬇️
databricks/conftest.py 98.55% <0.00%> (-1.45%) ⬇️
databricks/koalas/plot/matplotlib.py 93.06% <0.00%> (-0.29%) ⬇️
databricks/koalas/generic.py 92.63% <0.00%> (-0.29%) ⬇️
databricks/koalas/frame.py 96.55% <0.00%> (-0.21%) ⬇️
databricks/koalas/namespace.py 83.99% <0.00%> (-0.21%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01ada38...5e57eb3. Read the comment docs.

@ueshin ueshin marked this pull request as ready for review December 9, 2020 02:56
# only work between at most two `Index`s. We might need to fix it in the future.
self_len = len(self)
if any(len(col) != self_len for col in args if isinstance(col, IndexOpsMixin)):
raise ValueError("operands could not be broadcast together with shapes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does length comparison relate to broadcast in the error message?

Copy link
Collaborator Author

@ueshin ueshin Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "broadcast" here is not the same as Spark's "broadcast".
Maybe the term should've been "broad-cast", which means cast to broader types, or in this case cast to larger size. I'd just follow pandas' error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, maybe that's what we are talking about

>>> z= np.arange(12).reshape(3,4)
>>> m= np.arange(9).reshape(3,3)
>>> z * m
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    z * m
ValueError: operands could not be broadcast together with shapes (3,4) (3,3)

The current error message also looks good.

"Cannot combine the series or dataframe because it comes from a different dataframe. "
"In order to allow this operation, enable 'compute.ops_on_diff_frames' option."
)
raise ValueError(ERROR_MESSAGE_CANNOT_COMBINE)


def align_diff_frames(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

@@ -376,12 +385,11 @@ def align_diff_frames(
return kdf


def align_diff_series(func, this_series, *args, how="full"):
from databricks.koalas.base import IndexOpsMixin
def align_diff_series(func, this_series: "Series", *args, how: str = "full") -> "Series":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering whether we should keep align_diff_series here or move to series.py since it is only for Series.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe we can move it, but I'd leave it to the future PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@xinrong-meng xinrong-meng self-requested a review December 10, 2020 22:12
@ueshin
Copy link
Collaborator Author

ueshin commented Dec 10, 2020

Thanks! I'd merge this now. Please feel free to leave comments if any. @HyukjinKwon @itholic

@ueshin ueshin merged commit a68717d into databricks:master Dec 10, 2020
@ueshin ueshin deleted the index_operations branch December 10, 2020 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants