Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make DataFrame and Series/Index manage the connection with each other. #1592

Merged
merged 8 commits into from
Jun 19, 2020

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Jun 18, 2020

Making DataFrame and Series/Index manage the connection with each other to support inplace updates.

Series and Index don't manage its own InternalFrame anymore, basically they refer the anchor DataFrame and create the InternalFrame only when needed.

E.g.,

>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

Here, pser.fillna(0, inplace=True) should also update pdf, whereas:

>>> kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> kser = kdf.x
>>> kser.fillna(0, inplace=True)
>>> kser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> kdf
     x    y
0  NaN  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  NaN  NaN
5  6.0  6.0

Other examples:

  • The update of pser should be refected to pdf.

    >>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
    >>> pser = pdf.x
    >>> pser.loc[2] = 30
    >>> pser
    0     NaN
    1     2.0
    2    30.0
    3     4.0
    4     NaN
    5     6.0
    Name: x, dtype: float64
    >>> pdf
          x    y
    0   NaN  NaN
    1   2.0  2.0
    2  30.0  3.0
    3   4.0  4.0
    4   NaN  NaN
    5   6.0  6.0
  • The update of pdf should be reflected to pser.

    >>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
    >>> pser = pdf.x
    >>> pdf.loc[2, 'x'] = 30
    >>> pser
    0     NaN
    1     2.0
    2    30.0
    3     4.0
    4     NaN
    5     6.0
    Name: x, dtype: float64
    >>> pdf
          x    y
    0   NaN  NaN
    1   2.0  2.0
    2  30.0  3.0
    3   4.0  4.0
    4   NaN  NaN
    5   6.0  6.0

@ueshin ueshin requested a review from HyukjinKwon June 18, 2020 01:07
@itholic
Copy link
Contributor

itholic commented Jun 18, 2020

hmm... I also have experienced same problem with iLocIndexer test before.

It was resolved automatically without any fix, all I did just push an empty commit.

weird.

@ueshin
Copy link
Collaborator Author

ueshin commented Jun 18, 2020

@itholic Thanks for letting me know. I'm investigating the reason.

@ueshin
Copy link
Collaborator Author

ueshin commented Jun 18, 2020

Let me re-run tests, just in case.

for old_label, new_label in zip_longest(
self._internal.column_labels, internal.column_labels
):
if old_label is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin, can you add some comments here? It's a bit difficult to follow here ..

self._internal_frame = internal

@property
def _ksers(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin, shell we add some dostrings here too?

):
if old_label is not None:
kser = self._ksers[old_label]
if old_label != new_label or (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we assume the anchor is different 1. if a position of a label was changed 2. if the label was changed (?). Might be best to add comments here ..

result.name = key
result = first_series(
DataFrame(InternalFrame(spark_frame=sdf, index_map=None)).T
).rename(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin, out of curiosity, was it changed only because of the style or because ser.name = key doesn't work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that time I was working around here, ser.name = key didn't work.
We can revert it but I thinks .rename(key) is more natural here.

@@ -812,12 +812,12 @@ def resolved_copy(self):
def with_new_sdf(
self, spark_frame: spark.DataFrame, data_columns: Optional[List[str]] = None
) -> "InternalFrame":
""" Copy the immutable _InternalFrame with the updates by the specified Spark DataFrame.
""" Copy the immutable InternalFrame with the updates by the specified Spark DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just change:

databricks/koalas/frame.py:    :type _internal: _InternalFrame
databricks/koalas/frame.py:        The given label must be verified to exist in `_InternalFrame.column_labels`.
databricks/koalas/frame.py:        `self._kser_for(label)` can be used with `_InternalFrame.column_labels`:
databricks/koalas/groupby.py:                # TODO: deduplicate this logic with _InternalFrame.from_pandas

too while we're here?

def _internal(self) -> InternalFrame:
internal = self._kdf._internal
return internal.copy(
spark_column=internal.index_spark_columns[0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe completely remove spark_column in InternalFrame since Series and Index don't hold them directly anymore. Maybe in a separate pr ..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm planning a clean-up later.

@@ -1276,15 +1291,15 @@ def _NotImplemented(description):

@lazy_property
def _internal(self):
internal = super(iLocIndexer, self)._internal
internal = super(iLocIndexer, self)._internal.resolved_copy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we leave a short comment why it should use resolved_copy?

internal.spark_frame.select(internal.spark_columns)
)
self._kdf._update_internal_frame(
self._kdf._internal.resolved_copy, requires_same_anchor=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also leave a comment why we should use resolved_copy?

# pser.name = None
# kser.name = None
# self.assertEqual(kser.name, None)
# self.assert_eq(kser, pser)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. so this case doesn't work anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seires without the name is not working properly now anyway.
We should revisit later.

@@ -443,7 +443,9 @@ def cache(self):
"""
from databricks.koalas.frame import CachedDataFrame

self._kdf._internal = self._kdf._internal.resolved_copy
self._kdf._update_internal_frame(
self._kdf._internal.resolved_copy, requires_same_anchor=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we need to set requires_same_anchor=False?

@HyukjinKwon
Copy link
Member

The approach looks fine in general but had some questions on the details.

@HyukjinKwon
Copy link
Member

I am going to merge this to unblock the release.

@HyukjinKwon HyukjinKwon merged commit 1b23012 into databricks:master Jun 19, 2020
@ueshin ueshin deleted the inplace branch June 19, 2020 03:06
@ueshin ueshin mentioned this pull request Jun 19, 2020
pass

@property
def _kdf(self) -> DataFrame:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some question here !

For IndexOpsMixin class, we have _anchor and _kdf to indicates DataFrame that corresponds with Series or Index, and they are exactly same.

Is there a special reason that we have both property even they indicates same object ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants