Make DataFrame and Series/Index manage the connection with each other. #1592

ueshin · 2020-06-18T01:07:05Z

Making DataFrame and Series/Index manage the connection with each other to support inplace updates.

Series and Index don't manage its own InternalFrame anymore, basically they refer the anchor DataFrame and create the InternalFrame only when needed.

E.g.,

>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.fillna(0, inplace=True)
>>> pser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> pdf
     x    y
0  0.0  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  0.0  NaN
5  6.0  6.0

Here, pser.fillna(0, inplace=True) should also update pdf, whereas:

>>> kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> kser = kdf.x
>>> kser.fillna(0, inplace=True)
>>> kser
0    0.0
1    2.0
2    3.0
3    4.0
4    0.0
5    6.0
Name: x, dtype: float64
>>> kdf
     x    y
0  NaN  NaN
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0
4  NaN  NaN
5  6.0  6.0

Other examples:

The update of pser should be refected to pdf.

>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pser.loc[2] = 30
>>> pser
0     NaN
1     2.0
2    30.0
3     4.0
4     NaN
5     6.0
Name: x, dtype: float64
>>> pdf
      x    y
0   NaN  NaN
1   2.0  2.0
2  30.0  3.0
3   4.0  4.0
4   NaN  NaN
5   6.0  6.0

The update of pdf should be reflected to pser.

>>> pdf = pd.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6], "y": [np.nan, 2, 3, 4, np.nan, 6]})
>>> pser = pdf.x
>>> pdf.loc[2, 'x'] = 30
>>> pser
0     NaN
1     2.0
2    30.0
3     4.0
4     NaN
5     6.0
Name: x, dtype: float64
>>> pdf
      x    y
0   NaN  NaN
1   2.0  2.0
2  30.0  3.0
3   4.0  4.0
4   NaN  NaN
5   6.0  6.0

itholic · 2020-06-18T03:50:04Z

hmm... I also have experienced same problem with iLocIndexer test before.

It was resolved automatically without any fix, all I did just push an empty commit.

weird.

ueshin · 2020-06-18T03:50:43Z

@itholic Thanks for letting me know. I'm investigating the reason.

ueshin · 2020-06-18T07:06:41Z

Let me re-run tests, just in case.

HyukjinKwon · 2020-06-18T09:27:29Z

databricks/koalas/frame.py

+        for old_label, new_label in zip_longest(
+            self._internal.column_labels, internal.column_labels
+        ):
+            if old_label is not None:


@ueshin, can you add some comments here? It's a bit difficult to follow here ..

HyukjinKwon · 2020-06-18T11:46:39Z

databricks/koalas/frame.py

+        self._internal_frame = internal
+
+    @property
+    def _ksers(self):


@ueshin, shell we add some dostrings here too?

HyukjinKwon · 2020-06-18T12:10:12Z

databricks/koalas/frame.py

+        ):
+            if old_label is not None:
+                kser = self._ksers[old_label]
+                if old_label != new_label or (


Do we assume the anchor is different 1. if a position of a label was changed 2. if the label was changed (?). Might be best to add comments here ..

HyukjinKwon · 2020-06-18T12:25:50Z

databricks/koalas/frame.py

-            result.name = key
+            result = first_series(
+                DataFrame(InternalFrame(spark_frame=sdf, index_map=None)).T
+            ).rename(key)


@ueshin, out of curiosity, was it changed only because of the style or because ser.name = key doesn't work?

At that time I was working around here, ser.name = key didn't work.
We can revert it but I thinks .rename(key) is more natural here.

HyukjinKwon · 2020-06-18T12:30:20Z

databricks/koalas/internal.py

@@ -812,12 +812,12 @@ def resolved_copy(self):
    def with_new_sdf(
        self, spark_frame: spark.DataFrame, data_columns: Optional[List[str]] = None
    ) -> "InternalFrame":
-        """ Copy the immutable _InternalFrame with the updates by the specified Spark DataFrame.
+        """ Copy the immutable InternalFrame with the updates by the specified Spark DataFrame.


Shall we just change:

databricks/koalas/frame.py: :type _internal: _InternalFrame databricks/koalas/frame.py: The given label must be verified to exist in `_InternalFrame.column_labels`. databricks/koalas/frame.py: `self._kser_for(label)` can be used with `_InternalFrame.column_labels`: databricks/koalas/groupby.py: # TODO: deduplicate this logic with _InternalFrame.from_pandas

too while we're here?

HyukjinKwon · 2020-06-18T12:58:07Z

databricks/koalas/indexes.py

+    def _internal(self) -> InternalFrame:
+        internal = self._kdf._internal
+        return internal.copy(
+            spark_column=internal.index_spark_columns[0],


Should we maybe completely remove spark_column in InternalFrame since Series and Index don't hold them directly anymore. Maybe in a separate pr ..

yes, I'm planning a clean-up later.

HyukjinKwon · 2020-06-18T13:00:23Z

databricks/koalas/indexing.py

@@ -1276,15 +1291,15 @@ def _NotImplemented(description):

    @lazy_property
    def _internal(self):
-        internal = super(iLocIndexer, self)._internal
+        internal = super(iLocIndexer, self)._internal.resolved_copy


Shall we leave a short comment why it should use resolved_copy?

HyukjinKwon · 2020-06-18T13:10:15Z

databricks/koalas/indexing.py

-                internal.spark_frame.select(internal.spark_columns)
-            )
+        self._kdf._update_internal_frame(
+            self._kdf._internal.resolved_copy, requires_same_anchor=False


Can we also leave a comment why we should use resolved_copy?

HyukjinKwon · 2020-06-18T13:17:47Z

databricks/koalas/tests/test_series.py

+        # pser.name = None
+        # kser.name = None
+        # self.assertEqual(kser.name, None)
+        # self.assert_eq(kser, pser)


Hm .. so this case doesn't work anymore?

Seires without the name is not working properly now anyway.
We should revisit later.

HyukjinKwon · 2020-06-18T13:39:43Z

databricks/koalas/spark/accessors.py

@@ -443,7 +443,9 @@ def cache(self):
        """
        from databricks.koalas.frame import CachedDataFrame

-        self._kdf._internal = self._kdf._internal.resolved_copy
+        self._kdf._update_internal_frame(
+            self._kdf._internal.resolved_copy, requires_same_anchor=False


When do we need to set requires_same_anchor=False?

HyukjinKwon · 2020-06-18T13:53:20Z

The approach looks fine in general but had some questions on the details.

HyukjinKwon · 2020-06-19T03:05:33Z

I am going to merge this to unblock the release.

itholic · 2020-06-21T05:47:06Z

databricks/koalas/base.py

+        pass
+
+    @property
+    def _kdf(self) -> DataFrame:


I have some question here !

For IndexOpsMixin class, we have _anchor and _kdf to indicates DataFrame that corresponds with Series or Index, and they are exactly same.

Is there a special reason that we have both property even they indicates same object ?

Fix inplace updates.

9873435

ueshin requested a review from HyukjinKwon June 18, 2020 01:07

ueshin added 3 commits June 17, 2020 18:13

Fix.

742eb3a

Try.

ffd6d2e

Fix.

86b8e95

ueshin added 3 commits June 17, 2020 22:47

Merge branch 'master' into inplace

d154ec9

Fix flakiness.

e0735fb

Test.

542be48

HyukjinKwon reviewed Jun 18, 2020

View reviewed changes

Address comments and small fixes.

529495e

HyukjinKwon merged commit 1b23012 into databricks:master Jun 19, 2020

ueshin deleted the inplace branch June 19, 2020 03:06

ueshin mentioned this pull request Jun 19, 2020

datarfame.loc question #1378

Closed

itholic reviewed Jun 21, 2020

View reviewed changes

ueshin mentioned this pull request Jan 15, 2021

Implement DataFrame.insert #1983

Merged

Yikun mentioned this pull request Apr 26, 2022

[SPARK-38946][PYTHON][PS] Generates a new dataframe instead of operating inplace in setitem apache/spark#36353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DataFrame and Series/Index manage the connection with each other. #1592

Make DataFrame and Series/Index manage the connection with each other. #1592

ueshin commented Jun 18, 2020 •

edited

Loading

itholic commented Jun 18, 2020 •

edited

Loading

ueshin commented Jun 18, 2020

ueshin commented Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

ueshin Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

ueshin Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon Jun 18, 2020

ueshin Jun 18, 2020

HyukjinKwon Jun 18, 2020

HyukjinKwon commented Jun 18, 2020

HyukjinKwon commented Jun 19, 2020

itholic Jun 21, 2020

Make DataFrame and Series/Index manage the connection with each other. #1592

Make DataFrame and Series/Index manage the connection with each other. #1592

Conversation

ueshin commented Jun 18, 2020 • edited Loading

itholic commented Jun 18, 2020 • edited Loading

ueshin commented Jun 18, 2020

ueshin commented Jun 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 18, 2020

HyukjinKwon commented Jun 19, 2020

Choose a reason for hiding this comment

ueshin commented Jun 18, 2020 •

edited

Loading

itholic commented Jun 18, 2020 •

edited

Loading