Allow assigning index as a column #1696

HyukjinKwon · 2020-08-02T06:08:33Z

This PR proposes to allow assigning an index as a column:

>>> kdf = koalas.range(3)
>>> kdf["col"] = kdf.index
>>> kdf

Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work:

>>> kdf["col"] = kdf.index + 1
>>> kdf

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__
    pdf = self._get_or_create_repr_pandas_cache(max_display_count)
  File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache
    self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()}
  File "/.../koalas/databricks/koalas/frame.py", line 4985, in head
    sdf = self._internal.resolved_copy.spark_frame
  File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy
    sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
  File "/.../python/pyspark/sql/dataframe.py", line 1401, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;;
!Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]
+- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L]

Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal.

Resolves #1690

itholic · 2020-08-03T02:11:39Z

Seems fine to me as a quick fixing for now - after then seems to be more tests should be needed -

ueshin

LGTM as a bandaid fix.

For kdf["col"] = kdf.index + 1, I guess assigning modified index or computation between indices is not a simple issue. We will need to modify InternalFrame to store the change in index Spark columns.

HyukjinKwon · 2020-08-04T01:35:34Z

Thanks @itholic and @ueshin. Yes, I tried to fix it but ended up with suggesting a bandaid fix for now. I'll merge this one for now.

Allow assining index as a column

c503d16

HyukjinKwon requested a review from ueshin August 2, 2020 06:08

ueshin mentioned this pull request Aug 3, 2020

reset_index is super slow #1690

Closed

ueshin reviewed Aug 3, 2020

View reviewed changes

HyukjinKwon merged commit 1070cdc into databricks:master Aug 4, 2020

HyukjinKwon deleted the index-assignment branch September 11, 2020 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow assigning index as a column #1696

Allow assigning index as a column #1696

HyukjinKwon commented Aug 2, 2020 •

edited

Loading

itholic commented Aug 3, 2020

ueshin left a comment

HyukjinKwon commented Aug 4, 2020

Allow assigning index as a column #1696

Allow assigning index as a column #1696

Conversation

HyukjinKwon commented Aug 2, 2020 • edited Loading

itholic commented Aug 3, 2020

ueshin left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 4, 2020

HyukjinKwon commented Aug 2, 2020 •

edited

Loading