Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow assigning index as a column #1696

Merged
merged 1 commit into from
Aug 4, 2020

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 2, 2020

This PR proposes to allow assigning an index as a column:

>>> kdf = koalas.range(3)
>>> kdf["col"] = kdf.index
>>> kdf
   id  col
0   0    0
1   1    1
2   2    2

Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work:

>>> kdf["col"] = kdf.index + 1
>>> kdf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__
    pdf = self._get_or_create_repr_pandas_cache(max_display_count)
  File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache
    self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()}
  File "/.../koalas/databricks/koalas/frame.py", line 4985, in head
    sdf = self._internal.resolved_copy.spark_frame
  File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy
    sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
  File "/.../python/pyspark/sql/dataframe.py", line 1401, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;;
!Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]
+- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L]

Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal.

Resolves #1690

@HyukjinKwon HyukjinKwon requested a review from ueshin August 2, 2020 06:08
@itholic
Copy link
Contributor

itholic commented Aug 3, 2020

Seems fine to me as a quick fixing for now - after then seems to be more tests should be needed -

@ueshin ueshin mentioned this pull request Aug 3, 2020
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a bandaid fix.

For kdf["col"] = kdf.index + 1, I guess assigning modified index or computation between indices is not a simple issue. We will need to modify InternalFrame to store the change in index Spark columns.

@HyukjinKwon
Copy link
Member Author

Thanks @itholic and @ueshin. Yes, I tried to fix it but ended up with suggesting a bandaid fix for now. I'll merge this one for now.

@HyukjinKwon HyukjinKwon merged commit 1070cdc into databricks:master Aug 4, 2020
@HyukjinKwon HyukjinKwon deleted the index-assignment branch September 11, 2020 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reset_index is super slow
3 participants