Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reset_index is super slow #1690

Closed
luistelmocosta opened this issue Jul 31, 2020 · 2 comments · Fixed by #1696
Closed

reset_index is super slow #1690

luistelmocosta opened this issue Jul 31, 2020 · 2 comments · Fixed by #1696
Labels
bug Something isn't working

Comments

@luistelmocosta
Copy link

Hello, I am trying to load a dataset that does not have a particular index and I would like to know if my approach is correct.

I read that due to koalas index issues we should always specify the index, so I did it like this:

df = df_tmp.to_koalas(index_col = ['A', 'B']) since A and B are the columns that make a value unique.

However I need to drop the index because I will need these columns further and I tried to reset_index():

df = df.sort_values('Date').reset_index()

But this opperation takes 32 minutes which is completely infeasible.

My dataset comes from a csv file

df_tmp = spark.read.option("header", "true").csv(data_lake_path+"test.csv")

And I added this line already:

ks.set_option('compute.default_index_type', 'distributed-sequence')

Any idea what I am doing wrong?

@HyukjinKwon
Copy link
Member

I think the root cause is that we don't support the assignment of index to the columns. I will make a quick fix.

>>> kdf = koalas.range(1000)
>>> kdf["col"] = kdf.index
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 10075, in __setitem__
    kdf = self._assign({key: value})
  File "/.../koalas/databricks/koalas/frame.py", line 4301, in _assign
    "Column assignment doesn't support type " "{0}".format(type(v).__name__)
TypeError: Column assignment doesn't support type Index

@HyukjinKwon HyukjinKwon added the bug Something isn't working label Aug 2, 2020
@ueshin
Copy link
Collaborator

ueshin commented Aug 3, 2020

I'm fine with the fix at #1696, but I'm not sure it will solve the issue here.

If the index after reset_index() needs to be sequential values, it won't fix.
Attaching distributed-sequence default index could still be heavy. Cache could be helpful:

df = df.sort_values('Date').spark.cache().reset_index()

If the index after reset_index() doesn't care the values, you can just use distributed default index when resetting.

with ks.option_context('compute.default_index_type', 'distributed'):
    df = df.reset_index()

Btw, if it's okay to leave the indices as they are, you can also use reset_index(drop=False) which won't cause attaching the default index. nvm, I misunderstood the usage of it.

HyukjinKwon added a commit that referenced this issue Aug 4, 2020
This PR proposes to allow assigning an index as a column:

```python
>>> kdf = koalas.range(3)
>>> kdf["col"] = kdf.index
>>> kdf
```

```
   id  col
0   0    0
1   1    1
2   2    2
```

Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work:

```
>>> kdf["col"] = kdf.index + 1
>>> kdf
```
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__
    pdf = self._get_or_create_repr_pandas_cache(max_display_count)
  File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache
    self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()}
  File "/.../koalas/databricks/koalas/frame.py", line 4985, in head
    sdf = self._internal.resolved_copy.spark_frame
  File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy
    sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS))
  File "/.../python/pyspark/sql/dataframe.py", line 1401, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;;
!Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]
+- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L]
```

Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal.

Resolves #1690
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants