reset_index is super slow #1690

luistelmocosta · 2020-07-31T09:36:10Z

Hello, I am trying to load a dataset that does not have a particular index and I would like to know if my approach is correct.

I read that due to koalas index issues we should always specify the index, so I did it like this:

df = df_tmp.to_koalas(index_col = ['A', 'B']) since A and B are the columns that make a value unique.

However I need to drop the index because I will need these columns further and I tried to reset_index():

df = df.sort_values('Date').reset_index()

But this opperation takes 32 minutes which is completely infeasible.

My dataset comes from a csv file

df_tmp = spark.read.option("header", "true").csv(data_lake_path+"test.csv")

And I added this line already:

ks.set_option('compute.default_index_type', 'distributed-sequence')

Any idea what I am doing wrong?

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2020-08-02T04:57:16Z

I think the root cause is that we don't support the assignment of index to the columns. I will make a quick fix.

>>> kdf = koalas.range(1000)
>>> kdf["col"] = kdf.index

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../koalas/databricks/koalas/frame.py", line 10075, in __setitem__
    kdf = self._assign({key: value})
  File "/.../koalas/databricks/koalas/frame.py", line 4301, in _assign
    "Column assignment doesn't support type " "{0}".format(type(v).__name__)
TypeError: Column assignment doesn't support type Index

ueshin · 2020-08-03T18:45:35Z

I'm fine with the fix at #1696, but I'm not sure it will solve the issue here.

If the index after reset_index() needs to be sequential values, it won't fix.
Attaching distributed-sequence default index could still be heavy. Cache could be helpful:

df = df.sort_values('Date').spark.cache().reset_index()

If the index after reset_index() doesn't care the values, you can just use distributed default index when resetting.

with ks.option_context('compute.default_index_type', 'distributed'):
    df = df.reset_index()

~~Btw, if it's okay to leave the indices as they are, you can also use reset_index(drop=False) which won't cause attaching the default index.~~ nvm, I misunderstood the usage of it.

This PR proposes to allow assigning an index as a column: ```python >>> kdf = koalas.range(3) >>> kdf["col"] = kdf.index >>> kdf ``` ``` id col 0 0 0 1 1 1 2 2 2 ``` Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work: ``` >>> kdf["col"] = kdf.index + 1 >>> kdf ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__ pdf = self._get_or_create_repr_pandas_cache(max_display_count) File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()} File "/.../koalas/databricks/koalas/frame.py", line 4985, in head sdf = self._internal.resolved_copy.spark_frame File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS)) File "/.../python/pyspark/sql/dataframe.py", line 1401, in select jdf = self._jdf.select(self._jcols(*cols)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;; !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L] +- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L] ``` Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal. Resolves #1690

HyukjinKwon added the bug Something isn't working label Aug 2, 2020

HyukjinKwon mentioned this issue Aug 2, 2020

Allow assigning index as a column #1696

Merged

HyukjinKwon closed this as completed in #1696 Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset_index is super slow #1690

reset_index is super slow #1690

luistelmocosta commented Jul 31, 2020

HyukjinKwon commented Aug 2, 2020

ueshin commented Aug 3, 2020 •

edited

Loading

reset_index is super slow #1690

reset_index is super slow #1690

Comments

luistelmocosta commented Jul 31, 2020

HyukjinKwon commented Aug 2, 2020

ueshin commented Aug 3, 2020 • edited Loading

ueshin commented Aug 3, 2020 •

edited

Loading