-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset_index is super slow #1690
Comments
I think the root cause is that we don't support the assignment of index to the columns. I will make a quick fix.
|
I'm fine with the fix at #1696, but I'm not sure it will solve the issue here. If the index after df = df.sort_values('Date').spark.cache().reset_index() If the index after with ks.option_context('compute.default_index_type', 'distributed'):
df = df.reset_index()
|
This PR proposes to allow assigning an index as a column: ```python >>> kdf = koalas.range(3) >>> kdf["col"] = kdf.index >>> kdf ``` ``` id col 0 0 0 1 1 1 2 2 2 ``` Note that this is rather a bandaid fix. If we have a change in the index, it doesn't currently work: ``` >>> kdf["col"] = kdf.index + 1 >>> kdf ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../koalas/databricks/koalas/frame.py", line 9979, in __repr__ pdf = self._get_or_create_repr_pandas_cache(max_display_count) File "/.../koalas/databricks/koalas/frame.py", line 9971, in _get_or_create_repr_pandas_cache self._repr_pandas_cache = {n: self.head(n + 1)._to_internal_pandas()} File "/.../koalas/databricks/koalas/frame.py", line 4985, in head sdf = self._internal.resolved_copy.spark_frame File "/.../koalas/databricks/koalas/utils.py", line 477, in wrapped_lazy_property setattr(self, attr_name, fn(self)) File "/.../koalas/databricks/koalas/internal.py", line 789, in resolved_copy sdf = self.spark_frame.select(self.spark_columns + list(HIDDEN_COLUMNS)) File "/.../python/pyspark/sql/dataframe.py", line 1401, in select jdf = self._jdf.select(self._jcols(*cols)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Resolved attribute(s) __index_level_0__#67 missing from __index_level_0__#36,id#33L,__natural_order__#41L in operator !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L]. Attribute(s) with the same name appear in the operation: __index_level_0__. Please check if the right attribute(s) are used.;; !Project [__index_level_0__#36, id#33L, __index_level_0__#67 AS col#73, __natural_order__#41L] +- Project [__index_level_0__#36, id#33L, monotonically_increasing_id() AS __natural_order__#41L] ``` Looks we should fix https://github.com/databricks/koalas/blob/master/databricks/koalas/indexes.py#L139-L144 and maybe think about changing it back to don't copy its internal. Resolves #1690
Hello, I am trying to load a dataset that does not have a particular index and I would like to know if my approach is correct.
I read that due to koalas index issues we should always specify the index, so I did it like this:
df = df_tmp.to_koalas(index_col = ['A', 'B'])
since A and B are the columns that make a value unique.However I need to drop the index because I will need these columns further and I tried to reset_index():
df = df.sort_values('Date').reset_index()
But this opperation takes 32 minutes which is completely infeasible.
My dataset comes from a csv file
df_tmp = spark.read.option("header", "true").csv(data_lake_path+"test.csv")
And I added this line already:
ks.set_option('compute.default_index_type', 'distributed-sequence')
Any idea what I am doing wrong?
The text was updated successfully, but these errors were encountered: