Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return actual values via toPandas. #2077

Merged
merged 9 commits into from
Mar 19, 2021
Merged

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Mar 3, 2021

As sdf.toPandas() has a special handling for the timestamp type, we should always use sdf.toPandas() when returning the actual values.

@ueshin ueshin requested a review from HyukjinKwon March 5, 2021 19:32
@ueshin ueshin marked this pull request as ready for review March 6, 2021 00:37
@codecov-io
Copy link

codecov-io commented Mar 6, 2021

Codecov Report

Merging #2077 (f984efd) into master (557fb77) will decrease coverage by 5.48%.
The diff coverage is 96.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2077      +/-   ##
==========================================
- Coverage   95.27%   89.78%   -5.49%     
==========================================
  Files          57       57              
  Lines       13257    13187      -70     
==========================================
- Hits        12630    11840     -790     
- Misses        627     1347     +720     
Impacted Files Coverage Δ
databricks/koalas/tests/indexes/test_base.py 91.76% <94.11%> (-8.24%) ⬇️
databricks/koalas/utils.py 94.09% <95.00%> (-1.36%) ⬇️
databricks/koalas/frame.py 93.41% <100.00%> (-3.11%) ⬇️
databricks/koalas/generic.py 89.76% <100.00%> (-3.49%) ⬇️
databricks/koalas/indexes/base.py 97.10% <100.00%> (-0.20%) ⬇️
databricks/koalas/namespace.py 78.88% <100.00%> (-5.72%) ⬇️
databricks/koalas/series.py 95.68% <100.00%> (-1.18%) ⬇️
databricks/koalas/testing/utils.py 78.23% <100.00%> (-3.08%) ⬇️
databricks/koalas/plot/plotly.py 15.78% <0.00%> (-81.06%) ⬇️
...bricks/koalas/tests/plot/test_frame_plot_plotly.py 23.33% <0.00%> (-76.67%) ⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 557fb77...f984efd. Read the comment docs.

first_valid_row = (
self._internal.spark_frame.filter(cond)
.select(self._internal.index_spark_columns)
.limit(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mimic what we do in Koalas' DataFrame.head?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should add .orderBy(NATURAL_ORDER_COLUMN_NAME) instead of disabling Arrow execution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I still don't get why the order gets changed though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying plans are different between with and without Arrow as well as the execution method:

e.g.,

sdf = spark.range(10)
sdf.limit(5).toPandas()

  • Without Arrow

collect() -> plan.executeCollect().map(fromRow)

The plan is:

== Physical Plan ==
CollectLimit 5
+- *(1) Range (0, 10, step=1, splits=12)

There is no shuffle and CollectLimitExec.executeCollect() runs several Spark jobs from the first partition until the specified number of rows are collected. Thus the row order will be preserved.


  • With Arrow:

collectAsArrowToPython() -> toArrowBatchRdd(plan) -> plan.execute().mapPartitionsInternal ...

The plan is:

== Physical Plan ==
*(2) Project [id#0L AS col_0#12L]
+- *(2) GlobalLimit 5
   +- Exchange SinglePartition, true, [id=#63]
      +- *(1) LocalLimit 5
         +- *(1) Range (0, 10, step=1, splits=12)

There is a shuffle to limit the number of rows which won't preserve the order of rows and plan.execute() will just return the generated RDD as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotya. we gotta fix it in Apache Spark I guess .. anyway thanks. LGTM

@ueshin ueshin requested a review from HyukjinKwon March 17, 2021 18:06
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with as is. Looks fine.

@ueshin
Copy link
Collaborator Author

ueshin commented Mar 19, 2021

Thanks! Let me merge this now.

@ueshin ueshin merged commit c8a3c88 into databricks:master Mar 19, 2021
@ueshin ueshin deleted the toPandas branch March 19, 2021 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants