Return actual values via toPandas. #2077

ueshin · 2021-03-03T02:55:02Z

As sdf.toPandas() has a special handling for the timestamp type, we should always use sdf.toPandas() when returning the actual values.

codecov-io · 2021-03-06T01:15:27Z

Codecov Report

Merging #2077 (f984efd) into master (557fb77) will decrease coverage by 5.48%.
The diff coverage is 96.34%.

@@            Coverage Diff             @@
##           master    #2077      +/-   ##
==========================================
- Coverage   95.27%   89.78%   -5.49%     
==========================================
  Files          57       57              
  Lines       13257    13187      -70     
==========================================
- Hits        12630    11840     -790     
- Misses        627     1347     +720

Impacted Files	Coverage Δ
databricks/koalas/tests/indexes/test_base.py	`91.76% <94.11%> (-8.24%)`	⬇️
databricks/koalas/utils.py	`94.09% <95.00%> (-1.36%)`	⬇️
databricks/koalas/frame.py	`93.41% <100.00%> (-3.11%)`	⬇️
databricks/koalas/generic.py	`89.76% <100.00%> (-3.49%)`	⬇️
databricks/koalas/indexes/base.py	`97.10% <100.00%> (-0.20%)`	⬇️
databricks/koalas/namespace.py	`78.88% <100.00%> (-5.72%)`	⬇️
databricks/koalas/series.py	`95.68% <100.00%> (-1.18%)`	⬇️
databricks/koalas/testing/utils.py	`78.23% <100.00%> (-3.08%)`	⬇️
databricks/koalas/plot/plotly.py	`15.78% <0.00%> (-81.06%)`	⬇️
...bricks/koalas/tests/plot/test_frame_plot_plotly.py	`23.33% <0.00%> (-76.67%)`	⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 557fb77...f984efd. Read the comment docs.

databricks/koalas/frame.py

HyukjinKwon · 2021-03-12T12:53:14Z

databricks/koalas/generic.py

+            first_valid_row = (
+                self._internal.spark_frame.filter(cond)
+                .select(self._internal.index_spark_columns)
+                .limit(1)


Can we mimic what we do in Koalas' DataFrame.head?

Do you mean we should add .orderBy(NATURAL_ORDER_COLUMN_NAME) instead of disabling Arrow execution?

I was just wondering ...

Hm, I still don't get why the order gets changed though.

The underlying plans are different between with and without Arrow as well as the execution method:

e.g.,

sdf = spark.range(10) sdf.limit(5).toPandas()

Without Arrow

collect() -> plan.executeCollect().map(fromRow)

The plan is:

== Physical Plan == CollectLimit 5 +- *(1) Range (0, 10, step=1, splits=12)

There is no shuffle and CollectLimitExec.executeCollect() runs several Spark jobs from the first partition until the specified number of rows are collected. Thus the row order will be preserved.

With Arrow:

collectAsArrowToPython() -> toArrowBatchRdd(plan) -> plan.execute().mapPartitionsInternal ...

The plan is:

== Physical Plan == *(2) Project [id#0L AS col_0#12L] +- *(2) GlobalLimit 5 +- Exchange SinglePartition, true, [id=#63] +- *(1) LocalLimit 5 +- *(1) Range (0, 10, step=1, splits=12)

There is a shuffle to limit the number of rows which won't preserve the order of rows and plan.execute() will just return the generated RDD as-is.

gotya. we gotta fix it in Apache Spark I guess .. anyway thanks. LGTM

HyukjinKwon

I am good with as is. Looks fine.

ueshin · 2021-03-19T02:18:05Z

Thanks! Let me merge this now.

Return actual values via toPandas.

322fde0

ueshin mentioned this pull request Mar 3, 2021

datetime support for min and max functions #2068

Closed

Fix.

e99122c

ueshin requested a review from HyukjinKwon March 5, 2021 19:32

Merge branch 'master' into toPandas

a33092b

ueshin marked this pull request as ready for review March 6, 2021 00:37

Merge branch 'master' into toPandas

73198e9

Merge branch 'master' into toPandas

434b875

HyukjinKwon reviewed Mar 7, 2021

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 12, 2021

View reviewed changes

ueshin added 4 commits March 12, 2021 11:08

Merge branch 'master' into toPandas

5518167

Fix.

b8827cd

Fix.

a026365

Fix.

f984efd

ueshin requested a review from HyukjinKwon March 17, 2021 18:06

HyukjinKwon approved these changes Mar 18, 2021

View reviewed changes

ueshin merged commit c8a3c88 into databricks:master Mar 19, 2021

ueshin deleted the toPandas branch March 19, 2021 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return actual values via toPandas. #2077

Return actual values via toPandas. #2077

ueshin commented Mar 3, 2021 •

edited

Loading

codecov-io commented Mar 6, 2021 •

edited

Loading

HyukjinKwon Mar 12, 2021

ueshin Mar 12, 2021

HyukjinKwon Mar 18, 2021

HyukjinKwon Mar 18, 2021

ueshin Mar 19, 2021

HyukjinKwon Mar 19, 2021

HyukjinKwon left a comment

ueshin commented Mar 19, 2021

Return actual values via toPandas. #2077

Return actual values via toPandas. #2077

Conversation

ueshin commented Mar 3, 2021 • edited Loading

codecov-io commented Mar 6, 2021 • edited Loading

Codecov Report

HyukjinKwon Mar 12, 2021

Choose a reason for hiding this comment

ueshin Mar 12, 2021

Choose a reason for hiding this comment

HyukjinKwon Mar 18, 2021

Choose a reason for hiding this comment

HyukjinKwon Mar 18, 2021

Choose a reason for hiding this comment

ueshin Mar 19, 2021

Choose a reason for hiding this comment

HyukjinKwon Mar 19, 2021

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

ueshin commented Mar 19, 2021

ueshin commented Mar 3, 2021 •

edited

Loading

codecov-io commented Mar 6, 2021 •

edited

Loading