-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return actual values via toPandas. #2077
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2077 +/- ##
==========================================
- Coverage 95.27% 89.78% -5.49%
==========================================
Files 57 57
Lines 13257 13187 -70
==========================================
- Hits 12630 11840 -790
- Misses 627 1347 +720
Continue to review full report at Codecov.
|
first_valid_row = ( | ||
self._internal.spark_frame.filter(cond) | ||
.select(self._internal.index_spark_columns) | ||
.limit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mimic what we do in Koalas' DataFrame.head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we should add .orderBy(NATURAL_ORDER_COLUMN_NAME)
instead of disabling Arrow execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just wondering ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I still don't get why the order gets changed though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying plans are different between with and without Arrow as well as the execution method:
e.g.,
sdf = spark.range(10)
sdf.limit(5).toPandas()
- Without Arrow
collect()
-> plan.executeCollect().map(fromRow)
The plan is:
== Physical Plan ==
CollectLimit 5
+- *(1) Range (0, 10, step=1, splits=12)
There is no shuffle and CollectLimitExec.executeCollect()
runs several Spark jobs from the first partition until the specified number of rows are collected. Thus the row order will be preserved.
- With Arrow:
collectAsArrowToPython()
-> toArrowBatchRdd(plan)
-> plan.execute().mapPartitionsInternal ...
The plan is:
== Physical Plan ==
*(2) Project [id#0L AS col_0#12L]
+- *(2) GlobalLimit 5
+- Exchange SinglePartition, true, [id=#63]
+- *(1) LocalLimit 5
+- *(1) Range (0, 10, step=1, splits=12)
There is a shuffle to limit the number of rows which won't preserve the order of rows and plan.execute()
will just return the generated RDD as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotya. we gotta fix it in Apache Spark I guess .. anyway thanks. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am good with as is. Looks fine.
Thanks! Let me merge this now. |
As
sdf.toPandas()
has a special handling for the timestamp type, we should always usesdf.toPandas()
when returning the actual values.