[SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow #28928

HyukjinKwon · 2020-06-25T11:15:43Z

What changes were proposed in this pull request?

When you use floats in the index of pandas, it creates a Spark DataFrame with a wrong result as below when Arrow is enabled:

./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true

>>> import pandas as pd
>>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+

This is because direct slicing uses the value as index when the index contains floats:

>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
     a
2.0  1
3.0  2
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
     a
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
   a
4  3

This PR proposes to explicitly use iloc to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.

FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Why are the changes needed?

To create the correct Spark DataFrame from a pandas DataFrame without a data loss.

Does this PR introduce any user-facing change?

Yes, it is a bug fix.

./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true

import pandas as pd
spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()

Before:

+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+

After:

+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+

How was this patch tested?

Manually tested and unittest were added.

…taFrame with Arrow

HyukjinKwon · 2020-06-25T11:16:11Z

I think this should be ported back through branch-2.4 ...

SparkQA · 2020-06-25T11:47:12Z

Test build #124513 has finished for PR 28928 at commit 612426e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM.

BryanCutler

LGTM. I don't know all the ins and outs of using different kinds of index types, but iloc is the preferred way to select by position now.

…ct slicing in createDataFrame with Arrow When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled: ```bash ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true ``` ```python >>> import pandas as pd >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show() +---+ | a| +---+ | 1| | 1| | 2| +---+ ``` This is because direct slicing uses the value as index when the index contains floats: ```python >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:] a 2.0 1 3.0 2 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:] a 4.0 3 >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:] a 4 3 ``` This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled. FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection > While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`. To create the correct Spark DataFrame from a pandas DataFrame without a data loss. Yes, it is a bug fix. ```bash ./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true ``` ```python import pandas as pd spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show() ``` Before: ``` +---+ | a| +---+ | 1| | 1| | 2| +---+ ``` After: ``` +---+ | a| +---+ | 1| | 2| | 3| +---+ ``` Manually tested and unittest were added. Closes #28928 from HyukjinKwon/SPARK-32098. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]> (cherry picked from commit 1af19a7) Signed-off-by: Bryan Cutler <[email protected]>

BryanCutler · 2020-06-25T18:19:08Z

merged to master, branch-3.0 and branch-2.4

HyukjinKwon · 2020-06-26T00:30:07Z

Thank you @BryanCutler and @ueshin!

gatorsmile · 2020-06-26T05:36:06Z

python/pyspark/sql/pandas/conversion.py

        # Slice the DataFrame to be batched
        step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round int up
-        pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step))
+        pdf_slices = (pdf.iloc[start:start + step] for start in xrange(0, len(pdf), step))


Thank you for fixing this!

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Is it the only place?

As far as I can tell, yes.

Use iloc for positional slicing instead of direct slicing in createDa…

612426e

…taFrame with Arrow

probot-autolabeler bot added PYTHON SQL labels Jun 25, 2020

HyukjinKwon requested review from BryanCutler and ueshin June 25, 2020 11:15

ueshin approved these changes Jun 25, 2020

View reviewed changes

BryanCutler approved these changes Jun 25, 2020

View reviewed changes

BryanCutler closed this in 1af19a7 Jun 25, 2020

gatorsmile reviewed Jun 26, 2020

View reviewed changes

HyukjinKwon deleted the SPARK-32098 branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow #28928

[SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow #28928

Uh oh!

HyukjinKwon commented Jun 25, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented Jun 25, 2020

Uh oh!

SparkQA commented Jun 25, 2020

Uh oh!

ueshin left a comment

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler commented Jun 25, 2020

Uh oh!

HyukjinKwon commented Jun 26, 2020

Uh oh!

gatorsmile Jun 26, 2020

Uh oh!

HyukjinKwon Jun 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow #28928

[SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow #28928

Uh oh!

Conversation

HyukjinKwon commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jun 25, 2020

Uh oh!

SparkQA commented Jun 25, 2020

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jun 25, 2020

Uh oh!

HyukjinKwon commented Jun 26, 2020

Uh oh!

gatorsmile Jun 26, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 26, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Jun 25, 2020 •

edited

Loading