Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 25, 2020

What changes were proposed in this pull request?

When you use floats in the index of pandas, it creates a Spark DataFrame with a wrong result as below when Arrow is enabled:

./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
>>> import pandas as pd
>>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+

This is because direct slicing uses the value as index when the index contains floats:

>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
     a
2.0  1
3.0  2
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
     a
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
   a
4  3

This PR proposes to explicitly use iloc to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.

FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Why are the changes needed?

To create the correct Spark DataFrame from a pandas DataFrame without a data loss.

Does this PR introduce any user-facing change?

Yes, it is a bug fix.

./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
import pandas as pd
spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()

Before:

+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+

After:

+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+

How was this patch tested?

Manually tested and unittest were added.

@HyukjinKwon
Copy link
Member Author

I think this should be ported back through branch-2.4 ...

@SparkQA
Copy link

SparkQA commented Jun 25, 2020

Test build #124513 has finished for PR 28928 at commit 612426e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't know all the ins and outs of using different kinds of index types, but iloc is the preferred way to select by position now.

BryanCutler pushed a commit that referenced this pull request Jun 25, 2020
…ct slicing in createDataFrame with Arrow

When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled:

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```

```python
>>> import pandas as pd
>>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

This is because direct slicing uses the value as index when the index contains floats:

```python
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
     a
2.0  1
3.0  2
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
     a
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
   a
4  3
```

This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.

FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection

> While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.

To create the correct Spark DataFrame from a pandas DataFrame without a data loss.

Yes, it is a bug fix.

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```
```python
import pandas as pd
spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
```

Before:

```
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

After:

```
+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+
```

Manually tested and unittest were added.

Closes #28928 from HyukjinKwon/SPARK-32098.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit 1af19a7)
Signed-off-by: Bryan Cutler <[email protected]>
BryanCutler pushed a commit that referenced this pull request Jun 25, 2020
…ct slicing in createDataFrame with Arrow

When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled:

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```

```python
>>> import pandas as pd
>>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

This is because direct slicing uses the value as index when the index contains floats:

```python
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
     a
2.0  1
3.0  2
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
     a
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
   a
4  3
```

This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.

FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection

> While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.

To create the correct Spark DataFrame from a pandas DataFrame without a data loss.

Yes, it is a bug fix.

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```
```python
import pandas as pd
spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
```

Before:

```
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

After:

```
+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+
```

Manually tested and unittest were added.

Closes #28928 from HyukjinKwon/SPARK-32098.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>
(cherry picked from commit 1af19a7)
Signed-off-by: Bryan Cutler <[email protected]>
@BryanCutler
Copy link
Member

merged to master, branch-3.0 and branch-2.4

@HyukjinKwon
Copy link
Member Author

Thank you @BryanCutler and @ueshin!

# Slice the DataFrame to be batched
step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up
pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step))
pdf_slices = (pdf.iloc[start:start + step] for start in xrange(0, len(pdf), step))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this!

While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

Is it the only place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, yes.

@HyukjinKwon HyukjinKwon deleted the SPARK-32098 branch July 27, 2020 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants