Skip to content

Conversation

@ELHoussineT
Copy link
Contributor

@ELHoussineT ELHoussineT commented Sep 7, 2022

What changes were proposed in this pull request?

Use bool instead of np.bool as np.bool will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Why are the changes needed?

Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations.

Does this PR introduce any user-facing change?

The warning will be suppressed

How was this patch tested?

Existing tests should suffice.

@ELHoussineT ELHoussineT changed the title Avoid Numpy deprecation warning [SPARK-40376] Avoid Numpy deprecation warning Sep 7, 2022
@srowen
Copy link
Member

srowen commented Sep 7, 2022

Seems fine, though is this just an alias for previous versions of numpy that are currently supported too?

@ELHoussineT
Copy link
Contributor Author

Seems fine, though is this just an alias for previous versions of numpy that are currently supported too?

Correct.

@itholic
Copy link
Contributor

itholic commented Sep 8, 2022

Can we add a [PYTHON] tag for the title ?

Also check the https://spark.apache.org/contributing.html out when you find some time.

@itholic
Copy link
Contributor

itholic commented Sep 8, 2022

Would you check the "Workflow run detection failed" in https://github.com/apache/spark/pull/37817/checks?check_run_id=8226916282 for enabling Github Actions ??

@itholic
Copy link
Contributor

itholic commented Sep 8, 2022

I think How was this patch tested? in the PR description should contain how do we verify this patch within Apache Spark code base.

In this case, we can just simply mention like: Using the existing test for example.

@itholic
Copy link
Contributor

itholic commented Sep 8, 2022

The PR description usually started from "What changes were proposed in this pull request?"

So, can we put the description

"""
Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

"""

into "What changes were proposed in this pull request?" ??

e.g.

What changes were proposed in this pull request?

Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Use bool instead of np.bool as np.bool will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Why are the changes needed?

...

@itholic
Copy link
Contributor

itholic commented Sep 8, 2022

Looks fine otherwise.

Thanks for the your first contribution to Apache Spark!

@ELHoussineT ELHoussineT changed the title [SPARK-40376] Avoid Numpy deprecation warning [SPARK-40376][PYTHON] Avoid Numpy deprecation warning Sep 8, 2022
@ELHoussineT
Copy link
Contributor Author

ELHoussineT commented Sep 8, 2022

@itholic

Can we add a [PYTHON] tag for the title ?

Done

Also check the https://spark.apache.org/contributing.html out when you find some time.

I am sorry, you are right.

Would you check the "Workflow run detection failed" in https://github.com/apache/spark/pull/37817/checks?check_run_id=8226916282 for enabling Github Actions ??

I did that and the build in my fork went through: https://github.com/ELHoussineT/spark/actions/workflows/build_main.yml
But the actions in the PR are still red, thoughts?

I think How was this patch tested? in the PR description should contain how do we verify this patch within Apache Spark code base.

Updated.

So, can we put the description into "What changes were proposed in this pull request?" ??

Done

Thanks for the your first contribution to Apache Spark!

Its a tiny one! You're welcome :)

@srowen
Copy link
Member

srowen commented Sep 8, 2022

Hm, try pushing an empty commit?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Sep 9, 2022

Oh, remove the type ignore comment:

annotations failed mypy checks:
python/pyspark/sql/pandas/conversion.py:298: error: unused "type: ignore" comment
Found 1 error in 1 file (checked 339 source files)

@srowen
Copy link
Member

srowen commented Sep 12, 2022

Ping @ELHoussineT

@ELHoussineT
Copy link
Contributor Author

@srowen Sorry for the late reply.

Updated, let's see if it will go through.

@srowen srowen closed this in 12d984c Sep 13, 2022
@srowen
Copy link
Member

srowen commented Sep 13, 2022

Merged to master

@itholic
Copy link
Contributor

itholic commented Sep 13, 2022

Cool !

@joaoleveiga
Copy link

Hello all. Is this only going to be released in PySpark 3.4?

I looked at the branch-3.3 code and failed to see this change, if I saw correctly.

Thanks

@srowen
Copy link
Member

srowen commented Feb 22, 2023

@aimtsou
Copy link
Contributor

aimtsou commented Feb 24, 2023

@srowen: Although this is causing an issue:

If you try to build your own docker image of Spark including pyspark while trying to be compliant with Databricks you will observe that Databricks Runtime 12.1 and 12.2(which is currently in beta), both support officially until Spark 3.3.1 (while current version is 3.3.2).

Actually all of the LTS versions in the support matrix are not EOLed and since numpy 1.20.0 was released in 01/2021, which means that most spark compliant versions carry this bug. If you try to use Pandas by using toPandas() you end up with the numpy error, consequently being blocked from upgrading your spark versions.

Is there any chance of back-porting this commit into previous pyspark versions?

@srowen
Copy link
Member

srowen commented Feb 24, 2023

This is just a deprecation warning, not an error, right? I don't see a particular urgency here.
I don't think this is related to Databricks, particularly, either - Databricks can do what it likes with patches, etc. It will have a runtime based on 3.4 shortly after it's released.

@aimtsou
Copy link
Contributor

aimtsou commented Feb 24, 2023

Hi @srowen,

Thank you for your very prompt reply.

You are not correct about the error, after 1.20.0 it creates an attribute error

attr = 'bool'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
    
            def _expired(*args, **kwds):
                raise RuntimeError(msg)
    
            return _expired
    
        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val
    
        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    
        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'bool'.
E           `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/site-packages/numpy/__init__.py:305: AttributeError

This is the end of an error, coming after calling the function toPandas() from my tests:

/usr/local/lib/python3.9/site-packages/<my-pkg>/unit/test_case_runner.py:26: in run_test
    self.assert_df_are_equal(expected_df, actual)
/usr/local/lib/python3.9/site-packages/<my-pkg>/unit/test_case_runner.py:58: in assert_df_are_equal
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/site-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:216: in toPandas
    pandas_type = PandasConversionMixin._to_corrected_pandas_type(field.dataType)
/usr/local/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:298: in _to_corrected_pandas_type
    return np.bool  # type: ignore[attr-defined]

And the error comes from the numpy in the system which gets called by the _to_corrected_pandas_type function inside pyspark.

I agree about the comments on databricks but as shown above this does not work on Spark 3.3.1 without taking into consideration if you want to be compliant with Databricks.

@srowen
Copy link
Member

srowen commented Feb 24, 2023

Well, I think we're talking about numpy 1.20 here, not >1.20. You're correct that you therefore would not use the latest versions of numpy with Spark 3.3, but would work with 3.4. If that presents a significant problem during the lifetime of Spark 3.3, sure I think that's a decent argument to back-port. Do you know what version of numpy actually removed this ? if it not a recent removal, yeah I think we should back port this simple change

@aimtsou
Copy link
Contributor

aimtsou commented Feb 24, 2023

Yes we agree that users can limit their numpy system installation to < 1.20.0, if they use Spark 3.3

I will have to check and test the different versions but I believe according to the notes from numpy, should be from the numpy 1.20.0 1,2. I will have to verify it to be sure though.

Well numpy 1.20.0 was released in 01/2021 which makes it 2 year old but the final decision is up to you.

@srowen
Copy link
Member

srowen commented Feb 24, 2023

Looks like 1.22 removed it actually. That's still not recent. Yeah I think this is worth back porting

srowen pushed a commit that referenced this pull request Feb 25, 2023
### What changes were proposed in this pull request?

Use `bool` instead of `np.bool` as `np.bool` will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Using `np.bool` generates this warning:

```
UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

### Why are the changes needed?
Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations.

### Does this PR introduce _any_ user-facing change?
The warning will be suppressed

### How was this patch tested?
Existing tests should suffice.

Closes #37817 from ELHoussineT/patch-1.

Authored-by: ELHoussineT <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Feb 25, 2023

Also merged to 3.3

@aimtsou
Copy link
Contributor

aimtsou commented Feb 25, 2023

Thank you @srowen, really appreciated

@joaoleveiga
Copy link

Also merged to 3.3

Thank you so much! Here I was assuming I would pick up this thread on monday but you delivered it 😄

Cheers

srowen pushed a commit that referenced this pull request Mar 3, 2023
…ypes

### Problem description
Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this [pull](#37817) request

[numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
[numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated

### What changes were proposed in this pull request?
From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:

```
attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)

### Why are the changes needed?
The changes are needed so pyspark can be compatible with the latest numpy and avoid

- attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
- warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations

### Does this PR introduce _any_ user-facing change?
The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0

### How was this patch tested?
I assume that the existing tests should catch this. (see all section Extra questions)

I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:

```

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas
    corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError
```

Although i cannot provide the code doing in python the following should show the problem:
```
>>> import numpy as np
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
123
>>> np.object(123)
<stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy

### Supported Versions:
I propose this fix to be included in all pyspark 3.3 and onwards

### JIRA
I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.

### Extra questions:
By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.

```
git grep np.object
python/pyspark/ml/functions.py:        return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list))
python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) and any(
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)

git grep np.bool
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for key in cols_sel
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, "bool", "?"):
python/pyspark/sql/connect/expressions.py:                assert isinstance(value, (bool, np.bool_))
python/pyspark/sql/connect/expressions.py:                elif isinstance(value, np.bool_):
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", "boolean")]),
```

If yes concerning bool was merged already should we fix it too?

Closes #40220 from aimtsou/numpy-patch.

Authored-by: Aimilios Tsouvelekakis <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Mar 3, 2023
…ypes

### Problem description
Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this [pull](#37817) request

[numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
[numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated

### What changes were proposed in this pull request?
From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:

```
attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)

### Why are the changes needed?
The changes are needed so pyspark can be compatible with the latest numpy and avoid

- attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
- warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations

### Does this PR introduce _any_ user-facing change?
The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0

### How was this patch tested?
I assume that the existing tests should catch this. (see all section Extra questions)

I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:

```

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas
    corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError
```

Although i cannot provide the code doing in python the following should show the problem:
```
>>> import numpy as np
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
123
>>> np.object(123)
<stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy

### Supported Versions:
I propose this fix to be included in all pyspark 3.3 and onwards

### JIRA
I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.

### Extra questions:
By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.

```
git grep np.object
python/pyspark/ml/functions.py:        return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list))
python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) and any(
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)

git grep np.bool
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for key in cols_sel
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, "bool", "?"):
python/pyspark/sql/connect/expressions.py:                assert isinstance(value, (bool, np.bool_))
python/pyspark/sql/connect/expressions.py:                elif isinstance(value, np.bool_):
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", "boolean")]),
```

If yes concerning bool was merged already should we fix it too?

Closes #40220 from aimtsou/numpy-patch.

Authored-by: Aimilios Tsouvelekakis <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b3c26b8)
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Mar 3, 2023
…ypes

### Problem description
Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this [pull](#37817) request

[numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
[numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated

### What changes were proposed in this pull request?
From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:

```
attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)

### Why are the changes needed?
The changes are needed so pyspark can be compatible with the latest numpy and avoid

- attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
- warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations

### Does this PR introduce _any_ user-facing change?
The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0

### How was this patch tested?
I assume that the existing tests should catch this. (see all section Extra questions)

I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:

```

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas
    corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError
```

Although i cannot provide the code doing in python the following should show the problem:
```
>>> import numpy as np
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
123
>>> np.object(123)
<stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy

### Supported Versions:
I propose this fix to be included in all pyspark 3.3 and onwards

### JIRA
I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.

### Extra questions:
By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.

```
git grep np.object
python/pyspark/ml/functions.py:        return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list))
python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) and any(
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)

git grep np.bool
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for key in cols_sel
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, "bool", "?"):
python/pyspark/sql/connect/expressions.py:                assert isinstance(value, (bool, np.bool_))
python/pyspark/sql/connect/expressions.py:                elif isinstance(value, np.bool_):
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", "boolean")]),
```

If yes concerning bool was merged already should we fix it too?

Closes #40220 from aimtsou/numpy-patch.

Authored-by: Aimilios Tsouvelekakis <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b3c26b8)
Signed-off-by: Sean Owen <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…ypes

### Problem description
Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0
One of the types was fixed back in September with this [pull](apache#37817) request

[numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed.
[numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated

### What changes were proposed in this pull request?
From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error:

```
attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)

### Why are the changes needed?
The changes are needed so pyspark can be compatible with the latest numpy and avoid

- attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
- warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations

### Does this PR introduce _any_ user-facing change?
The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0

### How was this patch tested?
I assume that the existing tests should catch this. (see all section Extra questions)

I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test:

```

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas
    corrected_dtypes[index] = np.object  # type: ignore[attr-defined]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

attr = 'object'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)

            def _expired(*args, **kwds):
                raise RuntimeError(msg)

            return _expired

        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val

        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)

        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'object'.
E           `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError
```

Although i cannot provide the code doing in python the following should show the problem:
```
>>> import numpy as np
>>> np.object0(123)
<stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead.  (Deprecated NumPy 1.24)`.  (Deprecated NumPy 1.24)
123
>>> np.object(123)
<stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'object'.
`np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy

### Supported Versions:
I propose this fix to be included in all pyspark 3.3 and onwards

### JIRA
I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there.

### Extra questions:
By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected.

```
git grep np.object
python/pyspark/ml/functions.py:        return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list))
python/pyspark/ml/functions.py:        return any(data.dtypes == np.object_) and any(
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[4], np.object)  # datetime.date
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[1], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[6], np.object)
python/pyspark/sql/tests/test_dataframe.py:                self.assertEqual(types[7], np.object)

git grep np.bool
python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool       BooleanType
python/pyspark/pandas/indexing.py:            isinstance(key, np.bool_) for key in cols_sel
python/pyspark/pandas/tests/test_typedef.py:            np.bool: (np.bool, BooleanType()),
python/pyspark/pandas/tests/test_typedef.py:            bool: (np.bool, BooleanType()),
python/pyspark/pandas/typedef/typehints.py:    elif tpe in (bool, np.bool_, "bool", "?"):
python/pyspark/sql/connect/expressions.py:                assert isinstance(value, (bool, np.bool_))
python/pyspark/sql/connect/expressions.py:                elif isinstance(value, np.bool_):
python/pyspark/sql/tests/test_dataframe.py:        self.assertEqual(types[2], np.bool)
python/pyspark/sql/tests/test_functions.py:            (np.bool_, [("true", "boolean")]),
```

If yes concerning bool was merged already should we fix it too?

Closes apache#40220 from aimtsou/numpy-patch.

Authored-by: Aimilios Tsouvelekakis <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b3c26b8)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants