[SPARK-40376][PYTHON] Avoid Numpy deprecation warning #37817

ELHoussineT · 2022-09-07T11:14:42Z

What changes were proposed in this pull request?

Use bool instead of np.bool as np.bool will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Why are the changes needed?

Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations.

Does this PR introduce any user-facing change?

The warning will be suppressed

How was this patch tested?

Existing tests should suffice.

srowen · 2022-09-07T13:34:02Z

Seems fine, though is this just an alias for previous versions of numpy that are currently supported too?

ELHoussineT · 2022-09-07T14:01:43Z

Seems fine, though is this just an alias for previous versions of numpy that are currently supported too?

Correct.

itholic · 2022-09-08T02:21:18Z

Can we add a [PYTHON] tag for the title ?

Also check the https://spark.apache.org/contributing.html out when you find some time.

itholic · 2022-09-08T02:23:52Z

Would you check the "Workflow run detection failed" in https://github.com/apache/spark/pull/37817/checks?check_run_id=8226916282 for enabling Github Actions ??

itholic · 2022-09-08T02:25:50Z

I think How was this patch tested? in the PR description should contain how do we verify this patch within Apache Spark code base.

In this case, we can just simply mention like: Using the existing test for example.

itholic · 2022-09-08T02:32:18Z

The PR description usually started from "What changes were proposed in this pull request?"

So, can we put the description

"""
Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

"""

into "What changes were proposed in this pull request?" ??

e.g.

What changes were proposed in this pull request?

Using np.bool generates this warning:

UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Use bool instead of np.bool as np.bool will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Why are the changes needed?

...

itholic · 2022-09-08T02:33:06Z

Looks fine otherwise.

Thanks for the your first contribution to Apache Spark!

ELHoussineT · 2022-09-08T09:13:00Z

@itholic

Can we add a [PYTHON] tag for the title ?

Done

Also check the https://spark.apache.org/contributing.html out when you find some time.

I am sorry, you are right.

Would you check the "Workflow run detection failed" in https://github.com/apache/spark/pull/37817/checks?check_run_id=8226916282 for enabling Github Actions ??

I did that and the build in my fork went through: https://github.com/ELHoussineT/spark/actions/workflows/build_main.yml
But the actions in the PR are still red, thoughts?

I think How was this patch tested? in the PR description should contain how do we verify this patch within Apache Spark code base.

Updated.

So, can we put the description into "What changes were proposed in this pull request?" ??

Done

Thanks for the your first contribution to Apache Spark!

Its a tiny one! You're welcome :)

srowen · 2022-09-08T23:10:46Z

Hm, try pushing an empty commit?

AmplabJenkins · 2022-09-09T11:16:21Z

Can one of the admins verify this patch?

srowen · 2022-09-09T13:14:56Z

Oh, remove the type ignore comment:

annotations failed mypy checks:
python/pyspark/sql/pandas/conversion.py:298: error: unused "type: ignore" comment
Found 1 error in 1 file (checked 339 source files)

srowen · 2022-09-12T12:39:46Z

Ping @ELHoussineT

ELHoussineT · 2022-09-12T14:33:02Z

@srowen Sorry for the late reply.

Updated, let's see if it will go through.

srowen · 2022-09-13T01:46:58Z

Merged to master

itholic · 2022-09-13T02:44:16Z

Cool !

joaoleveiga · 2023-02-22T12:32:50Z

Hello all. Is this only going to be released in PySpark 3.4?

I looked at the branch-3.3 code and failed to see this change, if I saw correctly.

Thanks

srowen · 2023-02-22T14:36:12Z

It's in 3.4, not 3.3, yes: https://github.com/apache/spark/blob/v3.4.0-rc1/python/pyspark/sql/pandas/conversion.py#L301
See the JIRA

aimtsou · 2023-02-24T21:21:32Z

@srowen: Although this is causing an issue:

If you try to build your own docker image of Spark including pyspark while trying to be compliant with Databricks you will observe that Databricks Runtime 12.1 and 12.2(which is currently in beta), both support officially until Spark 3.3.1 (while current version is 3.3.2).

Actually all of the LTS versions in the support matrix are not EOLed and since numpy 1.20.0 was released in 01/2021, which means that most spark compliant versions carry this bug. If you try to use Pandas by using toPandas() you end up with the numpy error, consequently being blocked from upgrading your spark versions.

Is there any chance of back-porting this commit into previous pyspark versions?

srowen · 2023-02-24T21:29:27Z

This is just a deprecation warning, not an error, right? I don't see a particular urgency here.
I don't think this is related to Databricks, particularly, either - Databricks can do what it likes with patches, etc. It will have a runtime based on 3.4 shortly after it's released.

aimtsou · 2023-02-24T21:39:16Z

Hi @srowen,

Thank you for your very prompt reply.

You are not correct about the error, after 1.20.0 it creates an attribute error

attr = 'bool'

    def __getattr__(attr):
        # Warn for expired attributes, and return a dummy function
        # that always raises an exception.
        import warnings
        try:
            msg = __expired_functions__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
    
            def _expired(*args, **kwds):
                raise RuntimeError(msg)
    
            return _expired
    
        # Emit warnings for deprecated attributes
        try:
            val, msg = __deprecated_attrs__[attr]
        except KeyError:
            pass
        else:
            warnings.warn(msg, DeprecationWarning, stacklevel=2)
            return val
    
        if attr in __future_scalars__:
            # And future warnings for those that will change, but also give
            # the AttributeError
            warnings.warn(
                f"In the future `np.{attr}` will be defined as the "
                "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    
        if attr in __former_attrs__:
>           raise AttributeError(__former_attrs__[attr])
E           AttributeError: module 'numpy' has no attribute 'bool'.
E           `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
E           The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
E               https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

/usr/local/lib/python3.9/site-packages/numpy/__init__.py:305: AttributeError

This is the end of an error, coming after calling the function toPandas() from my tests:

/usr/local/lib/python3.9/site-packages/<my-pkg>/unit/test_case_runner.py:26: in run_test
    self.assert_df_are_equal(expected_df, actual)
/usr/local/lib/python3.9/site-packages/<my-pkg>/unit/test_case_runner.py:58: in assert_df_are_equal
    self.handler.compare_df(result, expected, config=self.compare_config)
/usr/local/lib/python3.9/site-packages/<my-pkg>/spark_test_handler.py:38: in compare_df
    actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True)
/usr/local/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:216: in toPandas
    pandas_type = PandasConversionMixin._to_corrected_pandas_type(field.dataType)
/usr/local/lib/python3.9/site-packages/pyspark/sql/pandas/conversion.py:298: in _to_corrected_pandas_type
    return np.bool  # type: ignore[attr-defined]

And the error comes from the numpy in the system which gets called by the _to_corrected_pandas_type function inside pyspark.

I agree about the comments on databricks but as shown above this does not work on Spark 3.3.1 without taking into consideration if you want to be compliant with Databricks.

srowen · 2023-02-24T21:46:36Z

Well, I think we're talking about numpy 1.20 here, not >1.20. You're correct that you therefore would not use the latest versions of numpy with Spark 3.3, but would work with 3.4. If that presents a significant problem during the lifetime of Spark 3.3, sure I think that's a decent argument to back-port. Do you know what version of numpy actually removed this ? if it not a recent removal, yeah I think we should back port this simple change

aimtsou · 2023-02-24T22:11:14Z

Yes we agree that users can limit their numpy system installation to < 1.20.0, if they use Spark 3.3

I will have to check and test the different versions but I believe according to the notes from numpy, should be from the numpy 1.20.0 1,2. I will have to verify it to be sure though.

Well numpy 1.20.0 was released in 01/2021 which makes it 2 year old but the final decision is up to you.

srowen · 2023-02-24T22:26:42Z

Looks like 1.22 removed it actually. That's still not recent. Yeah I think this is worth back porting

### What changes were proposed in this pull request? Use `bool` instead of `np.bool` as `np.bool` will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) Using `np.bool` generates this warning: ``` UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation. 3070E `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. 3071E Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` ### Why are the changes needed? Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations. ### Does this PR introduce _any_ user-facing change? The warning will be suppressed ### How was this patch tested? Existing tests should suffice. Closes #37817 from ELHoussineT/patch-1. Authored-by: ELHoussineT <[email protected]> Signed-off-by: Sean Owen <[email protected]>

srowen · 2023-02-25T16:52:38Z

Also merged to 3.3

aimtsou · 2023-02-25T19:41:18Z

Thank you @srowen, really appreciated

joaoleveiga · 2023-02-25T22:24:11Z

Also merged to 3.3

Thank you so much! Here I was assuming I would pick up this thread on monday but you delivered it 😄

Cheers

…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes #40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <[email protected]>

…ypes ### Problem description Numpy has started changing the alias to some of its data-types. This means that users with the latest version of numpy they will face either warnings or errors according to the type that they are using. This affects all the users using numoy > 1.20.0 One of the types was fixed back in September with this [pull](apache#37817) request [numpy 1.24.0](numpy/numpy#22607): The scalar type aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually be removed. [numpy 1.20.0](numpy/numpy#14882): Using the aliases of builtin types like np.int is deprecated ### What changes were proposed in this pull request? From numpy 1.20.0 we receive a deprecattion warning on np.object(https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations) and from numpy 1.24.0 we received an attribute error: ``` attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` From numpy version 1.24.0 we receive a deprecation warning on np.object0 and every np.datatype0 and np.bool8 >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) ### Why are the changes needed? The changes are needed so pyspark can be compatible with the latest numpy and avoid - attribute errors on data types being deprecated from version 1.20.0: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations - warnings on deprecated data types from version 1.24.0: https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations ### Does this PR introduce _any_ user-facing change? The change will suppress the warning coming from numpy 1.24.0 and the error coming from numpy 1.22.0 ### How was this patch tested? I assume that the existing tests should catch this. (see all section Extra questions) I found this to be a problem in my work's project where we use for our unit tests the toPandas() function to convert to np.object. Attaching the run result of our test: ``` _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.9/dist-packages/<my-pkg>/unit/spark_test.py:64: in run_testcase self.handler.compare_df(result, expected, config=self.compare_config) /usr/local/lib/python3.9/dist-packages/<my-pkg>/spark_test_handler.py:38: in compare_df actual_pd = actual.toPandas().sort_values(by=sort_columns, ignore_index=True) /usr/local/lib/python3.9/dist-packages/pyspark/sql/pandas/conversion.py:232: in toPandas corrected_dtypes[index] = np.object # type: ignore[attr-defined] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ attr = 'object' def __getattr__(attr): # Warn for expired attributes, and return a dummy function # that always raises an exception. import warnings try: msg = __expired_functions__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) def _expired(*args, **kwds): raise RuntimeError(msg) return _expired # Emit warnings for deprecated attributes try: val, msg = __deprecated_attrs__[attr] except KeyError: pass else: warnings.warn(msg, DeprecationWarning, stacklevel=2) return val if attr in __future_scalars__: # And future warnings for those that will change, but also give # the AttributeError warnings.warn( f"In the future `np.{attr}` will be defined as the " "corresponding NumPy scalar.", FutureWarning, stacklevel=2) if attr in __former_attrs__: > raise AttributeError(__former_attrs__[attr]) E AttributeError: module 'numpy' has no attribute 'object'. E `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. E The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: E https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations /usr/local/lib/python3.9/dist-packages/numpy/__init__.py:305: AttributeError ``` Although i cannot provide the code doing in python the following should show the problem: ``` >>> import numpy as np >>> np.object0(123) <stdin>:1: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24) 123 >>> np.object(123) <stdin>:1: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.9/dist-packages/numpy/__init__.py", line 305, in __getattr__ raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'object'. `np.object` was a deprecated alias for the builtin `object`. To avoid this error in existing code, use `object` by itself. Doing this will not modify any behavior and is safe. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ``` I do not have a use-case in my tests for np.object0 but I fixed like the suggestion from numpy ### Supported Versions: I propose this fix to be included in all pyspark 3.3 and onwards ### JIRA I know a JIRA ticket should be created I sent an email and I am waiting for the answer to document the case also there. ### Extra questions: By grepping for np.bool and np.object I see that the tests include them. Shall we change them also? Data types with _ I think they are not affected. ``` git grep np.object python/pyspark/ml/functions.py: return data.dtype == np.object_ and isinstance(data.iloc[0], (np.ndarray, list)) python/pyspark/ml/functions.py: return any(data.dtypes == np.object_) and any( python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[4], np.object) # datetime.date python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[1], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[6], np.object) python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[7], np.object) git grep np.bool python/docs/source/user_guide/pandas_on_spark/types.rst:np.bool BooleanType python/pyspark/pandas/indexing.py: isinstance(key, np.bool_) for key in cols_sel python/pyspark/pandas/tests/test_typedef.py: np.bool: (np.bool, BooleanType()), python/pyspark/pandas/tests/test_typedef.py: bool: (np.bool, BooleanType()), python/pyspark/pandas/typedef/typehints.py: elif tpe in (bool, np.bool_, "bool", "?"): python/pyspark/sql/connect/expressions.py: assert isinstance(value, (bool, np.bool_)) python/pyspark/sql/connect/expressions.py: elif isinstance(value, np.bool_): python/pyspark/sql/tests/test_dataframe.py: self.assertEqual(types[2], np.bool) python/pyspark/sql/tests/test_functions.py: (np.bool_, [("true", "boolean")]), ``` If yes concerning bool was merged already should we fix it too? Closes apache#40220 from aimtsou/numpy-patch. Authored-by: Aimilios Tsouvelekakis <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit b3c26b8) Signed-off-by: Sean Owen <[email protected]>

Avoid Numpy deprecation warning

0ec4ee9

ELHoussineT changed the title ~~Avoid Numpy deprecation warning~~ [SPARK-40376] Avoid Numpy deprecation warning Sep 7, 2022

github-actions bot added CORE PYTHON SQL labels Sep 7, 2022

ELHoussineT changed the title ~~[SPARK-40376] Avoid Numpy deprecation warning~~ [SPARK-40376][PYTHON] Avoid Numpy deprecation warning Sep 8, 2022

Merge branch 'apache:master' into patch-1

3007281

Merge branch 'apache:master' into patch-1

8596cbb

remove type ignore

674daee

Merge branch 'apache:master' into patch-1

9f43d44

srowen closed this in 12d984c Sep 13, 2022

AlenkaF mentioned this pull request Jan 16, 2023

[CI][Python] Nightly test for PySpark 3.2.0 fail with AttributeError on numpy.bool apache/arrow#33697

Closed

blrchen mentioned this pull request Jan 19, 2023

Resolve pyspark / numpy conflicts feathr-ai/feathr#992

Merged

2 tasks

sbrugman mentioned this pull request Feb 1, 2023

Numpy deprecations astral-sh/ruff#2455

Closed

aimtsou mentioned this pull request Feb 28, 2023

[SPARK-42647][PYTHON] Change alias for numpy deprecated and removed types #40220

Closed

[SPARK-40376][PYTHON] Avoid Numpy deprecation warning #37817

[SPARK-40376][PYTHON] Avoid Numpy deprecation warning #37817

Uh oh!

Conversation

ELHoussineT commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

srowen commented Sep 7, 2022

Uh oh!

ELHoussineT commented Sep 7, 2022

Uh oh!

itholic commented Sep 8, 2022

Uh oh!

itholic commented Sep 8, 2022

Uh oh!

itholic commented Sep 8, 2022

Uh oh!

itholic commented Sep 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

itholic commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ELHoussineT commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Sep 8, 2022

Uh oh!

AmplabJenkins commented Sep 9, 2022

Uh oh!

srowen commented Sep 9, 2022

Uh oh!

srowen commented Sep 12, 2022

Uh oh!

ELHoussineT commented Sep 12, 2022

Uh oh!

srowen commented Sep 13, 2022

Uh oh!

itholic commented Sep 13, 2022

Uh oh!

joaoleveiga commented Feb 22, 2023

Uh oh!

srowen commented Feb 22, 2023

Uh oh!

aimtsou commented Feb 24, 2023

Uh oh!

srowen commented Feb 24, 2023

Uh oh!

aimtsou commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Feb 24, 2023

Uh oh!

aimtsou commented Feb 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Feb 24, 2023

Uh oh!

srowen commented Feb 25, 2023

Uh oh!

aimtsou commented Feb 25, 2023

Uh oh!

joaoleveiga commented Feb 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ELHoussineT commented Sep 7, 2022 •

edited

Loading

itholic commented Sep 8, 2022 •

edited

Loading

ELHoussineT commented Sep 8, 2022 •

edited

Loading

aimtsou commented Feb 24, 2023 •

edited

Loading

aimtsou commented Feb 24, 2023 •

edited

Loading