|  | 
|  | 1 | +# PDEP-6: Ban upcasting in setitem-like operations | 
|  | 2 | + | 
|  | 3 | +- Created: 23 December 2022 | 
|  | 4 | +- Status: Draft | 
|  | 5 | +- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) | 
|  | 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) | 
|  | 7 | +- Revision: 1 | 
|  | 8 | + | 
|  | 9 | +## Abstract | 
|  | 10 | + | 
|  | 11 | +The suggestion is that setitem-like operations would | 
|  | 12 | +not change a ``Series``' dtype. | 
|  | 13 | + | 
|  | 14 | +Current behaviour: | 
|  | 15 | +```python | 
|  | 16 | +In [1]: ser = pd.Series([1, 2, 3], dtype='int64') | 
|  | 17 | + | 
|  | 18 | +In [2]: ser[2] = 'potage' | 
|  | 19 | + | 
|  | 20 | +In [3]: ser  # dtype changed to 'object'! | 
|  | 21 | +Out[3]: | 
|  | 22 | +0         1 | 
|  | 23 | +1         2 | 
|  | 24 | +2    potage | 
|  | 25 | +dtype: object | 
|  | 26 | +``` | 
|  | 27 | + | 
|  | 28 | +Suggested behaviour: | 
|  | 29 | + | 
|  | 30 | +```python | 
|  | 31 | +In [1]: ser = pd.Series([1, 2, 3]) | 
|  | 32 | + | 
|  | 33 | +In [2]: ser[2] = 'potage'  # raises! | 
|  | 34 | +--------------------------------------------------------------------------- | 
|  | 35 | +TypeError: Invalid value 'potage' for dtype int64 | 
|  | 36 | +``` | 
|  | 37 | + | 
|  | 38 | +## Motivation and Scope | 
|  | 39 | + | 
|  | 40 | +Currently, pandas is extremely flexible in handling different dtypes. | 
|  | 41 | +However, this can potentially hide bugs, break user expectations, and unnecessarily copy data. | 
|  | 42 | + | 
|  | 43 | +An example of it hiding a bug is: | 
|  | 44 | +```python | 
|  | 45 | +In [9]: ser = pd.Series(pd.date_range('2000', periods=3)) | 
|  | 46 | + | 
|  | 47 | +In [10]: ser[2] = '2000-01-04'  # works, is converted to datetime64 | 
|  | 48 | + | 
|  | 49 | +In [11]: ser[2] = '2000-01-04x'  # almost certainly a typo - but pandas doesn't error, it upcasts to object | 
|  | 50 | +``` | 
|  | 51 | + | 
|  | 52 | +The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as: | 
|  | 53 | +- ``ser[0] == 2``; | 
|  | 54 | +- ``ser.fillna(0, inplace=True)``; | 
|  | 55 | +- ``ser.where(ser.isna(), 0, inplace=True)`` | 
|  | 56 | + | 
|  | 57 | +There may be more. What is explicitly excluded from this PDEP is any operation would have no change | 
|  | 58 | +of operating inplace to begin with, such as: | 
|  | 59 | +- ``ser.diff()``; | 
|  | 60 | +- ``pd.concat([ser, other])``; | 
|  | 61 | +- ``ser.mean()``. | 
|  | 62 | + | 
|  | 63 | +These would keep being allowed to change Series' dtypes. | 
|  | 64 | + | 
|  | 65 | +## Detailed description | 
|  | 66 | + | 
|  | 67 | +Concretely, the suggestion is: | 
|  | 68 | +- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype; | 
|  | 69 | +- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. | 
|  | 70 | + | 
|  | 71 | +For a start, this would involve: | 
|  | 72 | + | 
|  | 73 | +1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in | 
|  | 74 | + | 
|  | 75 | +  ```python | 
|  | 76 | +  value = extract_array(value, extract_numpy=True) | 
|  | 77 | +  try: | 
|  | 78 | +      casted = np_can_hold_element(values.dtype, value) | 
|  | 79 | +  except LossySetitemError: | 
|  | 80 | +      # current dtype cannot store value, coerce to common dtype | 
|  | 81 | +      nb = self.coerce_to_target_dtype(value) | 
|  | 82 | +      return nb.setitem(indexer, value) | 
|  | 83 | +  else: | 
|  | 84 | +  ``` | 
|  | 85 | + | 
|  | 86 | +2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places. | 
|  | 87 | + | 
|  | 88 | +The above would already require several hundreds of tests to be adjusted. | 
|  | 89 | + | 
|  | 90 | +### Ban upcasting altogether, or just upcasting to ``object``? | 
|  | 91 | + | 
|  | 92 | +The trickiest part of this proposal concerns what to do when setting a float in an integer column: | 
|  | 93 | + | 
|  | 94 | +```python | 
|  | 95 | +In [1]: ser = pd.Series([1, 2, 3]) | 
|  | 96 | + | 
|  | 97 | +In [2]: ser[0] = 1.5 | 
|  | 98 | +``` | 
|  | 99 | + | 
|  | 100 | +Possibly options could be: | 
|  | 101 | +1. just raise; | 
|  | 102 | +2. convert the float value to ``int``, preserve the Series' dtype; | 
|  | 103 | +3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions. | 
|  | 104 | + | 
|  | 105 | +Let us compare with what other libraries do: | 
|  | 106 | +- ``numpy``: option 2 | 
|  | 107 | +- ``cudf``: option 2 | 
|  | 108 | +- ``polars``: option 2 | 
|  | 109 | +- ``R data.frame``: option 3 | 
|  | 110 | +- ``pandas`` (nullable dtype): option 1 | 
|  | 111 | + | 
|  | 112 | +If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable: | 
|  | 113 | +someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. | 
|  | 114 | + | 
|  | 115 | +Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity | 
|  | 116 | +to the codebase and to tests, and would be confusing to teach. | 
|  | 117 | + | 
|  | 118 | +Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would | 
|  | 119 | +also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach: | 
|  | 120 | +"if you try to set an element of a ``Series`` to a new value, then that value must be compatible | 
|  | 121 | +with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for | 
|  | 122 | +``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule | 
|  | 123 | +starts to become confusing. | 
|  | 124 | + | 
|  | 125 | +## Usage and Impact | 
|  | 126 | + | 
|  | 127 | +This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs. | 
|  | 128 | + | 
|  | 129 | +Unfortunately, it would also risk annoy users who might have been intentionally upcasting. | 
|  | 130 | + | 
|  | 131 | +Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call, | 
|  | 132 | +I think it would be more beneficial to the community at large to err on the side of strictness. | 
|  | 133 | + | 
|  | 134 | +## Timeline | 
|  | 135 | + | 
|  | 136 | +Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. | 
|  | 137 | + | 
|  | 138 | +### PDEP History | 
|  | 139 | + | 
|  | 140 | +- 23 December 2022: Initial draft | 
0 commit comments