| 
 | 1 | +# PDEP-6: Ban upcasting in setitem-like operations  | 
 | 2 | + | 
 | 3 | +- Created: 23 December 2022  | 
 | 4 | +- Status: Draft  | 
 | 5 | +- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402)  | 
 | 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche))  | 
 | 7 | +- Revision: 1  | 
 | 8 | + | 
 | 9 | +## Abstract  | 
 | 10 | + | 
 | 11 | +The suggestion is that setitem-like operations would  | 
 | 12 | +not change a ``Series``' dtype.  | 
 | 13 | + | 
 | 14 | +Current behaviour:  | 
 | 15 | +```python  | 
 | 16 | +In [1]: ser = pd.Series([1, 2, 3], dtype='int64')  | 
 | 17 | + | 
 | 18 | +In [2]: ser[2] = 'potage'  | 
 | 19 | + | 
 | 20 | +In [3]: ser  # dtype changed to 'object'!  | 
 | 21 | +Out[3]:  | 
 | 22 | +0         1  | 
 | 23 | +1         2  | 
 | 24 | +2    potage  | 
 | 25 | +dtype: object  | 
 | 26 | +```  | 
 | 27 | + | 
 | 28 | +Suggested behaviour:  | 
 | 29 | + | 
 | 30 | +```python  | 
 | 31 | +In [1]: ser = pd.Series([1, 2, 3])  | 
 | 32 | + | 
 | 33 | +In [2]: ser[2] = 'potage'  # raises!  | 
 | 34 | +---------------------------------------------------------------------------  | 
 | 35 | +TypeError: Invalid value 'potage' for dtype int64  | 
 | 36 | +```  | 
 | 37 | + | 
 | 38 | +## Motivation and Scope  | 
 | 39 | + | 
 | 40 | +Currently, pandas is extremely flexible in handling different dtypes.  | 
 | 41 | +However, this can potentially hide bugs, break user expectations, and unnecessarily copy data.  | 
 | 42 | + | 
 | 43 | +An example of it hiding a bug is:  | 
 | 44 | +```python  | 
 | 45 | +In [9]: ser = pd.Series(pd.date_range('2000', periods=3))  | 
 | 46 | + | 
 | 47 | +In [10]: ser[2] = '2000-01-04'  # works, is converted to datetime64  | 
 | 48 | + | 
 | 49 | +In [11]: ser[2] = '2000-01-04x'  # almost certainly a typo - but pandas doesn't error, it upcasts to object  | 
 | 50 | +```  | 
 | 51 | + | 
 | 52 | +The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as:  | 
 | 53 | +- ``ser[0] == 2``;  | 
 | 54 | +- ``ser.fillna(0, inplace=True)``;  | 
 | 55 | +- ``ser.where(ser.isna(), 0, inplace=True)``  | 
 | 56 | + | 
 | 57 | +There may be more. What is explicitly excluded from this PDEP is any operation would have no change  | 
 | 58 | +of operating inplace to begin with, such as:  | 
 | 59 | +- ``ser.diff()``;  | 
 | 60 | +- ``pd.concat([ser, other])``;  | 
 | 61 | +- ``ser.mean()``.  | 
 | 62 | + | 
 | 63 | +These would keep being allowed to change Series' dtypes.  | 
 | 64 | + | 
 | 65 | +## Detailed description  | 
 | 66 | + | 
 | 67 | +Concretely, the suggestion is:  | 
 | 68 | +- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype;  | 
 | 69 | +- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.  | 
 | 70 | + | 
 | 71 | +For a start, this would involve:  | 
 | 72 | + | 
 | 73 | +1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in  | 
 | 74 | + | 
 | 75 | +  ```python  | 
 | 76 | +  value = extract_array(value, extract_numpy=True)  | 
 | 77 | +  try:  | 
 | 78 | +      casted = np_can_hold_element(values.dtype, value)  | 
 | 79 | +  except LossySetitemError:  | 
 | 80 | +      # current dtype cannot store value, coerce to common dtype  | 
 | 81 | +      nb = self.coerce_to_target_dtype(value)  | 
 | 82 | +      return nb.setitem(indexer, value)  | 
 | 83 | +  else:  | 
 | 84 | +  ```  | 
 | 85 | + | 
 | 86 | +2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places.  | 
 | 87 | + | 
 | 88 | +The above would already require several hundreds of tests to be adjusted.  | 
 | 89 | + | 
 | 90 | +### Ban upcasting altogether, or just upcasting to ``object``?  | 
 | 91 | + | 
 | 92 | +The trickiest part of this proposal concerns what to do when setting a float in an integer column:  | 
 | 93 | + | 
 | 94 | +```python  | 
 | 95 | +In [1]: ser = pd.Series([1, 2, 3])  | 
 | 96 | + | 
 | 97 | +In [2]: ser[0] = 1.5  | 
 | 98 | +```  | 
 | 99 | + | 
 | 100 | +Possibly options could be:  | 
 | 101 | +1. just raise;  | 
 | 102 | +2. convert the float value to ``int``, preserve the Series' dtype;  | 
 | 103 | +3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions.  | 
 | 104 | + | 
 | 105 | +Let us compare with what other libraries do:  | 
 | 106 | +- ``numpy``: option 2  | 
 | 107 | +- ``cudf``: option 2  | 
 | 108 | +- ``polars``: option 2  | 
 | 109 | +- ``R data.frame``: option 3  | 
 | 110 | +- ``pandas`` (nullable dtype): option 1  | 
 | 111 | + | 
 | 112 | +If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable:  | 
 | 113 | +someone might set ``1.5`` and later be surprised to learn that they actually set ``1``.  | 
 | 114 | + | 
 | 115 | +Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity  | 
 | 116 | +to the codebase and to tests, and would be confusing to teach.  | 
 | 117 | + | 
 | 118 | +Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would  | 
 | 119 | +also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach:  | 
 | 120 | +"if you try to set an element of a ``Series`` to a new value, then that value must be compatible  | 
 | 121 | +with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for  | 
 | 122 | +``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule  | 
 | 123 | +starts to become confusing.  | 
 | 124 | + | 
 | 125 | +## Usage and Impact  | 
 | 126 | + | 
 | 127 | +This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs.  | 
 | 128 | + | 
 | 129 | +Unfortunately, it would also risk annoy users who might have been intentionally upcasting.  | 
 | 130 | + | 
 | 131 | +Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call,  | 
 | 132 | +I think it would be more beneficial to the community at large to err on the side of strictness.  | 
 | 133 | + | 
 | 134 | +## Timeline  | 
 | 135 | + | 
 | 136 | +Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0.  | 
 | 137 | + | 
 | 138 | +### PDEP History  | 
 | 139 | + | 
 | 140 | +- 23 December 2022: Initial draft  | 
0 commit comments