Matching data types to internal PySpark and pandas' behavior. #1870

itholic · 2020-10-27T14:53:25Z

This PR basically includes several fixes for matching data types to internal PySpark and pandas' behavior :

Added np.float32 and "float32" (matched to FloatType)

>>> ks.Series([10]).astype(np.float32)
0    10.0
dtype: float32

>>> ks.Series([10]).astype("float32")
0    10.0
dtype: float32

Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)

>>> ks.Series(["2020-10-26"]).astype(np.datetime64)
0   2020-10-26
dtype: datetime64[ns]

>>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
0   2020-10-26
dtype: datetime64[ns]

Fixed np.int to match LongType, not IntegerType.

>>> pd.Series([100]).astype(np.int)
0    100.0
dtype: int64

>>> ks.Series([100]).astype(np.int)
0    100.0
dtype: int32  # This fixed to `int64` now.

Fixed np.float to match DoubleType, not FloatType.

>>> pd.Series([100]).astype(np.float)
0    100.0
dtype: float64

>>> ks.Series([100]).astype(np.float)
0    100.0
dtype: float32  # This fixed to `float64` now.

Added test, test_typedef.py::TypeHintTests::test_as_spark_type
Added more tests for test_series.py::SeriesTest::test_astype
Fixed related doc tests

Should resolve #1840

codecov-io · 2020-10-28T11:17:15Z

Codecov Report

Merging #1870 into master will increase coverage by 0.01%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##           master    #1870      +/-   ##
==========================================
+ Coverage   94.18%   94.20%   +0.01%     
==========================================
  Files          40       40              
  Lines        9853     9867      +14     
==========================================
+ Hits         9280     9295      +15     
+ Misses        573      572       -1

Impacted Files	Coverage Δ
databricks/koalas/accessors.py	`93.06% <ø> (ø)`
databricks/koalas/groupby.py	`91.17% <ø> (ø)`
databricks/koalas/series.py	`96.95% <ø> (ø)`
databricks/koalas/strings.py	`82.35% <ø> (ø)`
databricks/koalas/typedef/typehints.py	`92.98% <95.23%> (+0.87%)`	⬆️
databricks/koalas/frame.py	`96.73% <0.00%> (+<0.01%)`	⬆️
databricks/koalas/spark/accessors.py	`94.33% <0.00%> (+0.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1de7ec4...e8a92ad. Read the comment docs.

ueshin · 2020-10-28T18:07:08Z

databricks/koalas/typedef/typehints.py

+    elif tpe in (str, "str", "string"):
+        return types.StringType()
+    # TimestampType
+    elif tpe in (datetime.datetime, np.datetime64, "datetime64[ns]"):


also pd.Timestamp?

I think maybe then we have inconsistency behavior with pandas for astype ??

>>> pd.Series(["2020-10-20"]).astype(pd.Timestamp) Traceback (most recent call last): ... TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood >>> ks.Series(["2020-10-20"]).astype(pd.Timestamp) 0 2020-10-20 dtype: datetime64[ns]

As shown above, seems pandas doesn't support pd.Timestamp as a dtype when type casting.
What do you think about this ??

Got it. Let's not add it for now. We might need to add it later anyway since the function is used not only in astype.

…ta_types

ueshin

Also could you check whether pandas supports string notation like:

'?' | boolean
'b' | (signed) byte
'B' | unsigned byte
'i' | (signed) integer
'u' | unsigned integer
'f' | floating-point
'c' | complex-floating point
'm' | timedelta
'M' | datetime
'O' | (Python) objects
'S', 'a' | zero-terminated bytes (not recommended)
'U' | Unicode string
'V' | raw data (void)

https://numpy.org/doc/stable/reference/arrays.dtypes.html

databricks/koalas/typedef/typehints.py

itholic · 2020-10-30T00:34:33Z

Thanks, @ueshin . Added the missing types.

itholic · 2020-10-30T00:35:03Z

Also could you check whether pandas supports string notation like:

'?' | boolean
'b' | (signed) byte
'B' | unsigned byte
'i' | (signed) integer
'u' | unsigned integer
'f' | floating-point
'c' | complex-floating point
'm' | timedelta
'M' | datetime
'O' | (Python) objects
'S', 'a' | zero-terminated bytes (not recommended)
'U' | Unicode string
'V' | raw data (void)

https://numpy.org/doc/stable/reference/arrays.dtypes.html

Yep, I'll check it !

databricks/koalas/typedef/typehints.py

ueshin

Otherwise, LGTM so far.

databricks/koalas/typedef/typehints.py

databricks/koalas/tests/test_series.py

itholic · 2020-10-31T06:51:04Z

databricks/koalas/typedef/typehints.py

-    if tpe in (str, "str", "string"):
-        return types.StringType()
-    elif tpe in (bytes,):
+    # TODO: Add "boolean" and "string" types.


I added "boolean" and "string" as TODO item since they work differently from "bool" and "str" respectively.

First, "boolean" casts the type from "bool-like values" to boolean whereas "bool" casts the type from various types to bool.

>>> pd.Series([1, 2, 3]).astype("bool") 0 True 1 True 2 True dtype: bool >>> pd.Series([1, 2, 3]).astype("boolean") Traceback (most recent call last): ... TypeError: Need to pass bool-like values >>> pd.Series([True, False, True]).astype("boolean") 0 True 1 False 2 True dtype: boolean

Next, "string" casts the type from "sequence of string or pandas.NA" to object whereas "str" casts the type from various types to object .

>>> pd.Series([1, 2, 3]).astype("str") 0 1 1 2 2 3 dtype: object >>> pd.Series([1, 2, 3]).astype("string") Traceback (most recent call last): ... ValueError: StringArray requires a sequence of strings or pandas.NA >>> pd.Series(['1', '2', '3']).astype("string") 0 1 1 2 2 3 dtype: string

In short, we need to add the new type boolean and string to match these behaviors.

itholic · 2020-11-04T07:07:13Z

I think It's good to go for now.

How about let's discuss the remaining things in the separated PR ?

like #1870 (comment) and #1870 (comment).

ueshin

LGTM.

ueshin · 2020-11-04T21:48:00Z

Thanks! I'd merge this now.
@itholic Could you move on to the the remaining topics in the separate PRs?

itholic · 2020-11-05T00:56:31Z

Thanks, @ueshin !
Sure I'll open a PR soon.

nanotellez · 2021-01-22T00:32:39Z

I just made sure I had the latest koalas version (1.5.0) and tested this and it is not working as described above.

ks.Series([10]).astype(np.float32)

Produced the following error:


ks.Series([10]).astype(np.float32)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-4576f10d772a> in <module>
----> 1 ks.Series([10]).astype(np.float32)

~/opt/anaconda3/envs/pipelines_env/lib/python3.7/site-packages/databricks/koalas/series.py in astype(self, dtype)
   1002         from databricks.koalas.typedef import as_spark_type
   1003 
-> 1004         spark_type = as_spark_type(dtype)
   1005         if not spark_type:
   1006             raise ValueError("Type {} not understood".format(dtype))

~/opt/anaconda3/envs/pipelines_env/lib/python3.7/site-packages/databricks/koalas/typedef/typehints.py in as_spark_type(tpe)
    123         return types.ArrayType(types.StringType())
    124     else:
--> 125         raise TypeError("Type %s was not understood." % tpe)
    126 
    127 

TypeError: Type <class 'numpy.float32'> was not understood.

itholic · 2021-01-22T01:23:06Z

@nanotellez

Hmm... I couldn't reproduce the same error.

Could you let me know how did you install Koalas ??

I installed simply via pip install koalas, and tested, it worked fine.

>>> ks.__version__
'1.5.0'

>>> ks.Series([10]).astype(np.float32)
0    10.0
dtype: float32

And this usage always tested for every code fixes in CI since this PR has been merged

koalas/databricks/koalas/tests/test_typedef.py

Line 230 in 8803344

np.float32: FloatType(),

Could you please check if there is more than one Koalas installed in your execution environment?

(Maybe an old version downloaded directly from github may be running instead or something ?)

nanotellez · 2021-01-22T04:34:17Z

Well... upon checking again, I realized that the version I had was in fact not the latest one. The update I ran must not have gone through for some reason, and I did not realize it. Now that I carefully ran the update, it is working exactly as intended. My apologies!

itholic · 2021-01-22T05:11:32Z

@nanotellez No problem. Glad to hear that you fixed !! :D

HyukjinKwon marked this pull request as draft October 28, 2020 05:32

Mapping data types

2b5c818

itholic force-pushed the data_types branch from b157d4f to 2b5c818 Compare October 28, 2020 07:29

Fix Series.repeat to match the type

7d92140

ueshin reviewed Oct 28, 2020

View reviewed changes

Merge branch 'master' of https://github.com/databricks/koalas into da…

e29c2a4

…ta_types

itholic marked this pull request as ready for review October 29, 2020 03:43

itholic requested review from HyukjinKwon, ueshin and xinrong-meng October 29, 2020 03:43

Add np.character

89fac06

ueshin reviewed Oct 29, 2020

View reviewed changes

Add missing types

5620c5e

ueshin reviewed Oct 30, 2020

View reviewed changes

databricks/koalas/typedef/typehints.py Show resolved Hide resolved

Add more string notations & TODO

fd38c9c

ueshin reviewed Oct 30, 2020

View reviewed changes

databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved

databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved

databricks/koalas/tests/test_series.py Show resolved Hide resolved

itholic added 2 commits October 31, 2020 15:37

Addressed comments

12af141

Remove boolean and string type

e8a92ad

itholic commented Oct 31, 2020

View reviewed changes

HyukjinKwon approved these changes Nov 3, 2020

View reviewed changes

ueshin approved these changes Nov 4, 2020

View reviewed changes

ueshin merged commit d4954b6 into databricks:master Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matching data types to internal PySpark and pandas' behavior. #1870

Matching data types to internal PySpark and pandas' behavior. #1870

itholic commented Oct 27, 2020 •

edited

Loading

codecov-io commented Oct 28, 2020 •

edited

Loading

ueshin Oct 28, 2020

itholic Oct 29, 2020 •

edited

Loading

ueshin Oct 29, 2020

ueshin left a comment

itholic commented Oct 30, 2020

itholic commented Oct 30, 2020

ueshin left a comment •

edited

Loading

itholic Oct 31, 2020 •

edited

Loading

itholic commented Nov 4, 2020 •

edited

Loading

ueshin left a comment

ueshin commented Nov 4, 2020 •

edited

Loading

itholic commented Nov 5, 2020

nanotellez commented Jan 22, 2021

itholic commented Jan 22, 2021 •

edited

Loading

nanotellez commented Jan 22, 2021

itholic commented Jan 22, 2021

Matching data types to internal PySpark and pandas' behavior. #1870

Matching data types to internal PySpark and pandas' behavior. #1870

Conversation

itholic commented Oct 27, 2020 • edited Loading

codecov-io commented Oct 28, 2020 • edited Loading

Codecov Report

ueshin Oct 28, 2020

Choose a reason for hiding this comment

itholic Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

ueshin Oct 29, 2020

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

itholic commented Oct 30, 2020

itholic commented Oct 30, 2020

ueshin left a comment • edited Loading

Choose a reason for hiding this comment

itholic Oct 31, 2020 • edited Loading

Choose a reason for hiding this comment

itholic commented Nov 4, 2020 • edited Loading

ueshin left a comment

Choose a reason for hiding this comment

ueshin commented Nov 4, 2020 • edited Loading

itholic commented Nov 5, 2020

nanotellez commented Jan 22, 2021

itholic commented Jan 22, 2021 • edited Loading

nanotellez commented Jan 22, 2021

itholic commented Jan 22, 2021

itholic commented Oct 27, 2020 •

edited

Loading

codecov-io commented Oct 28, 2020 •

edited

Loading

itholic Oct 29, 2020 •

edited

Loading

ueshin left a comment •

edited

Loading

itholic Oct 31, 2020 •

edited

Loading

itholic commented Nov 4, 2020 •

edited

Loading

ueshin commented Nov 4, 2020 •

edited

Loading

itholic commented Jan 22, 2021 •

edited

Loading