-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching data types to internal PySpark and pandas' behavior. #1870
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1870 +/- ##
==========================================
+ Coverage 94.18% 94.20% +0.01%
==========================================
Files 40 40
Lines 9853 9867 +14
==========================================
+ Hits 9280 9295 +15
+ Misses 573 572 -1
Continue to review full report at Codecov.
|
elif tpe in (str, "str", "string"): | ||
return types.StringType() | ||
# TimestampType | ||
elif tpe in (datetime.datetime, np.datetime64, "datetime64[ns]"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pd.Timestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe then we have inconsistency behavior with pandas for astype
??
>>> pd.Series(["2020-10-20"]).astype(pd.Timestamp)
Traceback (most recent call last):
...
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood
>>> ks.Series(["2020-10-20"]).astype(pd.Timestamp)
0 2020-10-20
dtype: datetime64[ns]
As shown above, seems pandas doesn't support pd.Timestamp
as a dtype
when type casting.
What do you think about this ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's not add it for now. We might need to add it later anyway since the function is used not only in astype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you check whether pandas supports string notation like:
'?' | boolean
'b' | (signed) byte
'B' | unsigned byte
'i' | (signed) integer
'u' | unsigned integer
'f' | floating-point
'c' | complex-floating point
'm' | timedelta
'M' | datetime
'O' | (Python) objects
'S', 'a' | zero-terminated bytes (not recommended)
'U' | Unicode string
'V' | raw data (void)
Thanks, @ueshin . Added the missing types. |
Yep, I'll check it ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM so far.
if tpe in (str, "str", "string"): | ||
return types.StringType() | ||
elif tpe in (bytes,): | ||
# TODO: Add "boolean" and "string" types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added "boolean" and "string" as TODO item since they work differently from "bool" and "str" respectively.
First, "boolean" casts the type from "bool-like values" to boolean
whereas "bool" casts the type from various types to bool
.
>>> pd.Series([1, 2, 3]).astype("bool")
0 True
1 True
2 True
dtype: bool
>>> pd.Series([1, 2, 3]).astype("boolean")
Traceback (most recent call last):
...
TypeError: Need to pass bool-like values
>>> pd.Series([True, False, True]).astype("boolean")
0 True
1 False
2 True
dtype: boolean
Next, "string" casts the type from "sequence of string or pandas.NA" to object
whereas "str" casts the type from various types to object
.
>>> pd.Series([1, 2, 3]).astype("str")
0 1
1 2
2 3
dtype: object
>>> pd.Series([1, 2, 3]).astype("string")
Traceback (most recent call last):
...
ValueError: StringArray requires a sequence of strings or pandas.NA
>>> pd.Series(['1', '2', '3']).astype("string")
0 1
1 2
2 3
dtype: string
In short, we need to add the new type boolean
and string
to match these behaviors.
I think It's good to go for now. How about let's discuss the remaining things in the separated PR ? like #1870 (comment) and #1870 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks! I'd merge this now. |
Thanks, @ueshin ! |
I just made sure I had the latest koalas version (1.5.0) and tested this and it is not working as described above.
Produced the following error:
|
Hmm... I couldn't reproduce the same error. Could you let me know how did you install Koalas ?? I installed simply via >>> ks.__version__
'1.5.0'
>>> ks.Series([10]).astype(np.float32)
0 10.0
dtype: float32 And this usage always tested for every code fixes in CI since this PR has been merged
Could you please check if there is more than one Koalas installed in your execution environment? (Maybe an old version downloaded directly from github may be running instead or something ?) |
Well... upon checking again, I realized that the version I had was in fact not the latest one. The update I ran must not have gone through for some reason, and I did not realize it. Now that I carefully ran the update, it is working exactly as intended. My apologies! |
@nanotellez No problem. Glad to hear that you fixed !! :D |
This PR basically includes several fixes for matching data types to internal PySpark and pandas' behavior :
np.float32
and"float32"
(matched toFloatType
)np.datetime64
and"datetime64[ns]"
(matched toTimestampType
)np.int
to matchLongType
, notIntegerType
.np.float
to matchDoubleType
, notFloatType
.test_typedef.py::TypeHintTests::test_as_spark_type
test_series.py::SeriesTest::test_astype
Should resolve #1840