Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching data types to internal PySpark and pandas' behavior. #1870

Merged
merged 8 commits into from
Nov 4, 2020

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Oct 27, 2020

This PR basically includes several fixes for matching data types to internal PySpark and pandas' behavior :

  • Added np.float32 and "float32" (matched to FloatType)
    >>> ks.Series([10]).astype(np.float32)
    0    10.0
    dtype: float32
    
    >>> ks.Series([10]).astype("float32")
    0    10.0
    dtype: float32
  • Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)
    >>> ks.Series(["2020-10-26"]).astype(np.datetime64)
    0   2020-10-26
    dtype: datetime64[ns]
    
    >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
    0   2020-10-26
    dtype: datetime64[ns]
  • Fixed np.int to match LongType, not IntegerType.
    >>> pd.Series([100]).astype(np.int)
    0    100.0
    dtype: int64
    
    >>> ks.Series([100]).astype(np.int)
    0    100.0
    dtype: int32  # This fixed to `int64` now.
  • Fixed np.float to match DoubleType, not FloatType.
    >>> pd.Series([100]).astype(np.float)
    0    100.0
    dtype: float64
    
    >>> ks.Series([100]).astype(np.float)
    0    100.0
    dtype: float32  # This fixed to `float64` now.
  • Added test, test_typedef.py::TypeHintTests::test_as_spark_type
  • Added more tests for test_series.py::SeriesTest::test_astype
  • Fixed related doc tests

Should resolve #1840

@HyukjinKwon HyukjinKwon marked this pull request as draft October 28, 2020 05:32
@codecov-io
Copy link

codecov-io commented Oct 28, 2020

Codecov Report

Merging #1870 into master will increase coverage by 0.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1870      +/-   ##
==========================================
+ Coverage   94.18%   94.20%   +0.01%     
==========================================
  Files          40       40              
  Lines        9853     9867      +14     
==========================================
+ Hits         9280     9295      +15     
+ Misses        573      572       -1     
Impacted Files Coverage Δ
databricks/koalas/accessors.py 93.06% <ø> (ø)
databricks/koalas/groupby.py 91.17% <ø> (ø)
databricks/koalas/series.py 96.95% <ø> (ø)
databricks/koalas/strings.py 82.35% <ø> (ø)
databricks/koalas/typedef/typehints.py 92.98% <95.23%> (+0.87%) ⬆️
databricks/koalas/frame.py 96.73% <0.00%> (+<0.01%) ⬆️
databricks/koalas/spark/accessors.py 94.33% <0.00%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1de7ec4...e8a92ad. Read the comment docs.

elif tpe in (str, "str", "string"):
return types.StringType()
# TimestampType
elif tpe in (datetime.datetime, np.datetime64, "datetime64[ns]"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pd.Timestamp?

Copy link
Contributor Author

@itholic itholic Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe then we have inconsistency behavior with pandas for astype ??

>>> pd.Series(["2020-10-20"]).astype(pd.Timestamp)
Traceback (most recent call last):
...
TypeError: dtype '<class 'pandas._libs.tslibs.timestamps.Timestamp'>' not understood

>>> ks.Series(["2020-10-20"]).astype(pd.Timestamp)
0   2020-10-20
dtype: datetime64[ns]

As shown above, seems pandas doesn't support pd.Timestamp as a dtype when type casting.
What do you think about this ??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Let's not add it for now. We might need to add it later anyway since the function is used not only in astype.

@itholic itholic marked this pull request as ready for review October 29, 2020 03:43
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you check whether pandas supports string notation like:

'?' | boolean
'b' | (signed) byte
'B' | unsigned byte
'i' | (signed) integer
'u' | unsigned integer
'f' | floating-point
'c' | complex-floating point
'm' | timedelta
'M' | datetime
'O' | (Python) objects
'S', 'a' | zero-terminated bytes (not recommended)
'U' | Unicode string
'V' | raw data (void)

https://numpy.org/doc/stable/reference/arrays.dtypes.html

databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
@itholic
Copy link
Contributor Author

itholic commented Oct 30, 2020

Thanks, @ueshin . Added the missing types.

@itholic
Copy link
Contributor Author

itholic commented Oct 30, 2020

Also could you check whether pandas supports string notation like:

'?' | boolean
'b' | (signed) byte
'B' | unsigned byte
'i' | (signed) integer
'u' | unsigned integer
'f' | floating-point
'c' | complex-floating point
'm' | timedelta
'M' | datetime
'O' | (Python) objects
'S', 'a' | zero-terminated bytes (not recommended)
'U' | Unicode string
'V' | raw data (void)

https://numpy.org/doc/stable/reference/arrays.dtypes.html

Yep, I'll check it !

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM so far.

databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
databricks/koalas/typedef/typehints.py Outdated Show resolved Hide resolved
databricks/koalas/tests/test_series.py Show resolved Hide resolved
if tpe in (str, "str", "string"):
return types.StringType()
elif tpe in (bytes,):
# TODO: Add "boolean" and "string" types.
Copy link
Contributor Author

@itholic itholic Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added "boolean" and "string" as TODO item since they work differently from "bool" and "str" respectively.

First, "boolean" casts the type from "bool-like values" to boolean whereas "bool" casts the type from various types to bool.

>>> pd.Series([1, 2, 3]).astype("bool")
0    True
1    True
2    True
dtype: bool

>>> pd.Series([1, 2, 3]).astype("boolean")
Traceback (most recent call last):
...
TypeError: Need to pass bool-like values

>>> pd.Series([True, False, True]).astype("boolean")
0     True
1    False
2     True
dtype: boolean

Next, "string" casts the type from "sequence of string or pandas.NA" to object whereas "str" casts the type from various types to object .

>>> pd.Series([1, 2, 3]).astype("str")
0    1
1    2
2    3
dtype: object

>>> pd.Series([1, 2, 3]).astype("string")
Traceback (most recent call last):
...
ValueError: StringArray requires a sequence of strings or pandas.NA

>>> pd.Series(['1', '2', '3']).astype("string")
0    1
1    2
2    3
dtype: string

In short, we need to add the new type boolean and string to match these behaviors.

@itholic
Copy link
Contributor Author

itholic commented Nov 4, 2020

I think It's good to go for now.

How about let's discuss the remaining things in the separated PR ?

like #1870 (comment) and #1870 (comment).

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ueshin
Copy link
Collaborator

ueshin commented Nov 4, 2020

Thanks! I'd merge this now.
@itholic Could you move on to the the remaining topics in the separate PRs?

@ueshin ueshin merged commit d4954b6 into databricks:master Nov 4, 2020
@itholic
Copy link
Contributor Author

itholic commented Nov 5, 2020

Thanks, @ueshin !
Sure I'll open a PR soon.

@nanotellez
Copy link

I just made sure I had the latest koalas version (1.5.0) and tested this and it is not working as described above.

ks.Series([10]).astype(np.float32)

Produced the following error:


ks.Series([10]).astype(np.float32)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-4576f10d772a> in <module>
----> 1 ks.Series([10]).astype(np.float32)

~/opt/anaconda3/envs/pipelines_env/lib/python3.7/site-packages/databricks/koalas/series.py in astype(self, dtype)
   1002         from databricks.koalas.typedef import as_spark_type
   1003 
-> 1004         spark_type = as_spark_type(dtype)
   1005         if not spark_type:
   1006             raise ValueError("Type {} not understood".format(dtype))

~/opt/anaconda3/envs/pipelines_env/lib/python3.7/site-packages/databricks/koalas/typedef/typehints.py in as_spark_type(tpe)
    123         return types.ArrayType(types.StringType())
    124     else:
--> 125         raise TypeError("Type %s was not understood." % tpe)
    126 
    127 

TypeError: Type <class 'numpy.float32'> was not understood.

@itholic
Copy link
Contributor Author

itholic commented Jan 22, 2021

@nanotellez

Hmm... I couldn't reproduce the same error.

Could you let me know how did you install Koalas ??

I installed simply via pip install koalas, and tested, it worked fine.

>>> ks.__version__
'1.5.0'

>>> ks.Series([10]).astype(np.float32)
0    10.0
dtype: float32

And this usage always tested for every code fixes in CI since this PR has been merged

np.float32: FloatType(),

Could you please check if there is more than one Koalas installed in your execution environment?

(Maybe an old version downloaded directly from github may be running instead or something ?)

@nanotellez
Copy link

Well... upon checking again, I realized that the version I had was in fact not the latest one. The update I ran must not have gone through for some reason, and I did not realize it. Now that I carefully ran the update, it is working exactly as intended. My apologies!

@itholic
Copy link
Contributor Author

itholic commented Jan 22, 2021

@nanotellez No problem. Glad to hear that you fixed !! :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type hints do not works for many spark types
5 participants