-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DataFrame.koalas.transform_batch to support additional dtypes. #2132
Conversation
def pandas_extract(pdf, name): | ||
# This is for output to work around a DataFrame for struct | ||
# from Spark 3.0. See SPARK-23836 | ||
return pdf[name] | ||
|
||
def pandas_series_func(f): | ||
def pandas_series_func(f, by_pass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it called by_pass
? Would you please help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses some new Spark APIs to "by pass" a workaround:
You can see:
koalas/databricks/koalas/accessors.py
Lines 646 to 666 in 80a5893
if should_by_pass: pudf = pandas_udf( output_func, returnType=return_schema, functionType=PandasUDFType.SCALAR ) temp_struct_column = verify_temp_column_name( self_applied._internal.spark_frame, "__temp_struct__" ) applied = pudf(F.struct(*columns)).alias(temp_struct_column) sdf = self_applied._internal.spark_frame.select(applied) sdf = sdf.selectExpr("%s.*" % temp_struct_column) else: applied = [] for field in return_schema.fields: applied.append( pandas_udf( pandas_frame_func(output_func, field.name), returnType=field.dataType, functionType=PandasUDFType.SCALAR, )(*columns).alias(field.name) ) sdf = self_applied._internal.spark_frame.select(*applied) koalas/databricks/koalas/accessors.py
Lines 715 to 735 in 80a5893
if should_by_pass: pudf = pandas_udf( output_func, returnType=return_schema, functionType=PandasUDFType.SCALAR ) temp_struct_column = verify_temp_column_name( self_applied._internal.spark_frame, "__temp_struct__" ) applied = pudf(F.struct(*columns)).alias(temp_struct_column) sdf = self_applied._internal.spark_frame.select(applied) sdf = sdf.selectExpr("%s.*" % temp_struct_column) else: applied = [] for field in return_schema.fields: applied.append( pandas_udf( pandas_frame_func(output_func, field.name), returnType=field.dataType, functionType=PandasUDFType.SCALAR, )(*columns).alias(field.name) ) sdf = self_applied._internal.spark_frame.select(*applied)
Looks great, thank you! |
Thanks! merging. |
Fix
DataFrame.koalas.transform_batch
to support additional dtypes.After this, additional dtypes can be specified in the return type annotation of the UDFs for
DataFrame.koalas.transform_batch
.