Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1752,6 +1752,15 @@ To use `groupBy().apply()`, the user needs to define the following:
* A Python function that defines the computation for each group.
* A `StructType` object or a string that defines the schema of the output `DataFrame`.

The output schema will be applied to the columns of the returned `pandas.DataFrame` in order by position,
not by name. This means that the columns in the `pandas.DataFrame` must be indexed so that their
position matches the corresponding field in the schema.

Note that when creating a new `pandas.DataFrame` using a dictionary, the actual position of the column
can differ from the order that it was placed in the dictionary. It is recommended in this case to
explicitly define the column order using the `columns` keyword, e.g.
`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`, or alternatively use an `OrderedDict`.

Note that all data for a group will be loaded into memory before the function is applied. This can
lead to out of memory exceptons, especially if the group sizes are skewed. The configuration for
[maxRecordsPerBatch](#setting-arrow-batch-size) is not applied on groups and it is up to the user
Expand Down
9 changes: 8 additions & 1 deletion python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2500,7 +2500,8 @@ def pandas_udf(f=None, returnType=None, functionType=None):
A grouped map UDF defines transformation: A `pandas.DataFrame` -> A `pandas.DataFrame`
The returnType should be a :class:`StructType` describing the schema of the returned
`pandas.DataFrame`.
The length of the returned `pandas.DataFrame` can be arbitrary.
The length of the returned `pandas.DataFrame` can be arbitrary and the columns must be
indexed so that their position matches the corresponding field in the schema.

Grouped map UDFs are used with :meth:`pyspark.sql.GroupedData.apply`.

Expand Down Expand Up @@ -2548,6 +2549,12 @@ def pandas_udf(f=None, returnType=None, functionType=None):
| 2|6.0|
+---+---+

.. note:: If returning a new `pandas.DataFrame` constructed with a dictionary, it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a "warning"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there is warning, but the notes show up highlighted so it's pretty clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K. I think sphinx supports these http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#directives

"caution" or "warning" seems more appropriate but note works too.

recommended to explicitly index the columns by name to ensure the positions are correct,
or alternatively use an `OrderedDict`.
For example, `pd.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])` or
`pd.DataFrame(OrderedDict([('id', ids), ('a', data)]))`.

.. seealso:: :meth:`pyspark.sql.GroupedData.apply`

3. GROUPED_AGG
Expand Down