Skip to content

Commit e5b4b32

Browse files
committed
[SPARK-40434][SS][PYTHON][FOLLOWUP] Address review comments
### What changes were proposed in this pull request? This PR addresses the review comments from the last round of review from HyukjinKwon in #37893. ### Why are the changes needed? Better documentation and removing unnecessary code. ### Does this PR introduce _any_ user-facing change? Slight documentation change. ### How was this patch tested? N/A Closes #37964 from HeartSaVioR/SPARK-40434-followup. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
1 parent c4a0360 commit e5b4b32

File tree

2 files changed

+7
-20
lines changed

2 files changed

+7
-20
lines changed

python/pyspark/sql/pandas/group_ops.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -231,9 +231,9 @@ def applyInPandasWithState(
231231
per-group state. The result Dataset will represent the flattened record returned by the
232232
function.
233233
234-
For a streaming Dataset, the function will be invoked first for all input groups and then
235-
for all timed out states where the input data is set to be empty. Updates to each group's
236-
state will be saved across invocations.
234+
For a streaming :class:`DataFrame`, the function will be invoked first for all input groups
235+
and then for all timed out states where the input data is set to be empty. Updates to each
236+
group's state will be saved across invocations.
237237
238238
The function should take parameters (key, Iterator[`pandas.DataFrame`], state) and
239239
return another Iterator[`pandas.DataFrame`]. The grouping key(s) will be passed as a tuple
@@ -257,10 +257,10 @@ def applyInPandasWithState(
257257
user-defined state. The value of the state will be presented as a tuple, as well as the
258258
update should be performed with the tuple. The corresponding Python types for
259259
:class:DataType are supported. Please refer to the page
260-
https://spark.apache.org/docs/latest/sql-ref-datatypes.html (python tab).
260+
https://spark.apache.org/docs/latest/sql-ref-datatypes.html (Python tab).
261261
262-
The size of each DataFrame in both the input and output can be arbitrary. The number of
263-
DataFrames in both the input and output can also be arbitrary.
262+
The size of each `pandas.DataFrame` in both the input and output can be arbitrary. The
263+
number of `pandas.DataFrame` in both the input and output can also be arbitrary.
264264
265265
.. versionadded:: 3.4.0
266266
@@ -294,6 +294,7 @@ def applyInPandasWithState(
294294
... total_len += len(pdf)
295295
... state.update((total_len,))
296296
... yield pd.DataFrame({"id": [key[0]], "countAsString": [str(total_len)]})
297+
...
297298
>>> df.groupby("id").applyInPandasWithState(
298299
... count_fn, outputStructType="id long, countAsString string",
299300
... stateStructType="len long", outputMode="Update",

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -98,16 +98,6 @@ class ArrowWriter(val root: VectorSchemaRoot, fields: Array[ArrowFieldWriter]) {
9898
count += 1
9999
}
100100

101-
def sizeInBytes(): Int = {
102-
var i = 0
103-
var bytes = 0
104-
while (i < fields.size) {
105-
bytes += fields(i).getSizeInBytes()
106-
i += 1
107-
}
108-
bytes
109-
}
110-
111101
def finish(): Unit = {
112102
root.setRowCount(count)
113103
fields.foreach(_.finish())
@@ -142,10 +132,6 @@ private[arrow] abstract class ArrowFieldWriter {
142132
count += 1
143133
}
144134

145-
def getSizeInBytes(): Int = {
146-
valueVector.getBufferSizeFor(count)
147-
}
148-
149135
def finish(): Unit = {
150136
valueVector.setValueCount(count)
151137
}

0 commit comments

Comments
 (0)