-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 7, ~30 functions) #37850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 7, ~30 functions) #37850
Conversation
|
Can one of the admins verify this patch? |
|
@HyukjinKwon @srowen @itholic Please review |
python/pyspark/sql/functions.py
Outdated
| Returns | ||
| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| a string representatio of a :class:`StructType` parsed from given JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in a few places: representatio -> representation
python/pyspark/sql/functions.py
Outdated
| Returns | ||
| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| an array of values from first array that is not in the second. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: that are not
python/pyspark/sql/functions.py
Outdated
| col : :class:`~pyspark.sql.Column` or str | ||
| target column to work on. | ||
| delimiter : str | ||
| delimiter to use concatanate elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concatenate, here and below
Also: "delimiter used to concatenate..."
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine otherwise
python/pyspark/sql/functions.py
Outdated
| Returns | ||
| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| a column of `boolean` type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we want to use single quotation for type name, why don't we use it in other docstrings ?
e.g.
- a column of array type.
+ a column of `array` type.| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| a column of array type. Subset of array. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since we're here, can we also fix the minor mistake in description?
I found there are two spaces between "containing" and "all".
- Collection function: returns an array containing all the elements in `x` from index `start`
+ Collection function: returns an array containing all the elements in `x` from index `start`
python/pyspark/sql/functions.py
Outdated
| Concatenates multiple input columns together into a single column. | ||
| The function works with strings, binary and compatible array columns. | ||
| The function works with strings, numeric, binary and compatible array columns. | ||
| Or any type that can be converted to string is good candidate as input value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are supported types other than string, numeric, and binary, can we list them all ?
python/pyspark/sql/functions.py
Outdated
| Returns | ||
| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| concatatened values. Type of the `Column` depends on input columns' type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another typo for "contatatened" here :-)
python/pyspark/sql/functions.py
Outdated
| See Also | ||
| -------- | ||
| :meth:`pyspark.sql.functions.array_join` : to concatanate string columns with delimiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| >>> df.select(element_at(df.data, -4)).collect() | ||
| [Row(element_at(data, -4)=None)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a short description why this returns None?
e.g.
>>> df.select(element_at(df.data, -1)).collect()
[Row(element_at(data, -1)='c')]
Returns `None` if there is no value corresponding to the given `extraction`.
>>> df.select(element_at(df.data, -4)).collect()
[Row(element_at(data, -4)=None)]
python/pyspark/sql/functions.py
Outdated
| Returns | ||
| ------- | ||
| :class:`~pyspark.sql.Column` | ||
| a string representatio of a :class:`StructType` parsed from given CSV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in a few places: representatio -> representation
Here, too
|
Merged to master. |
…ples self-contained (part 7, ~30 functions) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache#37592, apache#37662, apache#37686, apache#37786, apache#37797) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes apache#37850 from khalidmammadov/docstrings_funcs_part_7. Lead-authored-by: Khalid Mammadov <[email protected]> Co-authored-by: khalidmammadov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ple in element_at ### What changes were proposed in this pull request? This PR is a followup of #37850 that removes non-ANSI compliant example in `element_at`. ### Why are the changes needed? ANSI build fails to run the example. https://github.com/apache/spark/actions/runs/3094607589/jobs/5008176959 ``` Caused by: org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index -4 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidElementAtIndexError(QueryExecutionErrors.scala:264) ... /usr/local/pypy/pypy3.7/lib-python/3/runpy.py:125: RuntimeWarning: 'pyspark.sql.functions' found in sys.modules after import of package 'pyspark.sql', but prior to execution of 'pyspark.sql.functions'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) /__w/spark/spark/python/pyspark/context.py:310: FutureWarning: Python 3.7 support is deprecated in Spark 3.4. warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning) ********************************************************************** 1 of 6 in pyspark.sql.functions.element_at ``` ### Does this PR introduce _any_ user-facing change? No. The example added is not exposed to end users yet. ### How was this patch tested? Manually tested with enabling the ANSI configuration (`spark.sql.ansi.enabled`) Closes #37959 from HyukjinKwon/SPARK-40142-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797)
In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.
Why are the changes needed?
To improve PySpark documentation
Does this PR introduce any user-facing change?
Yes, documentation
How was this patch tested?