Skip to content

Conversation

@khalidmammadov
Copy link
Contributor

What changes were proposed in this pull request?

It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

Why are the changes needed?

To improve PySpark documentation

Does this PR introduce any user-facing change?

Yes, documentation

How was this patch tested?

PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@khalidmammadov
Copy link
Contributor Author

@HyukjinKwon @srowen @itholic Please review

Returns
-------
:class:`~pyspark.sql.Column`
a string representatio of a :class:`StructType` parsed from given JSON.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in a few places: representatio -> representation

Returns
-------
:class:`~pyspark.sql.Column`
an array of values from first array that is not in the second.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that are not

col : :class:`~pyspark.sql.Column` or str
target column to work on.
delimiter : str
delimiter to use concatanate elements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concatenate, here and below
Also: "delimiter used to concatenate..."

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine otherwise

Returns
-------
:class:`~pyspark.sql.Column`
a column of `boolean` type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we want to use single quotation for type name, why don't we use it in other docstrings ?

e.g.

-  a column of array type.
+  a column of `array` type.

-------
:class:`~pyspark.sql.Column`
a column of array type. Subset of array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we're here, can we also fix the minor mistake in description?

I found there are two spaces between "containing" and "all".

-  Collection function: returns an array containing  all the elements in `x` from index `start`
+  Collection function: returns an array containing all the elements in `x` from index `start`

Concatenates multiple input columns together into a single column.
The function works with strings, binary and compatible array columns.
The function works with strings, numeric, binary and compatible array columns.
Or any type that can be converted to string is good candidate as input value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are supported types other than string, numeric, and binary, can we list them all ?

Returns
-------
:class:`~pyspark.sql.Column`
concatatened values. Type of the `Column` depends on input columns' type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another typo for "contatatened" here :-)

See Also
--------
:meth:`pyspark.sql.functions.array_join` : to concatanate string columns with delimiter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +5713 to +5714
>>> df.select(element_at(df.data, -4)).collect()
[Row(element_at(data, -4)=None)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a short description why this returns None?

e.g.

    >>> df.select(element_at(df.data, -1)).collect()
    [Row(element_at(data, -1)='c')]

    Returns `None` if there is no value corresponding to the given `extraction`.

    >>> df.select(element_at(df.data, -4)).collect()
    [Row(element_at(data, -4)=None)]

Returns
-------
:class:`~pyspark.sql.Column`
a string representatio of a :class:`StructType` parsed from given CSV.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in a few places: representatio -> representation

Here, too

@HyukjinKwon
Copy link
Member

Merged to master.

LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Sep 20, 2022
…ples self-contained (part 7, ~30 functions)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache#37592, apache#37662, apache#37686, apache#37786, apache#37797)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes apache#37850 from khalidmammadov/docstrings_funcs_part_7.

Lead-authored-by: Khalid Mammadov <[email protected]>
Co-authored-by: khalidmammadov <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Sep 22, 2022
…ple in element_at

### What changes were proposed in this pull request?

This PR is a followup of #37850 that removes non-ANSI compliant example in `element_at`.

### Why are the changes needed?

ANSI build fails to run the example.

https://github.com/apache/spark/actions/runs/3094607589/jobs/5008176959

```
    Caused by: org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index -4 is out of bounds. The array has 3 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
    	at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidElementAtIndexError(QueryExecutionErrors.scala:264)
    	...

/usr/local/pypy/pypy3.7/lib-python/3/runpy.py:125: RuntimeWarning: 'pyspark.sql.functions' found in sys.modules after import of package 'pyspark.sql', but prior to execution of 'pyspark.sql.functions'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
/__w/spark/spark/python/pyspark/context.py:310: FutureWarning: Python 3.7 support is deprecated in Spark 3.4.
  warnings.warn("Python 3.7 support is deprecated in Spark 3.4.", FutureWarning)
**********************************************************************
   1 of   6 in pyspark.sql.functions.element_at

```

### Does this PR introduce _any_ user-facing change?

No. The example added is not exposed to end users yet.

### How was this patch tested?
Manually tested with enabling the ANSI configuration (`spark.sql.ansi.enabled`)

Closes #37959 from HyukjinKwon/SPARK-40142-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
srowen pushed a commit that referenced this pull request Sep 25, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 25, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
…ples self-contained (FINAL)

### What changes were proposed in this pull request?
It's part of the Pyspark docstrings improvement series (apache/spark#37592, apache/spark#37662, apache/spark#37686, apache/spark#37786, apache/spark#37797, apache/spark#37850)

In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed.

I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user.

This should complete "my take" on `functions.py` docstrings & example improvements.

### Why are the changes needed?
To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?
Yes, documentation

### How was this patch tested?
```
PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
./python/run-tests --testnames pyspark.sql.functions
bundle exec jekyll build
```

Closes #37988 from khalidmammadov/docstrings_funcs_part_8.

Authored-by: Khalid Mammadov <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants