Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 7, 2021

What changes were proposed in this pull request?

This PR proposes to:

  • Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
  • ML / MLlib -> MLlib (DataFrame-based) / MLlib (RDD-based) in API reference page
  • Mention other user guides as well because the guide such as ML and SQL.
  • Mention other migration guides as well because PySpark can get affected by it.

Why are the changes needed?

For better documentation.

Does this PR introduce any user-facing change?

It fixes user-facing docs. However, it's not released out yet.

How was this patch tested?

Manually tested by running:

cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch

@HyukjinKwon HyukjinKwon marked this pull request as draft January 7, 2021 09:02
@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133785 has finished for PR 31082 at commit 003b61f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<a class="dropdown-item" href="ml-guide.html">MLlib (Machine Learning)</a>
<a class="dropdown-item" href="graphx-programming-guide.html">GraphX (Graph Processing)</a>
<a class="dropdown-item" href="sparkr.html">SparkR (R on Spark)</a>
<a class="dropdown-item" href="api/python/getting_started/index.html">PySpark (Python on Spark)</a>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-07 at 6 45 02 PM

* [MLlib](ml-guide.html): applying machine learning algorithms
* [GraphX](graphx-programming-guide.html): processing graphs
* [SparkR](sparkr.html): processing data with Spark in R
* [PySpark](api/python/getting_started/index.html): processing data with Spark in Python
Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-07 at 6 49 59 PM

This page summarizes the basic steps required to setup and get started with PySpark.
There are more guides shared with other languages such as
`Quick Start <http://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
at `the Spark documentation <http://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-07 at 6 46 05 PM

- `Structured Streaming Programming Guide <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`_
- `Spark Streaming Programming Guide <http://spark.apache.org/docs/latest/streaming-programming-guide.html>`_
- `Machine Learning Library (MLlib) Guide <http://spark.apache.org/docs/latest/ml-guide.html>`_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-07 at 6 46 32 PM

- `Migration Guide: SQL, Datasets and DataFrame <http://spark.apache.org/docs/latest/sql-migration-guide.html>`_
- `Migration Guide: Structured Streaming <http://spark.apache.org/docs/latest/ss-migration-guide.html>`_
- `Migration Guide: MLlib (Machine Learning) <http://spark.apache.org/docs/latest/ml-migration-guide.html>`_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2021-01-07 at 6 46 56 PM

@HyukjinKwon HyukjinKwon marked this pull request as ready for review January 7, 2021 09:47
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 7, 2021

Thanks for post-reviewing @mengxr. Can you take a look please? @viirya, @srowen and @zero323 too FYI.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38373/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38373/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Test build #133789 has finished for PR 31082 at commit 0cacfd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38378/

@SparkQA
Copy link

SparkQA commented Jan 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38378/

ML
==
MLlib (DataFrame-based)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed about it. I am aware that MLlib is the official name applied to both, but majority of users, I interacted with, prefers to use ML when speaking about DataFrame-based API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is actually the feedback who has a lot of major contributions in ML, @mengxr :-). I think that this name was picked based on what we documented here https://spark.apache.org/docs/latest/ml-guide.html which I think makes sense:

  • What is “Spark ML”?

    “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API. This is majorly due to the org.apache.spark.ml Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.


This page summarizes the basic steps required to setup and get started with PySpark.
There are more guides shared with other languages such as
`Quick Start <http://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed to be absolute path? Cannot we use relative path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... I tried so hard to find a good way but failed. This is because the link here is outer side of the PySpark documentation build. So it can't resolve the link when PySpark documentation builds.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one question about absolute path.

@HyukjinKwon
Copy link
Member Author

Thanks guys. Let me merge this in.

HyukjinKwon added a commit that referenced this pull request Jan 8, 2021
…umentation

### What changes were proposed in this pull request?

This PR proposes to:
- Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs
- `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page
- Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
- Mention other migration guides as well because PySpark can get affected by it.

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It fixes user-facing docs. However, it's not released out yet.

### How was this patch tested?

Manually tested by running:

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

Closes #31082 from HyukjinKwon/SPARK-34041.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit aa388cf)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-34041 branch January 4, 2022 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants