Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 22, 2020

What changes were proposed in this pull request?

This PR proposes to redesign the PySpark documentation.

I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.

Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.

In more details, this PR proposes:

  1. Use pydata_sphinx_theme theme - pandas and Koalas use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.

  2. Use the Sphinx option to separate source and build directories as the documentation pages will likely grow.

  3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.

    One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.

Why are the changes needed?

Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as pandas and Koalas.

It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.

Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.

Does this PR introduce any user-facing change?

Yes, PySpark API documentation will be redesigned.

How was this patch tested?

Manually tested, and the demo site was made to show.

python/.eggs/
python/deps
python/docs/_site/
python/docs/source/reference/api/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated by autosummary plugin in Sphinx when autosummary_generate in conf.py is enabled. Each page of API or class under autosummary directive, for example, DataFrame.alias will be generated via that plugin as RST files.

Copy link
Member Author

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/img/spark-logo-reverse.png image is from "white logo" at http://spark.apache.org/faq.html.


{% endif %}
{% endblock %}

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to let autosummary plugin document the methods in a class. For example, when we use this template, it describes methods documentation on the bottom. See pyspark.ml.Transformer as an example.

Without this template, it only lists the methods and attributes without showing the documentation in details. See pyspark.sql.DataFrameNaFunctions as an example.

# Remove previously generated rst files. Ignore errors just in case it stops
# generating whole docs.
shutil.rmtree(
"%s/reference/api" % os.path.dirname(os.path.abspath(__file__)), ignore_errors=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autosummary generates RST files but don't remove it back. Here we always remove the generated RST files so the leftover doesn't cause any side effect.

@HyukjinKwon
Copy link
Member Author

@BryanCutler, @huaxingao, @ueshin, @viirya, @srowen, @dongjoon-hyun, @WeichenXu123, @zhengruifeng, @holdenk, @zero323, can you guys take a look when you are available?

@SparkQA

This comment has been minimized.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-32179 branch 3 times, most recently from 54ffa45 to 1c6fe6c Compare July 22, 2020 12:40
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! yes will take a little more work to maintain the modules / class lists, so whatever we can do to keep it simple is welcome.

@holdenk
Copy link
Contributor

holdenk commented Jul 22, 2020

Excited to see the site improve, I’ll take some time to review it this week.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! Besides visual effects like colors, it looks more structured.

@zero323
Copy link
Member

zero323 commented Jul 22, 2020

Looks nice. I miss a bit direct access to docstrings, but I guess that's a reasonable trade-off.

I wonder if there is some non-hacky way to organize functions into logical groups, similarly to what ScalaDoc does.

@huaxingao
Copy link
Contributor

Looks really nice! It's more organized this way.

@dongjoon-hyun
Copy link
Member

The demo website looks nice although I didn't generate manually this PR~

@HyukjinKwon
Copy link
Member Author

I wonder if there is some non-hacky way to organize functions into logical groups, similarly to what ScalaDoc does.

I tried hard but looked difficult to do. I will take a look one more time.

By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format
is omitted. Equivalent to ``col.cast("date")``.
.. _datetime pattern: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because we're now creating each page for each API. For example, see https://hyukjin-spark.readthedocs.io/en/stable/reference/api/pyspark.sql.functions.to_date.html

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-32179 branch 3 times, most recently from 1f121d5 to 3c89dab Compare July 23, 2020 08:22
@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 23, 2020

I believe this is ready for a look or possibly ready to go.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Jul 24, 2020

Test build #126449 has finished for PR 29188 at commit d6d0117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-32179 branch July 27, 2020 07:43
@HyukjinKwon HyukjinKwon restored the SPARK-32179 branch July 27, 2020 07:47
@HyukjinKwon HyukjinKwon reopened this Jul 27, 2020
@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

I will merge and go ahead given that there are multiple positive feedback here.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon
Copy link
Member Author

Let me know if you guys have any concern on this. I will be working on this to complete the other pages for a while.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126627 has finished for PR 29188 at commit d6d0117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

The demo website looks great!

HyukjinKwon added a commit that referenced this pull request Aug 5, 2020
### What changes were proposed in this pull request?

This PR proposes to write the main page of PySpark documentation. The base work is finished at #29188.

### Why are the changes needed?

For better usability and readability in PySpark documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it creates a new main page as below:

![Screen Shot 2020-07-31 at 10 02 44 PM](https://user-images.githubusercontent.com/6477701/89037618-d2d68880-d379-11ea-9a44-562f2aa0e3fd.png)

### How was this patch tested?

Manually built the PySpark documentation.

```bash
cd python
make clean html
```

Closes #29320 from HyukjinKwon/SPARK-32507.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-32179 branch December 7, 2020 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants