Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/docs/source/development/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Development
debugging
setting_ide

For pandas APIs on Spark:
For pandas API on Spark:

.. toctree::
:maxdepth: 2
Expand Down
10 changes: 5 additions & 5 deletions python/docs/source/development/ps_contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ The largest amount of work consists simply of implementing the pandas API using

3. Improve the project's documentation.

4. Write blog posts or tutorial articles evangelizing pandas APIs on Spark and help new users learn pandas APIs on Spark.
4. Write blog posts or tutorial articles evangelizing pandas API on Spark and help new users learn pandas API on Spark.

5. Give a talk about pandas APIs on Spark at your local meetup or a conference.
5. Give a talk about pandas API on Spark at your local meetup or a conference.


Step-by-step Guide For Code Contributions
Expand Down Expand Up @@ -48,7 +48,7 @@ Environment Setup
Conda
-----

If you are using Conda, the pandas APIs on Spark installation and development environment are as follows.
If you are using Conda, the pandas API on Spark installation and development environment are as follows.

.. code-block:: bash

Expand Down Expand Up @@ -131,7 +131,7 @@ We follow `PEP 8 <https://www.python.org/dev/peps/pep-0008/>`_ with one exceptio
Doctest Conventions
===================

When writing doctests, usually the doctests in pandas are converted into pandas APIs on Spark to make sure the same codes work in pandas APIs on Spark.
When writing doctests, usually the doctests in pandas are converted into pandas API on Spark to make sure the same codes work in pandas API on Spark.
In general, doctests should be grouped logically by separating a newline.

For instance, the first block is for the statements for preparation, the second block is for using the function with a specific argument,
Expand Down Expand Up @@ -176,7 +176,7 @@ Only project maintainers can do the following to publish a release.
# for release
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl dist/koalas-$package_version.tar.gz

5. Verify the uploaded package can be installed and executed. One unofficial tip is to run the doctests of pandas APIs on Spark within a Python interpreter after installing it.
5. Verify the uploaded package can be installed and executed. One unofficial tip is to run the doctests of pandas API on Spark within a Python interpreter after installing it.

.. code-block:: python

Expand Down
30 changes: 15 additions & 15 deletions python/docs/source/development/ps_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ Design Principles

.. currentmodule:: pyspark.pandas

This section outlines design principles guiding the pandas APIs on Spark.
This section outlines design principles guiding the pandas API on Spark.

Be Pythonic
-----------

Pandas APIs on Spark target Python data scientists. We want to stick to the convention that users are already familiar with as much as possible. Here are some examples:
Pandas API on Spark targets Python data scientists. We want to stick to the convention that users are already familiar with as much as possible. Here are some examples:

- Function names and parameters use snake_case, rather than CamelCase. This is different from PySpark's design. For example, pandas APIs on Spark have `to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we also provide Spark's variant as an alias.
- Function names and parameters use snake_case, rather than CamelCase. This is different from PySpark's design. For example, pandas API on Spark has `to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we also provide Spark's variant as an alias.

- Pandas APIs on Spark respect to the largest extent the conventions of the Python numerical ecosystem, and allows the use of NumPy types, etc. that can be supported by Spark.
- Pandas API on Spark respects to the largest extent the conventions of the Python numerical ecosystem, and allows the use of NumPy types, etc. that can be supported by Spark.

- pandas-on-Spark docs' style and infrastructure simply follow rest of the PyData projects'.

Expand All @@ -26,13 +26,13 @@ There are different classes of functions:

1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's).

2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of pandas APIs on Spark without having to adapt their code.
2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of pandas API on Spark without having to adapt their code.

3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in pandas APIs on Spark.
3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in pandas API on Spark.

4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in pandas APIs on Spark.
4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in pandas API on Spark.

We are still debating whether data transformation functions only available in Spark should be added to pandas APIs on Spark, e.g. `select`. We would love to hear your feedback on that.
We are still debating whether data transformation functions only available in Spark should be added to pandas API on Spark, e.g. `select`. We would love to hear your feedback on that.

Return pandas-on-Spark data structure for big data, and pandas data structure for small data
--------------------------------------------------------------------------------------------
Expand All @@ -46,40 +46,40 @@ At the risk of overgeneralization, there are two API design approaches: the firs

One example is value count (count by some key column), one of the most common operations in data science. pandas `DataFrame.value_count` returns the result in sorted order, which in 90% of the cases is what users prefer when exploring data, whereas Spark's does not sort, which is more desirable when building data pipelines, as users can accomplish the pandas behavior by adding an explicit `orderBy`.

Similar to pandas, pandas APIs on Spark should also lean more towards the former, providing discoverable APIs for common data science tasks. In most cases, this principle is well taken care of by simply implementing pandas' APIs. However, there will be circumstances in which pandas' APIs don't address a specific need, e.g. plotting for big data.
Similar to pandas, pandas API on Spark should also lean more towards the former, providing discoverable APIs for common data science tasks. In most cases, this principle is well taken care of by simply implementing pandas' APIs. However, there will be circumstances in which pandas' APIs don't address a specific need, e.g. plotting for big data.

Provide well documented APIs, with examples
-------------------------------------------

All functions and parameters should be documented. Most functions should be documented with examples, because those are the easiest to understand than a blob of text explaining what the function does.

A recommended way to add documentation is to start with the docstring of the corresponding function in PySpark or pandas, and adapt it for pandas APIs on Spark. If you are adding a new function, also add it to the API reference doc index page in `docs/source/reference` directory. The examples in docstring also improve our test coverage.
A recommended way to add documentation is to start with the docstring of the corresponding function in PySpark or pandas, and adapt it for pandas API on Spark. If you are adding a new function, also add it to the API reference doc index page in `docs/source/reference` directory. The examples in docstring also improve our test coverage.

Guardrails to prevent users from shooting themselves in the foot
----------------------------------------------------------------

Certain operations in pandas are prohibitively expensive as data scales, and we don't want to give users the illusion that they can rely on such operations in pandas APIs on Spark. That is to say, methods implemented in pandas APIs on Spark should be safe to perform by default on large datasets. As a result, the following capabilities are not implemented in pandas APIs on Spark:
Certain operations in pandas are prohibitively expensive as data scales, and we don't want to give users the illusion that they can rely on such operations in pandas API on Spark. That is to say, methods implemented in pandas API on Spark should be safe to perform by default on large datasets. As a result, the following capabilities are not implemented in pandas API on Spark:

1. Capabilities that are fundamentally not parallelizable: e.g. imperatively looping over each element
2. Capabilities that require materializing the entire working set in a single node's memory. This is why we do not implement `pandas.DataFrame.to_xarray <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html>`_. Another example is the `_repr_html_` call caps the total number of records shown to a maximum of 1000, to prevent users from blowing up their driver node simply by typing the name of the DataFrame in a notebook.

A few exceptions, however, exist. One common pattern with "big data science" is that while the initial dataset is large, the working set becomes smaller as the analysis goes deeper. For example, data scientists often perform aggregation on datasets and want to then convert the aggregated dataset to some local data structure. To help data scientists, we offer the following:

- :func:`DataFrame.to_pandas`: returns a pandas DataFrame, koalas only
- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas and pandas APIs on Spark
- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas and pandas API on Spark

Note that it is clear from the names that these functions return some local data structure that would require materializing data in a single node's memory. For these functions, we also explicitly document them with a warning note that the resulting data structure must be small.

Be a lean API layer and move fast
---------------------------------

Pandas APIs on Spark are designed as an API overlay layer on top of Spark. The project should be lightweight, and most functions should be implemented as wrappers
Pandas API on Spark is designed as an API overlay layer on top of Spark. The project should be lightweight, and most functions should be implemented as wrappers
around Spark or pandas - the pandas-on-Spark library is designed to be used only in the Spark's driver side in general.
Pandas APIs on Spark do not accept heavyweight implementations, e.g. execution engine changes.
Pandas API on Spark does not accept heavyweight implementations, e.g. execution engine changes.

This approach enables us to move fast. For the considerable future, we aim to be making monthly releases. If we find a critical bug, we will be making a new release as soon as the bug fix is available.

High test coverage
------------------

Pandas APIs on Spark should be well tested. The project tracks its test coverage with over 90% across the entire codebase, and close to 100% for critical parts. Pull requests will not be accepted unless they have close to 100% statement coverage from the codecov report.
Pandas API on Spark should be well tested. The project tracks its test coverage with over 90% across the entire codebase, and close to 100% for critical parts. Pull requests will not be accepted unless they have close to 100% statement coverage from the codecov report.
2 changes: 1 addition & 1 deletion python/docs/source/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#whe
install
quickstart

For pandas APIs on Spark:
For pandas API on Spark:

.. toctree::
:maxdepth: 2
Expand Down
6 changes: 3 additions & 3 deletions python/docs/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,9 +159,9 @@ Package Minimum supported version Note
`NumPy` 1.7 Required for MLlib DataFrame-based API
`pyarrow` 1.0.0 Optional for Spark SQL
`Py4J` 0.10.9.2 Required
`pandas` 0.23.2 Required for pandas APIs on Spark
`pyarrow` 1.0.0 Required for pandas APIs on Spark
`Numpy` 1.14(<1.20.0) Required for pandas APIs on Spark
`pandas` 0.23.2 Required for pandas API on Spark
`pyarrow` 1.0.0 Required for pandas API on Spark
`Numpy` 1.14(<1.20.0) Required for pandas API on Spark
============= ========================= ======================================

Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
Expand Down
16 changes: 8 additions & 8 deletions python/docs/source/getting_started/ps_10mins.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# 10 minutes to pandas APIs on Spark\n",
"# 10 minutes to pandas API on Spark\n",
"\n",
"This is a short introduction to pandas APIs on Spark, geared mainly for new users. This notebook shows you some key differences between pandas and pandas APIs on Spark. You can run this examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/pyspark.pandas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb). For Databricks Runtime, you can import and run [the current .ipynb file](https://raw.githubusercontent.com/databricks/koalas/master/docs/source/getting_started/10min.ipynb) out of the box. Try it on [Databricks Community Edition](https://community.cloud.databricks.com/) for free.\n",
"This is a short introduction to pandas API on Spark, geared mainly for new users. This notebook shows you some key differences between pandas and pandas API on Spark. You can run this examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/pyspark.pandas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb). For Databricks Runtime, you can import and run [the current .ipynb file](https://raw.githubusercontent.com/databricks/koalas/master/docs/source/getting_started/10min.ipynb) out of the box. Try it on [Databricks Community Edition](https://community.cloud.databricks.com/) for free.\n",
"\n",
"Customarily, we import pandas APIs on Spark as follows:"
"Customarily, we import pandas API on Spark as follows:"
]
},
{
Expand All @@ -35,7 +35,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Creating a pandas-on-Spark Series by passing a list of values, letting pandas APIs on Spark create a default integer index:"
"Creating a pandas-on-Spark Series by passing a list of values, letting pandas API on Spark create a default integer index:"
]
},
{
Expand Down Expand Up @@ -531,7 +531,7 @@
"metadata": {},
"source": [
"Creating pandas-on-Spark DataFrame from Spark DataFrame.\n",
"`to_koalas()` is automatically attached to Spark DataFrame and available as an API when pandas APIs on Spark are imported."
"`to_koalas()` is automatically attached to Spark DataFrame and available as an API when pandas API on Spark is imported."
]
},
{
Expand Down Expand Up @@ -1287,7 +1287,7 @@
"metadata": {},
"source": [
"## Missing Data\n",
"Pandas APIs on Spark primarily use the value `np.nan` to represent missing data. It is by default not included in computations. \n"
"Pandas API on Spark primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. \n"
]
},
{
Expand Down Expand Up @@ -1621,7 +1621,7 @@
"source": [
"### Spark Configurations\n",
"\n",
"Various configurations in PySpark could be applied internally in pandas APIs on Spark.\n",
"Various configurations in PySpark could be applied internally in pandas API on Spark.\n",
"For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href=\"https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html\">PySpark Usage Guide for Pandas with Apache Arrow</a>."
]
},
Expand Down Expand Up @@ -14312,7 +14312,7 @@
"source": [
"### Spark IO\n",
"\n",
"In addition, pandas APIs on Spark fully support Spark's various datasources such as ORC and an external datasource. See <a href=\"https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io\">here</a> to write it to the specified datasource and <a href=\"https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io\">here</a> to read it from the datasource."
"In addition, pandas API on Spark fully supports Spark's various datasources such as ORC and an external datasource. See <a href=\"https://koalas.readthedocs.io/en/latest/reference/api/pyspark.pandas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io\">here</a> to write it to the specified datasource and <a href=\"https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io\">here</a> to read it from the datasource."
]
},
{
Expand Down
16 changes: 8 additions & 8 deletions python/docs/source/getting_started/ps_install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
Installation
============

Pandas APIs on Spark require PySpark so please make sure your PySpark is available.
Pandas API on Spark requires PySpark so please make sure your PySpark is available.

To install pandas APIs on Spark, you can use:
To install pandas API on Spark, you can use:

- `Conda <https://anaconda.org/conda-forge/koalas>`__
- `PyPI <https://pypi.org/project/koalas>`__
Expand All @@ -24,12 +24,12 @@ Python version support
Officially Python 3.5 to 3.8.

.. note::
Pandas APIs on Spark support for Python 3.5 is deprecated and will be dropped in the future release.
At that point, existing Python 3.5 workflows that use pandas APIs on Spark will continue to work without
Python 3.5 support is deprecated and will be dropped in the future release.
At that point, existing Python 3.5 workflows that use pandas API on Spark will continue to work without
modification, but Python 3.5 users will no longer get access to the latest pandas-on-Spark features
and bugfixes. We recommend that you upgrade to Python 3.6 or newer.

Installing pandas APIs on Spark
Installing pandas API on Spark
-------------------------------

Installing with Conda
Expand All @@ -47,20 +47,20 @@ To put your self inside this environment run::

conda activate koalas-dev-env

The final step required is to install pandas APIs on Spark. This can be done with the
The final step required is to install pandas API on Spark. This can be done with the
following command::

conda install -c conda-forge koalas

To install a specific version of pandas APIs on Spark::
To install a specific version of pandas API on Spark::

conda install -c conda-forge koalas=1.3.0


Installing from PyPI
~~~~~~~~~~~~~~~~~~~~

Pandas APIs on Spark can be installed via pip from
Pandas API on Spark can be installed via pip from
`PyPI <https://pypi.org/project/koalas>`__::

pip install koalas
Expand Down
Loading