diff --git a/python/docs/source/development/index.rst b/python/docs/source/development/index.rst index eb80cc8aaed24..9c54d52d4d0b9 100644 --- a/python/docs/source/development/index.rst +++ b/python/docs/source/development/index.rst @@ -27,7 +27,7 @@ Development debugging setting_ide -For pandas APIs on Spark: +For pandas API on Spark: .. toctree:: :maxdepth: 2 diff --git a/python/docs/source/development/ps_contributing.rst b/python/docs/source/development/ps_contributing.rst index e87c1cfa8e4f4..c83057b01f6e5 100644 --- a/python/docs/source/development/ps_contributing.rst +++ b/python/docs/source/development/ps_contributing.rst @@ -17,9 +17,9 @@ The largest amount of work consists simply of implementing the pandas API using 3. Improve the project's documentation. -4. Write blog posts or tutorial articles evangelizing pandas APIs on Spark and help new users learn pandas APIs on Spark. +4. Write blog posts or tutorial articles evangelizing pandas API on Spark and help new users learn pandas API on Spark. -5. Give a talk about pandas APIs on Spark at your local meetup or a conference. +5. Give a talk about pandas API on Spark at your local meetup or a conference. Step-by-step Guide For Code Contributions @@ -48,7 +48,7 @@ Environment Setup Conda ----- -If you are using Conda, the pandas APIs on Spark installation and development environment are as follows. +If you are using Conda, the pandas API on Spark installation and development environment are as follows. .. code-block:: bash @@ -131,7 +131,7 @@ We follow `PEP 8 `_ with one exceptio Doctest Conventions =================== -When writing doctests, usually the doctests in pandas are converted into pandas APIs on Spark to make sure the same codes work in pandas APIs on Spark. +When writing doctests, usually the doctests in pandas are converted into pandas API on Spark to make sure the same codes work in pandas API on Spark. In general, doctests should be grouped logically by separating a newline. For instance, the first block is for the statements for preparation, the second block is for using the function with a specific argument, @@ -176,7 +176,7 @@ Only project maintainers can do the following to publish a release. # for release python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/koalas-$package_version-py3-none-any.whl dist/koalas-$package_version.tar.gz -5. Verify the uploaded package can be installed and executed. One unofficial tip is to run the doctests of pandas APIs on Spark within a Python interpreter after installing it. +5. Verify the uploaded package can be installed and executed. One unofficial tip is to run the doctests of pandas API on Spark within a Python interpreter after installing it. .. code-block:: python diff --git a/python/docs/source/development/ps_design.rst b/python/docs/source/development/ps_design.rst index 41515283437d1..b131e60587a22 100644 --- a/python/docs/source/development/ps_design.rst +++ b/python/docs/source/development/ps_design.rst @@ -4,16 +4,16 @@ Design Principles .. currentmodule:: pyspark.pandas -This section outlines design principles guiding the pandas APIs on Spark. +This section outlines design principles guiding the pandas API on Spark. Be Pythonic ----------- -Pandas APIs on Spark target Python data scientists. We want to stick to the convention that users are already familiar with as much as possible. Here are some examples: +Pandas API on Spark targets Python data scientists. We want to stick to the convention that users are already familiar with as much as possible. Here are some examples: -- Function names and parameters use snake_case, rather than CamelCase. This is different from PySpark's design. For example, pandas APIs on Spark have `to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we also provide Spark's variant as an alias. +- Function names and parameters use snake_case, rather than CamelCase. This is different from PySpark's design. For example, pandas API on Spark has `to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we also provide Spark's variant as an alias. -- Pandas APIs on Spark respect to the largest extent the conventions of the Python numerical ecosystem, and allows the use of NumPy types, etc. that can be supported by Spark. +- Pandas API on Spark respects to the largest extent the conventions of the Python numerical ecosystem, and allows the use of NumPy types, etc. that can be supported by Spark. - pandas-on-Spark docs' style and infrastructure simply follow rest of the PyData projects'. @@ -26,13 +26,13 @@ There are different classes of functions: 1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's). - 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of pandas APIs on Spark without having to adapt their code. + 2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of pandas API on Spark without having to adapt their code. - 3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in pandas APIs on Spark. + 3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in pandas API on Spark. - 4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in pandas APIs on Spark. + 4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in pandas API on Spark. -We are still debating whether data transformation functions only available in Spark should be added to pandas APIs on Spark, e.g. `select`. We would love to hear your feedback on that. +We are still debating whether data transformation functions only available in Spark should be added to pandas API on Spark, e.g. `select`. We would love to hear your feedback on that. Return pandas-on-Spark data structure for big data, and pandas data structure for small data -------------------------------------------------------------------------------------------- @@ -46,19 +46,19 @@ At the risk of overgeneralization, there are two API design approaches: the firs One example is value count (count by some key column), one of the most common operations in data science. pandas `DataFrame.value_count` returns the result in sorted order, which in 90% of the cases is what users prefer when exploring data, whereas Spark's does not sort, which is more desirable when building data pipelines, as users can accomplish the pandas behavior by adding an explicit `orderBy`. -Similar to pandas, pandas APIs on Spark should also lean more towards the former, providing discoverable APIs for common data science tasks. In most cases, this principle is well taken care of by simply implementing pandas' APIs. However, there will be circumstances in which pandas' APIs don't address a specific need, e.g. plotting for big data. +Similar to pandas, pandas API on Spark should also lean more towards the former, providing discoverable APIs for common data science tasks. In most cases, this principle is well taken care of by simply implementing pandas' APIs. However, there will be circumstances in which pandas' APIs don't address a specific need, e.g. plotting for big data. Provide well documented APIs, with examples ------------------------------------------- All functions and parameters should be documented. Most functions should be documented with examples, because those are the easiest to understand than a blob of text explaining what the function does. -A recommended way to add documentation is to start with the docstring of the corresponding function in PySpark or pandas, and adapt it for pandas APIs on Spark. If you are adding a new function, also add it to the API reference doc index page in `docs/source/reference` directory. The examples in docstring also improve our test coverage. +A recommended way to add documentation is to start with the docstring of the corresponding function in PySpark or pandas, and adapt it for pandas API on Spark. If you are adding a new function, also add it to the API reference doc index page in `docs/source/reference` directory. The examples in docstring also improve our test coverage. Guardrails to prevent users from shooting themselves in the foot ---------------------------------------------------------------- -Certain operations in pandas are prohibitively expensive as data scales, and we don't want to give users the illusion that they can rely on such operations in pandas APIs on Spark. That is to say, methods implemented in pandas APIs on Spark should be safe to perform by default on large datasets. As a result, the following capabilities are not implemented in pandas APIs on Spark: +Certain operations in pandas are prohibitively expensive as data scales, and we don't want to give users the illusion that they can rely on such operations in pandas API on Spark. That is to say, methods implemented in pandas API on Spark should be safe to perform by default on large datasets. As a result, the following capabilities are not implemented in pandas API on Spark: 1. Capabilities that are fundamentally not parallelizable: e.g. imperatively looping over each element 2. Capabilities that require materializing the entire working set in a single node's memory. This is why we do not implement `pandas.DataFrame.to_xarray `_. Another example is the `_repr_html_` call caps the total number of records shown to a maximum of 1000, to prevent users from blowing up their driver node simply by typing the name of the DataFrame in a notebook. @@ -66,20 +66,20 @@ Certain operations in pandas are prohibitively expensive as data scales, and we A few exceptions, however, exist. One common pattern with "big data science" is that while the initial dataset is large, the working set becomes smaller as the analysis goes deeper. For example, data scientists often perform aggregation on datasets and want to then convert the aggregated dataset to some local data structure. To help data scientists, we offer the following: - :func:`DataFrame.to_pandas`: returns a pandas DataFrame, koalas only -- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas and pandas APIs on Spark +- :func:`DataFrame.to_numpy`: returns a numpy array, works with both pandas and pandas API on Spark Note that it is clear from the names that these functions return some local data structure that would require materializing data in a single node's memory. For these functions, we also explicitly document them with a warning note that the resulting data structure must be small. Be a lean API layer and move fast --------------------------------- -Pandas APIs on Spark are designed as an API overlay layer on top of Spark. The project should be lightweight, and most functions should be implemented as wrappers +Pandas API on Spark is designed as an API overlay layer on top of Spark. The project should be lightweight, and most functions should be implemented as wrappers around Spark or pandas - the pandas-on-Spark library is designed to be used only in the Spark's driver side in general. -Pandas APIs on Spark do not accept heavyweight implementations, e.g. execution engine changes. +Pandas API on Spark does not accept heavyweight implementations, e.g. execution engine changes. This approach enables us to move fast. For the considerable future, we aim to be making monthly releases. If we find a critical bug, we will be making a new release as soon as the bug fix is available. High test coverage ------------------ -Pandas APIs on Spark should be well tested. The project tracks its test coverage with over 90% across the entire codebase, and close to 100% for critical parts. Pull requests will not be accepted unless they have close to 100% statement coverage from the codecov report. +Pandas API on Spark should be well tested. The project tracks its test coverage with over 90% across the entire codebase, and close to 100% for critical parts. Pull requests will not be accepted unless they have close to 100% statement coverage from the codecov report. diff --git a/python/docs/source/getting_started/index.rst b/python/docs/source/getting_started/index.rst index 5d2d1a81c3dfd..ed23f5a3d48a0 100644 --- a/python/docs/source/getting_started/index.rst +++ b/python/docs/source/getting_started/index.rst @@ -31,7 +31,7 @@ at `the Spark documentation PySpark Usage Guide for Pandas with Apache Arrow." ] }, @@ -14312,7 +14312,7 @@ "source": [ "### Spark IO\n", "\n", - "In addition, pandas APIs on Spark fully support Spark's various datasources such as ORC and an external datasource. See here to write it to the specified datasource and here to read it from the datasource." + "In addition, pandas API on Spark fully supports Spark's various datasources such as ORC and an external datasource. See here to write it to the specified datasource and here to read it from the datasource." ] }, { diff --git a/python/docs/source/getting_started/ps_install.rst b/python/docs/source/getting_started/ps_install.rst index 538f012e40868..974895a0732aa 100644 --- a/python/docs/source/getting_started/ps_install.rst +++ b/python/docs/source/getting_started/ps_install.rst @@ -2,9 +2,9 @@ Installation ============ -Pandas APIs on Spark require PySpark so please make sure your PySpark is available. +Pandas API on Spark requires PySpark so please make sure your PySpark is available. -To install pandas APIs on Spark, you can use: +To install pandas API on Spark, you can use: - `Conda `__ - `PyPI `__ @@ -24,12 +24,12 @@ Python version support Officially Python 3.5 to 3.8. .. note:: - Pandas APIs on Spark support for Python 3.5 is deprecated and will be dropped in the future release. - At that point, existing Python 3.5 workflows that use pandas APIs on Spark will continue to work without + Python 3.5 support is deprecated and will be dropped in the future release. + At that point, existing Python 3.5 workflows that use pandas API on Spark will continue to work without modification, but Python 3.5 users will no longer get access to the latest pandas-on-Spark features and bugfixes. We recommend that you upgrade to Python 3.6 or newer. -Installing pandas APIs on Spark +Installing pandas API on Spark ------------------------------- Installing with Conda @@ -47,12 +47,12 @@ To put your self inside this environment run:: conda activate koalas-dev-env -The final step required is to install pandas APIs on Spark. This can be done with the +The final step required is to install pandas API on Spark. This can be done with the following command:: conda install -c conda-forge koalas -To install a specific version of pandas APIs on Spark:: +To install a specific version of pandas API on Spark:: conda install -c conda-forge koalas=1.3.0 @@ -60,7 +60,7 @@ To install a specific version of pandas APIs on Spark:: Installing from PyPI ~~~~~~~~~~~~~~~~~~~~ -Pandas APIs on Spark can be installed via pip from +Pandas API on Spark can be installed via pip from `PyPI `__:: pip install koalas diff --git a/python/docs/source/getting_started/quickstart.ipynb b/python/docs/source/getting_started/quickstart.ipynb index 550b532fefc14..4325ce3e35136 100644 --- a/python/docs/source/getting_started/quickstart.ipynb +++ b/python/docs/source/getting_started/quickstart.ipynb @@ -448,7 +448,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "PySpark DataFrame also provides the conversion back to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to leverage pandas APIs. Note that `toPandas` also collects all data into the driver side that can easily cause an out-of-memory-error when the data is too large to fit into the driver side." + "PySpark DataFrame also provides the conversion back to a [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to leverage pandas API. Note that `toPandas` also collects all data into the driver side that can easily cause an out-of-memory-error when the data is too large to fit into the driver side." ] }, { @@ -831,7 +831,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can also apply a Python native function against each group by using pandas APIs." + "You can also apply a Python native function against each group by using pandas API." ] }, { diff --git a/python/docs/source/index.rst b/python/docs/source/index.rst index d9cb42b9f81e4..7f650b79a1ad5 100644 --- a/python/docs/source/index.rst +++ b/python/docs/source/index.rst @@ -38,9 +38,9 @@ Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine. -**pandas APIs on Spark** +**pandas API on Spark** -pandas APIs on Spark allow you to scale your pandas workload out. +pandas API on Spark allows you to scale your pandas workload out. With this package, you can: * Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas. @@ -74,4 +74,3 @@ and in-memory computing capabilities. reference/index development/index migration_guide/index - diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst b/python/docs/source/reference/pyspark.pandas/frame.rst index ad18e72ef9421..fb39b1dc47910 100644 --- a/python/docs/source/reference/pyspark.pandas/frame.rst +++ b/python/docs/source/reference/pyspark.pandas/frame.rst @@ -316,7 +316,7 @@ specific plotting methods of the form ``DataFrame.plot.``. Pandas-on-Spark specific ------------------------ -``DataFrame.pandas_on_spark`` provides pandas-on-Spark specific features that exists only in pandas APIs on Spark. +``DataFrame.pandas_on_spark`` provides pandas-on-Spark specific features that exists only in pandas API on Spark. These can be accessed by ``DataFrame.pandas_on_spark.``. .. autosummary:: diff --git a/python/docs/source/reference/pyspark.pandas/index.rst b/python/docs/source/reference/pyspark.pandas/index.rst index 8e4ca48394374..1e7177bb8031c 100644 --- a/python/docs/source/reference/pyspark.pandas/index.rst +++ b/python/docs/source/reference/pyspark.pandas/index.rst @@ -16,11 +16,11 @@ under the License. -==================== -Pandas APIs on Spark -==================== +=================== +Pandas API on Spark +=================== -This page gives an overview of all public pandas APIs on Spark. +This page gives an overview of all public pandas API on Spark. .. toctree:: :maxdepth: 2 diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index 0f6d06173a01a..a199d70aea541 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -258,7 +258,7 @@ in Spark. These can be accessed by ``Series.spark.``. Accessors --------- -Pandas APIs on Spark provide dtype-specific methods under various accessors. +Pandas API on Spark provides dtype-specific methods under various accessors. These are separate namespaces within :class:`Series` that only apply to specific data types. @@ -444,7 +444,7 @@ Serialization / IO / Conversion Pandas-on-Spark specific ------------------------ -``Series.pandas_on_spark`` provides pandas-on-Spark specific features that exists only in pandas APIs on Spark. +``Series.pandas_on_spark`` provides pandas-on-Spark specific features that exists only in pandas API on Spark. These can be accessed by ``Series.pandas_on_spark.``. .. autosummary:: diff --git a/python/docs/source/user_guide/pandas_on_spark/best_practices.rst b/python/docs/source/user_guide/pandas_on_spark/best_practices.rst index 051249dca4de5..6bd0e87f9af51 100644 --- a/python/docs/source/user_guide/pandas_on_spark/best_practices.rst +++ b/python/docs/source/user_guide/pandas_on_spark/best_practices.rst @@ -5,15 +5,15 @@ Best Practices Leverage PySpark APIs --------------------- -Pandas APIs on Spark use Spark under the hood; therefore, many features and performance optimization are available -in pandas APIs on Spark as well. Leverage and combine those cutting-edge features with pandas APIs on Spark. +Pandas API on Spark uses Spark under the hood; therefore, many features and performance optimization are available +in pandas API on Spark as well. Leverage and combine those cutting-edge features with pandas API on Spark. -Existing Spark context and Spark sessions are used out of the box in pandas APIs on Spark. If you already have your own -configured Spark context or sessions running, pandas APIs on Spark use them. +Existing Spark context and Spark sessions are used out of the box in pandas API on Spark. If you already have your own +configured Spark context or sessions running, pandas API on Spark uses them. If there is no Spark context or session running in your environment (e.g., ordinary Python interpreter), such configurations can be set to ``SparkContext`` and/or ``SparkSession``. -Once Spark context and/or session is created, pandas APIs on Spark can use this context and/or session automatically. +Once Spark context and/or session is created, pandas API on Spark can use this context and/or session automatically. For example, if you want to configure the executor memory in Spark, you can do as below: .. code-block:: python @@ -21,7 +21,7 @@ For example, if you want to configure the executor memory in Spark, you can do a from pyspark import SparkConf, SparkContext conf = SparkConf() conf.set('spark.executor.memory', '2g') - # Pandas APIs on Spark automatically use this Spark context with the configurations set. + # Pandas API on Spark automatically uses this Spark context with the configurations set. SparkContext(conf=conf) import pyspark.pandas as ks @@ -35,13 +35,13 @@ it can be set into Spark session as below: from pyspark.sql import SparkSession builder = SparkSession.builder.appName("pandas-on-spark") builder = builder.config("spark.sql.execution.arrow.enabled", "true") - # Pandas APIs on Spark automatically use this Spark session with the configurations set. + # Pandas API on Spark automatically uses this Spark session with the configurations set. builder.getOrCreate() import pyspark.pandas as ks ... -All Spark features such as history server, web UI and deployment modes can be used as are with pandas APIs on Spark. +All Spark features such as history server, web UI and deployment modes can be used as are with pandas API on Spark. If you are interested in performance tuning, please see also `Tuning Spark `_. @@ -49,7 +49,7 @@ Check execution plans --------------------- Expensive operations can be predicted by leveraging PySpark API `DataFrame.spark.explain()` -before the actual computation since pandas APIs on Spark are based on lazy execution. For example, see below. +before the actual computation since pandas API on Spark is based on lazy execution. For example, see below. .. code-block:: python @@ -65,14 +65,14 @@ before the actual computation since pandas APIs on Spark are based on lazy execu Whenever you are not sure about such cases, you can check the actual execution plans and foresee the expensive cases. -Even though pandas APIs on Spark try its best to optimize and reduce such shuffle operations by leveraging Spark +Even though pandas API on Spark tries its best to optimize and reduce such shuffle operations by leveraging Spark optimizers, it is best to avoid shuffling in the application side whenever possible. Use checkpoint -------------- -After a bunch of operations on pandas APIs on Spark objects, the underlying Spark planner can slow down due to the huge and complex plan. +After a bunch of operations on pandas API on Spark objects, the underlying Spark planner can slow down due to the huge and complex plan. If the Spark plan becomes huge or it takes the planning long time, ``DataFrame.spark.checkpoint()`` or ``DataFrame.spark.local_checkpoint()`` would be helpful. @@ -157,14 +157,14 @@ as it is less expensive because data can be distributed and computed for each gr Avoid reserved column names --------------------------- -Columns with leading ``__`` and trailing ``__`` are reserved in pandas APIs on Spark. To handle internal behaviors for, such as, index, -pandas APIs on Spark use some internal columns. Therefore, it is discouraged to use such column names and not guaranteed to work. +Columns with leading ``__`` and trailing ``__`` are reserved in pandas API on Spark. To handle internal behaviors for, such as, index, +pandas API on Spark uses some internal columns. Therefore, it is discouraged to use such column names and not guaranteed to work. Do not use duplicated column names ---------------------------------- -It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas APIs on Spark inherit +It is disallowed to use duplicated column names because Spark SQL does not allow this in general. Pandas API on Spark inherits this behavior. For instance, see below: .. code-block:: python @@ -175,7 +175,7 @@ this behavior. For instance, see below: ... Reference 'a' is ambiguous, could be: a, a.; -Additionally, it is strongly discouraged to use case sensitive column names. Pandas APIs on Spark disallow it by default. +Additionally, it is strongly discouraged to use case sensitive column names. Pandas API on Spark disallows it by default. .. code-block:: python @@ -204,8 +204,8 @@ However, you can turn on ``spark.sql.caseSensitive`` in Spark configuration to e Specify the index column in conversion from Spark DataFrame to pandas-on-Spark DataFrame ---------------------------------------------------------------------------------------- -When pandas APIs on Spark Dataframe are converted from Spark DataFrame, it loses the index information, which results in using -the default index in pandas APIs on Spark DataFrame. The default index is inefficient in general comparing to explicitly specifying +When pandas-on-Spark Dataframe is converted from Spark DataFrame, it loses the index information, which results in using +the default index in pandas API on Spark DataFrame. The default index is inefficient in general comparing to explicitly specifying the index column. Specify the index column whenever possible. See `working with PySpark `_ @@ -214,7 +214,7 @@ See `working with PySpark `_ Use ``distributed`` or ``distributed-sequence`` default index ------------------------------------------------------------- -One common issue when pandas-on-Spark users face is the slow performance by default index. Pandas APIs on Spark attache +One common issue when pandas-on-Spark users face is the slow performance by default index. Pandas API on Spark attaches a default index when the index is unknown, for example, Spark DataFrame is directly converted to pandas-on-Spark DataFrame. This default index is ``sequence`` which requires the computation on single partition which is discouraged. If you plan @@ -227,19 +227,19 @@ See `Default Index Type `_ for more details abou Reduce the operations on different DataFrame/Series --------------------------------------------------- -Pandas APIs on Spark disallow the operations on different DataFrames (or Series) by default to prevent expensive operations. +Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general, which is discouraged. Whenever possible, this operation should be avoided. See `Operations on different DataFrames `_ for more details. -Use pandas APIs on Spark directly whenever possible +Use pandas API on Spark directly whenever possible --------------------------------------------------- -Although pandas APIs on Spark have most of the pandas-equivalent APIs, there are several APIs not implemented yet or explicitly unsupported. +Although pandas API on Spark has most of the pandas-equivalent APIs, there are several APIs not implemented yet or explicitly unsupported. -As an example, pandas APIs on Spark do not implement ``__iter__()`` to prevent users from collecting all data into the client (driver) side from the whole cluster. +As an example, pandas API on Spark does not implement ``__iter__()`` to prevent users from collecting all data into the client (driver) side from the whole cluster. Unfortunately, many external APIs such as Python built-in functions such as min, max, sum, etc. require the given argument to be iterable. In case of pandas, it works properly out of the box as below: @@ -291,7 +291,7 @@ Therefore, it works seamlessly in pandas as below: Helsinki 144.0 dtype: float64 -However, for pandas APIs on Spark it do not work as the same reason above. +However, for pandas API on Spark it does not work as the same reason above. The example above can be also changed to directly using pandas-on-Spark APIs as below: .. code-block:: python diff --git a/python/docs/source/user_guide/pandas_on_spark/from_to_dbms.rst b/python/docs/source/user_guide/pandas_on_spark/from_to_dbms.rst index 2126d0fe058df..1f9e79495e9e9 100644 --- a/python/docs/source/user_guide/pandas_on_spark/from_to_dbms.rst +++ b/python/docs/source/user_guide/pandas_on_spark/from_to_dbms.rst @@ -4,8 +4,8 @@ From/to other DBMSes .. currentmodule:: pyspark.pandas -The APIs interacting with other DBMSes in pandas APIs on Spark are slightly different from the ones in pandas -because pandas APIs on Spark leverage JDBC APIs in PySpark to read and write from/to other DBMSes. +The APIs interacting with other DBMSes in pandas API on Spark are slightly different from the ones in pandas +because pandas API on Spark leverages JDBC APIs in PySpark to read and write from/to other DBMSes. The APIs to read/write from/to external DBMSes are as follows: @@ -48,13 +48,13 @@ Firstly, create the ``example`` database as below via Python's SQLite library. T con.commit() con.close() -Pandas APIs on Spark require a JDBC driver to read so it requires the driver for your particular database to be on the Spark's classpath. For SQLite JDBC driver, you can download it, for example, as below: +Pandas API on Spark requires a JDBC driver to read so it requires the driver for your particular database to be on the Spark's classpath. For SQLite JDBC driver, you can download it, for example, as below: .. code-block:: bash curl -O https://repo1.maven.org/maven2/org/xerial/sqlite-jdbc/3.34.0/sqlite-jdbc-3.34.0.jar -After that, you should add it into your Spark session first. Once you add them, pandas APIs on Spark will automatically detect the Spark session and leverage it. +After that, you should add it into your Spark session first. Once you add them, pandas API on Spark will automatically detect the Spark session and leverage it. .. code-block:: python diff --git a/python/docs/source/user_guide/pandas_on_spark/index.rst b/python/docs/source/user_guide/pandas_on_spark/index.rst index eedc1f7d7aa43..c623200d78279 100644 --- a/python/docs/source/user_guide/pandas_on_spark/index.rst +++ b/python/docs/source/user_guide/pandas_on_spark/index.rst @@ -16,9 +16,9 @@ under the License. -==================== -Pandas APIs on Spark -==================== +=================== +Pandas API on Spark +=================== .. toctree:: :maxdepth: 2 diff --git a/python/docs/source/user_guide/pandas_on_spark/options.rst b/python/docs/source/user_guide/pandas_on_spark/options.rst index 26839c8c2d2f9..66b7a93486751 100644 --- a/python/docs/source/user_guide/pandas_on_spark/options.rst +++ b/python/docs/source/user_guide/pandas_on_spark/options.rst @@ -3,7 +3,7 @@ Options and settings ==================== .. currentmodule:: pyspark.pandas -Pandas APIs on Spark have an options system that lets you customize some aspects of its behaviour, +Pandas API on Spark has an options system that lets you customize some aspects of its behaviour, display-related options being those the user is most likely to adjust. Options have a full "dotted-style", case-insensitive name (e.g. ``display.max_rows``). @@ -92,7 +92,7 @@ are restored automatically when you exit the `with` block: Operations on different DataFrames ---------------------------------- -Pandas APIs on Spark disallow the operations on different DataFrames (or Series) by default to prevent expensive +Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general. This can be enabled by setting `compute.ops_on_diff_frames` to `True` to allow such cases. @@ -134,8 +134,8 @@ See the examples below. Default Index type ------------------ -In pandas APIs on Spark, the default index is used in several cases, for instance, -when Spark DataFrame is converted into pandas-on-Spark DataFrame. In this case, internally pandas APIs on Spark attache a +In pandas API on Spark, the default index is used in several cases, for instance, +when Spark DataFrame is converted into pandas-on-Spark DataFrame. In this case, internally pandas API on Spark attaches a default index into pandas-on-Spark DataFrame. There are several types of the default index that can be configured by `compute.default_index_type` as below: diff --git a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst index c360946373f67..6796cbb0aa9c6 100644 --- a/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst +++ b/python/docs/source/user_guide/pandas_on_spark/pandas_pyspark.rst @@ -5,15 +5,15 @@ From/to pandas and PySpark DataFrames .. currentmodule:: pyspark.pandas Users from pandas and/or PySpark face API compatibility issue sometimes when they -work with pandas APIs on Spark. Since pandas APIs on Spark do not target 100% compatibility of both pandas and +work with pandas API on Spark. Since pandas API on Spark does not target 100% compatibility of both pandas and PySpark, users need to do some workaround to port their pandas and/or PySpark codes or -get familiar with pandas APIs on Spark in this case. This page aims to describe it. +get familiar with pandas API on Spark in this case. This page aims to describe it. pandas ------ -pandas users can access to full pandas APIs by calling :func:`DataFrame.to_pandas`. +pandas users can access to full pandas API by calling :func:`DataFrame.to_pandas`. pandas-on-Spark DataFrame and pandas DataFrame are similar. However, the former is distributed and the latter is in a single machine. When converting to each other, the data is transferred between multiple machines and the single client machine. @@ -57,7 +57,7 @@ pandas DataFrame can be a pandas-on-Spark DataFrame easily as below: 9 9 Note that converting pandas-on-Spark DataFrame to pandas requires to collect all the data into the client machine; therefore, -if possible, it is recommended to use pandas APIs on Spark or PySpark APIs instead. +if possible, it is recommended to use pandas API on Spark or PySpark APIs instead. PySpark diff --git a/python/docs/source/user_guide/pandas_on_spark/transform_apply.rst b/python/docs/source/user_guide/pandas_on_spark/transform_apply.rst index a127c5676847d..46d598080d8a3 100644 --- a/python/docs/source/user_guide/pandas_on_spark/transform_apply.rst +++ b/python/docs/source/user_guide/pandas_on_spark/transform_apply.rst @@ -34,7 +34,7 @@ to return the same length of the input and the latter does not require this. See ... >>> kdf.apply(pandas_plus) -In this case, each function takes a pandas Series, and pandas APIs on Spark compute the functions in a distributed manner as below. +In this case, each function takes a pandas Series, and pandas API on Spark computes the functions in a distributed manner as below. .. image:: https://user-images.githubusercontent.com/6477701/80076790-a1cf0680-8587-11ea-8b08-8dc694071ba0.png :alt: transform and apply @@ -86,7 +86,7 @@ then applies the given function with pandas DataFrame or Series as input and out >>> kdf.koalas.apply_batch(pandas_plus) The functions in both examples take a pandas DataFrame as a chunk of pandas-on-Spark DataFrame, and output a pandas DataFrame. -Pandas APIs on Spark combine the pandas DataFrames as a pandas-on-Spark DataFrame. +Pandas API on Spark combines the pandas DataFrames as a pandas-on-Spark DataFrame. Note that :func:`DataFrame.koalas.transform_batch` has the length restriction - the length of input and output should be the same whereas :func:`DataFrame.koalas.apply_batch` does not. However, it is important to know that diff --git a/python/docs/source/user_guide/pandas_on_spark/typehints.rst b/python/docs/source/user_guide/pandas_on_spark/typehints.rst index b6038f294f26d..cb1227115c742 100644 --- a/python/docs/source/user_guide/pandas_on_spark/typehints.rst +++ b/python/docs/source/user_guide/pandas_on_spark/typehints.rst @@ -1,27 +1,27 @@ -================================== -Type Hints in Pandas APIs on Spark -================================== +================================= +Type Hints in Pandas API on Spark +================================= .. currentmodule:: pyspark.pandas -Pandas APIs on Spark, by default, infers the schema by taking some top records from the output, +Pandas API on Spark, by default, infers the schema by taking some top records from the output, in particular, when you use APIs that allow users to apply a function against pandas-on-Spark DataFrame such as :func:`DataFrame.transform`, :func:`DataFrame.apply`, :func:`DataFrame.koalas.apply_batch`, :func:`DataFrame.koalas.apply_batch`, :func:`Series.koalas.apply_batch`, etc. However, this is potentially expensive. If there are several expensive operations such as a shuffle -in the upstream of the execution plan, pandas APIs on Spark will end up with executing the Spark job twice, once +in the upstream of the execution plan, pandas API on Spark will end up with executing the Spark job twice, once for schema inference, and once for processing actual data with the schema. -To avoid the consequences, pandas APIs on Spark have its own type hinting style to specify the schema to avoid -schema inference. Pandas APIs on Spark understand the type hints specified in the return type and converts it +To avoid the consequences, pandas API on Spark has its own type hinting style to specify the schema to avoid +schema inference. Pandas API on Spark understands the type hints specified in the return type and converts it as a Spark schema for pandas UDFs used internally. The way of type hinting has been evolved over the time. In this chapter, it covers the recommended way and the supported ways in details. .. note:: - The variadic generics support is experimental and unstable in pandas APIs on Spark. + The variadic generics support is experimental and unstable in pandas API on Spark. The way of typing can change between minor releases without a warning. See also `PEP 646 `_ for variadic generics in Python. @@ -43,7 +43,7 @@ it as a Spark schema. As an example, you can specify the return type hint as bel >>> df.groupby('A').apply(pandas_div) The function ``pandas_div`` actually takes and outputs a pandas DataFrame instead of pandas-on-Spark :class:`DataFrame`. -However, pandas APIs on Spark have to force to set the mismatched type hints. +However, pandas API on Spark has to force to set the mismatched type hints. From pandas-on-Spark 1.0 with Python 3.7+, now you can specify the type hints by using pandas instances. @@ -66,7 +66,7 @@ Likewise, pandas Series can be also used as a type hints: >>> df = ks.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df.apply(sqrt, axis=0) -Currently, both pandas APIs on Spark and pandas instances can be used to specify the type hints; however, pandas-on-Spark +Currently, both pandas API on Spark and pandas instances can be used to specify the type hints; however, pandas-on-Spark plans to move gradually towards using pandas instances only as the stability becomes proven. @@ -75,7 +75,7 @@ Type Hinting with Names In pandas-on-Spark 1.0, the new style of type hinting was introduced to overcome the limitations in the existing type hinting especially for DataFrame. When you use a DataFrame as the return type hint, for example, -``DataFrame[int, int]``, there is no way to specify the names of each Series. In the old way, pandas APIs on Spark just generate +``DataFrame[int, int]``, there is no way to specify the names of each Series. In the old way, pandas API on Spark just generates the column names as ``c#`` and this easily leads users to lose or forgot the Series mappings. See the example below: .. code-block:: python @@ -95,7 +95,7 @@ the column names as ``c#`` and this easily leads users to lose or forgot the Ser 3 3 4 4 4 5 -The new style of type hinting in pandas APIs on Spark are similar with the regular Python type hints in variables. The Series name +The new style of type hinting in pandas API on Spark is similar with the regular Python type hints in variables. The Series name is specified as a string, and the type is specified after a colon. The following example shows a simple case with the Series names, ``id`` and ``A``, and ``int`` types respectively. @@ -116,7 +116,7 @@ the Series names, ``id`` and ``A``, and ``int`` types respectively. 3 3 4 4 4 5 -In addition, pandas APIs on Spark also dynamically support ``dtype`` instance and the column index in pandas so that users can +In addition, pandas API on Spark also dynamically supports ``dtype`` instance and the column index in pandas so that users can programmatically generate the return type and schema. .. code-block:: python @@ -126,7 +126,7 @@ programmatically generate the return type and schema. ... >>> kdf.koalas.apply_batch(transform) -Likewise, ``dtype`` instances from pandas DataFrame can be used alone and let pandas APIs on Spark generate column names. +Likewise, ``dtype`` instances from pandas DataFrame can be used alone and let pandas API on Spark generate column names. .. code-block:: python @@ -134,4 +134,3 @@ Likewise, ``dtype`` instances from pandas DataFrame can be used alone and let pa ... return pdf + 1 ... >>> kdf.koalas.apply_batch(transform) - diff --git a/python/docs/source/user_guide/pandas_on_spark/types.rst b/python/docs/source/user_guide/pandas_on_spark/types.rst index 4ba8b461ce872..194a0ad4efd53 100644 --- a/python/docs/source/user_guide/pandas_on_spark/types.rst +++ b/python/docs/source/user_guide/pandas_on_spark/types.rst @@ -1,14 +1,14 @@ -==================================== -Type Support in Pandas APIs on Spark -==================================== +=================================== +Type Support in Pandas API on Spark +=================================== .. currentmodule:: pyspark.pandas In this chapter, we will briefly show you how data types change when converting pandas-on-Spark DataFrame from/to PySpark DataFrame or pandas DataFrame. -Type casting between PySpark and pandas APIs on Spark ------------------------------------------------------ +Type casting between PySpark and pandas API on Spark +---------------------------------------------------- When converting a pandas-on-Spark DataFrame from/to PySpark DataFrame, the data types are automatically casted to the appropriate type. @@ -80,8 +80,8 @@ The example below shows how data types are casted from pandas-on-Spark DataFrame DataFrame[int8: tinyint, bool: boolean, float32: float, float64: double, int32: int, int64: bigint, int16: smallint, datetime: timestamp, object_string: string, object_decimal: decimal(2,1), object_date: date] -Type casting between pandas and pandas APIs on Spark ----------------------------------------------------- +Type casting between pandas and pandas API on Spark +--------------------------------------------------- When converting pandas-on-Spark DataFrame to pandas DataFrame, and the data types are basically same as pandas. @@ -110,7 +110,7 @@ However, there are several data types only provided by pandas. .. code-block:: python - # pd.Catrgorical type is not supported in pandas APIs on Spark yet. + # pd.Catrgorical type is not supported in pandas API on Spark yet. >>> ks.Series([pd.Categorical([1, 2, 3])]) Traceback (most recent call last): ... @@ -118,14 +118,14 @@ However, there are several data types only provided by pandas. Categories (3, int64): [1, 2, 3] with type Categorical: did not recognize Python value type when inferring an Arrow data type -These kind of pandas specific data types below are not currently supported in pandas APIs on Spark but planned to be supported. +These kind of pandas specific data types below are not currently supported in pandas API on Spark but planned to be supported. * pd.Timedelta * pd.Categorical * pd.CategoricalDtype -The pandas specific data types below are not planned to be supported in pandas APIs on Spark yet. +The pandas specific data types below are not planned to be supported in pandas API on Spark yet. * pd.SparseDtype * pd.DatetimeTZDtype @@ -137,7 +137,7 @@ The pandas specific data types below are not planned to be supported in pandas A Internal type mapping --------------------- -The table below shows which NumPy data types are matched to which PySpark data types internally in pandas APIs on Spark. +The table below shows which NumPy data types are matched to which PySpark data types internally in pandas API on Spark. ============= ======================= NumPy PySpark @@ -162,7 +162,7 @@ np.ndarray ArrayType(StringType()) ============= ======================= -The table below shows which Python data types are matched to which PySpark data types internally in pandas APIs on Spark. +The table below shows which Python data types are matched to which PySpark data types internally in pandas API on Spark. ================= =================== Python PySpark @@ -177,7 +177,7 @@ datetime.date DateType decimal.Decimal DecimalType(38, 18) ================= =================== -For decimal type, pandas APIs on Spark use Spark's system default precision and scale. +For decimal type, pandas API on Spark uses Spark's system default precision and scale. You can check this mapping by using `as_spark_type` function. @@ -218,7 +218,7 @@ You can also check the underlying PySpark data type of `Series` or schema of `Da .. note:: - Pandas APIs on Spark currently do not support multiple types of data in single column. + Pandas API on Spark currently does not support multiple types of data in single column. .. code-block:: python diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py index de9219058a710..ceb5529b8a2e5 100644 --- a/python/pyspark/pandas/groupby.py +++ b/python/pyspark/pandas/groupby.py @@ -1005,7 +1005,7 @@ def apply(self, func, *args, **kwargs) -> Union[DataFrame, Series]: `_. .. note:: the dataframe within ``func`` is actually a pandas dataframe. Therefore, - any pandas APIs within this function is allowed. + any pandas API within this function is allowed. Parameters ---------- @@ -2048,7 +2048,7 @@ def transform(self, func, *args, **kwargs) -> Union[DataFrame, Series]: `_. .. note:: the series within ``func`` is actually a pandas series. Therefore, - any pandas APIs within this function is allowed. + any pandas API within this function is allowed. Parameters