Skip to content

Commit cd51cf2

Browse files
authored
Move Design Principles and FAQ to the official documentation (#944)
Current README.md is too long. this PR only adds some simple example/installation, and then move all to our official documentation. I kept the links in README.md.
1 parent a924df6 commit cd51cf2

File tree

7 files changed

+173
-167
lines changed

7 files changed

+173
-167
lines changed

README.md

+22-164
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ This project is currently in beta and is rapidly evolving, with a bi-weekly rele
2525

2626
Try the Koalas 10 minutes tutorial on a live Jupyter notebook [here](https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb). The initial launch can take up to several minutes.
2727

28-
2928
[![Build Status](https://travis-ci.com/databricks/koalas.svg?token=Rzzgd1itxsPZRuhKGnhD&branch=master)](https://travis-ci.com/databricks/koalas)
3029
[![codecov](https://codecov.io/gh/databricks/koalas/branch/master/graph/badge.svg)](https://codecov.io/gh/databricks/koalas)
3130
[![Documentation Status](https://readthedocs.org/projects/koalas/badge/?version=latest)](https://koalas.readthedocs.io/en/latest/?badge=latest)
@@ -34,185 +33,42 @@ Try the Koalas 10 minutes tutorial on a live Jupyter notebook [here](https://myb
3433
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb)
3534

3635

37-
## Table of Contents <!-- omit in toc -->
38-
- [Dependencies](#dependencies)
39-
- [Get Started](#get-started)
40-
- [Documentation](#documentation)
41-
- [Mailing List](#mailing-list)
42-
- [Development Guide](#development-guide)
43-
- [Design Principles](#design-principles)
44-
- [Be Pythonic](#be-pythonic)
45-
- [Unify small data (pandas) API and big data (Spark) API, but pandas first](#unify-small-data-pandas-api-and-big-data-spark-api-but-pandas-first)
46-
- [Return Koalas data structure for big data, and pandas data structure for small data](#return-koalas-data-structure-for-big-data-and-pandas-data-structure-for-small-data)
47-
- [Provide discoverable APIs for common data science tasks](#provide-discoverable-apis-for-common-data-science-tasks)
48-
- [Provide well documented APIs, with examples](#provide-well-documented-apis-with-examples)
49-
- [Guardrails to prevent users from shooting themselves in the foot](#guardrails-to-prevent-users-from-shooting-themselves-in-the-foot)
50-
- [Be a lean API layer and move fast](#be-a-lean-api-layer-and-move-fast)
51-
- [High test coverage](#high-test-coverage)
52-
- [FAQ](#faq)
53-
- [What's the project's status?](#whats-the-projects-status)
54-
- [Is it Koalas or koalas?](#is-it-koalas-or-koalas)
55-
- [Should I use PySpark's DataFrame API or Koalas?](#should-i-use-pysparks-dataframe-api-or-koalas)
56-
- [How can I request support for a method?](#how-can-i-request-support-for-a-method)
57-
- [How is Koalas different from Dask?](#how-is-koalas-different-from-dask)
58-
- [How can I contribute to Koalas?](#how-can-i-contribute-to-koalas)
59-
- [Why a new project (instead of putting this in Apache Spark itself)?](#why-a-new-project-instead-of-putting-this-in-apache-spark-itself)
60-
- [How do I use this on Databricks?](#how-do-i-use-this-on-databricks)
61-
62-
63-
## Dependencies
64-
65-
See [Dependencies](https://koalas.readthedocs.io/en/latest/getting_started/install.html#dependencies) in installation guide.
66-
67-
## Get Started
68-
69-
See [Getting Started](https://koalas.readthedocs.io/en/latest/getting_started/index.html).
70-
71-
## Documentation
72-
73-
Project docs are published here: https://koalas.readthedocs.io
74-
75-
76-
## Mailing List
77-
78-
We use Google Groups for mailing list: https://groups.google.com/forum/#!forum/koalas-dev
79-
80-
81-
## Development Guide
82-
83-
See [Contributing Guide](https://koalas.readthedocs.io/en/latest/development/contributing.html).
84-
85-
86-
## Design Principles
87-
88-
This section outlines design principles guiding the Koalas project.
89-
90-
### Be Pythonic
91-
92-
Koalas targets Python data scientists. We want to stick to the convention that users are already familiar with as much as possible. Here are some examples:
93-
94-
- Function names and parameters use snake_case, rather than CamelCase. This is different from PySpark's design. For example, Koalas has `to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we also provide Spark's variant as an alias.
95-
96-
- Koalas respects to the largest extent the conventions of the Python numerical ecosystem, and allows the use of NumPy types, etc. that can be supported by Spark.
97-
98-
- Koalas docs' style and infrastructure simply follow rest of the PyData projects'.
99-
100-
### Unify small data (pandas) API and big data (Spark) API, but pandas first
101-
102-
The Koalas DataFrame is meant to provide the best of pandas and Spark under a single API, with easy and clear conversions between each API when necessary. When Spark and pandas have similar APIs with subtle differences, the principle is to honor the contract of the pandas API first.
103-
104-
There are different classes of functions:
105-
106-
1. Functions that are found in both Spark and pandas under the same name (`count`, `dtypes`, `head`). The return value is the same as the return type in pandas (and not Spark's).
107-
108-
2. Functions that are found in Spark but that have a clear equivalent in pandas, e.g. `alias` and `rename`. These functions will be implemented as the alias of the pandas function, but should be marked that they are aliases of the same functions. They are provided so that existing users of PySpark can get the benefits of Koalas without having to adapt their code.
109-
110-
3. Functions that are only found in pandas. When these functions are appropriate for distributed datasets, they should become available in Koalas.
111-
112-
4. Functions that are only found in Spark that are essential to controlling the distributed nature of the computations, e.g. `cache`. These functions should be available in Koalas.
113-
114-
We are still debating whether data transformation functions only available in Spark should be added to Koalas, e.g. `select`. We would love to hear your feedback on that.
115-
116-
117-
### Return Koalas data structure for big data, and pandas data structure for small data
118-
119-
Often developers face the question whether a particular function should return a Koalas DataFrame/Series, or a pandas DataFrame/Series. The principle is: if the returned object can be large, use a Koalas DataFrame/Series. If the data is bound to be small, use a pandas DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, because the number of columns in a DataFrame is bounded and small, whereas `DataFrame.head()` or `Series.unique()` returns a Koalas DataFrame/Series, because the resulting object can be large.
36+
## Getting Started
12037

121-
### Provide discoverable APIs for common data science tasks
38+
Recommended way of installing is Conda. You can use not only Conda but also multiple ways to install Koalas. See [Installation](https://koalas.readthedocs.io/en/latest/getting_started/install.html) for full instructions to install Koalas.
12239

123-
At the risk of overgeneralization, there are two API design approaches: the first focuses on providing APIs for common tasks; the second starts with abstractions, and enable users to accomplish their tasks by composing primitives. While the world is not black and white, pandas takes more of the former approach, while Spark has taken more of the later.
124-
125-
One example is value count (count by some key column), one of the most common operations in data science. pandas `DataFrame.value_count` returns the result in sorted order, which in 90% of the cases is what users prefer when exploring data, whereas Spark's does not sort, which is more desirable when building data pipelines, as users can accomplish the pandas behavior by adding an explicit `orderBy`.
126-
127-
Similar to pandas, Koalas should also lean more towards the former, providing discoverable APIs for common data science tasks. In most cases, this principle is well taken care of by simply implementing pandas' APIs. However, there will be circumstances in which pandas' APIs don't address a specific need, e.g. plotting for big data.
128-
129-
### Provide well documented APIs, with examples
130-
131-
All functions and parameters should be documented. Most functions should be documented with examples, because those are the easiest to understand than a blob of text explaining what the function does.
132-
133-
A recommended way to add documentation is to start with the docstring of the corresponding function in PySpark or pandas, and adapt it for Koalas. If you are adding a new function, also add it to the API reference doc index page in `docs/source/reference` directory. The examples in docstring also improve our test coverage.
134-
135-
### Guardrails to prevent users from shooting themselves in the foot
136-
137-
Certain operations in pandas are prohibitively expensive as data scales, and we don't want to give users the illusion that they can rely on such operations in Koalas. That is to say, methods implemented in Koalas should be safe to perform by default on large datasets. As a result, the following capabilities are not implemented in Koalas:
40+
```bash
41+
conda install koalas -c conda-forge
42+
```
13843

139-
1. Capabilities that are fundamentally not parallelizable: e.g. imperatively looping over each element
140-
2. Capabilities that require materializing the entire working set in a single node's memory. This is why we do not implement [`pandas.DataFrame.values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values). Another example is the `_repr_html_` call caps the total number of records shown to a maximum of 1000, to prevent users from blowing up their driver node simply by typing the name of the DataFrame in a notebook.
44+
Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:
14145

142-
A few exceptions, however, exist. One common pattern with "big data science" is that while the initial dataset is large, the working set becomes smaller as the analysis goes deeper. For example, data scientists often perform aggregation on datasets and want to then convert the aggregated dataset to some local data structure. To help data scientists, we offer the following:
46+
```python
47+
import databricks.koalas as ks
48+
import pandas as pd
14349

144-
- [`DataFrame.to_pandas()`](https://koalas.readthedocs.io/en/stable/reference/api/databricks.koalas.DataFrame.to_pandas.html) : returns a pandas DataFrame, koalas only
145-
- [`DataFrame.to_numpy()`](https://koalas.readthedocs.io/en/stable/reference/api/databricks.koalas.DataFrame.to_numpy.html): returns a numpy array, works with both pandas and Koalas
50+
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
14651

147-
Note that it is clear from the names that these functions return some local data structure that would require materializing data in a single node's memory. For these functions, we also explicitly document them with a warning note that the resulting data structure must be small.
52+
# Create a Koalas DataFrame from pandas DataFrame
53+
df = ks.from_pandas(pdf)
14854

149-
### Be a lean API layer and move fast
55+
# Rename the columns
56+
df.columns = ['x', 'y', 'z1']
15057

151-
Koalas is designed as an API overlay layer on top of Spark. The project should be lightweight, and most functions should be implemented as wrappers around Spark or pandas. Koalas does not accept heavyweight implementations, e.g. execution engine changes.
58+
# Do some operations in place:
59+
df['x2'] = df.x * df.x
60+
```
15261

153-
This approach enables us to move fast. For the considerable future, we aim to be making weekly releases. If we find a critical bug, we will be making a new release as soon as the bug fix is available.
62+
For more details, see [Getting Started](https://koalas.readthedocs.io/en/latest/getting_started/index.html) and [Dependencies](https://koalas.readthedocs.io/en/latest/getting_started/install.html#dependencies) in the official documentation.
15463

155-
### High test coverage
15664

157-
Koalas should be well tested. The project tracks its test coverage with over 90% across the entire codebase, and close to 100% for critical parts. Pull requests will not be accepted unless they have close to 100% statement coverage from the codecov report.
65+
## Contributing Guide
15866

67+
See [Contributing Guide](https://koalas.readthedocs.io/en/latest/development/contributing.html) and [Design Principles](https://koalas.readthedocs.io/en/latest/development/design.html) in the official documentation.
15968

16069

16170
## FAQ
16271

163-
### What's the project's status?
164-
This project is currently in beta and is rapidly evolving.
165-
We plan to do weekly releases at this stage.
166-
You should expect the following differences:
167-
168-
- some functions may be missing. Please create a GitHub issue if your favorite function is not yet supported. We also document all the functions that are not yet supported in the [missing directory](https://github.com/databricks/koalas/tree/master/databricks/koalas/missing).
169-
170-
- some behavior may be different, in particular in the treatment of nulls: Pandas uses
171-
Not a Number (NaN) special constants to indicate missing values, while Spark has a
172-
special flag on each value to indicate missing values. We would love to hear from you
173-
if you come across any discrepancies
174-
175-
- because Spark is lazy in nature, some operations like creating new columns only get
176-
performed when Spark needs to print or write the dataframe.
177-
178-
### Is it Koalas or koalas?
179-
180-
It's Koalas. Unlike pandas, we use upper case here.
181-
182-
### Should I use PySpark's DataFrame API or Koalas?
183-
184-
If you are already familiar with pandas and want to leverage Spark for big data, we recommend
185-
using Koalas. If you are learning Spark from ground up, we recommend you start with PySpark's API.
186-
187-
### How can I request support for a method?
188-
189-
File a GitHub issue: https://github.com/databricks/koalas/issues
190-
191-
Databricks customers are also welcome to file a support ticket to request a new feature.
192-
193-
### How is Koalas different from Dask?
194-
195-
Different projects have different focuses. Spark is already deployed in virtually every
196-
organization, and often is the primary interface to the massive amount of data stored in data lakes.
197-
Koalas was inspired by Dask, and aims to make the transition from pandas to Spark easy for data
198-
scientists.
199-
200-
### How can I contribute to Koalas?
201-
202-
See [Contributing Guide](https://koalas.readthedocs.io/en/latest/development/contributing.html).
203-
204-
### Why a new project (instead of putting this in Apache Spark itself)?
205-
206-
Two reasons:
207-
208-
1. We want a venue in which we can rapidly iterate and make new releases. The overhead of making a
209-
release as a separate project is minuscule (in the order of minutes). A release on Spark takes a
210-
lot longer (in the order of days)
211-
212-
2. Koalas takes a different approach that might contradict Spark's API design principles, and those
213-
principles cannot be changed lightly given the large user base of Spark. A new, separate project
214-
provides an opportunity for us to experiment with new design principles.
215-
21672
### How do I use this on Databricks?
21773

21874
Koalas requires Databricks Runtime 5.x or above. For the regular Databricks Runtime, you can install Koalas using the Libraries tab on the cluster UI, or using dbutils in a notebook:
@@ -224,3 +80,5 @@ dbutils.library.restartPython()
22480

22581
In the future, we will package Koalas out-of-the-box in both the regular Databricks Runtime and
22682
Databricks Runtime for Machine Learning.
83+
84+
See also the full list of frequently asked questions at [FAQ](https://koalas.readthedocs.io/en/latest/user_guide/faq.html) in the official documentation.

docs/source/development/contributing.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The largest amount of work consists simply of implementing the pandas API using
2525
Step-by-step Guide For Code Contributions
2626
=========================================
2727

28-
1. Read and understand the `Design Principles <https://github.com/databricks/koalas/blob/master/README.md#design-principles>`_ for the project. Contributions should follow these principles.
28+
1. Read and understand the `Design Principles <design.rst>`_ for the project. Contributions should follow these principles.
2929

3030
2. Signaling your work: If you are working on something, comment on the relevant ticket that you are doing so to avoid multiple people taking on the same work at the same time. It is also a good practice to signal that your work has stalled or you have moved on and want somebody else to take over.
3131

0 commit comments

Comments
 (0)