Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: databricks/koalas
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.5.0
Choose a base ref
...
head repository: databricks/koalas
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Dec 14, 2020

  1. Use OpenJDK instead of OracleJDK in Binder (#1969)

    This PR proposes to use OpenJDK instead. Current Oracle JDK download with `wget` is broken:
    
        ```
        --2020-12-14 08:16:14--  http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
        Resolving download.oracle.com (download.oracle.com)... 184.50.116.99
        Connecting to download.oracle.com (download.oracle.com)|184.50.116.99|:80... connected.
        HTTP request sent, awaiting response... 302 Moved Temporarily
        Location: https://edelivery.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz [following]
        --2020-12-14 08:16:14--  https://edelivery.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
        Resolving edelivery.oracle.com (edelivery.oracle.com)... 2.17.191.76, 2a02:26f0:1700:58b::366, 2a02:26f0:1700:591::366
        Connecting to edelivery.oracle.com (edelivery.oracle.com)|2.17.191.76|:443... connected.
        HTTP request sent, awaiting response... 403 Forbidden
        2020-12-14 08:20:14 ERROR 403: Forbidden.
    
        tar (child): jdk-8u131-linux-x64.tar.gz: Cannot open: No such file or directory
        tar (child): Error is not recoverable: exiting now
        tar: Child returned status 2
        tar: Error is not recoverable: exiting now
        ./postBuild: line 9: cd: jdk1.8.0_131: No such file or directory
        ./postBuild: line 11: cd: bin: No such file or directory
        ```
    
    This was tested in https://mybinder.org/v2/gh/hyukjinkwon/koalas/fix-binder?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb
    
    Apache Spark uses the same approach as well:
    HyukjinKwon authored Dec 14, 2020
    Copy the full SHA
    d1babcc View commit details

Commits on Dec 16, 2020

  1. Fix DataFrame.replace with NaN/None values #1907 (#1962)

    Hi, this PR would close #1907.
    LucasG0 authored Dec 16, 2020
    Copy the full SHA
    fab671c View commit details
  2. Fix stat functions with no numeric columns. (#1967)

    Some statistic functions fail if there are no numeric columns.
    
    ```py
    >>> kdf = ks.DataFrame({"A": pd.date_range("2020-01-01", periods=3), "B": pd.date_range("2021-01-01", periods=3)})
    >>> kdf.mean()
    Traceback (most recent call last):
    ...
    ValueError: Current DataFrame has more then the given limit 1 rows. Please set 'compute.max_rows' by using 'databricks.koalas.config.set_option' to retrieve to retrieve more than 1 rows. Note that, before changing the 'compute.max_rows', this operation is considerably expensive.
    ```
    
    The functions which allow non-numeric columns by default are:
    
    - `count`
    - `min`
    - `max`
    ueshin authored Dec 16, 2020
    Copy the full SHA
    bd73c30 View commit details
  3. Simplify plot backend support (#1970)

    This PR proposes to simplify plot implementation. Current Koalas implementation attempts to map the argument between other plotting backends (e.g., matplotlib vs plotly). Keeping this map is a huge maintenance cost and it's unrealistic to track their change and keep updating this map.
    
    Pandas itself does not keep this map either:
    
    ```python
    >>> import pandas as pd
    >>> pd.DataFrame([1,2,3]).plot.line(logx=1)
    <AxesSubplot:>
    >>> pd.options.plotting.backend = "plotly"
    >>> pd.DataFrame([1,2,3]).plot.line(logx=1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/.../opt/miniconda3/envs/python3.8/lib/python3.8/site-packages/pandas/plotting/_core.py", line 1017, in line
        return self(kind="line", x=x, y=y, **kwargs)
      File "/.../opt/miniconda3/envs/python3.8/lib/python3.8/site-packages/pandas/plotting/_core.py", line 879, in __call__
        return plot_backend.plot(self._parent, x=x, y=y, kind=kind, **kwargs)
      File "/.../miniconda3/envs/python3.8/lib/python3.8/site-packages/plotly/__init__.py", line 102, in plot
        return line(data_frame, **kwargs)
    TypeError: line() got an unexpected keyword argument 'logx'
    ```
    HyukjinKwon authored Dec 16, 2020
    Copy the full SHA
    37f7e50 View commit details

Commits on Dec 18, 2020

  1. Implement (DataFrame|Series).plot.pie in plotly (#1971)

    This PR implements `DataFrame.plot.pie` in plotly as below:
    
    ```python
    from databricks import koalas as ks
    kdf = ks.DataFrame(
        {'a': [1, 2, 3, 4, 5, 6],
         'b': [100, 200, 300, 400, 500, 600]},
        index=[10, 20, 30, 40, 50, 60])
    ks.options.plotting.backend = 'plotly'
    kdf.plot.pie(y="b")
    ```
    
    ![Screen Shot 2020-12-17 at 3 28 12 PM](https://user-images.githubusercontent.com/6477701/102451779-87005380-407c-11eb-85f3-aa2d8e62c991.png)
    
    Binder to test: https://mybinder.org/v2/gh/HyukjinKwon/koalas/plotly-pie?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb
    HyukjinKwon authored Dec 18, 2020
    Copy the full SHA
    b81afcc View commit details

Commits on Dec 21, 2020

  1. Refine Frame._reduce_for_stat_function. (#1975)

    Refines `DataFrame/Series._reduce_for_stat_function` to avoid special handling based on a specific function.
    
    Also:
    - Consolidates the implementations of `count` and support `numeric_only` parameter.
    - Adds argument type annotations.
    ueshin authored Dec 21, 2020
    Copy the full SHA
    c973195 View commit details

Commits on Dec 22, 2020

  1. Add min_count parameter for Frame.sum. (#1978)

    Adds `min_count` parameter for `Frame.sum`.
    ueshin authored Dec 22, 2020
    Copy the full SHA
    dd9661f View commit details

Commits on Dec 23, 2020

  1. Fix cumsum and cumprod. (#1982)

    Fixes `DataFrame/Series/GroupBy.cumsum` and `cumprod`.
    ueshin authored Dec 23, 2020
    Copy the full SHA
    f9465aa View commit details
  2. Fix Frame.abs to support bool type and disallow non-numeric types. (#…

    …1980)
    
    Fixes `Frame.abs` to support bool type and disallow non-numeric types.
    ueshin authored Dec 23, 2020
    Copy the full SHA
    4c86f3c View commit details
  3. Refine DataFrame/Series.product. (#1979)

    Refines `DataFrame/Series.product` to:
    
    - Consolidate and reuse `_reduce_for_stat_function`.
    - Support `axis`, `numeric_only`, and `min_count` parameters.
    - Enable to calculate values including negative values or zeros.
    ueshin authored Dec 23, 2020
    Copy the full SHA
    ccc637b View commit details
  4. Fix the build error. (#1984)

    The build failed due to conflicts between recent PRs.
    
    ```
    flake8 checks failed:
    ./databricks/koalas/series.py:5741:45: F821 undefined name 'BooleanType'
            if isinstance(kser.spark.data_type, BooleanType):
                                                ^
    ./databricks/koalas/series.py:5750:45: F821 undefined name 'BooleanType'
            if isinstance(self.spark.data_type, BooleanType):
                                                ^
    ```
    ueshin authored Dec 23, 2020
    Copy the full SHA
    0d3d216 View commit details
  5. Refine DataFrame/Series.quantile. (#1977)

    Refines `DataFrame/Series.quantile` to:
    
    - Reuse `_reduce_for_stat_function` when `q` is `float`.
    - Consolidate the logic when `q` is `Iterable`.
    
    Also support `numeric_only` for `DataFrame`.
    ueshin authored Dec 23, 2020
    Copy the full SHA
    5c44ecc View commit details

Commits on Dec 24, 2020

  1. Support ddof parameter for std and var. (#1986)

    Supports `ddof` parameter for `Frame.std` and `var`.
    ueshin authored Dec 24, 2020
    Copy the full SHA
    5f27857 View commit details

Commits on Dec 26, 2020

  1. Use Python type name instead of Spark's in error messages. (#1985)

    Addressing #1980 (comment) to add pandas dtypes.
    ueshin authored Dec 26, 2020
    Copy the full SHA
    1e51477 View commit details

Commits on Dec 28, 2020

  1. Fix wrong condition for almostequals (#1988)

    Fixed several wrong condition in `if` statement for `assertPandasAlmostEqual`.
    itholic authored Dec 28, 2020
    Copy the full SHA
    6796aa4 View commit details

Commits on Dec 29, 2020

  1. Support setattr for DataFrame. (#1989)

    Support to set attributes for DataFrame.
    
    ```py
    >>> kdf = ks.DataFrame({'A': [1, 2, 3, None]})
    >>> kdf.A = kdf.A.fillna(kdf.A.median())
    >>> kdf
         A
    0  1.0
    1  2.0
    2  3.0
    3  2.0
    ```
    ueshin authored Dec 29, 2020
    Copy the full SHA
    9a9c178 View commit details

Commits on Dec 30, 2020

  1. Use object.__setattr__ in Series. (#1991)

    This is a follow-up of #1989.
    There are some more places where setting attributes with the overwritten `DataFrame.__setattr__`.
    ueshin authored Dec 30, 2020
    Copy the full SHA
    0e44bc7 View commit details

Commits on Jan 5, 2021

  1. Add note about missing mixed type support to docs (#1990)

    Added note about missing support for mixed type to documents.
    
    ![Screen Shot 2021-01-05 at 9 36 39 AM](https://user-images.githubusercontent.com/44108233/103610184-d7bfe980-4f62-11eb-9cbb-744623bcbc4d.png)
    
    
    Resolves #1981
    itholic authored Jan 5, 2021
    Copy the full SHA
    ae5c8d8 View commit details

Commits on Jan 6, 2021

  1. Implemented sem() for Series and DataFrame (#1993)

    This PR proposes `Series.sem()` and `DataFrame.sem()`
    
    ```python
    >>> kdf = ks.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    >>> kdf
       a  b
    0  1  4
    1  2  5
    2  3  6
    
    >>> kdf.sem()
    a    0.57735
    b    0.57735
    dtype: float64
    
    >>> kdf.sem(ddof=0)
    a    0.471405
    b    0.471405
    dtype: float64
    
    >>> kdf.sem(axis=1)
    0    1.5
    1    1.5
    2    1.5
    dtype: float64
    
    Support for Series
    
    >>> kser = kdf.a
    >>> kser
    0    1
    1    2
    2    3
    Name: a, dtype: int64
    
    >>> kser.sem()
    0.5773502691896258
    
    >>> kser.sem(ddof=0)
    0.47140452079103173
    ```
    itholic authored Jan 6, 2021
    Copy the full SHA
    29deaf1 View commit details

Commits on Jan 7, 2021

  1. Added ddof parameter for GroupBy.std() and GroupBy.var() (#1994)

    Added missing parameter `ddof` for `GroupBy.std()` and `GroupBy.var()`.
    
    ```python
    >>> kdf = ks.DataFrame(
    ...     {
    ...         "a": [1, 2, 6, 4, 4, 6, 4, 3, 7],
    ...         "b": [4, 2, 7, 3, 3, 1, 1, 1, 2],
    ...         "c": [4, 2, 7, 3, None, 1, 1, 1, 2],
    ...         "d": list("abcdefght"),
    ...     },
    ...     index=[0, 1, 3, 5, 6, 8, 9, 9, 9],
    ... )
    >>> kdf
       a  b    c  d
    0  1  4  4.0  a
    1  2  2  2.0  b
    3  6  7  7.0  c
    5  4  3  3.0  d
    6  4  3  NaN  e
    8  6  1  1.0  f
    9  4  1  1.0  g
    9  3  1  1.0  h
    9  7  2  2.0  t
    
    # std
    >>> kdf.groupby("a").std(ddof=1)
              b         c
    a
    7       NaN       NaN
    6  4.242641  4.242641
    1       NaN       NaN
    3       NaN       NaN
    2       NaN       NaN
    4  1.154701  1.414214
    
    >>> kdf.groupby("a").std(ddof=0)
              b    c
    a
    7  0.000000  0.0
    6  3.000000  3.0
    1  0.000000  0.0
    3  0.000000  0.0
    2  0.000000  0.0
    4  0.942809  1.0
    
    # var
    >>> kdf.groupby("a").var(ddof=1)
               b     c
    a
    7        NaN   NaN
    6  18.000000  18.0
    1        NaN   NaN
    3        NaN   NaN
    2        NaN   NaN
    4   1.333333   2.0
    
    >>> kdf.groupby("a").var(ddof=0)
              b    c
    a
    7  0.000000  0.0
    6  9.000000  9.0
    1  0.000000  0.0
    3  0.000000  0.0
    2  0.000000  0.0
    4  0.888889  1.0
    ```
    itholic authored Jan 7, 2021
    Copy the full SHA
    f7afe12 View commit details
  2. Adjust Series.mode to match pandas Series.mode (#1995)

    Currently, Series.mode reserves the name of Series in the result, whereas pandas Series.mode doesn't:
    ```
    >>> kser1
    x    1
    y    2
    Name: z, dtype: int64
    >>> kser1.mode()
    0    1
    1    2
    Name: z, dtype: int64. # Reserve name
    >>> pser1 = kser1.to_pandas()
    >>> pser1.mode()
    0    1
    1    2
    dtype: int64. # Not reserve name
    ```
    
    In addition, unit tests are added.
    xinrong-meng authored Jan 7, 2021
    Copy the full SHA
    ddbdc9a View commit details

Commits on Jan 10, 2021

  1. Copy the full SHA
    66bda08 View commit details

Commits on Jan 12, 2021

  1. Optimize histogram calculation as a single pass (#1997)

    This PR optimizes histogram plot in Koalas by unioning the transformed results and making it single pass.
    
    Previously, when the `DataFrame.plot.hist` is called, each column had to trigger each job.
    Now, we can do it in single pass even for `DataFrame`s.
    
    I also tested that it still result same:
    
    ![Screen Shot 2021-01-08 at 1 15 47 PM](https://user-images.githubusercontent.com/6477701/103973784-a68a2800-51b3-11eb-86ba-90141346434d.png)
    
    ![Screen Shot 2021-01-08 at 1 16 19 PM](https://user-images.githubusercontent.com/6477701/103973813-b99cf800-51b3-11eb-8dc9-4d6a6cc26e3c.png)
    HyukjinKwon authored Jan 12, 2021
    Copy the full SHA
    f5f88bd View commit details
  2. Refactor and extract hist calculation logic from matplotlib (#1998)

    This PR extract histogram calculation logic from `matplotlib.py` to `core.py`.
    This PR is dependent on #1997
    HyukjinKwon authored Jan 12, 2021
    Copy the full SHA
    ea6ad98 View commit details
  3. Support operations between Series and Index. (#1996)

    Supports operations between `Series` and `Index`.
    
    ```py
    >>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7])
    >>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])
    
    >>> (kser + 1 + 10 * kidx).sort_index()
    0     2
    1    13
    2    24
    3    35
    4    46
    5    57
    6    68
    dtype: int64
    >>> (kidx + 1 + 10 * kser).sort_index()
    0    11
    1    22
    2    33
    3    44
    4    55
    5    66
    6    77
    dtype: int64
    ```
    ueshin authored Jan 12, 2021
    Copy the full SHA
    8d4157d View commit details

Commits on Jan 13, 2021

  1. Implement (DataFrame|Series).plot.hist in plotly (#1999)

    This PR implements `(DataFrame|Series).plot.hist` in plotly:
    
    This can be tested via: https://mybinder.org/v2/gh/HyukjinKwon/koalas/plotly-histogram?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb
    
    Example:
    
    ```python
    # Koalas
    import databricks.koalas as ks
    ks.options.plotting.backend = "plotly"
    kdf = ks.DataFrame({
        'a c': [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],
        'b': [2, 3, 4, 5, 7, 9, 10, 15, 10, 20, 20]
    })
    (kdf + 100).plot.hist()
    
    # pandas
    import pandas as pd
    pd.options.plotting.backend = "plotly"
    pdf = pd.DataFrame({
        'a c': [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],
        'b': [2, 3, 4, 5, 7, 9, 10, 15, 10, 20, 20]
    })
    (pdf + 100).plot.hist()
    ```
    
    ![Screen Shot 2021-01-12 at 10 12 47 PM](https://user-images.githubusercontent.com/6477701/104318885-644e4700-5523-11eb-81bc-f56ea1dbe797.png)
    
    ![Screen Shot 2021-01-12 at 10 12 52 PM](https://user-images.githubusercontent.com/6477701/104318888-657f7400-5523-11eb-8d06-da40206b4d01.png)
    
     NOTE that the output is a bit different because:
     - We use Spark for histogram calculation and it's a bit different from pandas'
     - Histogram plot in plotly cannot be directly used in our case but should work around by leveraging bar charts because we don't use plotly to calculate histogram, see https://plotly.com/python/histograms/.
    HyukjinKwon authored Jan 13, 2021
    Copy the full SHA
    3ce2d87 View commit details
  2. Adjust data when all the values in a column are nulls. (#2004)

    For Spark < 3.0, when all the values in a column are nulls, it will be `None` regardless of its data type.
    
    ```py
    >>> pdf = pd.DataFrame(
    ...             {
    ...                 "a": [None, None, None, "a"],
    ...                 "b": [None, None, None, 1],
    ...                 "c": [None, None, None] + list(np.arange(1, 2).astype("i1")),
    ...                 "d": [None, None, None, 1.0],
    ...                 "e": [None, None, None, True],
    ...                 "f": [None, None, None] + list(pd.date_range("20130101", periods=1)),
    ...             },
    ...         )
    >>>
    >>> kdf = ks.from_pandas(pdf)
    >>> kdf.iloc[:-1]
          a     b     c     d     e     f
    0  None  None  None  None  None  None
    1  None  None  None  None  None  None
    2  None  None  None  None  None  None
    ```
    
    whereas for pandas:
    
    ```py
    >>> pdf.iloc[:-1]
          a   b   c   d     e   f
    0  None NaN NaN NaN  None NaT
    1  None NaN NaN NaN  None NaT
    2  None NaN NaN NaN  None NaT
    ```
    
    With Spark >= 3.0 seems fine:
    
    ```py
    >>> kdf.iloc[:-1]
          a   b   c   d     e   f
    0  None NaN NaN NaN  None NaT
    1  None NaN NaN NaN  None NaT
    2  None NaN NaN NaN  None NaT
    ```
    ueshin authored Jan 13, 2021
    Copy the full SHA
    3cde582 View commit details
  3. Implement Series.factorize() (#1972)

    ref #1929
    ```
            >>> kser = ks.Series(['b', None, 'a', 'c', 'b'])
            >>> codes, uniques = kser.factorize()
            >>> codes
            0    1
            1   -1
            2    0
            3    2
            4    1
            dtype: int64
            >>> uniques
            Index(['a', 'b', 'c'], dtype='object')
    
            >>> codes, uniques = kser.factorize(na_sentinel=None)
            >>> codes
            0    1
            1    3
            2    0
            3    2
            4    1
            dtype: int64
            >>> uniques
            Index(['a', 'b', 'c', None], dtype='object')
    
            >>> codes, uniques = kser.factorize(na_sentinel=-2)
            >>> codes
            0    1
            1   -2
            2    0
            3    2
            4    1
            dtype: int64
            >>> uniques
            Index(['a', 'b', 'c'], dtype='object')
    ```
    xinrong-meng authored Jan 13, 2021
    Copy the full SHA
    ce2d260 View commit details

Commits on Jan 14, 2021

  1. Refactor to use one similar logic to call plot backends (#2005)

    This PR proposes:
    - Remove `koalas_plotting_backends`. We don't currently have such mechanism like pandas-dev/pandas@e9a60bb
    - Load and use plotting backend in the same way:
      - If a plot module has `plot`, we use it after converting it to pandas instance.
      - if a plot module has `plot_koalas` (Koalas' `matplotlib` and `plotly` modules for example), we just pass Koalas instances to plot.
    
    Now `databricks.koalas.plot.plotly` and `databricks.koalas.plot.matplotlib` modules work like external plotting backends.
    HyukjinKwon authored Jan 14, 2021
    Copy the full SHA
    e77ee69 View commit details
  2. Extract box computing logic from matplotlib (#2006)

    This PR moves core computation logic from matplotlib module to core module in plot.
    HyukjinKwon authored Jan 14, 2021
    Copy the full SHA
    690a4f2 View commit details

Commits on Jan 15, 2021

  1. Fix build error. (#2008)

    Set the upper bound for `nbformat`.
    ueshin authored Jan 15, 2021
    Copy the full SHA
    d0f4ad2 View commit details
  2. Implement Series.plot.box (#2007)

    This PR implements Series.plot.box with plotly. Note that DataFrame.plot.box is not already supported in Koalas yet.
    This can be tested via the link: https://mybinder.org/v2/gh/HyukjinKwon/koalas/plot-box-ser?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb
    
    Note that you should manually install plotly to test with mybinder above:
    
    ```
    %%bash
    pip install plotly
    ```
    
    Example:
    
    ```python
    # Koalas
    from databricks import koalas as ks
    ks.options.plotting.backend = "plotly"
    kdf = ks.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],}, index=[0, 1, 3, 5, 6, 8, 9, 9, 9, 10, 10])
    kdf.a.plot.box()
    
    # pandas
    import pandas as pd
    pd.options.plotting.backend = "plotly"
    pdf = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 15, 50],}, index=[0, 1, 3, 5, 6, 8, 9, 9, 9, 10, 10])
    pdf.a.plot.box()
    ```
    
    ![Screen Shot 2021-01-14 at 6 56 19 PM](https://user-images.githubusercontent.com/6477701/104575700-acdc4080-569a-11eb-8d55-0ac3db800ddd.png)
    ![Screen Shot 2021-01-14 at 6 56 24 PM](https://user-images.githubusercontent.com/6477701/104575705-ad74d700-569a-11eb-9b7c-a37e04f77ec7.png)
    
    For the same reason as #1999, the output is slightly different from pandas'.
    I referred to "Box Plot With Precomputed Quartiles" in https://plotly.com/python/box-plots/
    HyukjinKwon authored Jan 15, 2021
    Copy the full SHA
    26e5a2d View commit details

Commits on Jan 17, 2021

  1. Extract kde computing logic from matplotlib (#2010)

    This PR moves core computation logic (in kde) from matplotlib module to core module in plot.
    HyukjinKwon authored Jan 17, 2021
    Copy the full SHA
    74b4892 View commit details

Commits on Jan 19, 2021

  1. Fix as_spark_type to not support "bigint". (#2011)

    Fix `as_spark_type` to not support "bigint".
    
    The string "bigint" is not recognizable by `np.dtype` and it causes an unexpected error:
    
    ```py
    >>> import numpy as np
    >>> from databricks.koalas.typedef import as_spark_type
    >>> as_spark_type(np.dtype("datetime64[ns]"))
    Traceback (most recent call last):
    ...
    TypeError: data type "bigint" not understood
    ```
    
    Also, it doesn't work in pandas:
    
    ```py
    >>> pd.Series([1, 2, 3], dtype="bigint")
    Traceback (most recent call last):
    ...
    TypeError: data type "bigint" not understood
    ```
    ueshin authored Jan 19, 2021
    Copy the full SHA
    0e39097 View commit details

Commits on Jan 20, 2021

  1. Reuse as_spark_type in infer_pd_series_spark_type. (#2012)

    Now that `as_spark_type` is good enough for Koalas, we should reuse it in `infer_pd_series_spark_type` to avoid inconsistency.
    ueshin authored Jan 20, 2021
    Copy the full SHA
    c38c96f View commit details
  2. Implement DataFrame.insert (#1983)

    ref #1929
    
    Insert column into DataFrame at a specified location.
    
    ```
            >>> kdf = ks.DataFrame([1, 2, 3])
            >>> kdf.insert(0, 'x', 4)
            >>> kdf.sort_index()
               x  0
            0  4  1
            1  4  2
            2  4  3
    
            >>> from databricks.koalas.config import set_option, reset_option
            >>> set_option("compute.ops_on_diff_frames", True)
    
            >>> kdf.insert(1, 'y', [5, 6, 7])
            >>> kdf.sort_index()
               x  y  0
            0  4  5  1
            1  4  6  2
            2  4  7  3
    
            >>> kdf.insert(2, 'z', ks.Series([8, 9, 10]))
            >>> kdf.sort_index()
               x  y   z  0
            0  4  5   8  1
            1  4  6   9  2
            2  4  7  10  3
    
            >>> reset_option("compute.ops_on_diff_frames")
    ```
    xinrong-meng authored Jan 20, 2021
    Copy the full SHA
    8803344 View commit details

Commits on Jan 22, 2021

  1. Set upperbound for pandas 1.2.0 (#2016)

    Set upperbound for pandas 1.2.0 before we fully support pandas 1.2.0
    
    refer #1987
    itholic authored Jan 22, 2021
    Copy the full SHA
    1b87f30 View commit details
  2. Bump up version to 1.6.0

    HyukjinKwon committed Jan 22, 2021
    Copy the full SHA
    9232f29 View commit details

Commits on Jan 26, 2021

  1. Copy the full SHA
    118384a View commit details

Commits on Jan 27, 2021

  1. Copy the full SHA
    885fdfc View commit details

Commits on Jan 28, 2021

  1. Implemented ks.read_orc (#2017)

    This PR proposes `ks.read_orc` to support creating `DataFrame` from ORC file.
    
    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_orc.html#pandas.read_orc
    
    ```python
    >>> ks.read_orc("example.orc")
       i32  i64    f  bhello
    0    0    0  0.0  people
    
    >>> pd.read_orc("example.orc")
       i32  i64    f  bhello
    0    0    0  0.0  people
    
    # with columns
    >>> ks.read_orc("example.orc", columns=["i32", "f"])
       i32    f
    0    0  0.0
    
    >>> pd.read_orc("example.orc", columns=["i32", "f"])
       i32    f
    0    0  0.0
    >>>
    ```
    itholic authored Jan 28, 2021
    Copy the full SHA
    b8e2924 View commit details

Commits on Jan 29, 2021

  1. Implemented DataFrame.to_orc (#2024)

    This PR proposes `DataFrame.to_orc` to write the ORC file.
    
    pandas doesn't support this, but we do provide it for convenience, though.
    itholic authored Jan 29, 2021
    Copy the full SHA
    6c254f7 View commit details

Commits on Feb 1, 2021

  1. [HOTFIX] set upperbounds numpy to fix CI failure (#2027)

    This PR is for quickly fix the `mypy` test failure to unblock other PRs caused by NumPy 1.20.0 release.
    
    This upper bound should be removed again when finishing & merging #2026
    itholic authored Feb 1, 2021
    Copy the full SHA
    c5dbc9b View commit details
  2. Copy the full SHA
    060fee3 View commit details

Commits on Feb 2, 2021

  1. Change matplotlib as an optional dependency (#2029)

    This PR proposes to change matplotlib as an optional dependency.
    
    ```python
    >>> from databricks import koalas as ks
    >>> ks.range(100).plot.bar()
    Traceback (most recent call last):
      ...
    ImportError: matplotlib is required for plotting when the default backend 'matplotlib' is selected.
    ```
    
    Resolves #issues
    HyukjinKwon authored Feb 2, 2021
    Copy the full SHA
    5e28195 View commit details
  2. Add Int64Index, Float64Index, DatetimeIndex. (#2025)

    Adds `Int64Index`, `Float64Index`, and `DatetimeIndex` as a placeholder.
    We should still add specific attributes and methods in the following PRs.
    
    Before:
    
    ```py
    >>> kdf = ks.DataFrame([1,2,3])
    >>> type(kdf.index)
    <class 'databricks.koalas.indexes.Index'>
    ```
    
    After:
    
    ```py
    >>> type(kdf.index)
    <class 'databricks.koalas.indexes.numeric.Int64Index'>
    ```
    ueshin authored Feb 2, 2021
    Copy the full SHA
    c023792 View commit details
  3. Use NullType for empty or null dataset. (#2013)

    Experimental.
    ueshin authored Feb 2, 2021
    Copy the full SHA
    5dbc5ec View commit details
  4. Remove pypandoc hack (#2034)

    JessicaTegner/pypandoc#154 is merged and released. We can remove the hack for file name handling
    HyukjinKwon authored Feb 2, 2021
    Copy the full SHA
    aef4f48 View commit details
  5. Copy the full SHA
    88212f3 View commit details
  6. Preserve index for statistical functions with axis==1. (#2036)

    Preserves `index` for statistical functions with `axis==1`.
    
    ```py
    >>> kdf = ks.DataFrame(
    ...     {
    ...         "A": [1, -2, 3, -4, 5],
    ...         "B": [1.0, -2, 3, -4, 5],
    ...         "C": [-6.0, -7, -8, -9, 10],
    ...         "D": [True, False, True, False, False],
    ...     },
    ...     index=[10, 20, 30, 40, 50]
    ... )
    >>> kdf.count(axis=1)
    10    4
    20    4
    30    4
    40    4
    50    4
    dtype: int64
    ```
    
    whereas:
    
    ```py
    >>> ks.set_option("compute.shortcut_limit", 2)
    >>> kdf.count(axis=1)
    0    4
    1    4
    2    4
    3    4
    4    4
    dtype: int64
    ```
    
    After:
    
    ```py
    >>> ks.set_option("compute.shortcut_limit", 2)
    >>> kdf.count(axis=1)
    10    4
    20    4
    30    4
    40    4
    50    4
    dtype: int64
    ```
    ueshin authored Feb 2, 2021
    Copy the full SHA
    96f04aa View commit details
Showing with 24,681 additions and 4,228 deletions.
  1. +49 −16 .github/workflows/master.yml
  2. +3 −1 README.md
  3. +1 −0 apt.txt
  4. +0 −7 databricks/conftest.py
  5. +58 −3 databricks/koalas/__init__.py
  6. +114 −35 databricks/koalas/accessors.py
  7. +536 −67 databricks/koalas/base.py
  8. +163 −0 databricks/koalas/categorical.py
  9. +6 −6 databricks/koalas/config.py
  10. +4 −4 databricks/koalas/extensions.py
  11. +1,033 −374 databricks/koalas/frame.py
  12. +641 −148 databricks/koalas/generic.py
  13. +119 −87 databricks/koalas/groupby.py
  14. +19 −0 databricks/koalas/indexes/__init__.py
  15. +205 −1,181 databricks/koalas/{indexes.py → indexes/base.py}
  16. +187 −0 databricks/koalas/indexes/category.py
  17. +741 −0 databricks/koalas/indexes/datetimes.py
  18. +1,169 −0 databricks/koalas/indexes/multi.py
  19. +146 −0 databricks/koalas/indexes/numeric.py
  20. +192 −40 databricks/koalas/indexing.py
  21. +415 −80 databricks/koalas/internal.py
  22. +0 −7 databricks/koalas/missing/frame.py
  23. +49 −5 databricks/koalas/missing/indexes.py
  24. +0 −7 databricks/koalas/missing/series.py
  25. +1 −1 databricks/koalas/mlflow.py
  26. +249 −6 databricks/koalas/namespace.py
  27. +0 −1 databricks/koalas/plot/__init__.py
  28. +520 −323 databricks/koalas/plot/core.py
  29. +61 −233 databricks/koalas/plot/matplotlib.py
  30. +217 −0 databricks/koalas/plot/plotly.py
  31. +504 −255 databricks/koalas/series.py
  32. +50 −1 databricks/koalas/spark/accessors.py
  33. +8 −4 databricks/koalas/strings.py
  34. +23 −27 databricks/koalas/testing/utils.py
  35. +15 −0 databricks/koalas/tests/indexes/__init__.py
  36. +300 −46 databricks/koalas/tests/{test_indexes.py → indexes/test_base.py}
  37. +110 −0 databricks/koalas/tests/indexes/test_category.py
  38. +218 −0 databricks/koalas/tests/indexes/test_datetime.py
  39. +43 −4 databricks/koalas/tests/plot/test_frame_plot.py
  40. +20 −0 databricks/koalas/tests/plot/test_frame_plot_matplotlib.py
  41. +109 −0 databricks/koalas/tests/plot/test_frame_plot_plotly.py
  42. +49 −2 databricks/koalas/tests/plot/test_series_plot.py
  43. +29 −45 databricks/koalas/tests/plot/test_series_plot_matplotlib.py
  44. +125 −0 databricks/koalas/tests/plot/test_series_plot_plotly.py
  45. +461 −0 databricks/koalas/tests/test_categorical.py
  46. +515 −32 databricks/koalas/tests/test_dataframe.py
  47. +14 −1 databricks/koalas/tests/test_dataframe_conversion.py
  48. +78 −0 databricks/koalas/tests/test_dataframe_spark_io.py
  49. +1 −1 databricks/koalas/tests/test_extension.py
  50. +61 −14 databricks/koalas/tests/test_groupby.py
  51. +35 −4 databricks/koalas/tests/test_indexing.py
  52. +57 −0 databricks/koalas/tests/test_namespace.py
  53. +568 −77 databricks/koalas/tests/test_ops_on_diff_frames.py
  54. +9 −0 databricks/koalas/tests/test_ops_on_diff_frames_groupby.py
  55. +27 −8 databricks/koalas/tests/test_reshape.py
  56. +530 −72 databricks/koalas/tests/test_series.py
  57. +33 −0 databricks/koalas/tests/test_series_datetime.py
  58. +8 −8 databricks/koalas/tests/test_series_string.py
  59. +230 −26 databricks/koalas/tests/test_stats.py
  60. +210 −62 databricks/koalas/tests/test_typedef.py
  61. +316 −73 databricks/koalas/typedef/typehints.py
  62. +17 −2 databricks/koalas/usage_logging/__init__.py
  63. +105 −44 databricks/koalas/utils.py
  64. +1 −1 databricks/koalas/version.py
  65. +18 −10 databricks/koalas/window.py
  66. +4 −39 dev/gendoc.py
  67. +7 −0 dev/lint-python
  68. +2 −2 dev/pytest
  69. +1 −1 dev/tox.ini
  70. +6 −7 docs/source/conf.py
  71. +3 −10 docs/source/development/contributing.rst
  72. +12,550 −675 docs/source/getting_started/10min.ipynb
  73. +6 −1 docs/source/getting_started/install.rst
  74. +3 −0 docs/source/index.rst
  75. +7 −0 docs/source/reference/frame.rst
  76. +1 −1 docs/source/reference/general_functions.rst
  77. +89 −6 docs/source/reference/indexing.rst
  78. +8 −0 docs/source/reference/io.rst
  79. +30 −6 docs/source/reference/series.rst
  80. +107 −0 docs/source/user_guide/from_to_dbms.rst
  81. +1 −0 docs/source/user_guide/index.rst
  82. +6 −7 docs/source/user_guide/options.rst
  83. +1 −1 docs/source/user_guide/transform_apply.rst
  84. +5 −4 docs/source/user_guide/typehints.rst
  85. +11 −0 docs/source/user_guide/types.rst
  86. +19 −9 postBuild
  87. +12 −3 requirements-dev.txt
  88. +7 −5 setup.py
65 changes: 49 additions & 16 deletions .github/workflows/master.yml
Original file line number Diff line number Diff line change
@@ -20,18 +20,33 @@ jobs:
spark-version: 2.3.4
pandas-version: 0.23.4
pyarrow-version: 0.16.0
- python-version: 3.5
numpy-version: 1.18.5
- python-version: 3.6
spark-version: 2.3.4
pandas-version: 0.24.2
pyarrow-version: 0.10.0
numpy-version: 1.19.5
default-index-type: 'distributed-sequence'
- python-version: 3.9
spark-version: 3.1.2
pandas-version: 1.2.5
pyarrow-version: 3.0.0
numpy-version: 1.20.3
- python-version: 3.9
spark-version: 3.2.0
pandas-version: 1.2.5
pyarrow-version: 4.0.1
numpy-version: 1.21.2
default-index-type: 'distributed-sequence'
env:
PYTHON_VERSION: ${{ matrix.python-version }}
SPARK_VERSION: ${{ matrix.spark-version }}
PANDAS_VERSION: ${{ matrix.pandas-version }}
PYARROW_VERSION: ${{ matrix.pyarrow-version }}
NUMPY_VERSION: ${{ matrix.numpy-version }}
DEFAULT_INDEX_TYPE: ${{ matrix.default-index-type }}
KOALAS_TESTING: 1
SPARK_LOCAL_IP: 127.0.0.1
# DISPLAY=0.0 does not work in Github Actions with Python 3.5. Here we work around with xvfb-run
PYTHON_EXECUTABLE: xvfb-run python
# Github token is required to auto-generate the release notes from Github release notes
@@ -61,8 +76,12 @@ jobs:
# as Black only works with Python 3.6+. This is hacky but we will drop
# Python 3.5 soon so it's fine.
if [[ "$PYTHON_VERSION" < "3.6" ]]; then sed -i '/black/d' requirements-dev.txt; fi
# sphinx-plotly-directive supports Python 3.6+
if [[ "$PYTHON_VERSION" < "3.6" ]]; then sed -i '/sphinx-plotly-directive/d' requirements-dev.txt; fi
# Disable mypy check for PySpark 3.1
if [[ "$SPARK_VERSION" > "3.1" ]]; then sed -i '/mypy/d' requirements-dev.txt; fi
pip install -r requirements-dev.txt
pip install pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION pyspark==$SPARK_VERSION
pip install pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION pyspark==$SPARK_VERSION numpy==$NUMPY_VERSION
# matplotlib dropped Python 3.5 support from 3.1.x; however, 3.0.3 only supports sphinx 2.x.
# It forces the sphinx version to 2.x.
if [[ "$PYTHON_VERSION" < "3.6" ]]; then pip install "sphinx<3.0.0"; fi
@@ -86,37 +105,45 @@ jobs:
spark-version: 2.4.7
pandas-version: 0.24.2
pyarrow-version: 0.14.1
numpy-version: 1.19.5
logger: databricks.koalas.usage_logging.usage_logger
- python-version: 3.6
spark-version: 2.4.7
pandas-version: 0.25.3
pyarrow-version: 0.15.1
default-index-type: 'distributed-sequence'
- python-version: 3.7
spark-version: 2.4.7
pandas-version: 0.25.3
pyarrow-version: 0.14.1
- python-version: 3.7
spark-version: 2.4.7
pandas-version: 1.0.5
pyarrow-version: 0.15.1
numpy-version: 1.19.5
default-index-type: 'distributed-sequence'
- python-version: 3.7
spark-version: 3.0.1
pandas-version: 0.25.3
spark-version: 3.0.2
pandas-version: 1.0.5
pyarrow-version: 1.0.1
numpy-version: 1.19.5
- python-version: 3.7
spark-version: 3.1.1
pandas-version: 1.1.5
pyarrow-version: 2.0.0
numpy-version: 1.19.5
default-index-type: 'distributed-sequence'
- python-version: 3.8
spark-version: 3.0.1
spark-version: 3.0.2
pandas-version: 1.1.5
pyarrow-version: 2.0.0
numpy-version: 1.19.5
- python-version: 3.8
spark-version: 3.1.1
pandas-version: 1.2.5
pyarrow-version: 3.0.0
numpy-version: 1.20.3
default-index-type: 'distributed-sequence'
env:
PYTHON_VERSION: ${{ matrix.python-version }}
SPARK_VERSION: ${{ matrix.spark-version }}
PANDAS_VERSION: ${{ matrix.pandas-version }}
PYARROW_VERSION: ${{ matrix.pyarrow-version }}
NUMPY_VERSION: ${{ matrix.numpy-version }}
DEFAULT_INDEX_TYPE: ${{ matrix.default-index-type }}
KOALAS_TESTING: 1
SPARK_LOCAL_IP: 127.0.0.1
# `QT_QPA_PLATFORM` for resolving 'QXcbConnection: Could not connect to display :0.0'
DISPLAY: 0.0
QT_QPA_PLATFORM: offscreen
@@ -141,15 +168,21 @@ jobs:
conda config --env --add pinned_packages python=$PYTHON_VERSION
conda config --env --add pinned_packages pandas==$PANDAS_VERSION
conda config --env --add pinned_packages pyarrow==$PYARROW_VERSION
conda config --env --add pinned_packages numpy==$NUMPY_VERSION
conda config --env --add pinned_packages pyspark==$SPARK_VERSION
if [[ "$SPARK_VERSION" < "3.0" ]]; then
pip install pyspark==$SPARK_VERSION
else
conda install -c conda-forge --yes pyspark==$SPARK_VERSION
fi
conda install -c conda-forge --yes pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION
sed -i -e "/pandas/d" -e "/pyarrow/d" requirements-dev.txt
conda install -c conda-forge --yes pandas==$PANDAS_VERSION pyarrow==$PYARROW_VERSION numpy==$NUMPY_VERSION
sed -i -e "/pandas/d" -e "/pyarrow/d" -e "/numpy>=/d" requirements-dev.txt
# Disable mypy check for PySpark 3.1
if [[ "$SPARK_VERSION" > "3.1" ]]; then sed -i '/mypy/d' requirements-dev.txt; fi
# sphinx-plotly-directive is not available on Conda.
sed -i '/sphinx-plotly-directive/d' requirements-dev.txt
conda install -c conda-forge --yes --file requirements-dev.txt
pip install sphinx-plotly-directive # pip-only dependency
conda list
- name: Run tests
run: |
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## DEPRECATED: Koalas supports Apache Spark 3.1 and below as it is [officially included to PySpark in Apache Spark 3.2](https://issues.apache.org/jira/browse/SPARK-34849). This repository is now in maintenance mode. For Apache Spark 3.2 and above, please use [PySpark](https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html) directly.

<p align="center">
<img src="https://raw.githubusercontent.com/databricks/koalas/master/icons/koalas-logo.png" width="140"/>
</p>
@@ -52,7 +54,7 @@ pip install koalas

See [Installation](https://koalas.readthedocs.io/en/latest/getting_started/install.html) for more details.

For Databricks Runtime users, Koalas is pre-installed in Databricks Runtime 7.1 and above, or you can follow these [steps](https://docs.databricks.com/libraries/index.html) to install a library on Databricks.
For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try [Databricks Community Edition](https://community.cloud.databricks.com/) for free. You can also follow these [steps](https://docs.databricks.com/libraries/index.html) to manually install a library on Databricks.

Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set `ARROW_PRE_0_15_IPC_FORMAT` environment variable to `1` manually.
Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.
1 change: 1 addition & 0 deletions apt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
openjdk-8-jre
7 changes: 0 additions & 7 deletions databricks/conftest.py
Original file line number Diff line number Diff line change
@@ -25,7 +25,6 @@

import pandas as pd
import pyarrow as pa
import matplotlib.pyplot as plt
from pyspark import __version__

from databricks import koalas as ks
@@ -102,12 +101,6 @@ def add_caplog(caplog):
yield


@pytest.fixture(autouse=True)
def close_figs():
yield
plt.close("all")


@pytest.fixture(autouse=True)
def check_options():
orig_default_index_type = ks.options.compute.default_index_type
61 changes: 58 additions & 3 deletions databricks/koalas/__init__.py
Original file line number Diff line number Diff line change
@@ -20,10 +20,31 @@
from databricks.koalas.version import __version__ # noqa: F401


def assert_python_version():
import warnings

major = 3
minor = 5
deprecated_version = (major, minor)
min_supported_version = (major, minor + 1)

if sys.version_info[:2] <= deprecated_version:
warnings.warn(
"Koalas support for Python {dep_ver} is deprecated and will be dropped in "
"the future release. At that point, existing Python {dep_ver} workflows "
"that use Koalas will continue to work without modification, but Python {dep_ver} "
"users will no longer get access to the latest Koalas features and bugfixes. "
"We recommend that you upgrade to Python {min_ver} or newer.".format(
dep_ver=".".join(map(str, deprecated_version)),
min_ver=".".join(map(str, min_supported_version)),
),
FutureWarning,
)


def assert_pyspark_version():
import logging

pyspark_ver = None
try:
import pyspark
except ImportError:
@@ -33,17 +54,42 @@ def assert_pyspark_version():
)
else:
pyspark_ver = getattr(pyspark, "__version__")
if pyspark_ver is None or pyspark_ver < "2.4":
if pyspark_ver is None or LooseVersion(pyspark_ver) < LooseVersion("2.4"):
logging.warning(
'Found pyspark version "{}" installed. pyspark>=2.4.0 is recommended.'.format(
pyspark_ver if pyspark_ver is not None else "<unknown version>"
)
)
elif LooseVersion(pyspark_ver) >= LooseVersion("3.2"):
logging.warning(
'Found pyspark version "{}" installed. The pyspark version 3.2 and above has '
'a built-in "pandas APIs on Spark" module ported from Koalas. '
"Try `import pyspark.pandas as ps` instead. ".format(pyspark_ver)
)


assert_python_version()
assert_pyspark_version()

import pyspark
import numpy

if LooseVersion(pyspark.__version__) < LooseVersion("3.1") and LooseVersion(
numpy.__version__
) >= LooseVersion("1.20"):
import logging

logging.warning(
'Found numpy version "{numpy_version}" installed with pyspark version "{pyspark_version}". '
"Some functions will not work well with this combination of "
'numpy version "{numpy_version}" and pyspark version "{pyspark_version}". '
"Please try to upgrade pyspark version to 3.1 or above, "
"or downgrade numpy version to below 1.20.".format(
numpy_version=numpy.__version__, pyspark_version=pyspark.__version__
)
)


import pyarrow

if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
@@ -86,20 +132,29 @@ def assert_pyspark_version():
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

from databricks.koalas.frame import DataFrame
from databricks.koalas.indexes import Index, MultiIndex
from databricks.koalas.indexes.base import Index
from databricks.koalas.indexes.category import CategoricalIndex
from databricks.koalas.indexes.datetimes import DatetimeIndex
from databricks.koalas.indexes.multi import MultiIndex
from databricks.koalas.indexes.numeric import Float64Index, Int64Index
from databricks.koalas.series import Series
from databricks.koalas.groupby import NamedAgg

__all__ = [ # noqa: F405
"read_csv",
"read_parquet",
"to_datetime",
"date_range",
"from_pandas",
"get_dummies",
"DataFrame",
"Series",
"Index",
"MultiIndex",
"Int64Index",
"Float64Index",
"CategoricalIndex",
"DatetimeIndex",
"sql",
"range",
"concat",
Loading