Introduce default index with new three index types #639

HyukjinKwon · 2019-08-13T04:17:23Z

This PR proposes a default index so that we can now forget about the case when index is missing in Koalas DataFrame - when Koalas DataFrame is directly created from Spark DataFrame.

There are three types of default index that can be controlled by DEFAULT_INDEX environment variable.

one-by-one: It implements an one-by-one sequence by Window function without
specifying partition. Therefore, it ends up with whole partition in single node.
This index type should be avoided when the data is large. This is default.
TL;DR: Window with row_number without partitioning spec
distributed-one-by-one: It implements an one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be an one-by-one sequence in a large dataset, this
index has to be used.
TL;DR: groupby(partition_id).count().collect() and groupby(partition_id).apply(f).
distributed: It implements a monotonically increasing sequence simply by using
Spark's monotonically_increasing_id function. If the index does not have to be
a one-by-one sequence, this index should be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.
TL;DR: moninically_increasing_id().

HyukjinKwon · 2019-08-13T05:28:41Z

will fix the tests soon.

HyukjinKwon · 2019-08-13T11:41:02Z

databricks/koalas/internal.py

+            assert column_index is None
+            assert column_index_names is None
+
+            if "__index_level_0__" not in sdf.schema.names:


I will make a separate PR to completely disallow no-index Koalas DataFrame. Seems like there are multiple places to fix.

One side question is that how we will handle the roundtrip in Koalas and Spark DataFrame. If Koalas has an index, to_spark() loses index information and we have nowhere to store.

I think, until we figure out a way to store index information properly, we should strip index when we convert to Spark DataFrame.

Currently it's kind of funny:

>>> import databricks.koalas as ks >>> ks.DataFrame(ks.DataFrame({'a': [1,2,3]}).to_spark()) __index_level_0__ a 0 0 1 1 1 2 2 2 3

So basically I would like to propose below until we figure out a clever way to roundtrip.

>>> ks.DataFrame({'a': [1,2,3]}).to_spark().show() +---+ | a| +---+ | 1| | 2| | 3| +---+

cc @ueshin and @rxin.

How about dropping only if the index is not named, otherwise retain it with the name?

But to do that, we should keep index mapping information somewhere. After converting Koalas DataFrame into Spark DataFrame, it's being lost.

yeah, loosing the mapping should be okay, I just meant that if the index is named, then the name should be the column name for the Spark DataFrame.

Oh, oh right. Yes.

HyukjinKwon · 2019-08-13T12:15:27Z

databricks/koalas/frame.py

+        1  foo        1  foo        5
+        2  foo        1  foo        8
+        3  foo        5  foo        5
+        4  foo        5  foo        8


This is actually matched with pandas's result since the output is sorted.

HyukjinKwon · 2019-08-13T12:16:07Z

databricks/koalas/namespace.py

@@ -273,7 +273,7 @@ def read_delta(path: str, version: Optional[str] = None, timestamp: Optional[str
    Examples
    --------
    >>> ks.range(1).to_delta('%s/read_delta/foo' % path)
-    >>> ks.read_delta('%s/read_delta/foo' % path)
+    >>> ks.read_delta('%s/read_delta/foo' % path)  # doctest: +SKIP


Those tests are related with https://github.com/databricks/koalas/pull/639/files#r313350742.

I will make a separate PR to fix it.

HyukjinKwon · 2019-08-13T12:27:24Z

databricks/koalas/internal.py

+                Window.orderBy(F.monotonically_increasing_id().asc())) - 1
+            scols = [scol_for(sdf, column) for column in sdf.columns]
+            return sdf.select(sequential_index.alias("__index_level_0__"), *scols)
+        elif default_index_type == "distributed-one-by-one":


Here, actually it mimics zipWithIndex in RDD API.

ueshin · 2019-08-13T19:53:36Z

databricks/koalas/internal.py

+            assert column_index is None
+            assert column_index_names is None
+
+            if "__index_level_0__" not in sdf.schema.names:


How about dropping only if the index is not named, otherwise retain it with the name?

databricks/koalas/internal.py

softagram-bot · 2019-08-14T01:32:28Z

Softagram Impact Report for pull/639 (head commit: `4978f80`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/639

Give feedback on this report to support@softagram.com

codecov-io · 2019-08-14T01:50:39Z

Codecov Report

Merging #639 into master will decrease coverage by 0.1%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master     #639      +/-   ##
==========================================
- Coverage   92.95%   92.85%   -0.11%     
==========================================
  Files          31       31              
  Lines        5093     5119      +26     
==========================================
+ Hits         4734     4753      +19     
- Misses        359      366       +7

Impacted Files	Coverage Δ
databricks/koalas/sql.py	`93.75% <ø> (-1.05%)`	⬇️
databricks/koalas/namespace.py	`88% <ø> (-1.21%)`	⬇️
databricks/koalas/frame.py	`94.65% <ø> (+0.01%)`	⬆️
databricks/koalas/internal.py	`96.66% <85.71%> (-2.65%)`	⬇️
databricks/koalas/series.py	`92.94% <0%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e2e41...4978f80. Read the comment docs.

HyukjinKwon · 2019-08-16T00:39:11Z

I am merging this too. Let me know if you guys have more comments.

Currently the master build is failing. Seems like there are the conflict changes between #633 and #639. I'd skip the test for now to unblock other PRs.

… DataFrame with no index (#655) This PR is a followup of and proposes two things: - Exclude Index columns for exposed Spark DataFrame - Disallow Koalas DataFrame with no index So, for instance, `to_spark()` now shows: ```diff - __index_level_0__ x - 0 0 0 - 1 1 1 + x + 0 0 + 1 1 ``` and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per #639

… DataFrame with no index (#655) This PR is a followup of and proposes two things: - Exclude Index columns for exposed Spark DataFrame - Disallow Koalas DataFrame with no index So, for instance, `to_spark()` now shows: ```diff - __index_level_0__ x - 0 0 0 - 1 1 1 + x + 0 0 + 1 1 ``` and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per databricks/koalas#639

HyukjinKwon requested review from ueshin and rxin August 13, 2019 04:17

HyukjinKwon force-pushed the default-index branch from 2eddda2 to fccf690 Compare August 13, 2019 04:21

Introduce default index with new three index types

b0e8729

HyukjinKwon force-pushed the default-index branch from 27b4652 to b0e8729 Compare August 13, 2019 04:40

Allow no index for now

16a44cf

HyukjinKwon commented Aug 13, 2019

View reviewed changes

Skip tests for now

fb7a8c1

HyukjinKwon force-pushed the default-index branch from ac96a61 to fb7a8c1 Compare August 13, 2019 12:15

HyukjinKwon commented Aug 13, 2019

View reviewed changes

HyukjinKwon requested a review from thunterdb August 13, 2019 12:25

HyukjinKwon commented Aug 13, 2019

View reviewed changes

ueshin reviewed Aug 13, 2019

View reviewed changes

Address comments

4978f80

HyukjinKwon merged commit 795b4ed into databricks:master Aug 16, 2019

ueshin mentioned this pull request Aug 16, 2019

Skip a test OpsOnDiffFramesEnabledTest.test_no_index. #651

Merged

ueshin added a commit that referenced this pull request Aug 16, 2019

Skip a test OpsOnDiffFramesEnabledTest.test_no_index. (#651)

f0f1859

Currently the master build is failing. Seems like there are the conflict changes between #633 and #639. I'd skip the test for now to unblock other PRs.

HyukjinKwon mentioned this pull request Aug 19, 2019

Exclude Index columns for exposed Spark DataFrame and disallow Koalas DataFrame with no index #655

Merged

HyukjinKwon deleted the default-index branch November 6, 2019 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce default index with new three index types #639

Introduce default index with new three index types #639

HyukjinKwon commented Aug 13, 2019 •

edited

Loading

HyukjinKwon commented Aug 13, 2019

HyukjinKwon Aug 13, 2019

HyukjinKwon Aug 13, 2019

ueshin Aug 13, 2019

HyukjinKwon Aug 14, 2019

ueshin Aug 14, 2019

HyukjinKwon Aug 14, 2019

HyukjinKwon Aug 13, 2019

HyukjinKwon Aug 13, 2019

HyukjinKwon Aug 13, 2019

ueshin Aug 13, 2019

softagram-bot commented Aug 14, 2019

codecov-io commented Aug 14, 2019 •

edited

Loading

HyukjinKwon commented Aug 16, 2019

Introduce default index with new three index types #639

Introduce default index with new three index types #639

Conversation

HyukjinKwon commented Aug 13, 2019 • edited Loading

HyukjinKwon commented Aug 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented Aug 14, 2019

Softagram Impact Report for pull/639 (head commit: 4978f80)

⭐ Change Overview

⭐ Details of Dependency Changes

📄 Full report

codecov-io commented Aug 14, 2019 • edited Loading

Codecov Report

HyukjinKwon commented Aug 16, 2019

HyukjinKwon commented Aug 13, 2019 •

edited

Loading

Softagram Impact Report for pull/639 (head commit: `4978f80`)

codecov-io commented Aug 14, 2019 •

edited

Loading