Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce default index with new three index types #639

Merged
merged 4 commits into from
Aug 16, 2019

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 13, 2019

This PR proposes a default index so that we can now forget about the case when index is missing in Koalas DataFrame - when Koalas DataFrame is directly created from Spark DataFrame.

There are three types of default index that can be controlled by DEFAULT_INDEX environment variable.

  • one-by-one: It implements an one-by-one sequence by Window function without
    specifying partition. Therefore, it ends up with whole partition in single node.
    This index type should be avoided when the data is large. This is default.
    TL;DR: Window with row_number without partitioning spec

  • distributed-one-by-one: It implements an one-by-one sequence by group-by and
    group-map approach. It still generates a one-by-one sequential index globally.
    If the default index must be an one-by-one sequence in a large dataset, this
    index has to be used.
    TL;DR: groupby(partition_id).count().collect() and groupby(partition_id).apply(f).

  • distributed: It implements a monotonically increasing sequence simply by using
    Spark's monotonically_increasing_id function. If the index does not have to be
    a one-by-one sequence, this index should be used. Performance-wise, this index
    almost does not have any penalty comparing to other index types.
    TL;DR: moninically_increasing_id().

@HyukjinKwon
Copy link
Member Author

will fix the tests soon.

assert column_index is None
assert column_index_names is None

if "__index_level_0__" not in sdf.schema.names:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a separate PR to completely disallow no-index Koalas DataFrame. Seems like there are multiple places to fix.

One side question is that how we will handle the roundtrip in Koalas and Spark DataFrame. If Koalas has an index, to_spark() loses index information and we have nowhere to store.

I think, until we figure out a way to store index information properly, we should strip index when we convert to Spark DataFrame.

Currently it's kind of funny:

>>> import databricks.koalas as ks
>>> ks.DataFrame(ks.DataFrame({'a': [1,2,3]}).to_spark())
   __index_level_0__  a
0                  0  1
1                  1  2
2                  2  3

So basically I would like to propose below until we figure out a clever way to roundtrip.

>>> ks.DataFrame({'a': [1,2,3]}).to_spark().show()
+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin and @rxin.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about dropping only if the index is not named, otherwise retain it with the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But to do that, we should keep index mapping information somewhere. After converting Koalas DataFrame into Spark DataFrame, it's being lost.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, loosing the mapping should be okay, I just meant that if the index is named, then the name should be the column name for the Spark DataFrame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, oh right. Yes.

1 foo 1 foo 5
2 foo 1 foo 8
3 foo 5 foo 5
4 foo 5 foo 8
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually matched with pandas's result since the output is sorted.

@@ -273,7 +273,7 @@ def read_delta(path: str, version: Optional[str] = None, timestamp: Optional[str
Examples
--------
>>> ks.range(1).to_delta('%s/read_delta/foo' % path)
>>> ks.read_delta('%s/read_delta/foo' % path)
>>> ks.read_delta('%s/read_delta/foo' % path) # doctest: +SKIP
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests are related with https://github.com/databricks/koalas/pull/639/files#r313350742.

I will make a separate PR to fix it.

@HyukjinKwon HyukjinKwon requested a review from thunterdb August 13, 2019 12:25
Window.orderBy(F.monotonically_increasing_id().asc())) - 1
scols = [scol_for(sdf, column) for column in sdf.columns]
return sdf.select(sequential_index.alias("__index_level_0__"), *scols)
elif default_index_type == "distributed-one-by-one":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, actually it mimics zipWithIndex in RDD API.

assert column_index is None
assert column_index_names is None

if "__index_level_0__" not in sdf.schema.names:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about dropping only if the index is not named, otherwise retain it with the name?

@softagram-bot
Copy link

Softagram Impact Report for pull/639 (head commit: 4978f80)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

@codecov-io
Copy link

codecov-io commented Aug 14, 2019

Codecov Report

Merging #639 into master will decrease coverage by 0.1%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #639      +/-   ##
==========================================
- Coverage   92.95%   92.85%   -0.11%     
==========================================
  Files          31       31              
  Lines        5093     5119      +26     
==========================================
+ Hits         4734     4753      +19     
- Misses        359      366       +7
Impacted Files Coverage Δ
databricks/koalas/sql.py 93.75% <ø> (-1.05%) ⬇️
databricks/koalas/namespace.py 88% <ø> (-1.21%) ⬇️
databricks/koalas/frame.py 94.65% <ø> (+0.01%) ⬆️
databricks/koalas/internal.py 96.66% <85.71%> (-2.65%) ⬇️
databricks/koalas/series.py 92.94% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82e2e41...4978f80. Read the comment docs.

@HyukjinKwon HyukjinKwon merged commit 795b4ed into databricks:master Aug 16, 2019
@HyukjinKwon
Copy link
Member Author

I am merging this too. Let me know if you guys have more comments.

ueshin added a commit that referenced this pull request Aug 16, 2019
Currently the master build is failing.
Seems like there are the conflict changes between #633 and #639.
I'd skip the test for now to unblock other PRs.
HyukjinKwon added a commit that referenced this pull request Aug 19, 2019
… DataFrame with no index (#655)

This PR is a followup of  and proposes two things:
  - Exclude Index columns for exposed Spark DataFrame
  - Disallow Koalas DataFrame with no index

So, for instance, `to_spark()`  now shows:

```diff
-       __index_level_0__  x
-    0                  0  0
-    1                  1  1
+       x
+    0  0
+    1  1
```

and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per #639
@HyukjinKwon HyukjinKwon deleted the default-index branch November 6, 2019 02:23
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this pull request Jan 27, 2023
… DataFrame with no index (#655)

This PR is a followup of  and proposes two things:
  - Exclude Index columns for exposed Spark DataFrame
  - Disallow Koalas DataFrame with no index

So, for instance, `to_spark()`  now shows:

```diff
-       __index_level_0__  x
-    0                  0  0
-    1                  1  1
+       x
+    0  0
+    1  1
```

and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per databricks/koalas#639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants