-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce default index with new three index types #639
Conversation
2eddda2
to
fccf690
Compare
27b4652
to
b0e8729
Compare
will fix the tests soon. |
assert column_index is None | ||
assert column_index_names is None | ||
|
||
if "__index_level_0__" not in sdf.schema.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a separate PR to completely disallow no-index Koalas DataFrame. Seems like there are multiple places to fix.
One side question is that how we will handle the roundtrip in Koalas and Spark DataFrame. If Koalas has an index, to_spark()
loses index information and we have nowhere to store.
I think, until we figure out a way to store index information properly, we should strip index when we convert to Spark DataFrame.
Currently it's kind of funny:
>>> import databricks.koalas as ks
>>> ks.DataFrame(ks.DataFrame({'a': [1,2,3]}).to_spark())
__index_level_0__ a
0 0 1
1 1 2
2 2 3
So basically I would like to propose below until we figure out a clever way to roundtrip.
>>> ks.DataFrame({'a': [1,2,3]}).to_spark().show()
+---+
| a|
+---+
| 1|
| 2|
| 3|
+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about dropping only if the index is not named, otherwise retain it with the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But to do that, we should keep index mapping information somewhere. After converting Koalas DataFrame into Spark DataFrame, it's being lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, loosing the mapping should be okay, I just meant that if the index is named, then the name should be the column name for the Spark DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, oh right. Yes.
ac96a61
to
fb7a8c1
Compare
1 foo 1 foo 5 | ||
2 foo 1 foo 8 | ||
3 foo 5 foo 5 | ||
4 foo 5 foo 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually matched with pandas's result since the output is sorted.
@@ -273,7 +273,7 @@ def read_delta(path: str, version: Optional[str] = None, timestamp: Optional[str | |||
Examples | |||
-------- | |||
>>> ks.range(1).to_delta('%s/read_delta/foo' % path) | |||
>>> ks.read_delta('%s/read_delta/foo' % path) | |||
>>> ks.read_delta('%s/read_delta/foo' % path) # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tests are related with https://github.com/databricks/koalas/pull/639/files#r313350742.
I will make a separate PR to fix it.
Window.orderBy(F.monotonically_increasing_id().asc())) - 1 | ||
scols = [scol_for(sdf, column) for column in sdf.columns] | ||
return sdf.select(sequential_index.alias("__index_level_0__"), *scols) | ||
elif default_index_type == "distributed-one-by-one": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, actually it mimics zipWithIndex
in RDD API.
assert column_index is None | ||
assert column_index_names is None | ||
|
||
if "__index_level_0__" not in sdf.schema.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about dropping only if the index is not named, otherwise retain it with the name?
Softagram Impact Report for pull/639 (head commit: 4978f80)⭐ Change Overview
⭐ Details of Dependency Changes
📄 Full report
Give feedback on this report to support@softagram.com |
Codecov Report
@@ Coverage Diff @@
## master #639 +/- ##
==========================================
- Coverage 92.95% 92.85% -0.11%
==========================================
Files 31 31
Lines 5093 5119 +26
==========================================
+ Hits 4734 4753 +19
- Misses 359 366 +7
Continue to review full report at Codecov.
|
I am merging this too. Let me know if you guys have more comments. |
… DataFrame with no index (#655) This PR is a followup of and proposes two things: - Exclude Index columns for exposed Spark DataFrame - Disallow Koalas DataFrame with no index So, for instance, `to_spark()` now shows: ```diff - __index_level_0__ x - 0 0 0 - 1 1 1 + x + 0 0 + 1 1 ``` and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per #639
… DataFrame with no index (#655) This PR is a followup of and proposes two things: - Exclude Index columns for exposed Spark DataFrame - Disallow Koalas DataFrame with no index So, for instance, `to_spark()` now shows: ```diff - __index_level_0__ x - 0 0 0 - 1 1 1 + x + 0 0 + 1 1 ``` and `index_map` is not expected to be empty in Koalas DataFrame, always. It sets the default index explicitly per databricks/koalas#639
This PR proposes a default index so that we can now forget about the case when index is missing in Koalas DataFrame - when Koalas DataFrame is directly created from Spark DataFrame.
There are three types of default index that can be controlled by
DEFAULT_INDEX
environment variable.one-by-one: It implements an one-by-one sequence by Window function without
specifying partition. Therefore, it ends up with whole partition in single node.
This index type should be avoided when the data is large. This is default.
TL;DR:
Window
withrow_number
without partitioning specdistributed-one-by-one: It implements an one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be an one-by-one sequence in a large dataset, this
index has to be used.
TL;DR:
groupby(partition_id).count().collect()
andgroupby(partition_id).apply(f)
.distributed: It implements a monotonically increasing sequence simply by using
Spark's
monotonically_increasing_id
function. If the index does not have to bea one-by-one sequence, this index should be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.
TL;DR:
moninically_increasing_id()
.