Skip to content

[DataFrame] Fix bug with consistency between IndexMetadata and partitions#2088

Closed
pschafhalter wants to merge 1 commit intoray-project:masterfrom
pschafhalter:df-fix-correct-column-dtypes
Closed

[DataFrame] Fix bug with consistency between IndexMetadata and partitions#2088
pschafhalter wants to merge 1 commit intoray-project:masterfrom
pschafhalter:df-fix-correct-column-dtypes

Conversation

@pschafhalter
Copy link
Contributor

Calling _correct_column_dtypes rebuilds partitions using create_blocks_helper. However, create_blocks_helper creates partitions of equal size which results in an inconsistency between IndexMetadata and column partitions. This PR aims to fix the issue by ensuring that _correct_column_dtypes constructs partitions of the same size as before.

Code which demonstrates the bug:

import ray
import ray.dataframe as rdf
df = rdf.DataFrame({"col": list(range(1000000))})
df.to_csv("test_df.csv")
read_df = rdf.read_csv("test_df.csv")
print(read_df._row_metadata._lengths) 
print(ray.get(rdf.utils._map_partitions(lambda df: len(df), read_df._row_partitions)))

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5465/
Test PASSed.

@pschafhalter pschafhalter force-pushed the df-fix-correct-column-dtypes branch from a2ef058 to 9802010 Compare May 18, 2018 23:06
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5484/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be fixed in create_blocks_helper? It looks like that would be the source of the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could if we want to add another argument to create_blocks_helper to specify the number of rows in each partition.

I wasn't sure if we wanted to change the behavior of create_blocks_helper.

@pschafhalter pschafhalter force-pushed the df-fix-correct-column-dtypes branch from 9802010 to 8dd2fb3 Compare May 19, 2018 03:48
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5493/
Test PASSed.

@pschafhalter pschafhalter force-pushed the df-fix-correct-column-dtypes branch from 8dd2fb3 to 3f5a0d8 Compare May 19, 2018 06:40
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5494/
Test PASSed.

@pschafhalter
Copy link
Contributor Author

All tests passed on private travis for current commit.

@pschafhalter
Copy link
Contributor Author

Deprecated due to #2118.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants