Skip to content

Conversation

@simon-mo
Copy link
Contributor

What do these changes do?

  • Implement loc and iloc's getitem methods

Note

  • Tests passed in private repo.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5078/
Test PASSed.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! I left some comments and questions.

Metadata for the new dataframe's columns
partial (boolean):
Internal: row_metadata and col_metadata only covers part of the
block partitions. (Used in index 'vew' accessor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vew -> view.

copy=False, col_partitions=None, row_partitions=None,
block_partitions=None, row_metadata=None, col_metadata=None):
block_partitions=None, row_metadata=None, col_metadata=None,
partial=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have a DataFrameView object as a subclass of this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions is handled in the case that it's a view makes complicated code more complicated.

Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should have it as a subclass. DataFrameView object make sense.

I'm refactoring parts of my indexing.py to make enlargement inside the parent class of the _LocIndexer and _iLocIndexer. So that we have cleaner code in indexing.py

if block_partitions is not None:
# put in numpy array here to make accesses easier since it's 2D
self._block_partitions = np.array(block_partitions)
if row_metadata is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in two places?

Returns:
The dtypes for this DataFrame.
"""
# Deal with empty column case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer comment to say Deal with a DataFrame with no columns or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).

Internal: row_metadata and col_metadata only covers part of the
block partitions. (Used in index 'vew' accessor)
"""
self.partial = partial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial -> _partial

if is_2d(row_loc) and is_2d(col_loc):
return self._get_dataframe_view(row_loc, col_loc)

def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to implement these? It would be better to have a NotImplementedError than just pass. That way we don't have some silent internal error or something.

axis = 0
columns = pd_df.columns
index = pd_df.index
self._row_metadata = self._col_metadata = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated (Remove this in favor of Line 73)

retrieved_rows_remote = self._map_partition(
lookup_dict, col_label, indexer='loc')
joined_df = pd.concat(ray.get(retrieved_rows_remote))
def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.

# The returned result need to be indexed series/df
# Re-index is needed.
joined_df.index = index_loc.index
def _get_scaler(self, row_loc, col_loc):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same on this file about method level documentation.

@simon-mo
Copy link
Contributor Author

simon-mo commented May 5, 2018

@devin-petersohn I'll make the following change in the indexing.py, removing the triage redundancy:
iloc_refactor

@simon-mo
Copy link
Contributor Author

simon-mo commented May 9, 2018

Closed via #2020

@simon-mo simon-mo closed this May 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants