-
Notifications
You must be signed in to change notification settings - Fork 7k
[DataFrame] Implement Indexer getitem #1955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test PASSed. |
devin-petersohn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! I left some comments and questions.
| Metadata for the new dataframe's columns | ||
| partial (boolean): | ||
| Internal: row_metadata and col_metadata only covers part of the | ||
| block partitions. (Used in index 'vew' accessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vew -> view.
| copy=False, col_partitions=None, row_partitions=None, | ||
| block_partitions=None, row_metadata=None, col_metadata=None): | ||
| block_partitions=None, row_metadata=None, col_metadata=None, | ||
| partial=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to have a DataFrameView object as a subclass of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions is handled in the case that it's a view makes complicated code more complicated.
Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should have it as a subclass. DataFrameView object make sense.
I'm refactoring parts of my indexing.py to make enlargement inside the parent class of the _LocIndexer and _iLocIndexer. So that we have cleaner code in indexing.py
| if block_partitions is not None: | ||
| # put in numpy array here to make accesses easier since it's 2D | ||
| self._block_partitions = np.array(block_partitions) | ||
| if row_metadata is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in two places?
| Returns: | ||
| The dtypes for this DataFrame. | ||
| """ | ||
| # Deal with empty column case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer comment to say Deal with a DataFrame with no columns or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).
| Internal: row_metadata and col_metadata only covers part of the | ||
| block partitions. (Used in index 'vew' accessor) | ||
| """ | ||
| self.partial = partial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial -> _partial
| if is_2d(row_loc) and is_2d(col_loc): | ||
| return self._get_dataframe_view(row_loc, col_loc) | ||
|
|
||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to implement these? It would be better to have a NotImplementedError than just pass. That way we don't have some silent internal error or something.
| axis = 0 | ||
| columns = pd_df.columns | ||
| index = pd_df.index | ||
| self._row_metadata = self._col_metadata = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated (Remove this in favor of Line 73)
| retrieved_rows_remote = self._map_partition( | ||
| lookup_dict, col_label, indexer='loc') | ||
| joined_df = pd.concat(ray.get(retrieved_rows_remote)) | ||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.
| # The returned result need to be indexed series/df | ||
| # Re-index is needed. | ||
| joined_df.index = index_loc.index | ||
| def _get_scaler(self, row_loc, col_loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same on this file about method level documentation.
|
@devin-petersohn I'll make the following change in the |
|
Closed via #2020 |

What do these changes do?
locandiloc's getitem methodsNote