-
Notifications
You must be signed in to change notification settings - Fork 63
refactor: Migrate DataFrame display to use IPython's _repr_mimebundle_() protocol for anywidget mode #2271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
bigframes/streaming/dataframe.py
Outdated
|
|
||
| def _repr_html_(self, *args, **kwargs): | ||
| return _return_type_wrapper(self._df._repr_html_, StreamingDataFrame)( | ||
| def _repr_html_fallback_(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the trailing _ because we aren't trying to get ipython to automatically call this. In fact, we do not want it to call this directly.
bigframes/core/indexes/base.py
Outdated
| # metadata, like we do with DataFrame. | ||
| opts = bigframes.options.display | ||
| max_results = opts.max_rows | ||
| # anywdiget mode uses the same display logic as the "deferred" mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is out-of-date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed outdated comments
bigframes/dataframe.py
Outdated
| return "\n".join(lines) | ||
|
|
||
| def _repr_html_(self) -> str: | ||
| def _repr_html_fallback_(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, remove the trailing _. Just, _repr_html_fallback.
bigframes/dataframe.py
Outdated
| widget_repr["text/plain"] = formatter.repr_query_job(df._compute_dry_run()) | ||
| widget_repr["text/html"] = self._repr_html_fallback_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we have already executed the query as part of the widget construction. Let's use the information available to render the HTML and plain text versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimized to use the data already fetched by the widget. Now it generates the HTML and plain text representations form the widget's cached data, avoiding redundant queries
bigframes/dataframe.py
Outdated
| except (AttributeError, ValueError, ImportError): | ||
| # Fallback: let IPython use _repr_html_fallback_() instead | ||
| warnings.warn( | ||
| "Anywidget mode is not available. " | ||
| "Please `pip install anywidget traitlets` or `pip install 'bigframes[anywidget]'` to use interactive tables. " | ||
| f"Falling back to static HTML. Error: {traceback.format_exc()}" | ||
| ) | ||
| # Don't return anything - let IPython fall back to _repr_html_fallback_() | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this except block. I would prefer we explicitly check for anywidget in particular to be installed. Right now, this could be hiding all sorts of errors.
If we must have it, then please remove all exception types except for ImportError.
bigframes/dataframe.py
Outdated
| # Don't return anything - let IPython fall back to _repr_html_fallback_() | ||
| pass | ||
|
|
||
| return {"text/html": self._repr_html_fallback_(), "text/plain": repr(self)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be calling more BQ API calls than necessary. Let's call to_pandas_batches() and fetch the first page, just like we do in the anywidget version.
This reverts commit 61a9903.
bigframes/dataframe.py
Outdated
| # Process blob columns if needed | ||
| self._cached() | ||
| df = self.copy() | ||
| if bigframes.options.display.blob_display: | ||
| blob_cols = [ | ||
| series_name | ||
| for series_name, series in df.items() | ||
| if series.dtype == bigframes.dtypes.OBJ_REF_DTYPE | ||
| ] | ||
| for col in blob_cols: | ||
| df[col] = df[col].blob._get_runtime(mode="R", with_metadata=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks duplicated with _repr_mimebundle_. Please pull it out into a separate function.
bigframes/dataframe.py
Outdated
| # Re-create the text representation from what we know. | ||
| opts = bigframes.options.display | ||
| with display_options.pandas_repr(opts): | ||
| import pandas.io.formats | ||
|
|
||
| # safe to mutate this, this dict is owned by this code, and does not affect global config | ||
| to_string_kwargs = ( | ||
| pandas.io.formats.format.get_dataframe_repr_params() # type: ignore | ||
| ) | ||
| if not self._has_index: | ||
| to_string_kwargs.update({"index": False}) | ||
| repr_string = widget._cached_data.to_string(**to_string_kwargs) | ||
|
|
||
| lines = repr_string.split("\n") | ||
| row_count = widget.row_count | ||
| if row_count is not None and row_count > len(widget._cached_data): | ||
| lines.append("...") | ||
|
|
||
| lines.append("") | ||
| column_count = len(self.columns) | ||
| lines.append(f"[{row_count or '?'} rows x {column_count} columns]") | ||
| widget_repr["text/plain"] = "\n".join(lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this looks duplicated as well. Please get the common parameters out and refactor to a shared function.
bigframes/dataframe.py
Outdated
| # Generate HTML representation | ||
| with display_options.pandas_repr(opts): | ||
| if bigframes.options.display.blob_display and blob_cols: | ||
|
|
||
| def obj_ref_rt_to_html(obj_ref_rt) -> str: | ||
| obj_ref_rt_json = json.loads(obj_ref_rt) | ||
| obj_ref_details = obj_ref_rt_json["objectref"]["details"] | ||
| if "gcs_metadata" in obj_ref_details: | ||
| gcs_metadata = obj_ref_details["gcs_metadata"] | ||
| content_type = typing.cast( | ||
| str, gcs_metadata.get("content_type", "") | ||
| ) | ||
| if content_type.startswith("image"): | ||
| size_str = "" | ||
| if bigframes.options.display.blob_display_width: | ||
| size_str = f' width="{bigframes.options.display.blob_display_width}"' | ||
| if bigframes.options.display.blob_display_height: | ||
| size_str = ( | ||
| size_str | ||
| + f' height="{bigframes.options.display.blob_display_height}"' | ||
| ) | ||
| url = obj_ref_rt_json["access_urls"]["read_url"] | ||
| return f'<img src="{url}"{size_str}>' | ||
|
|
||
| return f'uri: {obj_ref_rt_json["objectref"]["uri"]}, authorizer: {obj_ref_rt_json["objectref"]["authorizer"]}' | ||
|
|
||
| formatters = {blob_col: obj_ref_rt_to_html for blob_col in blob_cols} | ||
| with pandas.option_context("display.max_colwidth", None): | ||
| html_string = pandas_df.to_html( | ||
| escape=False, | ||
| notebook=True, | ||
| max_rows=pandas.get_option("display.max_rows"), | ||
| max_cols=pandas.get_option("display.max_columns"), | ||
| show_dimensions=pandas.get_option("display.show_dimensions"), | ||
| formatters=formatters, # type: ignore | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Shouldn't we be using the _repr_html_fallback method you created here for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. It was confusing and the logic was duplicated. I have refactored the display logic. A new private function _create_html_represenetation for the shared logic. _repr_html_fallback and repr_mimebundle now call this new method. The code duplication is removed.
c79425c to
86e0b6a
Compare
| def _repr_html_fallback(self) -> str: | ||
| """ | ||
| Returns an html string primarily for use by notebooks for displaying | ||
| a representation of the DataFrame. Displays 20 rows by default since | ||
| many notebooks are not configured for large tables. | ||
| """ | ||
| opts = bigframes.options.display | ||
| max_results = opts.max_rows | ||
| if opts.repr_mode == "deferred": | ||
| return formatter.repr_query_job(self._compute_dry_run()) | ||
|
|
||
| # Process blob columns first, regardless of display mode | ||
| # Process blob columns first for non-deferred modes | ||
| df, blob_cols = self._process_blob_columns() | ||
|
|
||
| pandas_df, row_count, query_job = df._block.retrieve_repr_request_results( | ||
| max_results | ||
| ) | ||
|
|
||
| self._set_internal_query_job(query_job) | ||
| column_count = len(pandas_df.columns) | ||
|
|
||
| return self._create_html_representation( | ||
| pandas_df, row_count, column_count, blob_cols | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing references to this _repr_html_fallback function. I think it is dead code and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_repr_html_fallback is referenced in bigframes/streaming/dataframe.py. This chunk of code is needed.
| df = self | ||
| blob_cols = [] | ||
| if bigframes.options.display.blob_display: | ||
| blob_cols = [ | ||
| series_name | ||
| for series_name, series in df.items() | ||
| for series_name, series in self.items() | ||
| if series.dtype == bigframes.dtypes.OBJ_REF_DTYPE | ||
| ] | ||
| for col in blob_cols: | ||
| # TODO(garrettwu): Not necessary to get access urls for all the rows. Update when having a to get URLs from local data. | ||
| df[col] = df[col].blob._get_runtime(mode="R", with_metadata=True) | ||
| if blob_cols: | ||
| df = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we define self twice? copy() is a pretty cheap operation (just makes a new block referencing the same expression), so I think we can just define it once.
| df = self | |
| blob_cols = [] | |
| if bigframes.options.display.blob_display: | |
| blob_cols = [ | |
| series_name | |
| for series_name, series in df.items() | |
| for series_name, series in self.items() | |
| if series.dtype == bigframes.dtypes.OBJ_REF_DTYPE | |
| ] | |
| for col in blob_cols: | |
| # TODO(garrettwu): Not necessary to get access urls for all the rows. Update when having a to get URLs from local data. | |
| df[col] = df[col].blob._get_runtime(mode="R", with_metadata=True) | |
| if blob_cols: | |
| df = self.copy() | |
| df = self.copy() | |
| blob_cols = [] | |
| if bigframes.options.display.blob_display: | |
| blob_cols = [ | |
| series_name | |
| for series_name, series in df.items() | |
| if series.dtype == bigframes.dtypes.OBJ_REF_DTYPE | |
| ] | |
| if blob_cols: |
bigframes/dataframe.py
Outdated
|
|
||
| def _process_blob_columns(self) -> tuple[DataFrame, list[str]]: | ||
| """Process blob columns for display.""" | ||
| self._cached() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the wrong function in which to call _cached(). _cached() can be very expensive (causes a query). Seems better to just call it once if needed in repr_mimebundle`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to _cached() was removed from _process_blob_columns (now _get_display_df_and_blob_cols) to avoid unnecessary queries, as suggested.
bigframes/dataframe.py
Outdated
| """ | ||
| from bigframes import display | ||
|
|
||
| df, _ = self._process_blob_columns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we doing when we are "processing"? Seems like we should have a more descriptive name here that describes the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed _process_blob_columns to _get_display_df_and_blob_cols to better reflect its purpose of preparing the DataFrame for display, including handling blob columns.
bigframes/dataframe.py
Outdated
| # Fallback: let IPython use _repr_html_fallback() instead | ||
| warnings.warn( | ||
| "Anywidget mode is not available. " | ||
| "Please `pip install anywidget traitlets` or `pip install 'bigframes[anywidget]'` to use interactive tables. " | ||
| f"Falling back to deferred mode. Error: {traceback.format_exc()}" | ||
| f"Falling back to static HTML. Error: {traceback.format_exc()}" | ||
| ) | ||
| return formatter.repr_query_job(self._compute_dry_run()) | ||
| # Don't return anything - let IPython fall back to _repr_html_fallback() | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: pass is only necessary if there is no body, but we do have a body with the warning above.
Also, we are manually falling back.
| # Fallback: let IPython use _repr_html_fallback() instead | |
| warnings.warn( | |
| "Anywidget mode is not available. " | |
| "Please `pip install anywidget traitlets` or `pip install 'bigframes[anywidget]'` to use interactive tables. " | |
| f"Falling back to deferred mode. Error: {traceback.format_exc()}" | |
| f"Falling back to static HTML. Error: {traceback.format_exc()}" | |
| ) | |
| return formatter.repr_query_job(self._compute_dry_run()) | |
| # Don't return anything - let IPython fall back to _repr_html_fallback() | |
| pass | |
| # Anywidget is an optional dependency, so warn rather than fail. | |
| # TODO(shuowei): When Anywidget becomes the default for all repr modes, | |
| # remove this warning. | |
| warnings.warn( | |
| "Anywidget mode is not available. " | |
| "Please `pip install anywidget traitlets` or `pip install 'bigframes[anywidget]'` to use interactive tables. " | |
| f"Falling back to static HTML. Error: {traceback.format_exc()}" | |
| ) | |
bigframes/dataframe.py
Outdated
| """ | ||
| from bigframes import display | ||
|
|
||
| df, _ = self._process_blob_columns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| df, _ = self._process_blob_columns() | |
| # TODO(shuowei): Keep blob_cols and pass them to TableWidget so that they can render properly. | |
| df, _ = self._process_blob_columns() |
This PR refactors the DataFrame display system to properly implement IPython's repr_mimebundle() protocol for anywidget mode, moving widget handling logic out of repr_html() and into the appropriate display method.
See design doc here: screen/8XWbJDa2d9mb6r5
Fixes #<458796812> 🦕