Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor html repr performance on large multi-indexes #5529

Closed
max-sixty opened this issue Jun 25, 2021 · 5 comments · Fixed by #6400
Closed

Very poor html repr performance on large multi-indexes #5529

max-sixty opened this issue Jun 25, 2021 · 5 comments · Fixed by #6400

Comments

@max-sixty
Copy link
Collaborator

What happened:

We have catestrophic performance on the html repr of some long multi-indexed data arrays. Here's a case of it taking 12s.

Minimal Complete Verifiable Example:

import xarray as xr

ds = xr.tutorial.load_dataset("air_temperature")
da = ds["air"].stack(z=[...])

da.shape 

# (3869000,)

%timeit -n 1 -r 1 da._repr_html_()

# 12.4 s !!

Anything else we need to know?:

I thought we'd fixed some issues here: https://github.com/pydata/xarray/pull/4846/files

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, May 9 2021, 13:21:55)
[Clang 12.0.5 (clang-1205.0.22.9)]
python-bits: 64
OS: Darwin
OS-release: 20.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.3
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.06.1
distributed: 2021.06.1
matplotlib: 3.4.2
cartopy: None
seaborn: 0.11.1
numbagg: 0.2.1
pint: None
setuptools: 56.0.0
pip: 21.1.2
conda: None
pytest: 6.2.4
IPython: 7.24.0
sphinx: 4.0.1

@Illviljan
Copy link
Contributor

I think it's some lazy calculation that kicks in. Because I can reproduce using np.asarray.

import numpy as np
import xarray as xr

ds = xr.tutorial.load_dataset("air_temperature")
da = ds["air"].stack(z=[...])

coord = da.z.variable.to_index_variable()

# This is very slow:
a = np.asarray(coord)

da._repr_html_()

image

@max-sixty
Copy link
Collaborator Author

Yes, I think it's materializing the multiindex as an array of tuples. Which we definitely shouldn't be doing for reprs.

@Illviljan nice profiling view! What is that?

@Illviljan
Copy link
Contributor

One way of solving it could be to slice the arrays to a smaller size but still showing the same repr. Because coords[0:12] seems easy to print, not sure how tricky it is to slice it in this way though.

I'm using https://github.com/spyder-ide/spyder for the profiling and general hacking.

@max-sixty
Copy link
Collaborator Author

Yes very much so @Illviljan . But weirdly the linked PR is attempting to do that — so maybe this code path doesn't hit that change?

Spyder's profiler looks good!

@benbovy
Copy link
Member

benbovy commented Mar 22, 2022

But weirdly the linked PR is attempting to do that — so maybe this code path doesn't hit that change?

I think the linked PR only fixed the summary (inline) repr. The bottleneck here is when formatting the array detailed view for the multi-index coordinates, which triggers the conversion of the whole pandas MultiIndex (tuple elements) and each of its levels as a numpy arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants