Very poor html repr performance on large multi-indexes #5529

max-sixty · 2021-06-25T04:31:27Z

What happened:

We have catestrophic performance on the html repr of some long multi-indexed data arrays. Here's a case of it taking 12s.

Minimal Complete Verifiable Example:

import xarray as xr

ds = xr.tutorial.load_dataset("air_temperature")
da = ds["air"].stack(z=[...])

da.shape 

# (3869000,)

%timeit -n 1 -r 1 da._repr_html_()

# 12.4 s !!

Anything else we need to know?:

I thought we'd fixed some issues here: https://github.com/pydata/xarray/pull/4846/files

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, May 9 2021, 13:21:55)
[Clang 12.0.5 (clang-1205.0.22.9)]
python-bits: 64
OS: Darwin
OS-release: 20.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.3
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.06.1
distributed: 2021.06.1
matplotlib: 3.4.2
cartopy: None
seaborn: 0.11.1
numbagg: 0.2.1
pint: None
setuptools: 56.0.0
pip: 21.1.2
conda: None
pytest: 6.2.4
IPython: 7.24.0
sphinx: 4.0.1

The text was updated successfully, but these errors were encountered:

Illviljan · 2021-06-25T17:54:00Z

I think it's some lazy calculation that kicks in. Because I can reproduce using np.asarray.

import numpy as np
import xarray as xr

ds = xr.tutorial.load_dataset("air_temperature")
da = ds["air"].stack(z=[...])

coord = da.z.variable.to_index_variable()

# This is very slow:
a = np.asarray(coord)

da._repr_html_()

max-sixty · 2021-06-25T17:58:11Z

Yes, I think it's materializing the multiindex as an array of tuples. Which we definitely shouldn't be doing for reprs.

@Illviljan nice profiling view! What is that?

Illviljan · 2021-06-25T18:52:37Z

One way of solving it could be to slice the arrays to a smaller size but still showing the same repr. Because coords[0:12] seems easy to print, not sure how tricky it is to slice it in this way though.

I'm using https://github.com/spyder-ide/spyder for the profiling and general hacking.

max-sixty · 2021-06-25T21:21:55Z

Yes very much so @Illviljan . But weirdly the linked PR is attempting to do that — so maybe this code path doesn't hit that change?

Spyder's profiler looks good!

benbovy · 2022-03-22T12:23:23Z

But weirdly the linked PR is attempting to do that — so maybe this code path doesn't hit that change?

I think the linked PR only fixed the summary (inline) repr. The bottleneck here is when formatting the array detailed view for the multi-index coordinates, which triggers the conversion of the whole pandas MultiIndex (tuple elements) and each of its levels as a numpy arrays.

dcherian added the topic-html-repr label Jul 4, 2021

dcherian added the topic-performance label Sep 29, 2021

benbovy mentioned this issue Mar 22, 2022

Speed-up multi-index html repr + add display_values_threshold option #6400

Merged

2 tasks

benbovy closed this as completed in #6400 Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very poor html repr performance on large multi-indexes #5529

Very poor html repr performance on large multi-indexes #5529

max-sixty commented Jun 25, 2021

INSTALLED VERSIONS

Illviljan commented Jun 25, 2021

max-sixty commented Jun 25, 2021

Illviljan commented Jun 25, 2021

max-sixty commented Jun 25, 2021

benbovy commented Mar 22, 2022

Very poor html repr performance on large multi-indexes #5529

Very poor html repr performance on large multi-indexes #5529

Comments

max-sixty commented Jun 25, 2021

INSTALLED VERSIONS

Illviljan commented Jun 25, 2021

max-sixty commented Jun 25, 2021

Illviljan commented Jun 25, 2021

max-sixty commented Jun 25, 2021

benbovy commented Mar 22, 2022