Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long import time #6726

Closed
leroyvn opened this issue Jun 25, 2022 · 9 comments · Fixed by #7179
Closed

Long import time #6726

leroyvn opened this issue Jun 25, 2022 · 9 comments · Fixed by #7179
Labels
dependencies Pull requests that update a dependency file topic-internals

Comments

@leroyvn
Copy link

leroyvn commented Jun 25, 2022

What is your issue?

Importing the xarray package takes a significant amount of time. For instance:

❯ time python -c "import xarray"
python -c "import xarray"  1.44s user 0.52s system 132% cpu 1.476 total

compared to others

❯ time python -c "import pandas"
python -c "import pandas"  0.45s user 0.35s system 177% cpu 0.447 total

❯ time python -c "import scipy"
python -c "import scipy"  0.29s user 0.23s system 297% cpu 0.175 total

❯ time python -c "import numpy"
python -c "import numpy"  0.29s user 0.43s system 313% cpu 0.229 total

❯ time python -c "import datetime"
python -c "import datetime"  0.05s user 0.00s system 99% cpu 0.051 total

I am obviously not surprised that importing xarray takes longer than importing Pandas, Numpy or the datetime module, but 1.5 s is something you clearly notice when it is done e.g. by a command-line application.

I inquired about import performance and found out about a lazy module loader proposal by the Scientific Python community. AFAIK SciPy uses a similar system to populate its namespaces without import time penalty. Would it be possible for xarray to use delayed imports when relevant?

@leroyvn leroyvn added the needs triage Issue that has not been reviewed by xarray team member label Jun 25, 2022
@mathause
Copy link
Collaborator

Thanks for the report. I think one resaon is that we import all the io libraries non-lazy (I think since the backend refactor). And many of the dependecies still use pkg_resources instead of importlib.metadata (which is considetably slower).

We'd need to take a look at the lazy loader.

@mathause mathause added topic-internals dependencies Pull requests that update a dependency file and removed needs triage Issue that has not been reviewed by xarray team member labels Jun 25, 2022
@headtr1ck
Copy link
Collaborator

Useful for debugging:
python -X importtime -c "import xarray"

@mathause
Copy link
Collaborator

mathause commented Jul 30, 2022

I just had another look at this using

python -X importtime -c "import llvmlite" 2> import.log

and tuna for the visualization.

  • pseudoNETCDF adds quite some overhead, but I think only few people have this installed (could be made faster, but not sure if worth it)
  • llmvlite (required by numba) seems the last dependency relying on pkg_resources but this is fixed in the new version which should be out soonish
  • dask recently merged a PR that avoids a slow import Only import IPython if type checking dask/dask#9230 (from which we should profit)

This should bring it down a bit by another 0.25 s, but I agree it would be nice to have it even lower.

@eendebakpt
Copy link
Contributor

Some other projects are considering lazy imports as well: https://scientific-python.org/specs/spec-0001/

@headtr1ck
Copy link
Collaborator

I think we could rework our backend solution to do the imports lazy:
To check if a file might be openable via some backend we usually do not need to import its dependency module.

@headtr1ck
Copy link
Collaborator

headtr1ck commented Sep 28, 2022

I just checked, many backends are importing their external dependencies at module level with a try-except block.
This could be replaced by importlib.util.find_spec.

However, many backends also check for ImportErrors (not ModuleNotFoundError) that occur when a library is not correctly installed. I am not sure if in this case the backend should simply be disabled like it is now (At least cfgrib is raising a warning instead)?
Would it be a problem if this error is only appearing when actually trying to open a file? If that is the case, we could move to lazy external lib loading for the backends.

Not sure how much it actually saves, but should be ~0.2s (at least on my machine, but depends on the number of intalled backends, the fewer are installed the faster the import should be).

@dcherian
Copy link
Contributor

dcherian commented Oct 3, 2022

his could be replaced by importlib.util.find_spec.

Nice. Does it work on python 3.8?

However, many backends also check for ImportErrors (not ModuleNotFoundError) that occur when a library is not correctly installed. I am not sure if in this case the backend should simply be disabled like it is now (At least cfgrib is raising a warning instead)?

Would it be a problem if this error is only appearing when actually trying to open a file

Sounds OK to error when trying to use the backend.

@headtr1ck
Copy link
Collaborator

Nice. Does it work on python 3.8?

according to the docu it exists since 3.4.

@hmaarrfk
Copy link
Contributor

In developing #7172, there are also some places where class types are used to check for features:
https://github.com/pydata/xarray/blob/main/xarray/core/pycompat.py#L35

Dask and sparse and big contributors due to their need to resolve the class name in question.

Ultimately. I think it is important to maybe constrain the problem.

Are we ok with 100 ms over numpy + pandas? 20 ms?

On my machines, the 0.5 s that xarray is close to seems long... but everytime I look at it, it seems to "just be a python problem".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file topic-internals
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants