Improve performance for backend datetime handling #7374

Illviljan · 2022-12-11T16:01:05Z

Was hunting some low-hanging performance fruits when reading in files.

Use Variable(..., fastpath=True) for cases when a Variable has been unpacked and modified slightly.
Don't check if variable is variable in decode_cf_variable
Don't import DataArray until necessary in as_compatible_data.
Add typing to touched files to make sure only Variables are used.

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

for more information, see https://pre-commit.ci

…into mypy_conventions

for more information, see https://pre-commit.ci

…into mypy_conventions

for more information, see https://pre-commit.ci

…into mypy_conventions

for more information, see https://pre-commit.ci

Illviljan · 2022-12-11T21:44:18Z

xarray/conventions.py

+    if (
+        decode_coords
+        and "coordinates" in attributes
+        and isinstance(attributes["coordinates"], str)
+    ):
        attributes = dict(attributes)
-        coord_names.update(attributes.pop("coordinates").split())
+        crds = attributes.pop("coordinates")
+        coord_names.update(crds.split())


This previously would've crashed when trying to use attrs["coordinates"].split() on a non-string value.
Might be a functionality change? Should this raise an error instead if it's not a string?

Yes it would be good to raise a nice error. "coordinates" is expected to be a string with variable names separated by spaces: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#attribute-appendix

xarray/coding/times.py

Function is about 18% faster without this check.

…into mypy_conventions

Reduces time from 450ms -> 290ms from my open_dataset testing.

headtr1ck

+100 for all the typing :)

Do we need some benchmark for this?

xarray/coding/times.py

for more information, see https://pre-commit.ci

…into mypy_conventions

Illviljan · 2023-01-11T22:33:55Z

Benchmark are improved if I understand the logs correctly. Unfortunately not significant enough to make ASV report it though. The ratio has to be >1.5 and the improvements on .time_open_dataset are around 1.3-1.4.

# PR:
[ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== =========
               chunks           
              -------- ---------
                None    130±1ms 
                 {}     689±6ms 
              ======== =========

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============= =============
              --                   chunks          
              --------- ---------------------------
                engine       None           {}     
              ========= ============= =============
                scipy    5.48±0.04ms   6.91±0.01ms 
               netcdf4   2.93±0.04ms   4.32±0.02ms 
              ========= ============= =============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ===========
               chunks             
              -------- -----------
                None    177±0.5ms 
                 {}      737±3ms  
              ======== ===========

[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
                scipy    4.47±0.6ms   5.74±0.7ms 
               netcdf4   4.39±0.7ms   5.82±0.6ms 
              ========= ============ ============

# PR:
[ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== ==========
               chunks            
              -------- ----------
                None    149±4ms  
                 {}     797±20ms 
              ======== ==========

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============ =============
              --                  chunks          
              --------- --------------------------
                engine      None           {}     
              ========= ============ =============
                scipy    6.57±0.2ms   7.77±0.01ms 
               netcdf4   3.71±0.1ms    6.17±0.5ms 
              ========= ============ =============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ==========
               chunks            
              -------- ----------
                None    204±2ms  
                 {}     857±20ms 
              ======== ==========

[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
                scipy     5.53±1ms    7.12±0.8ms 
               netcdf4   4.96±0.6ms   6.74±0.8ms 
              ========= ============ ============

# PR:
 [ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== ============
               chunks              
              -------- ------------
                None     204±8ms   
                 {}     1.20±0.04s 
              ======== ============

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
               netcdf4   6.86±0.7ms   9.81±0.6ms 
                scipy     6.74±1ms     9.10±1ms  
              ========= ============ ============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ============
               chunks              
              -------- ------------
                None     282±5ms   
                 {}     1.20±0.04s 
              ======== ============


[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
               netcdf4    6.91±1ms    9.77±0.7ms 
                scipy    6.91±0.7ms    9.11±1ms  
              ========= ============ ============

xarray/conventions.py

dcherian · 2023-01-12T22:46:41Z

Looks like an improvement on my machine

       before           after         ratio
     [4f3128bb]       [898b8728]
     <main>           <mypy_conventions>
-         244±7ms          184±3ms     0.76  dataset_io.IOReadCustomEngine.time_open_dataset(None)

IOReadSingleFile hasn't changed significantly

dcherian

Thanks @Illviljan

headtr1ck · 2023-01-13T15:03:34Z

Great, thanks!

* main: (41 commits) v2023.01.0 whats-new (pydata#7440) explain keep_attrs in docstring of apply_ufunc (pydata#7445) Add sentence to open_dataset docstring (pydata#7438) pin scipy version in doc environment (pydata#7436) Improve performance for backend datetime handling (pydata#7374) fix typo (pydata#7433) Add lazy backend ASV test (pydata#7426) Pull Request Labeler - Workaround sync-labels bug (pydata#7431) see also : groupby in resample doc and vice-versa (pydata#7425) Some alignment optimizations (pydata#7382) Make `broadcast` and `concat` work with the Array API (pydata#7387) remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416) [pre-commit.ci] pre-commit autoupdate (pydata#7402) Preserve original dtype when accessing MultiIndex levels (pydata#7393) [pre-commit.ci] pre-commit autoupdate (pydata#7389) [pre-commit.ci] pre-commit autoupdate (pydata#7360) COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361) Avoid loading entire dataset by getting the nbytes in an array (pydata#7356) `keep_attrs` for pad (pydata#7267) Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375) ...

Add typing to conventions.py use fastpath on recreated variables

52c85f6

github-actions bot added the topic-CF conventions label Dec 11, 2022

Illviljan and others added 6 commits December 11, 2022 17:01

Merge branch 'main' into mypy_conventions

fa9ae28

[pre-commit.ci] auto fixes from pre-commit.com hooks

253360f

for more information, see https://pre-commit.ci

Add fastpath

9285cc0

Merge branch 'mypy_conventions' of https://github.com/Illviljan/xarray …

db582ac

…into mypy_conventions

Add fastpath

a63ed09

add typing

a12ff46

github-actions bot added the topic-cftime label Dec 11, 2022

pre-commit-ci bot and others added 11 commits December 11, 2022 17:15

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd72c14

for more information, see https://pre-commit.ci

Update times.py

8dbe1ad

Think mypy found an error here.

e71a80c

Update variables.py

23dcdd9

[pre-commit.ci] auto fixes from pre-commit.com hooks

b7526a0

for more information, see https://pre-commit.ci

Update times.py

b54f1eb

Merge branch 'mypy_conventions' of https://github.com/Illviljan/xarray …

775faf1

…into mypy_conventions

[pre-commit.ci] auto fixes from pre-commit.com hooks

e64d6e6

for more information, see https://pre-commit.ci

Tuple for mypy38

798bd19

Merge branch 'mypy_conventions' of https://github.com/Illviljan/xarray …

432b938

…into mypy_conventions

[pre-commit.ci] auto fixes from pre-commit.com hooks

ec8daa1

for more information, see https://pre-commit.ci

Illviljan changed the title ~~Add typing to conventions.py~~ Improve typing in datetime handling Dec 11, 2022

Illviljan commented Dec 11, 2022

View reviewed changes

Illviljan marked this pull request as ready for review December 11, 2022 21:49

Illviljan added 5 commits December 11, 2022 23:05

Variable is already input to this function

e72b6e2

Function is about 18% faster without this check.

Merge branch 'mypy_conventions' of https://github.com/Illviljan/xarray …

9e956b8

…into mypy_conventions

Don't import DataArray until necessary.

72c8833

Reduces time from 450ms -> 290ms from my open_dataset testing.

Update conventions.py

629bc0e

Only create a Variable if a change has been made.

91f1871

Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 11, 2022

Don't recreate a unmodified variable

8274848

github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Dec 13, 2022

Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 15, 2022

Illviljan changed the title ~~Improve typing in datetime handling~~ Improve performance for backend datetime handling Dec 15, 2022

headtr1ck approved these changes Dec 19, 2022

View reviewed changes

xarray/coding/times.py Outdated Show resolved Hide resolved

Add ASV test

e82cd1c

github-actions bot added the topic-performance label Jan 6, 2023

pre-commit-ci bot and others added 3 commits January 6, 2023 21:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

939cdba

for more information, see https://pre-commit.ci

Update dataset_io.py

dbf2d82

Merge branch 'mypy_conventions' of https://github.com/Illviljan/xarray …

f73e278

…into mypy_conventions

Illviljan mentioned this pull request Jan 6, 2023

Add lazy backend ASV test #7426

Merged

Illviljan added 3 commits January 6, 2023 23:32

Update dataset_io.py

7e26870

Update dataset_io.py

6bc9605

Merge branch 'main' into mypy_conventions

b4eced4

Illviljan added 2 commits January 11, 2023 23:36

return early instead of new variables

6808383

Update conventions.py

ec48baa

Illviljan added the plan to merge Final call for comments label Jan 11, 2023

headtr1ck reviewed Jan 12, 2023

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

Update conventions.py

898b872

dcherian approved these changes Jan 12, 2023

View reviewed changes

Illviljan enabled auto-merge (squash) January 13, 2023 14:50

Illviljan disabled auto-merge January 13, 2023 14:50

Illviljan merged commit 6c5840e into pydata:main Jan 13, 2023

Illviljan deleted the mypy_conventions branch January 18, 2023 22:45

Illviljan mentioned this pull request Mar 19, 2023

encode_cf_variable triggers AttributeError: 'DataArray' object has no attribute '_data' #7645

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for backend datetime handling #7374

Improve performance for backend datetime handling #7374

Illviljan commented Dec 11, 2022 •

edited

Loading

Illviljan Dec 11, 2022 •

edited

Loading

dcherian Jan 12, 2023

headtr1ck left a comment

Illviljan commented Jan 11, 2023

dcherian commented Jan 12, 2023

dcherian left a comment

headtr1ck commented Jan 13, 2023

Improve performance for backend datetime handling #7374

Improve performance for backend datetime handling #7374

Conversation

Illviljan commented Dec 11, 2022 • edited Loading

Illviljan Dec 11, 2022 • edited Loading

Choose a reason for hiding this comment

dcherian Jan 12, 2023

Choose a reason for hiding this comment

headtr1ck left a comment

Choose a reason for hiding this comment

Illviljan commented Jan 11, 2023

dcherian commented Jan 12, 2023

dcherian left a comment

Choose a reason for hiding this comment

headtr1ck commented Jan 13, 2023

Illviljan commented Dec 11, 2022 •

edited

Loading

Illviljan Dec 11, 2022 •

edited

Loading