Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for backend datetime handling #7374

Merged
merged 34 commits into from
Jan 13, 2023

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Dec 11, 2022

Was hunting some low-hanging performance fruits when reading in files.

  • Use Variable(..., fastpath=True) for cases when a Variable has been unpacked and modified slightly.
  • Don't check if variable is variable in decode_cf_variable
  • Don't import DataArray until necessary in as_compatible_data.
  • Add typing to touched files to make sure only Variables are used.
  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@Illviljan Illviljan changed the title Add typing to conventions.py Improve typing in datetime handling Dec 11, 2022
Comment on lines 607 to 614
if (
decode_coords
and "coordinates" in attributes
and isinstance(attributes["coordinates"], str)
):
attributes = dict(attributes)
coord_names.update(attributes.pop("coordinates").split())
crds = attributes.pop("coordinates")
coord_names.update(crds.split())
Copy link
Contributor Author

@Illviljan Illviljan Dec 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previously would've crashed when trying to use attrs["coordinates"].split() on a non-string value.
Might be a functionality change? Should this raise an error instead if it's not a string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would be good to raise a nice error. "coordinates" is expected to be a string with variable names separated by spaces: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#attribute-appendix

xarray/coding/times.py Outdated Show resolved Hide resolved
@Illviljan Illviljan marked this pull request as ready for review December 11, 2022 21:49
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 11, 2022
@github-actions github-actions bot removed the run-benchmark Run the ASV benchmark workflow label Dec 13, 2022
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Dec 15, 2022
@Illviljan Illviljan changed the title Improve typing in datetime handling Improve performance for backend datetime handling Dec 15, 2022
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100 for all the typing :)

Do we need some benchmark for this?

xarray/coding/times.py Outdated Show resolved Hide resolved
@Illviljan
Copy link
Contributor Author

Benchmark are improved if I understand the logs correctly. Unfortunately not significant enough to make ASV report it though. The ratio has to be >1.5 and the improvements on .time_open_dataset are around 1.3-1.4.

# PR:
[ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== =========
               chunks           
              -------- ---------
                None    130±1ms 
                 {}     689±6ms 
              ======== =========

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============= =============
              --                   chunks          
              --------- ---------------------------
                engine       None           {}     
              ========= ============= =============
                scipy    5.48±0.04ms   6.91±0.01ms 
               netcdf4   2.93±0.04ms   4.32±0.02ms 
              ========= ============= =============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ===========
               chunks             
              -------- -----------
                None    177±0.5ms 
                 {}      737±3ms  
              ======== ===========

[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
                scipy    4.47±0.6ms   5.74±0.7ms 
               netcdf4   4.39±0.7ms   5.82±0.6ms 
              ========= ============ ============
# PR:
[ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== ==========
               chunks            
              -------- ----------
                None    149±4ms  
                 {}     797±20ms 
              ======== ==========

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============ =============
              --                  chunks          
              --------- --------------------------
                engine      None           {}     
              ========= ============ =============
                scipy    6.57±0.2ms   7.77±0.01ms 
               netcdf4   3.71±0.1ms    6.17±0.5ms 
              ========= ============ =============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ==========
               chunks            
              -------- ----------
                None    204±2ms  
                 {}     857±20ms 
              ======== ==========

[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
                scipy     5.53±1ms    7.12±0.8ms 
               netcdf4   4.96±0.6ms   6.74±0.8ms 
              ========= ============ ============

# PR:
 [ 50.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 50.85%] ··· ======== ============
               chunks              
              -------- ------------
                None     204±8ms   
                 {}     1.20±0.04s 
              ======== ============

[ 54.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 54.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
               netcdf4   6.86±0.7ms   9.81±0.6ms 
                scipy     6.74±1ms     9.10±1ms  
              ========= ============ ============

# Baseline:
[ 75.85%] ··· dataset_io.IOReadCustomEngine.time_open_dataset                 ok
[ 75.85%] ··· ======== ============
               chunks              
              -------- ------------
                None     282±5ms   
                 {}     1.20±0.04s 
              ======== ============


[ 79.69%] ··· dataset_io.IOReadSingleFile.time_read_dataset                   ok
[ 79.69%] ··· ========= ============ ============
              --                  chunks         
              --------- -------------------------
                engine      None          {}     
              ========= ============ ============
               netcdf4    6.91±1ms    9.77±0.7ms 
                scipy    6.91±0.7ms    9.11±1ms  
              ========= ============ ============

@Illviljan Illviljan added the plan to merge Final call for comments label Jan 11, 2023
xarray/conventions.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

Looks like an improvement on my machine

       before           after         ratio
     [4f3128bb]       [898b8728]
     <main>           <mypy_conventions>
-         244±7ms          184±3ms     0.76  dataset_io.IOReadCustomEngine.time_open_dataset(None)

IOReadSingleFile hasn't changed significantly

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Illviljan

@Illviljan Illviljan enabled auto-merge (squash) January 13, 2023 14:50
@Illviljan Illviljan disabled auto-merge January 13, 2023 14:50
@Illviljan Illviljan merged commit 6c5840e into pydata:main Jan 13, 2023
@headtr1ck
Copy link
Collaborator

Great, thanks!

dcherian added a commit to dcherian/xarray that referenced this pull request Jan 18, 2023
* main: (41 commits)
  v2023.01.0 whats-new (pydata#7440)
  explain keep_attrs in docstring of apply_ufunc (pydata#7445)
  Add sentence to open_dataset docstring (pydata#7438)
  pin scipy version in doc environment (pydata#7436)
  Improve performance for backend datetime handling (pydata#7374)
  fix typo (pydata#7433)
  Add lazy backend ASV test (pydata#7426)
  Pull Request Labeler - Workaround sync-labels bug (pydata#7431)
  see also : groupby in resample doc and vice-versa (pydata#7425)
  Some alignment optimizations (pydata#7382)
  Make `broadcast` and `concat` work with the Array API (pydata#7387)
  remove `numbagg` and `numba` from the upstream-dev CI (pydata#7416)
  [pre-commit.ci] pre-commit autoupdate (pydata#7402)
  Preserve original dtype when accessing MultiIndex levels (pydata#7393)
  [pre-commit.ci] pre-commit autoupdate (pydata#7389)
  [pre-commit.ci] pre-commit autoupdate (pydata#7360)
  COMPAT: Adjust CFTimeIndex.get_loc for pandas 2.0 deprecation enforcement (pydata#7361)
  Avoid loading entire dataset by getting the nbytes in an array (pydata#7356)
  `keep_attrs` for pad (pydata#7267)
  Bump pypa/gh-action-pypi-publish from 1.5.1 to 1.6.4 (pydata#7375)
  ...
@Illviljan Illviljan deleted the mypy_conventions branch January 18, 2023 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants