Skip to content

Conversation

@bouweandela
Copy link
Member

@bouweandela bouweandela commented Jul 3, 2025

Description

Add an interface for adding new data sources. Documentation of the new interface is available here: esmvalcore.io.

The existing esmvalcore.local and esmvalcore.esgf modules have been modified to make use of the new interface and as an example use case, support for using intake-esgf to find input data has been added.

The plan is to use this interface to add support for xcube and intake-esm next.

Several commands have been added:

  • esmvaltool config show: print the current configuration
  • esmvaltool config list: list available example configuration files
  • esmvaltool config copy: copy an example configuration file to your configuration directory, i.e. ~/.config/esmvaltool or the path defined by the ESMVALTOOL_CONFIG_DIR environment variable.

To try the new intake-esgf data source, configure esmvaltool to use it by running the command esmvalcore config copy intake-esgf-data.yml.

This pull request also adds the option to skip tests that require an internet connection by marking those as online. E.g. pytest -m 'not online' will skip those.

Related to #2584

Deprecations

The following configuration settings are no longer needed because they can be configured using the new data sources format.

The following configuration settings are deprecated and will be removed in v2.16:

  • rootpath
  • drs
  • download_dir
  • search_esgf: use the new data sources to configure the 'never' option. The 'when_missing' and 'always' values are equivalent to configuring both local and ESGF data sources and using the new search_data option with value 'quick' and 'complete' respectively.

and in config-developer.yml

  • input_dir
  • input_file
  • ignore_warnings

The following Python functions are deprecated and will be removed in v2.16:

  • esmvalcore.local.DataSource: use esmvalcore.local.LocalDataSource instead
  • esmvalcore.local.find_files: use esmvalcore.local.LocalDataSource.find_data instead

Documentation

Follow up ideas

  • Improve control over loading of data by intake-esgf (chunks, streaming)
  • Move the modules esmvalcore.esgf and esmvalcore.local into esmvalcore.io. To avoid introducing even more changes in the pull request, I will do this in a follow up pull request. Move the local and esgf modules from esmvalcore to esmvalcore.io #2882
  • Make the fixes module configurable per data source
  • Add a site configuration setting that selects defaults appropriate to that site, e.g. site: levante would select data sources and dask settings appropriate to Levante, site: jasmin for Jasmin, to simplify configuration of the tool Add a site option to the get_config_user command #1706
  • Improve validation of the data source configuration

Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@valeriupredoi
Copy link
Contributor

I'll work with you on this one @bouweandela 🍺

@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from e91e383 to 9d67ed5 Compare July 22, 2025 13:56
@bouweandela bouweandela added the enhancement New feature or request label Jul 23, 2025
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a dive in this, bud - let me know how I can help!

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jul 24, 2025

this one here ties in very well with this PR, bud #2785 - enjoy your time off 🏖️

@valeriupredoi
Copy link
Contributor

hey @bouweandela hope you're enjoying your holiday time! I kept myself busy and we now have Zarr support (in _io.load) and have done other improvements, hence the conflicts with main, let me fix those for you now. Alas, you can now pass an Intake catalog via this PR, and if that has Zarr files in S3 buckets, then we can load them and test this one 😃

@codecov
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.66%. Comparing base (64a741e) to head (71d763c).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2765      +/-   ##
==========================================
+ Coverage   95.50%   95.66%   +0.15%     
==========================================
  Files         259      263       +4     
  Lines       15099    15505     +406     
==========================================
+ Hits        14421    14833     +412     
+ Misses        678      672       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bouweandela bouweandela force-pushed the add-intake-esgf-support branch 2 times, most recently from 3bf06ad to ef2e7cd Compare September 17, 2025 09:15
@bouweandela bouweandela added this to the v2.14.0 milestone Oct 3, 2025
@bouweandela bouweandela force-pushed the add-intake-esgf-support branch 5 times, most recently from 0b12c7b to 1794742 Compare October 17, 2025 14:36
@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from ca867c6 to 94287ab Compare October 22, 2025 16:00
@bouweandela bouweandela changed the title Add support for intake-esgf Add an interface for adding new data sources and add support for intake-esgf as a first example Oct 22, 2025
@valeriupredoi
Copy link
Contributor

am finally able to start looking at this in great detail, bud, sorry, got hijacked by other things until now 🍺

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Oct 30, 2025

many thanks @bouweandela - as promised, I have now started to stress-test this baby - please see my very initial query #2765 (comment)

The first type of test is a basic run (what they do with any aircraft prototype - they just taxi it at the very beginning):

  • esgf intake gets the data but that takes about 4 or 5 x longer as before with the old esgf-pyclient
  • reruns are fine, data is cached and no downloads happen (as expected)

Am looking through the debug log and am seeing

2025-10-30 15:20:42,767 UTC [3125417] DEBUG   globus_sdk.config.env_vars:59 on lookup, default setting: GLOBUS_SDK_ENVIRONMENT=production
[many many debug lines later]
2025-10-30 15:23:12,505 UTC [3125417] DEBUG   globus_sdk.client:518 request completed with response code: 200

-> that's about 3 minutes of Globus SDK going around over requests and fetching data that takes 30s with the old esgf pyclient - we need to sort this out somehow or we'll be toast!

@valeriupredoi
Copy link
Contributor

another thing from the first test: we really shouldn't dump the intake config yamls in the debug log file - that poor thing is now 33k lines 😆

@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from e94e3ec to c0939d4 Compare November 14, 2025 08:40
@bouweandela bouweandela marked this pull request as ready for review November 14, 2025 12:00
@bouweandela bouweandela requested a review from Copilot November 14, 2025 12:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a modular interface for configuring and using multiple data sources in ESMValCore. The main change introduces support for intake-esgf as an alternative to the existing esgf-pyclient for finding ESGF data, while refactoring the existing local and ESGF data source implementations to use the new interface.

Key changes:

  • New esmvalcore.io module with DataSource and DataElement protocols for extensible data source support
  • Configuration via YAML files under projects instead of deprecated rootpath, drs, and search_esgf settings
  • New CLI commands: esmvaltool config show, esmvaltool config list, esmvaltool config copy

Reviewed Changes

Copilot reviewed 102 out of 104 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/test_provenance.py Removed tests (moved to tests/unit/provenance/test_trackedfile.py)
tests/unit/test_dataset.py Updated tests to use new data source configuration and mock structure
tests/unit/task/test_print.py Updated test expectations for LocalFile and Path representations
tests/unit/task/test_diagnostic_task.py Updated TrackedFile initialization to use Path
tests/unit/recipe/test_to_datasets.py Updated tests to handle new data source interface and removed _file_globs references
tests/unit/recipe/test_recipe.py Removed test for _schedule_for_download function
tests/unit/provenance/test_trackedfile.py Added comprehensive tests for TrackedFile with new protocol support
tests/unit/preprocessor/test_shared.py Updated PreprocessorFile instantiation to use Path
tests/unit/main/test_esmvaltool.py Updated test to use search_data instead of deprecated search_esgf
tests/unit/local/test_to_iris.py Added test for _get_attr_from_field_coord function moved to local module
tests/unit/local/test_time.py Removed _get_start_end_year tests and updated date extraction tests
tests/unit/local/test_get_data_sources.py Added deprecation test for legacy DataSource class
tests/unit/local/test_facets.py Updated tests for renamed LocalDataSource and new timerange handling
tests/unit/io/test_load_data_sources.py Added tests for new data source loading functionality
tests/unit/io/test_intake_esgf.py Added comprehensive tests for intake-esgf data source implementation
tests/unit/esgf/test_search.py Added tests for new ESGFDataSource class
tests/unit/esgf/test_download.py Updated tests for ESGFFile as DataElement with new methods
tests/unit/config/test_data_sources.py Added tests for data source loading with legacy configuration
tests/unit/config/test_config_validator.py Added validator tests for search_data configuration option
tests/unit/config/test_config_object.py Updated tests to use search_data instead of search_esgf
tests/unit/config/test_config.py Added tests for built-in configuration file descriptions
tests/sample_data/experimental/test_run_recipe.py Updated to configure data sources explicitly
tests/integration/test_main.py Added tests for new config commands and updated existing tests
tests/integration/recipe/test_recipe.py Added data source configuration helpers and updated tests
tests/integration/recipe/test_check.py Updated data availability check tests
tests/integration/preprocessor/_io/test_zarr.py Added @pytest.mark.online markers
tests/integration/preprocessor/_io/test_load.py Updated load function call to use new interface
tests/integration/esgf/test_search_download.py Added @pytest.mark.online marker
tests/integration/esgf/search_results/expected.yml Added timerange facet to expected results
tests/integration/dataset/test_dataset.py Updated to configure data sources explicitly
tests/integration/conftest.py Updated test fixtures to work with new data source interface
tests/integration/cmor/_fixes/icon/test_icon_xpp.py Added session fixtures and @pytest.mark.online markers
tests/integration/cmor/_fixes/icon/test_icon.py Added session fixtures and @pytest.mark.online markers
tests/integration/cmor/_fixes/icon/conftest.py Added fixture for ICON data source configuration
tests/conftest.py Removed rootpath configuration from session fixture
pyproject.toml Added intake-esgf and globus-sdk dependencies and new pytest marker
esmvalcore/typing.py Updated type definitions to be more specific
esmvalcore/preprocessor/_io.py Refactored to use DataElement protocol and removed functions moved to local module
esmvalcore/preprocessor/__init__.py Updated to use DataElement instead of file paths
esmvalcore/local.py Major refactoring with LocalDataSource class, moved functions from _io.py, and deprecation warnings
esmvalcore/io/protocol.py New file defining DataSource and DataElement protocols
esmvalcore/io/intake_esgf.py New module implementing intake-esgf data source
esmvalcore/io/__init__.py New module with load_data_sources function
esmvalcore/exceptions.py Added SuppressedError base to RecipeError
esmvalcore/esgf/_search.py Added ESGFDataSource class and updated file selection
esmvalcore/esgf/_download.py Updated ESGFFile to implement DataElement protocol
esmvalcore/esgf/__init__.py Updated docstring and exports
esmvalcore/dataset.py Major refactoring to use data sources instead of direct file finding
esmvalcore/config/configurations/defaults/*.yml Added comments to default configuration files
esmvalcore/config/configurations/data-*.yml New data source configuration files
esmvalcore/config/config-logging.yml Updated logging configuration
esmvalcore/config/_validated_config.py Added return type annotation
esmvalcore/config/_data_sources.py New module for data source configuration
esmvalcore/config/_config_validators.py Added validators and deprecators for new/deprecated options
esmvalcore/config/_config_object.py Updated to load projects from config-developer file
esmvalcore/config/_config.py Added return type to load_config_developer
esmvalcore/cmor/_fixes/icon/_base_fixes.py Updated to use new data sources interface
esmvalcore/_task.py Updated to use Path for filenames
environment.yml Added intake-esgf and globus-sdk dependencies
doc/recipe/overview.rst Updated documentation for new configuration option
doc/configurations Added symlink to configurations directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bouweandela
Copy link
Member Author

Thanks a lot for all your advice so far @valeriupredoi! I've finally marked it as 'ready for review'.

@bouweandela
Copy link
Member Author

Merging this now so it is ready to be presented at the workshop. The design of the new way of configuring the tool has been open for discussion for over a year now in #2371, with added detail in #2584, so I assume that anyone has had time to comment on its usability.

If you encounter any issues with the new way of configuring data sources or using intake-esgf, please report them in an issue.

@bouweandela bouweandela merged commit 9017538 into main Nov 17, 2025
4 checks passed
@bouweandela bouweandela deleted the add-intake-esgf-support branch November 17, 2025 15:28
@valeriupredoi
Copy link
Contributor

Thanks a lot for all your advice so far @valeriupredoi! I've finally marked it as 'ready for review'.

momentous 😃

Screenshot from 2025-11-17 15-36-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants