Skip to content

Conversation

@bouweandela
Copy link
Member

@bouweandela bouweandela commented Nov 29, 2022

Description

Add the esmvalcore.local module that can be used to find files on the local filesystem (previously know as esmvalcore._data_finder). It supports using '*' as a facet value to match anything and also allows searching for a specific version of a file instead of just the latest. These features are needed to support using wildcards and versioned datasets in the recipe in #1609.

Some minor related changes in this pull request are:

  • Add esmvalcore.typing module to define some types that can be used for type hints.
  • Test the file-finding capabilities without using the extra facets facility, as it was done before Added tests for input/output filenames for ICON and EMAC on-the-fly CMORizer #1718. This offers better separation of concerns and makes it easier to find bugs that are specific to finding files.
  • Update [facet] to {facet} in the finding input data help section, as we've been using curly braces in the rootpaths in config-developer.yml for quite a while now.
  • Some of the tests in tests/integration/test_recipe.py were using ancillary variables that were not meaningful for the preprocessor function they were added to. This is corrected here.

Closes #286
Closes #1825

Backward incompatible change

If a version of a dataset is specified in the recipe, the tool will now search for exactly that version, instead of simply using the latest version. Therefore it is necessary to make sure that the version number in the directory tree matches with the version number in the recipe to find the files.


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added enhancement New feature or request api Notebook API labels Nov 29, 2022
@codecov
Copy link

codecov bot commented Nov 29, 2022

Codecov Report

Merging #1835 (4b56d79) into main (a56e611) will decrease coverage by 0.04%.
The diff coverage is 98.32%.

@@            Coverage Diff             @@
##             main    #1835      +/-   ##
==========================================
- Coverage   91.57%   91.52%   -0.05%     
==========================================
  Files         203      210       +7     
  Lines       10979    11369     +390     
==========================================
+ Hits        10054    10406     +352     
- Misses        925      963      +38     
Impacted Files Coverage Δ
esmvalcore/cmor/_fixes/ipslcm/ipsl_cm6.py 36.95% <0.00%> (ø)
esmvalcore/_provenance.py 97.46% <100.00%> (ø)
esmvalcore/_recipe.py 95.86% <100.00%> (+0.04%) ⬆️
esmvalcore/_recipe_checks.py 90.95% <100.00%> (-0.28%) ⬇️
esmvalcore/config/_config.py 98.76% <100.00%> (+0.08%) ⬆️
esmvalcore/esgf/_download.py 100.00% <100.00%> (ø)
esmvalcore/esgf/_search.py 100.00% <100.00%> (ø)
esmvalcore/local.py 96.90% <100.00%> (ø)
esmvalcore/preprocessor/__init__.py 88.61% <100.00%> (ø)
esmvalcore/preprocessor/_ancillary_vars.py 98.59% <100.00%> (+0.02%) ⬆️
... and 9 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@bouweandela bouweandela changed the title Add esmvalcore.local, a module to search the local filesystem Add esmvalcore.local, a module to search data on the local filesystem Dec 1, 2022
@bouweandela bouweandela marked this pull request as ready for review December 1, 2022 12:27
@bouweandela bouweandela mentioned this pull request Dec 1, 2022
10 tasks
@bouweandela bouweandela added this to the v2.8.0 milestone Dec 1, 2022
@valeriupredoi
Copy link
Contributor

whoa! That's a biggie - I'll have a test or two, but I suggest we split the review effort: Manu checks the code and I test the thing or the other way round? Or Saskia checks the code, Manu tests, and I merge? 😁

@sloosvel
Copy link
Contributor

sloosvel commented Dec 6, 2022

I can take a look tomorrow

@schlunma
Copy link
Contributor

schlunma commented Dec 6, 2022

Sorry, I won't have time to review this until next year. Levante is on maintenance currently (and we need to thoroughly test this with actual data/recipes), and I will be on AGU next week and on holidays afterwards.

I can take a look in early January if that's necessary 👍

@bouweandela
Copy link
Member Author

bouweandela commented Dec 6, 2022

I'm rather hoping that we can get this merged sooner than January, otherwise I'm worried there will not be enough time left to get #1609 merged and have a few weeks for it to be used by ESMValCore development installation users before it hits the release.

Note that the number of changes to the actual file-finding code in this pull request is pretty small: it's mostly renaming a lot of functions from _data_finder.py so they start with an _ so only the part of the interface that we want becomes public. They only real change is that we now use a single combined glob pattern to look for files instead of one to look for directories first and then another one to look for files (this fixes #286 as well as enabling the ability to search for a specific version). I'm pretty sure that works fine because we have many unit tests for data finding and they are passing.

@schlunma
Copy link
Contributor

schlunma commented Dec 6, 2022

All right, sounds reasonable. I'll try to have a quick look at the code today, but I would really appreciate it if someone can run some recipes with this (maybe myself on Friday, let's see). I know we have plenty of unit tests, but (1) you modified a lot of them and (2) I don't fully trust them (it happened in the past that changes didn't break unit tests but broke recipes) 😬

@bouweandela
Copy link
Member Author

I merged the main branch into this to make sure everything is up to date.

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bouwe, the code is much cleaner now! Some comments on the code, did not look at the tests yet.

@valeriupredoi
Copy link
Contributor

I'll run recipes once the code settles a bit after review, maybe tomorrow or on Thu? 🍺

@bouweandela
Copy link
Member Author

@remi-kazeroni The problems you reported are fixed now. Could you have another go please?

@remi-kazeroni
Copy link
Contributor

@remi-kazeroni The problems you reported are fixed now. Could you have another go please?

Thanks a lot for addressing these problems @bouweandela! I ran the same set of recipes again and all problems are now fixed, the recipes run fine. 👍 I'm currently running a few more recipes outside of examples. I'll report in a bit how this goes

@remi-kazeroni
Copy link
Contributor

I ran some further tests and almost all recipes seem to run fine. The main problem I noticed is that the data finder does not work with native6 datasets. For example, no data is found when running esmvaltool run examples/recipe_check_obs.yml --diagnostics=ERA5_native6. Here is the content of the log file for one variable if it helps the debugging:

[2093965] INFO    esmvalcore._recipe:1825 Creating preprocessor task ERA5_native6/vas_Amon
[2093965] INFO    esmvalcore._recipe:1214 Creating preprocessor 'default' task for variable 'vas'
[2093965] DEBUG   esmvalcore.local:443 Looking for files matching [PosixPath('/work/bd0854/DATA/ESMValTool2/RAWOBS/Tier3/ERA5/1/mon/vas/*.nc')]
[2093965] DEBUG   esmvalcore._recipe:623 Using input files for variable vas of dataset ERA5:

[2093965] ERROR   esmvalcore._recipe_checks:101 No input files found for variable {'short_name': 'vas', 'mip': 'Amon', 'variable_group': 'vas_Amon', 'diagnostic': 'ERA5_native6', 'preprocessor': 'default', 'dataset': 'ERA5', 'project': 'native6', 'tier': 3, 'type': 'reanaly', 'version': 1, 'recipe_dataset_index': 0, 'timerange': '1990/1990', 'alias': 'ERA5', 'original_short_name': 'vas', 'standard_name': 'northward_wind', 'long_name': 'Northward Near-Surface Wind', 'units': 'm s-1', 'modeling_realm': ['atmos'], 'frequency': 'mon', 'start_year': 1990, 'end_year': 1990}
[2093965] ERROR   esmvalcore._recipe_checks:107 Looked for files matching: /work/bd0854/DATA/ESMValTool2/RAWOBS/Tier3/ERA5/1/mon/vas/*.nc
[2093965] ERROR   esmvalcore._recipe_checks:108 Set 'log_level' to 'debug' to get more information

Regarding the diagnostics: it seems that only one diagnostic is currently using from esmvalcore._data_finder import, namely monitor/monitor_base.py but do you think other imports could be broken after merging this PR?

@bouweandela
Copy link
Member Author

Thanks for testing again @remi-kazeroni!

I ran some further tests and almost all recipes seem to run fine. The main problem I noticed is that the data finder does not work with native6 datasets. For example, no data is found when running esmvaltool run examples/recipe_check_obs.yml --diagnostics=ERA5_native6.

The way the ERA5 data is organized on Levante is not compatible with what is specified in the recipe.

Here is an example of a file path:

/work/bd0854/DATA/ESMValTool2/RAWOBS/Tier3/ERA5/v1/1hrPt/tas/era5_2m_temperature_1990_hourly.nc

note that the version is called v1. In the recipe, on the other hand, the version is just 1, without the v: https://github.com/ESMValGroup/ESMValTool/blob/3ef4a12f9d896b9836488a30e6b949215c2a5515/esmvaltool/recipes/examples/recipe_check_obs.yml#L1409

I expect similar problems for other recipes which do specify version numbers that do not match how data is organized. In particular, for obs4MIPs where the situation is even hairier, see #1859.

@bouweandela
Copy link
Member Author

Regarding the diagnostics: it seems that only one diagnostic is currently using from esmvalcore._data_finder import, namely monitor/monitor_base.py but do you think other imports could be broken after merging this PR?

There the recipe loading tests which extensively patch ESMValCore: https://github.com/ESMValGroup/ESMValTool/blob/main/tests/integration/test_recipes_loading.py

I'll have a look at what I can do.

@schlunma
Copy link
Contributor

schlunma commented Dec 9, 2022

Thanks Bouwe for addressing all my comments so far! I tested this with 5 very complex recipes, and all worked well.

I found a tiny issue regarding native IPSL-CM6 data. Would it be ok for you if I push that change directly here? Most likely you don't have the necessary data to test this. It's just one line.

@bouweandela
Copy link
Member Author

Sure, go ahead!

@schlunma
Copy link
Contributor

schlunma commented Dec 9, 2022

The way the ERA5 data is organized on Levante is not compatible with what is specified in the recipe.

Here is an example of a file path:

/work/bd0854/DATA/ESMValTool2/RAWOBS/Tier3/ERA5/v1/1hrPt/tas/era5_2m_temperature_1990_hourly.nc

note that the version is called v1. In the recipe, on the other hand, the version is just 1, without the v: https://github.com/ESMValGroup/ESMValTool/blob/3ef4a12f9d896b9836488a30e6b949215c2a5515/esmvaltool/recipes/examples/recipe_check_obs.yml#L1409

I expect similar problems for other recipes which do specify version numbers that do not match how data is organized. In particular, for obs4MIPs where the situation is even hairier, see #1859.

Can this be fixed by adapting the recipe? And any idea why this worked before even though we used the wrong version tag (I noticed that already in the past and always wondered why it worked 😄)?

@schlunma
Copy link
Contributor

schlunma commented Dec 9, 2022

Sorry for messing up Codecov😁 Let's just ignore it, this line hasn't been tested before...

@bouweandela
Copy link
Member Author

Can this be fixed by adapting the recipe? And any idea why this worked before even though we used the wrong version tag (I noticed that already in the past and always wondered why it worked 😄)?

Yes, we can update the recipe to use version v1 or take out the version number completely, for now, to automatically select the latest version. It worked because currently the tool always selects the latest version where the config-developer.yml file contains {latestversion}, regardless of what version you specify in the recipe. With the changes in this pull request, you will actually get the version you ask for (or an error if it doesn't exist).

Copy link
Contributor

@schlunma schlunma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Could you please open an issue about that or address it in ESMValGroup/ESMValTool#2958?

From my side, this is ready 👍 (I didn't have a look at the tests at all though).

Thanks Bouwe, cool stuff 🚀

@bouweandela
Copy link
Member Author

Regarding the diagnostics: it seems that only one diagnostic is currently using from esmvalcore._data_finder import, namely monitor/monitor_base.py but do you think other imports could be broken after merging this PR?

@remi-kazeroni I opened ESMValGroup/ESMValTool#2958 to update the imports from esmvalcore._data_finder in ESMValTool.

@bouweandela
Copy link
Member Author

Created an issue to ask what version number to use for the ERA5 data: ESMValGroup/ESMValTool#2960

@remi-kazeroni
Copy link
Contributor

The way the ERA5 data is organized on Levante is not compatible with what is specified in the recipe.

Here is an example of a file path:

/work/bd0854/DATA/ESMValTool2/RAWOBS/Tier3/ERA5/v1/1hrPt/tas/era5_2m_temperature_1990_hourly.nc

note that the version is called v1. In the recipe, on the other hand, the version is just 1, without the v: https://github.com/ESMValGroup/ESMValTool/blob/3ef4a12f9d896b9836488a30e6b949215c2a5515/esmvaltool/recipes/examples/recipe_check_obs.yml#L1409

I expect similar problems for other recipes which do specify version numbers that do not match how data is organized. In particular, for obs4MIPs where the situation is even hairier, see #1859.

Thanks for clarifying @bouweandela, make sense to me. I'll take a the other issues/PRs that you opened regarding version numbers. For native6 datasets used in public recipes, I could only find one other case outside of ERA5 (-Land) data. In the recipe_hydro_forcing (here), we have: dataset: MSWEP, version: 220 while the directory is named v220 so the data is not found so the same reason. Let's fix that once we have agreed on a solution for ERA5 👍

@bouweandela
Copy link
Member Author

Thanks @remi-kazeroni! Did you want to do additional tests with this branch or are you happy with it now?

@remi-kazeroni
Copy link
Contributor

Thanks @remi-kazeroni! Did you want to do additional tests with this branch or are you happy with it now?

Thanks for your great work @bouweandela! I don't think it is necessary to do further recipe testing for this PR. The tests of @schlunma and myself should be enough for now. You can take this as an approval from my side.

From @schlunma's review, I see that one should maybe check the tests before this PR can be merged. @valeriupredoi, as one of our specialist, would you maybe have the time for that? 🍻

@valeriupredoi
Copy link
Contributor

wonderful work @bouweandela @schlunma and @remi-kazeroni 🍺 x 3 - I have taken a look at codecov fails and those are simply because we dont have a test for ipsl_cm6 - do you want to add a test? At any rate, that will be outside the scope of this PR so kinda needs be done in a separate PR. Codacyissues are minor, so am gonna merge this, great work 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Notebook API backwards incompatible change enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add data finder to the public API Data finder cannot find data spread over different versions

6 participants