Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
6f977cd
created extract surface preprocessor.
ledm Mar 16, 2021
e47fc4a
added some new uses and documentatiotn.
ledm Mar 16, 2021
66147bc
Added extract_levels documentation
ledm Mar 16, 2021
22e6f4b
added unit test for surface extraction
ledm Mar 16, 2021
a68937b
Update _volume.py
ledm Mar 19, 2021
b9b5800
minor bugfix
ledm Mar 19, 2021
3c7b0c0
fixed test
ledm Mar 19, 2021
c6c8e36
added additional documentation and comments.
ledm Mar 19, 2021
a027d96
removing trailing space
ledm Mar 19, 2021
cb78b9c
added doc fix.
ledm Mar 19, 2021
0b04d4e
Update doc/recipe/preprocessor.rst
ledm Mar 24, 2021
319ba19
Update esmvalcore/preprocessor/_volume.py
ledm Mar 24, 2021
053cfc0
Update esmvalcore/preprocessor/_volume.py
ledm Mar 24, 2021
96fc7e4
Update doc/recipe/preprocessor.rst
ledm Mar 24, 2021
deb6918
Merge remote-tracking branch 'origin/master' into extract_surface
ledm Mar 29, 2021
2f729e9
make surface more generic
ledm Mar 29, 2021
d644526
Search r0i0p0 if fx data not found under original ensemble
thomascrocker Apr 26, 2021
da8cf7e
typo. Moved second search under if statement
thomascrocker Apr 26, 2021
574f9f8
fixed line length complaint
thomascrocker Apr 26, 2021
4378116
fix by using globbing
thomascrocker May 5, 2021
54a5105
Merge branch 'master' into fix_fx_ensembles
valeriupredoi May 10, 2021
0c539c3
fix line too long
valeriupredoi May 10, 2021
5f5bb10
modded tests
valeriupredoi May 10, 2021
a5a933b
Added docs to additional places where FX variables are used
thomascrocker May 11, 2021
d471de2
Merge branch 'master' into fix_fx_ensembles
thomascrocker May 14, 2021
6eb521c
Merge remote-tracking branch 'origin/master' into extract_surface
ledm May 17, 2021
c17f85b
downgrade logging message to debug
thomascrocker May 18, 2021
7df08ee
addressing review comments and maintaining backwards compatibility
thomascrocker May 18, 2021
87b5f48
reverting tests to original
thomascrocker May 18, 2021
b53a991
refactored to resolve wildcards in separate function
thomascrocker May 19, 2021
3611451
CMIP6 wildcard FX test
thomascrocker May 19, 2021
a87d290
added CORDEX wildcard fx test
thomascrocker May 19, 2021
3f1b881
minor tweak to docs
thomascrocker May 19, 2021
8d3bf2d
flake8 fixes
thomascrocker May 19, 2021
443208f
Update _data_finder.py
thomascrocker May 19, 2021
62d091a
codacy fix
thomascrocker May 19, 2021
e2c76b1
minor changes to address review comments
thomascrocker May 20, 2021
ad9b702
additional tests
thomascrocker May 20, 2021
6909061
docs update
thomascrocker May 20, 2021
c88898b
Merge branch 'fix_fx_ensembles' of https://github.com/ESMValGroup/ESM…
thomascrocker May 20, 2021
a7bccf8
pylint fix
thomascrocker May 20, 2021
439ce43
Merge remote-tracking branch 'origin/master' into fix_fx_ensembles
schlunma May 20, 2021
3961cfc
Added further check on variable's ensemble during fx file retrieval i…
schlunma May 20, 2021
893aaaf
refactor to deal with more file path cases
thomascrocker May 21, 2021
c9cdbe4
further dir finder test
thomascrocker May 21, 2021
f0637f2
tweaks for codacy
thomascrocker May 21, 2021
fc61e06
Fixed globbing for fx files when latestversion tag is present
schlunma May 21, 2021
09543a0
Fixed output path for fx files when wildcards are used
schlunma May 21, 2021
92a6207
Fixed output path for fx files when wildcards are used (again)
schlunma May 21, 2021
3676f66
Fixed output path for fx files when wildcards are used (this time for…
schlunma May 21, 2021
3b1c0a2
Removed print() statement
schlunma May 21, 2021
845cc8d
Merge remote-tracking branch 'origin/master' into fix_fx_ensembles
schlunma May 21, 2021
b562249
Fixed tests
schlunma May 21, 2021
993ceaa
added fixes to cmip6 omon.
ledm May 25, 2021
fc60589
working here.
ledm Jun 11, 2021
ce87c3c
pre-merge changes
ledm Jun 15, 2021
15807ae
Merge remote-tracking branch 'origin/main' into extract_surface
ledm Jun 15, 2021
0947529
Merge remote-tracking branch 'origin/main' into extract_surface
ledm Jun 16, 2021
ca850b1
Merge remote-tracking branch 'origin/main' into extract_surface
ledm Aug 17, 2021
d4a2552
added fixes
ledm Aug 17, 2021
2b32a17
Merge remote-tracking branch 'origin/main' into extract_surface
ledm Sep 6, 2021
b50b094
Merge remote-tracking branch 'origin/main' into fix_fx_ensembles
schlunma Sep 9, 2021
fae8f0c
added fatal error if fx missing
ledm Sep 9, 2021
f200c14
Merge remote-tracking branch 'origin/fix_fx_ensembles' into extract_s…
ledm Sep 9, 2021
2d13465
Merge remote-tracking branch 'origin/main' into extract_surface
ledm Sep 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions doc/recipe/preprocessor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,17 @@ available tables of the specified project.
a given dataset) fx files are found in more than one table, ``mip`` needs to
be specified, otherwise an error is raised.

Additionally, it is possible to search across all ensembles and experiments (or
any other keys) when specifying the fx variable, by using the ``*`` character,
which is useful for some projects where the location of the fx files is not
consistent. This makes it possible to search for fx files under multiple
ensemble members or experiments. For example: ``ensemble: '*'``. Note that the
``*`` character must be quoted since ``*`` is a special charcter in YAML. This
functionality is only supported for time invariant fx variables (i.e. frequency
``fx``). Note also that if multiple folders of matching fx files are found,
ESMValTool will default to ensemble r0i0p0 if it exists and then first folder
found only if it does not.

Internally, the required ``fx_variables`` are automatically loaded by the
preprocessor step ``add_fx_variables`` which also checks them against CMOR
standards and adds them either as ``cell_measure`` (see `CF conventions on cell
Expand Down Expand Up @@ -1507,6 +1518,7 @@ The ``_volume.py`` module contains the following preprocessor functions:
* ``extract_transect``: Extract data along a line of constant latitude or
longitude.
* ``extract_trajectory``: Extract data along a specified trajectory.
* ``extract_surface``: Extract the surface layer from a cube.


``extract_volume``
Expand Down Expand Up @@ -1590,6 +1602,24 @@ Note that this function uses the expensive ``interpolate`` method from
See also :func:`esmvalcore.preprocessor.extract_trajectory`.


``extract_surface``
-------------------

This function extracts the surface layer from a dataset.

The surface layer is defined as the minimum
of the absolute value of the Z-dimension.
This is typically the case for ocean models, but might not be the
case for atmospheric models.

The same functionality exists in the ``extract_levels`` preprocessor,
and this function should be used for more complex datasets
if extract_surface fails. However, ``extract_levels`` also
has a higher computational cost, and may be slower.

See also :func:`esmvalcore.preprocessor.extract_levels`.


.. _cycles:

Cycles
Expand Down
170 changes: 131 additions & 39 deletions esmvalcore/_data_finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,16 +93,16 @@ def get_start_end_year(filename):
for cube in cubes:
logger.debug(cube)
try:
time = cube.coord('time')
time = cube.coord("time")
except iris.exceptions.CoordinateNotFoundError:
continue
start_year = time.cell(0).point.year
end_year = time.cell(-1).point.year
break

if start_year is None or end_year is None:
raise ValueError(f'File {filename} dates do not match a recognized'
'pattern and time can not be read from the file')
raise ValueError(f"File {filename} dates do not match a recognized"
"pattern and time can not be read from the file")

logger.debug("Found start_year %s and end_year %s", start_year, end_year)
return int(start_year), int(end_year)
Expand All @@ -124,7 +124,7 @@ def select_files(filenames, start_year, end_year):
def _replace_tags(paths, variable):
"""Replace tags in the config-developer's file with actual values."""
if isinstance(paths, str):
paths = set((paths.strip('/'),))
paths = set((paths.strip('/'), ))
else:
paths = set(path.strip('/') for path in paths)
tlist = set()
Expand All @@ -133,10 +133,9 @@ def _replace_tags(paths, variable):
if 'sub_experiment' in variable:
new_paths = []
for path in paths:
new_paths.extend((
re.sub(r'(\b{ensemble}\b)', r'{sub_experiment}-\1', path),
re.sub(r'({ensemble})', r'{sub_experiment}-\1', path)
))
new_paths.extend(
(re.sub(r'(\b{ensemble}\b)', r'{sub_experiment}-\1', path),
re.sub(r'({ensemble})', r'{sub_experiment}-\1', path)))
tlist.add('sub_experiment')
paths = new_paths
logger.debug(tlist)
Expand All @@ -145,7 +144,7 @@ def _replace_tags(paths, variable):
original_tag = tag
tag, _, _ = _get_caps_options(tag)

if tag == 'latestversion': # handled separately later
if tag == "latestversion": # handled separately later
continue
if tag in variable:
replacewith = variable[tag]
Expand All @@ -172,10 +171,10 @@ def _replace_tag(paths, tag, replacewith):
def _get_caps_options(tag):
lower = False
upper = False
if tag.endswith('.lower'):
if tag.endswith(".lower"):
lower = True
tag = tag[0:-6]
elif tag.endswith('.upper'):
elif tag.endswith(".upper"):
upper = True
tag = tag[0:-6]
return tag, lower, upper
Expand All @@ -195,36 +194,82 @@ def _resolve_latestversion(dirname_template):
This implementation avoid globbing on centralized clusters with very
large data root dirs (i.e. ESGF nodes like Jasmin/DKRZ).
"""
if '{latestversion}' not in dirname_template:
if "{latestversion}" not in dirname_template:
return dirname_template

# Find latest version
part1, part2 = dirname_template.split('{latestversion}')
part1, part2 = dirname_template.split("{latestversion}")
part2 = part2.lstrip(os.sep)
if os.path.exists(part1):
versions = os.listdir(part1)
versions.sort(reverse=True)
for version in ['latest'] + versions:
for version in ["latest"] + versions:
dirname = os.path.join(part1, version, part2)
if os.path.isdir(dirname):
return dirname

return dirname_template


def _resolve_wildcards_and_version(dirname, basepath, project, drs):
"""Resolve wildcards and latestversion tag."""
if "{latestversion}" in dirname:
dirname_version_wildcard = dirname.replace("{latestversion}", "*")

# Find all directories that match the template
all_dirs = sorted(glob.glob(dirname_version_wildcard))

# Sort directories by version
all_dirs_dict = {}
for directory in all_dirs:
version = dir_to_var(
directory, basepath, project, drs)['latestversion']
all_dirs_dict.setdefault(version, [])
all_dirs_dict[version].append(directory)

# Select latest version
if not all_dirs_dict:
dirnames = []
elif 'latest' in all_dirs_dict:
dirnames = all_dirs_dict['latest']
else:
all_versions = sorted(list(all_dirs_dict))
dirnames = all_dirs_dict[all_versions[-1]]

# No {latestversion} tag
else:
dirnames = sorted(glob.glob(dirname))

# No directories found
if not dirnames:
logger.debug("Unable to resolve %s", dirname)
return dirname

# Exactly one directory found
if len(dirnames) == 1:
return dirnames[0]

# Warn if multiple directories have been found and prioritize r0i0p0
logger.warning("Multiple directories for fx variables found: %s", dirnames)
r0i0p0_matches = [d for d in dirnames if "r0i0p0" in d]
if r0i0p0_matches:
return r0i0p0_matches[0]
return dirnames[0]


def _select_drs(input_type, drs, project):
"""Select the directory structure of input path."""
cfg = get_project_config(project)
input_path = cfg[input_type]
if isinstance(input_path, str):
return input_path

structure = drs.get(project, 'default')
structure = drs.get(project, "default")
if structure in input_path:
return input_path[structure]

raise KeyError(
'drs {} for {} project not specified in config-developer file'.format(
"drs {} for {} project not specified in config-developer file".format(
structure, project))


Expand All @@ -248,16 +293,24 @@ def get_rootpath(rootpath, project):

def _find_input_dirs(variable, rootpath, drs):
"""Return a the full paths to input directories."""
project = variable['project']
project = variable["project"]

root = get_rootpath(rootpath, project)
path_template = _select_drs('input_dir', drs, project)
path_template = _select_drs("input_dir", drs, project)

dirnames = []
for dirname_template in _replace_tags(path_template, variable):
for base_path in root:
dirname = os.path.join(base_path, dirname_template)
dirname = _resolve_latestversion(dirname)
if variable["frequency"] == "fx" and "*" in dirname:
dirname = _resolve_wildcards_and_version(dirname, base_path,
project, drs)
var_from_dir = dir_to_var(dirname, base_path, project, drs)
for (key, val) in variable.items():
if val == '*':
variable[key] = var_from_dir.get(key, '*')
else:
dirname = _resolve_latestversion(dirname)
matches = glob.glob(dirname)
matches = [match for match in matches if os.path.isdir(match)]
if matches:
Expand All @@ -272,65 +325,104 @@ def _find_input_dirs(variable, rootpath, drs):

def _get_filenames_glob(variable, drs):
"""Return patterns that can be used to look for input files."""
path_template = _select_drs('input_file', drs, variable['project'])
path_template = _select_drs("input_file", drs, variable["project"])
filenames_glob = _replace_tags(path_template, variable)
return filenames_glob


def _find_input_files(variable, rootpath, drs):
short_name = variable['short_name']
variable['short_name'] = variable['original_short_name']
short_name = variable["short_name"]
variable["short_name"] = variable["original_short_name"]
input_dirs = _find_input_dirs(variable, rootpath, drs)
filenames_glob = _get_filenames_glob(variable, drs)
files = find_files(input_dirs, filenames_glob)
variable['short_name'] = short_name
variable["short_name"] = short_name
return (files, input_dirs, filenames_glob)


def get_input_filelist(variable, rootpath, drs):
"""Return the full path to input files."""
# change ensemble to fixed r0i0p0 for fx variables
# this is needed and is not a duplicate effort
if variable['project'] == 'CMIP5' and variable['frequency'] == 'fx':
if all([
variable['project'] == 'CMIP5', variable['frequency'] == 'fx',
variable.get('ensemble') != '*'
]):
variable['ensemble'] = 'r0i0p0'
(files, dirnames, filenames) = _find_input_files(variable, rootpath, drs)

# do time gating only for non-fx variables
if variable['frequency'] != 'fx':
files = select_files(files, variable['start_year'],
variable['end_year'])
if variable["frequency"] != "fx":
files = select_files(files, variable["start_year"],
variable["end_year"])
return (files, dirnames, filenames)


def get_output_file(variable, preproc_dir):
"""Return the full path to the output (preprocessed) file."""
cfg = get_project_config(variable['project'])
cfg = get_project_config(variable["project"])

# Join different experiment names
if isinstance(variable.get('exp'), (list, tuple)):
if isinstance(variable.get("exp"), (list, tuple)):
variable = dict(variable)
variable['exp'] = '-'.join(variable['exp'])
variable["exp"] = "-".join(variable["exp"])

outfile = os.path.join(
preproc_dir,
variable['diagnostic'],
variable['variable_group'],
_replace_tags(cfg['output_file'], variable)[0],
variable["diagnostic"],
variable["variable_group"],
_replace_tags(cfg["output_file"], variable)[0],
)
if variable['frequency'] != 'fx':
outfile += '_{start_year}-{end_year}'.format(**variable)
outfile += '.nc'
if variable["frequency"] != "fx":
outfile += "_{start_year}-{end_year}".format(**variable)
outfile += ".nc"
return outfile


def get_statistic_output_file(variable, preproc_dir):
"""Get multi model statistic filename depending on settings."""
template = os.path.join(
preproc_dir,
'{diagnostic}',
'{variable_group}',
'{dataset}_{mip}_{short_name}_{start_year}-{end_year}.nc',
"{diagnostic}",
"{variable_group}",
"{dataset}_{mip}_{short_name}_{start_year}-{end_year}.nc",
)

outfile = template.format(**variable)

return outfile


def dir_to_var(dirname, basepath, project, drs):
"""Convert directory path to variable :obj:`dict`."""
if dirname != os.sep:
dirname = dirname.rstrip(os.sep)
if basepath != os.sep:
basepath = basepath.rstrip(os.sep)
path_template = _select_drs("input_dir", drs, project).rstrip(os.sep)
rel_dir = os.path.relpath(dirname, basepath)
keys = path_template.split(os.sep)
vals = rel_dir.split(os.sep)
if len(keys) != len(vals):
raise ValueError(
f"Cannot extract tags '{path_template}' from directory "
f"'{rel_dir}' (root: '{basepath}') with different numbers of "
f"elements")
variable = {}
for (idx, full_key) in enumerate(keys):
matches = re.findall(r'.*\{(.*)\}.*', full_key)
if len(matches) != 1:
continue
key = matches[0]
regex = rf"{full_key.replace(key, '(.*)')}"
regex = regex.replace('{', '').replace('}', '')
matches = re.findall(regex, vals[idx])
while '' in matches:
matches.remove('')
if len(matches) != 1:
raise ValueError(
f"Regex pattern '{regex}' for '{full_key}' cannot be "
f"(uniquely) matched to element '{vals[idx]}' in directory "
f"'{dirname}'")
variable[key] = matches[0]
return variable
17 changes: 13 additions & 4 deletions esmvalcore/_recipe.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,7 +336,8 @@ def _add_fxvar_keys(fx_info, variable):
fx_variable['variable_group'] = fx_info['short_name']

# add special ensemble for CMIP5 only
if fx_variable['project'] == 'CMIP5':
if (fx_variable['project'] == 'CMIP5' and
fx_variable.get('ensemble') != '*'):
fx_variable['ensemble'] = 'r0i0p0'

# add missing cmor info
Expand Down Expand Up @@ -427,11 +428,14 @@ def _get_fx_files(variable, fx_info, config_user):
f"table '{mip}' for '{var_project}'")
fx_info = _add_fxvar_keys(fx_info, variable)
fx_files = _get_input_files(fx_info, config_user)[0]



# Flag a warning if no files are found
if not fx_files:
logger.warning("Missing data for fx variable %s of dataset %s",
fx_info['short_name'], fx_info['dataset'])
outs = ', '.join([str(i)+': '+str(v) for i, v in fx_info.items()])
logger.error("Missing data for fx variable '%s', '%s'",
fx_info['short_name'], outs)
assert 0

# If frequency = fx, only allow a single file
if fx_files:
Expand Down Expand Up @@ -796,6 +800,11 @@ def _get_preprocessor_products(variables, profile, order, ancestor_products,
else:
missing_vars.add(ex.message)
continue

# Update output filename in case wildcards have been resolved
if '*' in variable['filename']:
variable['filename'] = get_output_file(variable,
config_user['preproc_dir'])
product = PreprocessorFile(
attributes=variable,
settings=settings,
Expand Down
Loading