-
Notifications
You must be signed in to change notification settings - Fork 31
Description
In #841 we have added support for defining p_id and hh_id in the top-level namespace. This is currently possible with these columns only because they are never used as aggregation source columns. Adding support for more variables in the top-level namespaces requires us to change the infrastructure.
Essentially, all changes come down to improving the qualified name checker in _get_tree_path_from_source_col_name from a pure "__" is in name to something that can handle qualified names without double underscores (top-level inputs), partially namespaced arguments (i.e. namespaced relative to current module), and fully qualified/simple name arguments as before. Below is a draft.
We need _get_tree_path_from_source_col_name for two things:
- To find the source column in the functions or data tree to derive annotations.
- To determine the position of (automatically) derived aggregation functions (e.g. when
x_hhis used as a function argument) when putting them into the functions tree.
To determine whether a given source column is already fully (or partly) namespaced, or belongs to the top-level namespace or the current one, we need look for it in the functions, data, or aggregations tree. This is not super trivial because we have time-conversions and group aggregations. Consider these two examples:
A. _get_tree_path_from_source_col_name fails because of grouping level
Consider the functions tree {"n1": {"f": lambda n2__x_hh: n2__x_hh}}. The data tree is {"n2": {"x": pd.Series([1])}}. The tree path of the argument of ("n1", "f") should be ("n2", "x_hh"). Because x_hh is not part of the data tree, the result will be ("n1", "n2", "x_hh"). One potential remedy would be something like this:
source_function_is_derived = (
name not in dt.qual_names(aggregations_tree)
and name not in dt.qual_names(data_tree)
and name not in dt.qual_names(functions_tree)
)
if source_function_is_derived:
# 'name' is derived from another function that is already in the
# aggregations, data, functions or basic input variables tree.
base_function_name, grouping_key = name.rsplit("_", 1)
path_of_base_function = _get_tree_path_from_source_col_name(
name=base_function_name,
current_namespace=tree_path[:-1],
functions_tree=functions_tree,
data_tree=data_tree,
aggregations_tree=aggregations_tree,
)
path_of_function_argument = path_of_base_function[:-1] + (
path_of_base_function[-1] + "_" + grouping_key,
)
else:
# 'name' is not derived
path_of_function_argument = _get_tree_path_from_source_col_name(
name=name,
current_namespace=tree_path[:-1],
functions_tree=functions_tree,
data_tree=data_tree,
aggregations_tree=aggregations_tree,
)B. Handling of functions that are not part of the functions or data tree
This is something more fundamental. Derived functions might stem from basic input columns (see TYPES_INPUT_VARIABLES in config.py). To put the derived functions into the correct namespace, we need to process the basic input columns, even if they are not part of the data tree. However, this entails that we not only need to create time conversion and derived groupings functions for all data inputs and the functions tree, but also for the basic input columns.
To see why this is a bad solution, consider a user that runs GETTSIM using an empty functions tree. Still, this user would end up with stuff like elterngeld__nettoeinkommen_vorjahr_m, elterngeld__nettoeinkommen_vorjahr_y, ... in the functions tree after adding all derived functions.
This will be much easier once we have implemented #833
Drafted implementation of _get_tree_path_from_source_col_name
def _get_tree_path_from_source_col_name(
name: str,
current_namespace: tuple[str],
functions_tree: NestedFunctionDict,
data_tree: NestedDataDict,
aggregations_tree: NestedAggregationSpecDict,
) -> tuple[str]:
"""Get the tree path of a source column name that may be qualified or simple.
This function returns the tree path of a source column name that may be a qualified
or simple name. If the name is qualified, the path implied by the name is returned.
Else, the current path plus the simple name is returned.
Parameters
----------
name
The qualified or simple name.
current_namespace
The namespace where 'name' is located.
functions_tree
The functions tree.
data_tree
The data tree.
aggregations_tree_provided_by_env
The aggregation specifications provided by the environment.
Returns
-------
The path of 'name' in the tree.
"""
name_is_in_qualified_names = (
name in dt.qual_names(data_tree)
or name in dt.qual_names(functions_tree)
or name in dt.qual_names(aggregations_tree)
or name in TYPES_INPUT_VARIABLES
)
if dt.QUAL_NAME_DELIMITER in name:
# Name is already namespaced (either fully or partially)
name_parts = name.split(dt.QUAL_NAME_DELIMITER)
if name_is_in_qualified_names:
# Fully qualified valid name - use as is
new_tree_path = name_parts
else:
# Partially namespaced, prepend current namespace
new_tree_path = list(current_namespace) + name_parts
else:
# Simple name without namespace delimiter
if name_is_in_qualified_names:
# It's a top-level name
new_tree_path = [name]
else:
# Not namespaced and not at top level, use current namespace
new_tree_path = [*list(current_namespace), name]
return tuple(new_tree_path)