ENH: Make top-level inputs and functions besides `p_id` and `hh_id` possible

In #841 we have added support for defining `p_id` and `hh_id` in the top-level namespace. This is currently possible with these columns only because they are never used as aggregation source columns. Adding support for more variables in the top-level namespaces requires us to change the infrastructure.

Essentially, all changes come down to improving the qualified name checker in `_get_tree_path_from_source_col_name` from a pure `"__" is in name` to something that can handle qualified names without double underscores (top-level inputs), partially namespaced arguments (i.e. namespaced relative to current module), and fully qualified/simple name arguments as before. Below is a draft.

We need `_get_tree_path_from_source_col_name` for two things:
1. To find the source column in the functions or data tree to derive annotations.
2. To determine the position of (automatically) derived aggregation functions (e.g. when `x_hh` is used as a function argument) when putting them into the functions tree.

To determine whether a given source column is already fully (or partly) namespaced, or belongs to the top-level namespace or the current one, we need look for it in the functions, data, or aggregations tree. This is not super trivial because we have time-conversions and group aggregations. Consider these two examples:

A. `_get_tree_path_from_source_col_name` fails because of grouping level

Consider the functions tree `{"n1": {"f": lambda n2__x_hh: n2__x_hh}}`. The data tree is `{"n2": {"x": pd.Series([1])}}`. The tree path of the argument of `("n1", "f")` should be `("n2", "x_hh")`. Because `x_hh` is not part of the data tree, the result will be `("n1", "n2", "x_hh")`. One potential remedy would be something like this:

```python
    source_function_is_derived = (
        name not in dt.qual_names(aggregations_tree)
        and name not in dt.qual_names(data_tree)
        and name not in dt.qual_names(functions_tree)
    )
    if source_function_is_derived:
        # 'name' is derived from another function that is already in the
        # aggregations, data, functions or basic input variables tree.
        base_function_name, grouping_key = name.rsplit("_", 1)
        path_of_base_function = _get_tree_path_from_source_col_name(
            name=base_function_name,
            current_namespace=tree_path[:-1],
            functions_tree=functions_tree,
            data_tree=data_tree,
            aggregations_tree=aggregations_tree,
        )
        path_of_function_argument = path_of_base_function[:-1] + (
            path_of_base_function[-1] + "_" + grouping_key,
        )
    else:
        # 'name' is not derived
        path_of_function_argument = _get_tree_path_from_source_col_name(
            name=name,
            current_namespace=tree_path[:-1],
            functions_tree=functions_tree,
            data_tree=data_tree,
            aggregations_tree=aggregations_tree,
        )
```

B. Handling of functions that are not part of the functions or data tree

This is something more fundamental. Derived functions might stem from basic input columns (see `TYPES_INPUT_VARIABLES` in `config.py`). To put the derived functions into the correct namespace, we need to process the basic input columns, even if they are not part of the data tree. However, this entails that we not only need to create time conversion and derived groupings functions for all data inputs and the functions tree, but also for the basic input columns.

To see why this is a bad solution, consider a user that runs GETTSIM using an empty functions tree. Still, this user would end up with stuff like `elterngeld__nettoeinkommen_vorjahr_m`, `elterngeld__nettoeinkommen_vorjahr_y`, ... in the functions tree after adding all derived functions.

This will be much easier once we have implemented #833   

---
Drafted implementation of `_get_tree_path_from_source_col_name`


```python
def _get_tree_path_from_source_col_name(
    name: str,
    current_namespace: tuple[str],
    functions_tree: NestedFunctionDict,
    data_tree: NestedDataDict,
    aggregations_tree: NestedAggregationSpecDict,
) -> tuple[str]:
    """Get the tree path of a source column name that may be qualified or simple.

    This function returns the tree path of a source column name that may be a qualified
    or simple name. If the name is qualified, the path implied by the name is returned.
    Else, the current path plus the simple name is returned.

    Parameters
    ----------
    name
        The qualified or simple name.
    current_namespace
        The namespace where 'name' is located.
    functions_tree
        The functions tree.
    data_tree
        The data tree.
    aggregations_tree_provided_by_env
        The aggregation specifications provided by the environment.

    Returns
    -------
    The path of 'name' in the tree.
    """

    name_is_in_qualified_names = (
        name in dt.qual_names(data_tree)
        or name in dt.qual_names(functions_tree)
        or name in dt.qual_names(aggregations_tree)
        or name in TYPES_INPUT_VARIABLES
    )

    if dt.QUAL_NAME_DELIMITER in name:
        # Name is already namespaced (either fully or partially)
        name_parts = name.split(dt.QUAL_NAME_DELIMITER)
        if name_is_in_qualified_names:
            # Fully qualified valid name - use as is
            new_tree_path = name_parts
        else:
            # Partially namespaced, prepend current namespace
            new_tree_path = list(current_namespace) + name_parts
    else:
        # Simple name without namespace delimiter
        if name_is_in_qualified_names:
            # It's a top-level name
            new_tree_path = [name]
        else:
            # Not namespaced and not at top level, use current namespace
            new_tree_path = [*list(current_namespace), name]

    return tuple(new_tree_path)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Make top-level inputs and functions besides `p_id` and `hh_id` possible #848

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: Make top-level inputs and functions besides p_id and hh_id possible #848

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH: Make top-level inputs and functions besides `p_id` and `hh_id` possible #848