Skip to content

ENH: Make top-level inputs and functions besides p_id and hh_id possible #848

@MImmesberger

Description

@MImmesberger

In #841 we have added support for defining p_id and hh_id in the top-level namespace. This is currently possible with these columns only because they are never used as aggregation source columns. Adding support for more variables in the top-level namespaces requires us to change the infrastructure.

Essentially, all changes come down to improving the qualified name checker in _get_tree_path_from_source_col_name from a pure "__" is in name to something that can handle qualified names without double underscores (top-level inputs), partially namespaced arguments (i.e. namespaced relative to current module), and fully qualified/simple name arguments as before. Below is a draft.

We need _get_tree_path_from_source_col_name for two things:

  1. To find the source column in the functions or data tree to derive annotations.
  2. To determine the position of (automatically) derived aggregation functions (e.g. when x_hh is used as a function argument) when putting them into the functions tree.

To determine whether a given source column is already fully (or partly) namespaced, or belongs to the top-level namespace or the current one, we need look for it in the functions, data, or aggregations tree. This is not super trivial because we have time-conversions and group aggregations. Consider these two examples:

A. _get_tree_path_from_source_col_name fails because of grouping level

Consider the functions tree {"n1": {"f": lambda n2__x_hh: n2__x_hh}}. The data tree is {"n2": {"x": pd.Series([1])}}. The tree path of the argument of ("n1", "f") should be ("n2", "x_hh"). Because x_hh is not part of the data tree, the result will be ("n1", "n2", "x_hh"). One potential remedy would be something like this:

    source_function_is_derived = (
        name not in dt.qual_names(aggregations_tree)
        and name not in dt.qual_names(data_tree)
        and name not in dt.qual_names(functions_tree)
    )
    if source_function_is_derived:
        # 'name' is derived from another function that is already in the
        # aggregations, data, functions or basic input variables tree.
        base_function_name, grouping_key = name.rsplit("_", 1)
        path_of_base_function = _get_tree_path_from_source_col_name(
            name=base_function_name,
            current_namespace=tree_path[:-1],
            functions_tree=functions_tree,
            data_tree=data_tree,
            aggregations_tree=aggregations_tree,
        )
        path_of_function_argument = path_of_base_function[:-1] + (
            path_of_base_function[-1] + "_" + grouping_key,
        )
    else:
        # 'name' is not derived
        path_of_function_argument = _get_tree_path_from_source_col_name(
            name=name,
            current_namespace=tree_path[:-1],
            functions_tree=functions_tree,
            data_tree=data_tree,
            aggregations_tree=aggregations_tree,
        )

B. Handling of functions that are not part of the functions or data tree

This is something more fundamental. Derived functions might stem from basic input columns (see TYPES_INPUT_VARIABLES in config.py). To put the derived functions into the correct namespace, we need to process the basic input columns, even if they are not part of the data tree. However, this entails that we not only need to create time conversion and derived groupings functions for all data inputs and the functions tree, but also for the basic input columns.

To see why this is a bad solution, consider a user that runs GETTSIM using an empty functions tree. Still, this user would end up with stuff like elterngeld__nettoeinkommen_vorjahr_m, elterngeld__nettoeinkommen_vorjahr_y, ... in the functions tree after adding all derived functions.

This will be much easier once we have implemented #833


Drafted implementation of _get_tree_path_from_source_col_name

def _get_tree_path_from_source_col_name(
    name: str,
    current_namespace: tuple[str],
    functions_tree: NestedFunctionDict,
    data_tree: NestedDataDict,
    aggregations_tree: NestedAggregationSpecDict,
) -> tuple[str]:
    """Get the tree path of a source column name that may be qualified or simple.

    This function returns the tree path of a source column name that may be a qualified
    or simple name. If the name is qualified, the path implied by the name is returned.
    Else, the current path plus the simple name is returned.

    Parameters
    ----------
    name
        The qualified or simple name.
    current_namespace
        The namespace where 'name' is located.
    functions_tree
        The functions tree.
    data_tree
        The data tree.
    aggregations_tree_provided_by_env
        The aggregation specifications provided by the environment.

    Returns
    -------
    The path of 'name' in the tree.
    """

    name_is_in_qualified_names = (
        name in dt.qual_names(data_tree)
        or name in dt.qual_names(functions_tree)
        or name in dt.qual_names(aggregations_tree)
        or name in TYPES_INPUT_VARIABLES
    )

    if dt.QUAL_NAME_DELIMITER in name:
        # Name is already namespaced (either fully or partially)
        name_parts = name.split(dt.QUAL_NAME_DELIMITER)
        if name_is_in_qualified_names:
            # Fully qualified valid name - use as is
            new_tree_path = name_parts
        else:
            # Partially namespaced, prepend current namespace
            new_tree_path = list(current_namespace) + name_parts
    else:
        # Simple name without namespace delimiter
        if name_is_in_qualified_names:
            # It's a top-level name
            new_tree_path = [name]
        else:
            # Not namespaced and not at top level, use current namespace
            new_tree_path = [*list(current_namespace), name]

    return tuple(new_tree_path)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions