-
Notifications
You must be signed in to change notification settings - Fork 31
Interface: Add DataFrame to NestedDataDict conversion #876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
0f5b6ee
Draft of df_to_data_tree.
MImmesberger 5835a07
Return NestedDataDict.
MImmesberger cbfd9d6
Fix some issues when checking the data tree. Make df optional.
MImmesberger 08c3298
Merge branch 'collect-components-of-namespaces' into df-to-tree
MImmesberger 0c73204
Draft quickrun function.
MImmesberger e831c1d
Make df required again.
MImmesberger 778303a
Fix imports-n
MImmesberger dcf3e26
Update test and change imports.
MImmesberger 193c26c
Make types available on ttsim level.
MImmesberger b539339
Just remove the three type hints now to continue.
MImmesberger 20ef4b2
Test fail_ifs.
MImmesberger 1accf2e
Remove some more type hints...
MImmesberger d8b859d
Add type hints.
MImmesberger c3d8656
Remove potential (?) circular import.
MImmesberger b9fa5c3
Merge branch 'collect-components-of-namespaces' into df-to-tree
hmgaudecker 3122c02
Fix imports, get annotations from future.
hmgaudecker f8f910b
Review comments.
MImmesberger 5838bba
Review comments.
MImmesberger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| from _gettsim.config import RESOURCE_DIR, SUPPORTED_GROUPINGS | ||
| from ttsim import ( | ||
| compute_taxes_and_transfers, | ||
| create_data_tree_from_df, | ||
| set_up_policy_environment, | ||
| ) | ||
|
|
||
| if TYPE_CHECKING: | ||
| import pandas as pd | ||
| from dags.tree.typing import NestedTargetDict | ||
|
|
||
| from ttsim.typing import NestedDataDict, NestedInputsPathsToDfColumns | ||
|
|
||
|
|
||
| def oss( | ||
| date: str, | ||
| df: pd.DataFrame, | ||
| inputs_tree_to_df_columns: NestedInputsPathsToDfColumns, | ||
| targets_tree: NestedTargetDict, | ||
| ) -> NestedDataDict: | ||
| """One-stop-shop for computing taxes and transfers. | ||
|
|
||
| Args: | ||
| date: | ||
| The date to compute taxes and transfers for. The date determines the policy | ||
| environment for which the taxes and transfers are computed. | ||
| df: | ||
| The DataFrame containing the data. | ||
| inputs_tree_to_df_columns: | ||
| A nested dictionary that maps GETTSIM's expected input structure to the data | ||
| provided by the user. Keys are strings that provide a path to an input. | ||
|
|
||
| Values can be: | ||
| - Strings that reference column names in the DataFrame. | ||
| - Numeric or boolean values (which will be broadcasted to match the length | ||
| of the DataFrame). | ||
| targets_tree: | ||
| The targets tree. | ||
|
|
||
|
|
||
| Examples: | ||
| -------- | ||
| >>> inputs_tree_to_df_columns = { | ||
| ... "einkommensteuer": { | ||
| ... "gemeinsam_veranlagt": "joint_taxation", | ||
| ... "einkünfte": { | ||
| ... "aus_nichtselbstständiger_arbeit": { | ||
| ... "bruttolohn_m": "gross_wage_m", | ||
| ... }, | ||
| ... }, | ||
| ... }, | ||
| ... "alter": 30, | ||
| ... "p_id": "p_id", | ||
| ... } | ||
| >>> df = pd.DataFrame( | ||
| ... { | ||
| ... "gross_wage_m": [1000, 2000, 3000], | ||
| ... "joint_taxation": [True, True, False], | ||
| ... "p_id": [0, 1, 2], | ||
| ... } | ||
| ... ) | ||
| >>> oss( | ||
| ... date="2024-01-01", | ||
| ... inputs_tree_to_df_columns=inputs_tree_to_df_columns, | ||
| ... targets_tree=targets_tree, | ||
| ... df=df, | ||
| ... ) | ||
| """ | ||
| data_tree = create_data_tree_from_df( | ||
| inputs_tree_to_df_columns=inputs_tree_to_df_columns, | ||
| df=df, | ||
| ) | ||
| policy_environment = set_up_policy_environment( | ||
| date=date, | ||
| resource_dir=RESOURCE_DIR, | ||
| ) | ||
| return compute_taxes_and_transfers( | ||
| data_tree=data_tree, | ||
| environment=policy_environment, | ||
| targets_tree=targets_tree, | ||
| supported_groupings=SUPPORTED_GROUPINGS, | ||
| rounding=True, | ||
| debug=False, | ||
| jit=False, | ||
| ) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
||
| import dags.tree as dt | ||
| import optree | ||
| import pandas as pd | ||
|
|
||
| from ttsim.shared import format_errors_and_warnings, format_list_linewise | ||
|
|
||
| if TYPE_CHECKING: | ||
| from ttsim.typing import NestedDataDict, NestedInputsPathsToDfColumns | ||
|
|
||
|
|
||
| def create_data_tree_from_df( | ||
| inputs_tree_to_df_columns: NestedInputsPathsToDfColumns, | ||
| df: pd.DataFrame, | ||
| ) -> NestedDataDict: | ||
| """Transform a pandas DataFrame to a nested dictionary expected by TTSIM. | ||
| ` | ||
| Args | ||
| ---- | ||
| inputs_tree_to_df_columns: | ||
| A nested dictionary that defines the structure of the output tree. Keys | ||
| are strings that define the nested structure. Values can be: | ||
|
|
||
| - Strings that reference column names in the DataFrame. | ||
| - Numeric or boolean values (which will be broadcasted to match the | ||
| DataFrame length) | ||
| df: | ||
| The pandas DataFrame containing the source data. | ||
|
|
||
| Returns | ||
| ------- | ||
| A nested dictionary structure containing the data organized according to the | ||
| mapping definition. | ||
|
|
||
| Examples | ||
| -------- | ||
| >>> df = pd.DataFrame({ | ||
| ... "a": [1, 2, 3], | ||
| ... "b": [4, 5, 6], | ||
| ... "c": [7, 8, 9], | ||
| ... }) | ||
| >>> inputs_tree_to_df_columns = { | ||
| ... "n1": { | ||
| ... "n2": "a", | ||
| ... "n3": "b", | ||
| ... }, | ||
| ... "n4": 3, | ||
| ... } | ||
| >>> result = create_data_tree( | ||
| ... inputs_tree_to_df_columns=inputs_tree_to_df_columns, | ||
| ... df=df, | ||
| ... ) | ||
| >>> result | ||
| { | ||
| "n1": { | ||
| "n2": pd.Series([1, 2, 3]), | ||
| "n3": pd.Series([4, 5, 6]), | ||
| }, | ||
| "n4": pd.Series([3, 3, 3]), | ||
| } | ||
|
|
||
|
|
||
| """ | ||
| _fail_if_df_has_bool_or_numeric_column_names(df) | ||
| _fail_if_mapper_has_incorrect_format(inputs_tree_to_df_columns) | ||
|
|
||
| qualified_inputs_tree_to_df_columns = dt.flatten_to_qual_names( | ||
| inputs_tree_to_df_columns | ||
| ) | ||
|
|
||
| name_to_input_series = {} | ||
| for ( | ||
| qualified_input_name, | ||
| input_value, | ||
| ) in qualified_inputs_tree_to_df_columns.items(): | ||
| if input_value in df.columns: | ||
| name_to_input_series[qualified_input_name] = df[input_value] | ||
| else: | ||
| name_to_input_series[qualified_input_name] = pd.Series( | ||
| [input_value] * len(df), | ||
| index=df.index, | ||
| ) | ||
|
|
||
| return dt.unflatten_from_qual_names(name_to_input_series) | ||
|
|
||
|
|
||
| def _fail_if_mapper_has_incorrect_format( | ||
| inputs_tree_to_df_columns: NestedInputsPathsToDfColumns, | ||
| ) -> None: | ||
| """Fail if the input tree to column name mapping has an incorrect format.""" | ||
| if not isinstance(inputs_tree_to_df_columns, dict): | ||
| msg = format_errors_and_warnings( | ||
| """The input tree to column mapping must be a (nested) dictionary. Call | ||
| `create_input_structure` to create a template.""" | ||
| ) | ||
| raise TypeError(msg) | ||
|
|
||
| non_string_paths = [ | ||
| str(path) | ||
| for path in optree.tree_paths(inputs_tree_to_df_columns, none_is_leaf=True) | ||
| if not all(isinstance(part, str) for part in path) | ||
| ] | ||
| if non_string_paths: | ||
| msg = format_errors_and_warnings( | ||
| f"""All path elements of `inputs_tree_to_df_columns` must be strings. | ||
| Found the following paths that contain non-string elements: | ||
|
|
||
| {format_list_linewise(non_string_paths)} | ||
|
|
||
| Call `create_input_structure` to create a template. | ||
| """ | ||
| ) | ||
| raise TypeError(msg) | ||
|
|
||
| incorrect_types = { | ||
| k: type(v) | ||
| for k, v in dt.flatten_to_qual_names(inputs_tree_to_df_columns).items() | ||
| if not isinstance(v, str | int | bool) | ||
| } | ||
| if incorrect_types: | ||
| formatted_incorrect_types = "\n".join( | ||
| f" - {k}: {v.__name__}" for k, v in incorrect_types.items() | ||
| ) | ||
| msg = format_errors_and_warnings( | ||
| f"""Values of the input tree to column mapping must be strings, integers, | ||
| or booleans. | ||
| Found the following incorrect types: | ||
|
|
||
| {formatted_incorrect_types} | ||
| """ | ||
| ) | ||
| raise TypeError(msg) | ||
|
|
||
|
|
||
| def _fail_if_df_has_bool_or_numeric_column_names(df: pd.DataFrame) -> None: | ||
| """Fail if the DataFrame has bool or numeric column names.""" | ||
| common_msg = format_errors_and_warnings( | ||
| """DataFrame column names cannot be booleans or numbers. This restriction | ||
| prevents ambiguity between actual column references and values intended for | ||
| broadcasting. | ||
| """ | ||
| ) | ||
| bool_column_names = [col for col in df.columns if isinstance(col, bool)] | ||
| numeric_column_names = [ | ||
| col | ||
| for col in df.columns | ||
| if isinstance(col, (int, float)) or (isinstance(col, str) and col.isnumeric()) | ||
| ] | ||
|
|
||
| if bool_column_names or numeric_column_names: | ||
| msg = format_errors_and_warnings( | ||
| f""" | ||
| {common_msg} | ||
|
|
||
| Boolean column names: {bool_column_names}. | ||
| Numeric column names: {numeric_column_names}. | ||
| """ | ||
| ) | ||
| raise ValueError(msg) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,4 @@ | ||
| from typing import TYPE_CHECKING, NewType | ||
| from typing import TYPE_CHECKING, Any, NewType | ||
|
|
||
| if TYPE_CHECKING: | ||
| from collections.abc import Mapping | ||
|
|
@@ -25,6 +25,7 @@ | |
| QualNamePolicyInputDict = Mapping[str, PolicyInput] | ||
|
|
||
| # Specialise from dags' NestedInputDict to GETTSIM's types. | ||
| NestedInputsPathsToDfColumns = Mapping[str, Any | "NestedInputsPathsToDfColumns"] | ||
| NestedDataDict = Mapping[str, pd.Series | "NestedDataDict"] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a heads-up that the current type |
||
| QualNameDataDict = Mapping[str, pd.Series] | ||
| NestedArrayDict = Mapping[str, np.ndarray | "NestedArrayDict"] | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.