Skip to content

Conversation

@hmgaudecker
Copy link
Collaborator

@hmgaudecker hmgaudecker commented May 1, 2025

What problem do you want to solve?

A brief shot at trying to do the same for GETTSIM what #879 did for METTSIM.

  • Good news: test_full_taxes_and_transfers runs nicely!
  • Bad news: Many tests fail because the logic in dividing up the taxes and transfers function is not elaborate enough. In the first pass, we are trying to build all ids. However, in many cases we are missing the required input data (made-up example: Einkommensteuer tests may require calculation of sn_id, but won't have all inputs required for bg_id).

Solving this should be doable (first set up the entire graph, then check which ids are needed), but the required functions are buried inside of compute_taxes_and_transfers so we should not waste time on that before implementing the new interface.

hmgaudecker and others added 30 commits April 22, 2025 15:13
…n.to_source. In limited set of experiments, it produced exactly the same result.
…g tests. Fix typing and some isinstance checks.
… will do in the office where I had merged things but not pushed.
@hmgaudecker hmgaudecker requested a review from MImmesberger May 1, 2025 11:59
@hmgaudecker
Copy link
Collaborator Author

Update: I changed the default of vectorization_strategy to be vectorize now and added loop only where required. Should not be too hard to adjust most of these cases, eventually, I probably added a few too much.

What I did not realise until recently is that jax.jit and vectorising the functions are not nested. I had always thought of the latter as a prerequisite of the former... @timmens @mj023, I guess it will be mainly compilation speed / efficiency that we mostly want to have vectorised functions to be jitted?

@codecov
Copy link

codecov bot commented May 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.69%. Comparing base (7980a57) to head (04f4847).
Report is 1 commits behind head on collect-components-of-namespaces.

Additional details and impacted files
@@                          Coverage Diff                          @@
##           collect-components-of-namespaces     #891       +/-   ##
=====================================================================
- Coverage                             83.57%   69.69%   -13.88%     
=====================================================================
  Files                                   147      147               
  Lines                                  5704     5702        -2     
=====================================================================
- Hits                                   4767     3974      -793     
- Misses                                  937     1728      +791     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@MImmesberger MImmesberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic of calculating IDs beforehand and calculate the rest of the DAG afterwards fails for stuff like bg_id and wthh_id. Their dependencies consist basically of almost the entire DAG. (That's probably what you meant, I just want to make sure we're one the same page). So users would not get speed benefit if their data is large, but only if they want to call GETTSIM repeatedly on the same (stable - with respect to BD and WTHH) household constellation.

@@ -51,7 +56,7 @@ def __init__(
date: datetime.date,
) -> None:
self.info = info
self.input_tree = input_tree
self.input_tree = optree.tree_map(np.array, input_tree)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want np.array([1, 2, 3]) instead of [np.array(1), np.array(2), np.array(3)] here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what happens in my book 😉

(Pdb+) input_tree
{
    'alter':
        0    67
        1    34
        2    37
        3    48
        dtype: int64, 
        ...
}
(Pdb+) self.input_tree
{'alter': array([67, 34, 37, 48]), ... }

@hmgaudecker
Copy link
Collaborator Author

I think the logic of calculating IDs beforehand and calculate the rest of the DAG afterwards fails for stuff like bg_id and wthh_id. Their dependencies consist basically of almost the entire DAG. (That's probably what you meant, I just want to make sure we're one the same page). So users would not get speed benefit if their data is large, but only if they want to call GETTSIM repeatedly on the same (stable - with respect to BD and WTHH) household constellation.

And yes, I'd think that will be a fairly common use case. And of course, one could do 3+ calls of GETTSIM; just that we can't / probably don't want to bake that into gettsim.oss.

@hmgaudecker hmgaudecker merged commit d9248f9 into collect-components-of-namespaces May 1, 2025
8 of 9 checks passed
@hmgaudecker hmgaudecker deleted the vectorize-gettsim branch May 1, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants