Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add StandardState #32

Merged
merged 41 commits into from
Jul 21, 2023
Merged

feat: add StandardState #32

merged 41 commits into from
Jul 21, 2023

Conversation

hollandjg
Copy link
Member

@hollandjg hollandjg commented Jul 12, 2023

Description

Add a new StandardState object which can be used as the default for future experimentalists, experiment runners and theorists.

Type of change

  • fix: A bug fix
  • feat: A new feature

Features

Add a new StandardState object with

  • variables
  • experiment_data
  • conditions
  • models
  • alias model which returns the last model.

Make some fixes and refactors in the autora.state.delta module.

Questions (Optional)

  • Are the field names understandable?
  • Do they match the agreed names?
  • Are there better names we could use?

Details

Aliases

Aliases work like this:

s = SomeState(models=[FittedModel0(), FittedModel1()]) # `models` is the list of models
t = s + Delta(model=FittedModel2())  # !! `model` is singular and not in a list!

...and get the following back as t:

SomeState(models=[FittedModel0(), FittedModel1(), FittedModel2()])

This is required for our Theorist interface which may by default return a single model:

def a_theorist(experiment_data):
    X, y = experiment_data[x_names], experiment_data[y_names]
    model = ModelFitter().fit(X, y)
    return Result(model=model)

Without this feature, this couldn't be handled without requiring each theorist to always return a list of models, which is a pain and feels wrong: return Result(models=[model])

hollandjg and others added 25 commits July 10, 2023 11:32
…state

# Conflicts:
#	src/autora/state/delta.py
@hollandjg hollandjg marked this pull request as ready for review July 12, 2023 21:27
Copy link
Contributor

@younesStrittmatter younesStrittmatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@benwandrew benwandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good, cool stuff! my only hesitation so far regarding names is models vs model (and more generally, things (full list) vs thing (last element); can we somehow more clearly differentiate the last model and the List?

src/autora/state/bundled.py Outdated Show resolved Hide resolved
src/autora/state/bundled.py Outdated Show resolved Hide resolved
>>> (s + dm1 + dm2).models
[DummyClassifier(constant=1), DummyClassifier(constant=2), DummyClassifier(constant=3)]

The last model is available under the `model` property:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if want a clearer distinction between the models and model properties? could get a little confusing for users since they're so close in name... although, admittedly hard to think of an alternative that keeps single words

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... how about model vs. model_list or model_set? Are those better than models?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@musslick i'm fine with either of those! i'm also fine with models and model if others think it's not actually going to be an issue; just trying to maximally clear while still being Pythonic :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, I feel like I still prefer models rather than anything longer – it feels a bit more natural to me. Combined with the type annotations (like BaseEstimator vs List[BaseEstimator]) and type checking, I feel like the risk of accidental confusion is sufficiently small to be unproblematic. I'm open to persuasion there though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, i'm happy with models! it was only a slight hesitation on my part, and i think it is indeed quite natural.

@benwandrew
Copy link
Contributor

benwandrew commented Jul 13, 2023

a few more comments:

  1. initially i was confused by some of the output displayed in Creating Generators With State Based Functions section of the notebook. specifically,
Screen Shot 2023-07-13 at 13 54 16

seemed like the outputs were reversed with respect to what was being described. i think the issue is that the output is correct the first pass through running the cells (as i verified), but then if you got back up and run a cell above again it will keep cycling but the descriptive text remains the same. no substantive error, just threw me off given what was displayed when you first open and read through the notebook.

  1. in the subsequent section — Adding The Experimentalist — i encounter the following when running the first cell:
Screen Shot 2023-07-13 at 14 00 47

note, everything up to that point executed successfully.

Copy link
Contributor

@musslick musslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just had a question regarding numpy<>pandas conversion.

... metadata={"delta": "replace",
... "converter": np.asarray})

Here we pass a dataframe, but expect a numpy array:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: Do we also allow for casting from pandas to numpy? I think we should allow that since most people will be feeding the theorists with numpy array (it's just simpler).

Copy link
Member Author

@hollandjg hollandjg Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Casting from pandas to numpy is totally possible – using the np.asarray converter like on line 170, if the Delta includes a DataFrame, it will be converted to a numpy array.

You can always put a pd.DataFrame through np.asarray to get a numpy array.

My recommendation would be to have the functions and classes which want to represent the data internally as numpy arrays accept np.typing.ArrayLike as the input type, which allows for lots of input types – lists of values, or DataFrames or np.ndarrays – and then do a np.asarray() call at the top of the function to turn the input into the array you want.

It's really hard to do the casting properly from outside these functions, as we'd have to inspect the function signatures and work out whether the type we have is compatible. It's much simpler and more efficient to do this within the function itself, especially if we have a general interchange format like pd.DataFrame as our standard.

>>> (s + dm1 + dm2).models
[DummyClassifier(constant=1), DummyClassifier(constant=2), DummyClassifier(constant=3)]

The last model is available under the `model` property:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... how about model vs. model_list or model_set? Are those better than models?

@hollandjg
Copy link
Member Author

seemed like the outputs were reversed with respect to what was being described

They were. Thanks for catching this! It should be in order now. Using generators is tricky, perhaps it's not something we should really recommend.

@hollandjg
Copy link
Member Author

2. in the subsequent section — Adding The Experimentalist — i encounter the following when running the first cell:

This is because the new experimentalist isn't yet defined – that comes in #33 . I've taken out that block from this PR but reintroduce it (in a fixed form) in #32.

@benwandrew
Copy link
Contributor

  1. in the subsequent section — Adding The Experimentalist — i encounter the following when running the first cell:

This is because the new experimentalist isn't yet defined – that comes in #33 . I've taken out that block from this PR but reintroduce it (in a fixed form) in #32.

ok, that makes sense.

Copy link
Contributor

@benwandrew benwandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per the end of our conversation in group today, and with all previous requested changes/questions addressed, i'm happy to get this merged!

>>> (s + dm1 + dm2).models
[DummyClassifier(constant=1), DummyClassifier(constant=2), DummyClassifier(constant=3)]

The last model is available under the `model` property:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, i'm happy with models! it was only a slight hesitation on my part, and i think it is indeed quite natural.

@hollandjg hollandjg added this pull request to the merge queue Jul 21, 2023
Merged via the queue into main with commit 689e192 Jul 21, 2023
13 checks passed
@hollandjg hollandjg deleted the feat/default-state-from-main branch July 21, 2023 15:06
hollandjg added a commit that referenced this pull request Nov 29, 2023
ci: update nb-clean hook to remove empty cells
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants