-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.2.dev3] Choice simulation without capacity constraints #43
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all seems reasonably straightforward to me. I wonder though, if the pandas conversion is whats causing the slow down, if its really worth it instead of just making it pure numpy? I agree the clean API is nice to have, but do we expect many users to actually leverage it down at this level? Certainly most UrbanSim users would not. And users that want to leverage choicemodels outside of an UrbanSim context definitely might want to access these lower level functions, but in this scenario they are probably less likely to be working with pandas objects to begin with, maybe? Ultimately I agree its not a huge performance hit, and shouldn't have a big impact on overall UrbanSim runtimes. Just food for thought.
@mxndrwgrdnr Thanks for taking a look at this. Yeah, these are good points. I guess we could also leave pandas formats as the default but provide an option for passing numpy arrays directly, which we could use in the UrbanSim templates to maximize performance. I'll look into this more as i'm building out the capacity-constrained choice simulation, which i think will have much worse performance. |
This PR adds functionality for efficient Monte Carlo simulation of choices for a set of K scenarios, each having different probability distributions (and potentially different alternatives). Choices are independent and unconstrained, meaning that the same alternative can be chosen in multiple scenarios.
This is a component of issue #26. With this PR, we have full support in ChoiceModels for unconstrained choice simulation. The next PR will handle capacity constraints. A separate PR in UrbanSim Templates will provide access to this logic.
Discussion
This PR adds a tool called
choicemodels.tools.monte_carlo_choices()
.Using this is equivalent to applying
np.random.choice()
to each of K scenarios, but it's implemented as a single-pass matrix calculation. This is about 50x faster than usingdf.apply()
or a loop. The algorithm is adapted fromurbansim.urbanchoice
.For cases where all the choice scenarios have the same probability distribution among alternatives, you don't need this function. You can use
np.random.choice()
withsize=K
, which will be more efficient. (For example, that would work for a choice model whose expression includes only attributes of the alternatives.)PR includes a unit test that confirms the simulated choices align with the provided probabilities.
Usage
This is implemented as a general-purpose function that can accept any list of indexed probabilities -- so it will work with output from our own MNL estimator, or PyLogit, or future model types. It can be called directly or used as the back end for a model template.
Performance
Overall the performance is excellent, especially compared to
df.apply()
as noted above.Simulating choices is faster than calculating choice probabilities from the MNL utility equations. For 1 million choice scenarios with 10 alternatives each, calculating the probabilities takes 1.0 seconds and then simulating choices takes 0.5 seconds, on an old i5 MacBook.
Although this seems fine in absolute terms, it's worth noting that it's a little bit slower than the 100%-numpy implementation in the original
urbansim.urbanchoice
codebase. It looks like this is caused by overhead from requiring the probabilities to be formatted as an indexed pandas object.Profiling indicates that 65% of the execution time, and the vast majority of memory usage, comes from a couple of initial pandas operations. The numpy matrix math is very efficient in comparison.
I think for now, the clean data format is worth the performance hit. But I'd like to go through and do more careful profiling of other parts of the codebase in light of this.
Other changes
MultinomialLogitResults()
constructor and makes theestimation_engine
parameter optionalMultinomialLogitResults.probabilities()
Versioning