Skip to content

Facility for model comparison #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fgregg opened this issue May 15, 2015 · 5 comments
Open

Facility for model comparison #64

fgregg opened this issue May 15, 2015 · 5 comments
Assignees

Comments

@fgregg
Copy link
Contributor

fgregg commented May 15, 2015

  • Have current model output predictions in csv
  • Write naive model script that also outputs predictions in csv
  • Write script to intake prediction csvs and report model performance on test sample

When developing alternate model, this final script will facilitate evaluation of the models.

@orborde
Copy link

orborde commented May 22, 2015

Looking at the code and whitepaper, it seems like the way you evaluated the model was by using the glm output to create an inspection "schedule" (a list of the order in which to conduct the inspections) and then analyzing how quickly the created schedule located the violations (as opposed to looking at the model confusion matrix or other traditional measures of model performance).

So an evaluation script should probably take the "schedule" as input and rerun the analyses in the white paper to compute some metrics. I'm planning on hacking one together today in the course of trying some other ML techniques on this dataset.

@geneorama
Copy link
Member

@orborde you are exactly correct. The individual glm scores are used to sort the inspectors into a schedule, and that schedule is more important than the individual scores. I don't know of a way to directly optimize on the schedule performance. Hopefully optimizing the scores results in a better schedule.
(btw, thanks for introducing the word schedule. That's a useful addition to the vocabulary of this project.)

@geneorama
Copy link
Member

Sorry that this has been taking so long, I've been busy with a few other things.

Here's an update on what I'm thinking for the plan:

Refactor the 30 script to only "run the model"; specifically:

  • Import pre-calculated features and raw data
  • Put the data into a form that works for the model
    • Convert to proper class (e.g. matrix / numeric)
    • Manage factors (currently with model.matrix)
  • Create test / train index
  • Run model
  • Calculate prediction (test and train)
  • Save prediction

The plots and benchmarks should go to another report / file, which will have a more clear comparison.

I was thinking it would be nice to make a demonstration "31" file that has an alternative model, and an accompanying report that compares the results between 30 and 31. That way someone could just pick up from there and have a facility for comparison.

For the 31 demonstration file I was thinking it would be nice to simply have the past "average" value. This would be similar to how the baselines look in Kaggle competitions. Rather than having a "submission" the user could compile results in the report. To guard against overfitting, we would check that the results make sense on even more recent data.

This might be a separate issue, but perhaps it would be nice to publish the 40 prediction scripts. The model uses the food inspection history, but the prediction uses current business licenses as the basis. Ultimately the logic in the prediction script would be important for testing on new samples, especially if this is going to be an ongoing evaluation. @tomschenkjr - you may have some thoughts on this?

@geneorama
Copy link
Member

@orborde or @fgregg Do you have any recommendations for best practices for model comparison?

As I mentioned in @orborde 's pull request, the format of the food inspection data has changed dramatically as of last year, and there is a need to reconsider the model.

@orborde
Copy link

orborde commented Apr 1, 2019

I don't have any "best practices" in mind offhand. I do think that generating an inspection schedule and simulating to see how quickly that schedule finds violations, or how efficiently (in terms of number of violations per inspection performed), is a solid approach.

Note that you'll need to be careful not to directly evaluate your inspection schedule on the data used to train the model generating that schedule. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Beyond that, I don't know enough about your problem space to give you more specific advice. Let me know what you wind up trying, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants