Skip to content

Conversation

JoelNiklaus
Copy link
Contributor

Some models are very expensive to run inference on (e.g., Llama-3.3-70B). When we need to rerun inference to add a new metric for example, it would be very time consuming and expensive, especially since at least 4 80GB GPUs are necessary for inference.

We might want to add a flag to enable/disable caching. Also, we might want it for the other methods like loglikelihood generation too.

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition ! Ping us when ready :)

@JoelNiklaus
Copy link
Contributor Author

Thanks, I don't know when I have the capacity to add it to the other methods.

@JoelNiklaus
Copy link
Contributor Author

This might not be necessary anymore with PR #488.

@clefourrier
Copy link
Member

Want us to close this one?

@JoelNiklaus
Copy link
Contributor Author

I personally think it would still be nice to have caching here too, but for me it is not strictly necessary anymore I guess.

@JoelNiklaus
Copy link
Contributor Author

To make local inference of large models more robust it would still be useful.

@NathanHB NathanHB closed this Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants