-
Notifications
You must be signed in to change notification settings - Fork 369
Conversation
Breaking changes: - EOD_TOKEN_ID -> EOT_TOKEN_ID - temp -> temperature - current_part zero-indexed
This is good to go aside from the AS fix (need to test on work laptop). Something I noticed is that pub fn inference_with_prompt<E: std::error::Error + 'static>(
&mut self,
model: &Model,
vocab: &Vocabulary,
inference_params: &InferenceParameters,
inference_with_prompt_params: &InferenceWithPromptParameters,
prompt: &str,
rng: &mut impl rand::Rng,
callback: impl Fn(&str) -> Result<(), E>,
) -> Result<InferenceStats, InferenceError> {
} We could also potentially move the Is there a context in which you would use the |
You'd have to add a lifetime to the struct and maybe it would need to be mutable also, so probably not worth it. (Also, that's a lot of tasks checked off. Nice work!) |
I'm on apple silicon and I haven't felt any performance issues versus the original implementation of llama.cpp in C. What are the performance issues? |
We don't compile with any flags with ARM, so the compiler won't use NEON etc. Easy enough to fix, just haven't yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the changes! :) Everything looks good, there's just a very minor nitpick comment from my end.
Also, why does ggml.rs
appear as a removed file? The code seems to still be using it, and I don't see where that code has moved to 🤔
This builds with Accelerate and uses performance cores for CPU count.
It's moved to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 😄
Mostly housekeeping.
Closes #57.
Still to do:
llama-rs
usize
overi32
where possible #18RMSNorm
for normalization #80InferenceSession
Clone
-able #48infer_next_token
still works after end of text #44