You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently read the paper Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (https://huggingface.co/papers/2502.05171), and it got me thinking about how this could work with llama.cpp. The idea of dynamically scaling compute at inference time by iterating in the latent space instead of generating extra tokens seems pretty interesting, especially for local LLM setups.
Would something like this be feasible with GGUF models in the future? I’m curious...
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I recently read the paper Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (https://huggingface.co/papers/2502.05171), and it got me thinking about how this could work with llama.cpp. The idea of dynamically scaling compute at inference time by iterating in the latent space instead of generating extra tokens seems pretty interesting, especially for local LLM setups.
Would something like this be feasible with GGUF models in the future? I’m curious...
Beta Was this translation helpful? Give feedback.
All reactions