You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that a cache mechanism is in place as of #15 , there are times when the server is idle. This time could be used to speculatively send requests to the server and cache responses ahead of time. If the prediction turns out to be wrong, it can discard those results from the cache. Essentially, this would allow us to make predictions in advance and backtrack if it's incorrect. Doing this would maximize the utilization of the server.
Here is a scenario when this could be useful:
The user has the following code with a suggestion.
While waiting for the user to accept or reject the suggestion, we could make another request assuming the user already accepted the current suggestion and cache that response. Since this next suggestion is already cached, it would show up much quicker.
The text was updated successfully, but these errors were encountered:
This is a great idea. Btw, it's better to call this feature with some different name from "speculative decoding" because this will conflict with the established meaning of this term. Maybe something like "speculative FIM" or "speculative suggestion/completion"?
VJHack
changed the title
cache: Speculative Decoding
cache: Speculative FIM
Jan 2, 2025
Now that a cache mechanism is in place as of #15 , there are times when the server is idle. This time could be used to speculatively send requests to the server and cache responses ahead of time. If the prediction turns out to be wrong, it can discard those results from the cache. Essentially, this would allow us to make predictions in advance and backtrack if it's incorrect. Doing this would maximize the utilization of the server.
Here is a scenario when this could be useful:
The user has the following code with a suggestion.
While waiting for the user to accept or reject the suggestion, we could make another request assuming the user already accepted the current suggestion and cache that response. Since this next suggestion is already cached, it would show up much quicker.
The text was updated successfully, but these errors were encountered: