Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds OpenAI compatible endpoint option #16

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ohmeow
Copy link

@ohmeow ohmeow commented Jan 27, 2025

This PR enables llama-vscode to talk to a OpenAI compatible endpoint in lieu of llama.cpp running locally.

Tested with OpenAI's exposed enpoint as well as a local vLLM server endpoint.

I'm not sure if I translated all the llama.cpp arguments correctly for an OpenAI API so I imagine we might have to do a few iterations to get the hypers and prompt right. I'm relatively new to this extension and to llama.cpp so definitely up for taking as much time to get this right.

Thanks much - wg

@shishkin
Copy link

Would love to use it with Ollama server as well.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets merge after resolving the conflicts.

Comment on lines 4 to 34
export class Configuration {
// extension configs
enabled = true
endpoint = "http=//127.0.0.1:8012"
auto = true
api_key = ""
n_prefix = 256
n_suffix = 64
n_predict = 128
t_max_prompt_ms = 500
t_max_predict_ms = 2500
show_info = true
max_line_suffix = 8
max_cache_keys = 250
ring_n_chunks = 16
ring_chunk_size = 64
ring_scope = 1024
ring_update_ms = 1000
language = "en"
// additional configs
axiosRequestConfig = {}
disabledLanguages: string[] = []
RING_UPDATE_MIN_TIME_LAST_COMPL = 3000
MIN_TIME_BETWEEN_COMPL = 600
MAX_LAST_PICK_LINE_DISTANCE = 32
MAX_QUEUED_CHUNKS = 16
DELAY_BEFORE_COMPL_REQUEST = 150
// extension configs
enabled = true;
endpoint = "http=//127.0.0.1:8012";
is_openai_compatible = false;
openAiClient: OpenAI | null = null;
openAiClientModel: string | null = null;
opeanAiPromptTemplate: string = "<|fim_prefix|>{inputPrefix}{prompt}<|fim_suffix|>{inputSuffix}<|fim_middle|>";
auto = true;
api_key = "";
n_prefix = 256;
n_suffix = 64;
n_predict = 128;
t_max_prompt_ms = 500;
t_max_predict_ms = 2500;
show_info = true;
max_line_suffix = 8;
max_cache_keys = 250;
ring_n_chunks = 16;
ring_chunk_size = 64;
ring_scope = 1024;
ring_update_ms = 1000;
language = "en";
// additional configs
axiosRequestConfig = {};
disabledLanguages: string[] = [];
RING_UPDATE_MIN_TIME_LAST_COMPL = 3000;
MIN_TIME_BETWEEN_COMPL = 600;
MAX_LAST_PICK_LINE_DISTANCE = 32;
MAX_QUEUED_CHUNKS = 16;
DELAY_BEFORE_COMPL_REQUEST = 150;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future PRs: we should normalize the naming here. We have 3 different styles now. We should choose one and follow it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed formatting by adding a .prettierrc to the project for folks (like me) that use prettier for formatting js/ts.

Updated configuration names to be consistent.

Also, can you take a look at my openai compatible implementation ins llama-server.ts? I know there isn't a one-to-one correlation between what llama.cpp accepts and openai compatible endpoints but I'd love to make sure it is as consistent as possible.

Thanks - wg

@ggerganov
Copy link
Member

Would love to use it with Ollama server as well.

Note that using anything else other than the llama.cpp server will be majorly inefficient, because only the llama.cpp server has the necessary optimizations used by this extension.

@shishkin
Copy link

Note that using anything else other than the llama.cpp server will be majorly inefficient, because only the llama.cpp server has the necessary optimizations used by this extension.

Apologies for my ignorance, but what are those optimizations and what would be the way forward to reuse them between llama.cpp-based model servers? If those are on the API layer, maybe Ollama should expose llama.cpp server's API directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants