server: allow to get default generation settings for completion #5307

z80maniac · 2024-02-03T15:06:21Z

What does it do?

The PR adds a new field into the /props response - default_generation_settings. This object contains the default server params that will be used to generate the response. Its contents are exactly the same as in the generation_settings object from the /completion endpoint.

What does it solve?

This PR mainly addresses one of my points in #4216 (comment)

There should be a way to get some common server params, like the context size, without doing a /completion request. Since an API client can't set the context size, it needs to get it from the server. Yes, the context size is returned from the /completion endpoint, but there should be a way to get it before sending the first inference request, because the API client might need to use the context size to process the input or something else. I think some common "static" server params like n_ctx and model can be returned from the /props.

Now API clients can get the context size without any trickery. Before that I used {"n_predict": 0} request, but recently it stopped working (#5246) and it was a hack anyway.

Also, this PR will allow API clients to get the default inference params before doing any inference. For example, if the API client decides to populate its GUI with these default values.

Implementation

For the new default_generation_settings object, I take the first slot available. Here the code assumes that all slots have identical default params. Not sure if this is true or not.
The JSON is stored right after creating the slot. We can't get info from the slot itself after that, because it will/may be polluted by API params that user already sent to the server.
This line is kind of awkward:
```
default_generation_settings_for_props["seed"] = -1;
```
But i'm not sure how to do it "properly".

Example

The resulting /props response example:

{
  "assistant_name": "",
  "default_generation_settings": {
    "frequency_penalty": 0,
    "grammar": "",
    "ignore_eos": false,
    "logit_bias": [],
    "min_p": 0.05000000074505806,
    "mirostat": 0,
    "mirostat_eta": 0.10000000149011612,
    "mirostat_tau": 5,
    "model": "/opt/models/text/llama-2-13b-chat.Q4_K_M.gguf",
    "n_ctx": 2048,
    "n_keep": 0,
    "n_predict": -1,
    "n_probs": 0,
    "penalize_nl": true,
    "penalty_prompt_tokens": [],
    "presence_penalty": 0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.100000023841858,
    "seed": -1,
    "stop": [],
    "stream": true,
    "temperature": 0.800000011920929,
    "tfs_z": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "typical_p": 1,
    "use_penalty_prompt_tokens": false
  },
  "user_name": ""
}

z80maniac · 2024-02-05T08:22:26Z

Damn, I was a little bit too late... So, several minutes ago I got a letter from someone who told me that there's already an endpoint /model.json for that info. I checked it and it indeed returns all the model settings, including the context size. But it's not documented anywhere, so I had no idea it even existed. What is that endpoint anyway? Some sort of intentionally undocumented hack for developers?

curl -Ss http://127.0.0.1:8080/model.json | jq

{
  "frequency_penalty": 0,
  "grammar": "",
  "ignore_eos": false,
  "logit_bias": [],
  "min_p": 0.05000000074505806,
  "mirostat": 0,
  "mirostat_eta": 0.10000000149011612,
  "mirostat_tau": 5,
  "model": "/opt/models/text/llama-2-13b-chat.Q4_K_M.gguf",
  "n_ctx": 512,
  "n_keep": 0,
  "n_predict": -1,
  "n_probs": 0,
  "penalize_nl": true,
  "penalty_prompt_tokens": [],
  "presence_penalty": 0,
  "repeat_last_n": 64,
  "repeat_penalty": 1.100000023841858,
  "seed": 4294967295,
  "stop": [],
  "stream": true,
  "temperature": 0.800000011920929,
  "tfs_z": 1,
  "top_k": 40,
  "top_p": 0.949999988079071,
  "typical_p": 1,
  "use_penalty_prompt_tokens": false
}

I still think that /props is more appropriate endpoint for this, because why have two separate endpoints for server info?

ggerganov · 2024-02-05T11:22:34Z

FWIW, I also didn't know (or forgot) there is a model.json endpoint. We should leave just one of those - maybe keep the original one and document it in the readme?

z80maniac · 2024-02-05T11:38:16Z

Either way works for me, but I personally think that there's no need for a separate endpoint, when all can be in /props. But to not break compatibility the original endpoint needs to stay, and this PR needs to be reverted. Though, I'm not sure if such undocumented stuff should be protected by compatibility guarantees.

ggerganov · 2024-02-05T11:58:31Z

Ok, let's remove model.json

z80maniac · 2024-02-05T12:04:35Z

It seems like /model.json is used in examples/server/public/completion.js. So it needs to be reworked too, I guess. It's embedded into the server itself, so there should be no compatibility issues in this regard. I can try to replace it with the new endpoint.

EDIT: seems like /model.json is in dead code and not even called from anywhere, unless I'm missing something.

…l-org#5307)

server: allow to get default generation settings for completion

dcc179e

ggerganov approved these changes Feb 5, 2024

View reviewed changes

ggerganov merged commit a2d60c9 into ggml-org:master Feb 5, 2024
53 checks passed

z80maniac mentioned this pull request Feb 6, 2024

server: remove model.json endpoint #5371

Merged

z80maniac mentioned this pull request Feb 21, 2024

server : improvements and maintenance #4216

Open

10 tasks

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024

server : allow to get default generation settings for completion (ggm…

8b32a17

…l-org#5307)

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

server : allow to get default generation settings for completion (ggm…

35b93a6

…l-org#5307)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: allow to get default generation settings for completion #5307

server: allow to get default generation settings for completion #5307

z80maniac commented Feb 3, 2024

z80maniac commented Feb 5, 2024

ggerganov commented Feb 5, 2024 •

edited

Loading

z80maniac commented Feb 5, 2024

ggerganov commented Feb 5, 2024

z80maniac commented Feb 5, 2024 •

edited

Loading

server: allow to get default generation settings for completion #5307

server: allow to get default generation settings for completion #5307

Conversation

z80maniac commented Feb 3, 2024

What does it do?

What does it solve?

Implementation

Example

z80maniac commented Feb 5, 2024

ggerganov commented Feb 5, 2024 • edited Loading

z80maniac commented Feb 5, 2024

ggerganov commented Feb 5, 2024

z80maniac commented Feb 5, 2024 • edited Loading

ggerganov commented Feb 5, 2024 •

edited

Loading

z80maniac commented Feb 5, 2024 •

edited

Loading