Skip to content

feat(ollama): periodically refresh models#2805

Merged
ashwinb merged 3 commits intollamastack:mainfrom
ashwinb:envsimplify_0
Jul 18, 2025
Merged

feat(ollama): periodically refresh models#2805
ashwinb merged 3 commits intollamastack:mainfrom
ashwinb:envsimplify_0

Conversation

@ashwinb
Copy link
Contributor

@ashwinb ashwinb commented Jul 17, 2025

For self-hosted providers like Ollama (or vLLM), the backing server is running a set of models. That server should be treated as the source of truth and the Stack registry should just be a cache for those models. Of course, in production environments, you may not want this (because you know what model you are running statically) hence there's a config boolean to control this behavior.

This is part of a series of PRs aimed at removing the requirement of needing to set INFERENCE_MODEL env variables for running Llama Stack server.

Test Plan

Copy and modify the starter.yaml template / config and enable refresh_models: true, refresh_models_interval: 10 for the ollama provider. Then, run:

LLAMA_STACK_LOGGING=all=debug \
  ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml

See a gargantuan amount of logs, but verify that the provider is periodically refreshing models. Stop and prune a model from ollama server, restart the server. Verify that the model goes away when I call uv run llama-stack-client models list

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 17, 2025
@ashwinb ashwinb changed the title ollama: periodically refresh models feat(ollama): periodically refresh models Jul 17, 2025
ashwinb added a commit that referenced this pull request Jul 18, 2025
When we call `construct_stack()`, providers are instantiated and
`initialize()` is called. This call can end up doing _anything_ at all
-- specifically, providers are free to create long running background
tasks as part of this. If we wrapped this within a `asyncio.run()` as in
the current code, these tasks get canceled when the stack construction
finishes. This is not correct. The PR addresses the issue by creating a
persistent event loop which is used for both the stack as well as for
running the uvicorn server. In other words, the lifetime of the
providers (and downstream async code) is now the same as the lifetime of
the uvicorn server.

## Test Plan

This should not affect any current code since we don't have background
tasks created right now. However,
#2805 will start using
this functionality.

async def _refresh_models(self) -> None:
# Wait for model store to be available (with timeout)
wait_interval = 1 # check every 3 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 or 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehhuang lol will fix

@ashwinb ashwinb merged commit 68a2dfb into llamastack:main Jul 18, 2025
77 checks passed
@ashwinb ashwinb deleted the envsimplify_0 branch July 18, 2025 19:20
ashwinb added a commit that referenced this pull request Jul 18, 2025
Just like #2805 but for vLLM.

We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.

## Test Plan

Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:

```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
  uv run llama stack run --image-type venv /tmp/starter.yaml
```

Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
Nehanth pushed a commit to Nehanth/llama-stack that referenced this pull request Jul 23, 2025
For self-hosted providers like Ollama (or vLLM), the backing server is
running a set of models. That server should be treated as the source of
truth and the Stack registry should just be a cache for those models. Of
course, in production environments, you may not want this (because you
know what model you are running statically) hence there's a config
boolean to control this behavior.

_This is part of a series of PRs aimed at removing the requirement of
needing to set `INFERENCE_MODEL` env variables for running Llama Stack
server._

## Test Plan

Copy and modify the starter.yaml template / config and enable
`refresh_models: true, refresh_models_interval: 10` for the ollama
provider. Then, run:

```
LLAMA_STACK_LOGGING=all=debug \
  ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml
```

See a gargantuan amount of logs, but verify that the provider is
periodically refreshing models. Stop and prune a model from ollama
server, restart the server. Verify that the model goes away when I call
`uv run llama-stack-client models list`
Nehanth pushed a commit to Nehanth/llama-stack that referenced this pull request Jul 23, 2025
Just like llamastack#2805 but for vLLM.

We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.

## Test Plan

Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:

```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
  uv run llama stack run --image-type venv /tmp/starter.yaml
```

Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
ashwinb added a commit that referenced this pull request Jul 24, 2025
This flips #2823 and #2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.

In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.
ChristianZaccaria pushed a commit to ChristianZaccaria/llama-stack that referenced this pull request Jul 28, 2025
…mastack#2862)

This flips llamastack#2823 and llamastack#2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.

In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants