feat(ollama): periodically refresh models by ashwinb · Pull Request #2805 · llamastack/llama-stack

ashwinb · 2025-07-17T23:02:27Z

For self-hosted providers like Ollama (or vLLM), the backing server is running a set of models. That server should be treated as the source of truth and the Stack registry should just be a cache for those models. Of course, in production environments, you may not want this (because you know what model you are running statically) hence there's a config boolean to control this behavior.

This is part of a series of PRs aimed at removing the requirement of needing to set INFERENCE_MODEL env variables for running Llama Stack server.

Test Plan

Copy and modify the starter.yaml template / config and enable refresh_models: true, refresh_models_interval: 10 for the ollama provider. Then, run:

LLAMA_STACK_LOGGING=all=debug \
  ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml

See a gargantuan amount of logs, but verify that the provider is periodically refreshing models. Stop and prune a model from ollama server, restart the server. Verify that the model goes away when I call uv run llama-stack-client models list

docs/source/providers/inference/remote_ollama.md

llama_stack/providers/remote/inference/ollama/ollama.py

llama_stack/distribution/routing_tables/models.py

docs/source/providers/inference/remote_ollama.md

llama_stack/providers/remote/inference/ollama/config.py

llama_stack/providers/remote/inference/ollama/ollama.py

When we call `construct_stack()`, providers are instantiated and `initialize()` is called. This call can end up doing _anything_ at all -- specifically, providers are free to create long running background tasks as part of this. If we wrapped this within a `asyncio.run()` as in the current code, these tasks get canceled when the stack construction finishes. This is not correct. The PR addresses the issue by creating a persistent event loop which is used for both the stack as well as for running the uvicorn server. In other words, the lifetime of the providers (and downstream async code) is now the same as the lifetime of the uvicorn server. ## Test Plan This should not affect any current code since we don't have background tasks created right now. However, #2805 will start using this functionality.

ehhuang · 2025-07-18T18:58:35Z

llama_stack/providers/remote/inference/ollama/ollama.py

+
+    async def _refresh_models(self) -> None:
+        # Wait for model store to be available (with timeout)
+        wait_interval = 1  # check every 3 seconds


@ehhuang lol will fix

Just like #2805 but for vLLM. We also make VLLM_URL env variable optional (not required) -- if not specified, the provider silently sits idle and yells eventually if someone tries to call a completion on it. This is done so as to allow this provider to be present in the `starter` distribution. ## Test Plan Set up vLLM, copy the starter template and set `{ refresh_models: true, refresh_models_interval: 10 }` for the vllm provider and then run: ``` ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \ uv run llama stack run --image-type venv /tmp/starter.yaml ``` Verify that `llama-stack-client models list` brings up the model correctly from vLLM.

For self-hosted providers like Ollama (or vLLM), the backing server is running a set of models. That server should be treated as the source of truth and the Stack registry should just be a cache for those models. Of course, in production environments, you may not want this (because you know what model you are running statically) hence there's a config boolean to control this behavior. _This is part of a series of PRs aimed at removing the requirement of needing to set `INFERENCE_MODEL` env variables for running Llama Stack server._ ## Test Plan Copy and modify the starter.yaml template / config and enable `refresh_models: true, refresh_models_interval: 10` for the ollama provider. Then, run: ``` LLAMA_STACK_LOGGING=all=debug \ ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml ``` See a gargantuan amount of logs, but verify that the provider is periodically refreshing models. Stop and prune a model from ollama server, restart the server. Verify that the model goes away when I call `uv run llama-stack-client models list`

Just like llamastack#2805 but for vLLM. We also make VLLM_URL env variable optional (not required) -- if not specified, the provider silently sits idle and yells eventually if someone tries to call a completion on it. This is done so as to allow this provider to be present in the `starter` distribution. ## Test Plan Set up vLLM, copy the starter template and set `{ refresh_models: true, refresh_models_interval: 10 }` for the vllm provider and then run: ``` ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \ uv run llama stack run --image-type venv /tmp/starter.yaml ``` Verify that `llama-stack-client models list` brings up the model correctly from vLLM.

This flips #2823 and #2805 by making the Stack periodically query the providers for models rather than the providers going behind the back and calling "register" on to the registry themselves. This also adds support for model listing for all other providers via `ModelRegistryHelper`. Once this is done, we do not need to manually list or register models via `run.yaml` and it will remove both noise and annoyance (setting `INFERENCE_MODEL` environment variables, for example) from the new user experience. In addition, it adds a configuration variable `allowed_models` which can be used to optionally restrict the set of models exposed from a provider.

…mastack#2862) This flips llamastack#2823 and llamastack#2805 by making the Stack periodically query the providers for models rather than the providers going behind the back and calling "register" on to the registry themselves. This also adds support for model listing for all other providers via `ModelRegistryHelper`. Once this is done, we do not need to manually list or register models via `run.yaml` and it will remove both noise and annoyance (setting `INFERENCE_MODEL` environment variables, for example) from the new user experience. In addition, it adds a configuration variable `allowed_models` which can be used to optionally restrict the set of models exposed from a provider.

ashwinb requested review from bbrowning, ehhuang, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, terrytangyuan and yanxi0830 as code owners July 17, 2025 23:02

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 17, 2025

ashwinb commented Jul 17, 2025

View reviewed changes

docs/source/providers/inference/remote_ollama.md Outdated Show resolved Hide resolved

ashwinb changed the title ~~ollama: periodically refresh models~~ feat(ollama): periodically refresh models Jul 17, 2025

cdoern reviewed Jul 18, 2025

View reviewed changes

llama_stack/providers/remote/inference/ollama/ollama.py Outdated Show resolved Hide resolved

ehhuang reviewed Jul 18, 2025

View reviewed changes

llama_stack/distribution/routing_tables/models.py Show resolved Hide resolved

leseb reviewed Jul 18, 2025

View reviewed changes

docs/source/providers/inference/remote_ollama.md Outdated Show resolved Hide resolved

llama_stack/providers/remote/inference/ollama/config.py Outdated Show resolved Hide resolved

llama_stack/providers/remote/inference/ollama/ollama.py Outdated Show resolved Hide resolved

ashwinb mentioned this pull request Jul 18, 2025

feat(server): construct the stack in a persistent event loop #2818

Merged

ashwinb force-pushed the envsimplify_0 branch from 58f9d2b to 66a6456 Compare July 18, 2025 17:53

ollama: periodically refresh models

a2f4608

ashwinb force-pushed the envsimplify_0 branch from 66a6456 to a2f4608 Compare July 18, 2025 18:11

use a persistent runloop in library client

e90ba95

ehhuang approved these changes Jul 18, 2025

View reviewed changes

fix

ae9c8f6

ashwinb merged commit 68a2dfb into llamastack:main Jul 18, 2025
77 checks passed

ashwinb deleted the envsimplify_0 branch July 18, 2025 19:20

ashwinb mentioned this pull request Jul 18, 2025

feat(vllm): periodically refresh models #2823

Merged

ashwinb mentioned this pull request Jul 22, 2025

feat(registry): make the Stack query providers for model listing #2862

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ollama): periodically refresh models#2805

feat(ollama): periodically refresh models#2805
ashwinb merged 3 commits intollamastack:mainfrom
ashwinb:envsimplify_0

ashwinb commented Jul 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehhuang Jul 18, 2025

Uh oh!

ashwinb Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ashwinb commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehhuang Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ashwinb commented Jul 17, 2025 •

edited

Loading