Skip to content

Commit

Permalink
Update links (openvinotoolkit#3043)
Browse files Browse the repository at this point in the history
  • Loading branch information
dkalinowski authored Feb 7, 2025
1 parent c29380d commit 7cefcee
Show file tree
Hide file tree
Showing 25 changed files with 54 additions and 54 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ Model Server hosts models and makes them accessible to software components over

OpenVINO™ Model Server (OVMS) is a high-performance system for serving models. Implemented in C++ for scalability and optimized for deployment on Intel architectures. It uses the same API as [TensorFlow Serving](https://github.com/tensorflow/serving) and [KServe](https://github.com/kserve/kserve) while applying OpenVINO for inference execution. Inference service is provided via gRPC or REST API, making deploying new algorithms and AI experiments easy.

In addition, there are included endpoints for generative use cases compatible with [OpenAI API and Cohere API](./clients_genai.md).
In addition, there are included endpoints for generative use cases compatible with [OpenAI API and Cohere API](./docs/clients_genai.md).

![OVMS picture](docs/ovms_high_level.png)

The models used by the server need to be stored locally or hosted remotely by object storage services. For more details, refer to [Preparing Model Repository](docs/models_repository.md) documentation. Model server works inside [Docker containers](docs/deploying_server.md#deploying-model-server-in-docker-container), on [Bare Metal](docs/deploying_server.md#deploying-model-server-on-baremetal-without-container), and in [Kubernetes environment](docs/deploying_server.md#deploying-model-server-in-kubernetes).
Start using OpenVINO Model Server with a fast-forward serving example from the [QuickStart guide](docs/ovms_quickstart.md) or [LLM QuickStart guide](./llm/quickstart.md).
Start using OpenVINO Model Server with a fast-forward serving example from the [QuickStart guide](docs/ovms_quickstart.md) or [LLM QuickStart guide](./docs/llm/quickstart.md).

Read [release notes](https://github.com/openvinotoolkit/model_server/releases) to find out what’s new.

### Key features:
- **[NEW]** Native Windows support. Check updated [deployment guide](./deploying_server.md)
- **[NEW]** Native Windows support. Check updated [deployment guide](./docs/deploying_server.md)
- **[NEW]** [Text Embeddings compatible with OpenAI API](demos/embeddings/README.md)
- **[NEW]** [Reranking compatible with Cohere API](demos/rerank/README.md)
- **[NEW]** [Efficient Text Generation via OpenAI API](demos/continuous_batching/README.md)
Expand Down
8 changes: 4 additions & 4 deletions demos/continuous_batching/rag/rag_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -130,10 +130,10 @@
}
],
"source": [
"!curl https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_what_is_openvino_model_server.html --create-dirs -o ./docs/ovms_what_is_openvino_model_server.html\n",
"!curl https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_docs_metrics.html -o ./docs/ovms_docs_metrics.html\n",
"!curl https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_docs_streaming_endpoints.html -o ./docs/ovms_docs_streaming_endpoints.html\n",
"!curl https://docs.openvino.ai/2024/openvino-workflow/model-server/ovms_docs_target_devices.html -o ./docs/ovms_docs_target_devices.html\n"
"!curl https://docs.openvino.ai/2025/openvino-workflow/model-server/ovms_what_is_openvino_model_server.html --create-dirs -o ./docs/ovms_what_is_openvino_model_server.html\n",
"!curl https://docs.openvino.ai/2025/openvino-workflow/model-server/ovms_docs_metrics.html -o ./docs/ovms_docs_metrics.html\n",
"!curl https://docs.openvino.ai/2025/openvino-workflow/model-server/ovms_docs_streaming_endpoints.html -o ./docs/ovms_docs_streaming_endpoints.html\n",
"!curl https://docs.openvino.ai/2025/openvino-workflow/model-server/ovms_docs_target_devices.html -o ./docs/ovms_docs_target_devices.html\n"
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions demos/continuous_batching/speculative_decoding/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# How to serve LLM Models in Speculative Decoding Pipeline{#ovms_demos_continuous_batching_speculative_decoding}

Following [OpenVINO GenAI docs](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html#efficient-text-generation-via-speculative-decoding):
Following [OpenVINO GenAI docs](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html#efficient-text-generation-via-speculative-decoding):
> Speculative decoding (or assisted-generation) enables faster token generation when an additional smaller draft model is used alongside the main model. This reduces the number of infer requests to the main model, increasing performance.
>
> The draft model predicts the next K tokens one by one in an autoregressive manner. The main model validates these predictions and corrects them if necessary - in case of a discrepancy, the main model prediction is used. Then, the draft model acquires this token and runs prediction of the next K tokens, thus repeating the cycle.
Expand All @@ -13,7 +13,7 @@ This demo shows how to use speculative decoding in the model serving scenario, b

**Model preparation**: Python 3.9 or higher with pip and HuggingFace account

**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../docs/deploying_server_baremetal.md)
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md)

## Model considerations

Expand Down Expand Up @@ -103,7 +103,7 @@ Assuming you have unpacked model server package, make sure to:
- **On Windows**: run `setupvars` script
- **On Linux**: set `LD_LIBRARY_PATH` and `PATH` environment variables

as mentioned in [deployment guide](../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.
as mentioned in [deployment guide](../../../docs/deploying_server_baremetal.md), in every new shell that will start OpenVINO Model Server.

Depending on how you prepared models in the first step of this demo, they are deployed to either CPU or GPU (it's defined in `config.json`). If you run on GPU make sure to have appropriate drivers installed, so the device is accessible for the model server.

Expand Down
18 changes: 9 additions & 9 deletions docs/accelerators.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

Docker engine installed (on Linux and WSL), or ovms binary package installed as described in the [guide](./deploying_server_baremetal.md) (on Linux or Windows).

Supported HW is documented in [OpenVINO system requirements](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html)
Supported HW is documented in [OpenVINO system requirements](https://docs.openvino.ai/2025/about-openvino/release-notes-openvino/system-requirements.html)

Before staring the model server as a binary package, make sure there are installed GPU or/and NPU required drivers like described in [https://docs.openvino.ai/2024/get-started/configurations.html](https://docs.openvino.ai/2024/get-started/configurations.html)
Before staring the model server as a binary package, make sure there are installed GPU or/and NPU required drivers like described in [https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html)

Additional considerations when deploying with docker container:
- make sure to use the image version including runtime drivers. The public image has a suffix -gpu like `openvino/model_server:latest-gpu`.
Expand All @@ -27,7 +27,7 @@ rm model/1/model.tar.gz

## Starting Model Server with Intel GPU

The [GPU plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) uses the [oneDNN](https://github.com/oneapi-src/oneDNN) and [OpenCL](https://github.com/KhronosGroup/OpenCL-SDK) to infer deep neural networks. For inference execution, it employs Intel® Processor Graphics including
The [GPU plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) uses the [oneDNN](https://github.com/oneapi-src/oneDNN) and [OpenCL](https://github.com/KhronosGroup/OpenCL-SDK) to infer deep neural networks. For inference execution, it employs Intel® Processor Graphics including
Intel® Arc™ GPU Series, Intel® UHD Graphics, Intel® HD Graphics, Intel® Iris® Graphics, Intel® Iris® Xe Graphics, and Intel® Iris® Xe MAX graphics and Intel® Data Center GPU.

### Container
Expand Down Expand Up @@ -57,7 +57,7 @@ docker run --rm -it --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -u $(i

### Binary

Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html)
Starting the server with GPU acceleration requires installation of runtime drivers and ocl-icd-libopencl1 package like described on [configuration guide](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html)

Start the model server with GPU accelerations using a command:
```console
Expand All @@ -67,7 +67,7 @@ ovms --model_path model --model_name resnet --port 9000 --target_device GPU

## Using NPU device Plugin

OpenVINO Model Server supports using [NPU device](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html)
OpenVINO Model Server supports using [NPU device](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html)

### Container
Example command to run container with NPU:
Expand All @@ -82,13 +82,13 @@ Start the model server with NPU accelerations using a command:
ovms --model_path model --model_name resnet --port 9000 --target_device NPU --batch_size 1
```

Check more info about the [NPU driver configuration](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-npu.html).
Check more info about the [NPU driver configuration](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-npu.html).

> **NOTE**: NPU device execute models with static input and output shapes only. If your model has dynamic shape, it can be reset to static with parameters `--batch_size` or `--shape`.
## Using Heterogeneous Plugin

The [HETERO plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html) makes it possible to distribute inference load of one model
The [HETERO plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html) makes it possible to distribute inference load of one model
among several computing devices. That way different parts of the deep learning network can be executed by devices best suited to their type of calculations.
OpenVINO automatically divides the network to optimize the process.

Expand All @@ -115,7 +115,7 @@ ovms --model_path model --model_name resnet --port 9000 --target_device "HETERO:

## Using AUTO Plugin

[Auto Device](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection.html) (or AUTO in short) is a new special “virtual” or “proxy” device in the OpenVINO toolkit, it doesn’t bind to a specific type of HW device.
[Auto Device](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection.html) (or AUTO in short) is a new special “virtual” or “proxy” device in the OpenVINO toolkit, it doesn’t bind to a specific type of HW device.
AUTO solves the complexity in application required to code a logic for the HW device selection (through HW devices) and then, on the deducing the best optimization settings on that device.
AUTO always chooses the best device, if compiling model fails on this device, AUTO will try to compile it on next best device until one of them succeeds.

Expand Down Expand Up @@ -197,7 +197,7 @@ ovms --model_path model --model_name resnet --port 9000 --plugin_config "{\"PERF

## Using Automatic Batching Plugin

[Auto Batching](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/automatic-batching.html) (or BATCH in short) is a new special “virtual” device
[Auto Batching](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/automatic-batching.html) (or BATCH in short) is a new special “virtual” device
which explicitly defines the auto batching.

It performs automatic batching on-the-fly to improve device utilization by grouping inference requests together, without programming effort from the user.
Expand Down
2 changes: 1 addition & 1 deletion docs/advanced_topics.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Implement any CPU layer, that is not support by OpenVINO yet, as a shared librar
[Learn more](../src/example/SampleCpuExtension/README.md)

## Model Cache
Leverage the OpenVINO [model caching](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html) feature to speed up subsequent model loading on a target device.
Leverage the OpenVINO [model caching](https://docs.openvino.ai/2025/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html) feature to speed up subsequent model loading on a target device.

[Learn more](model_cache.md)

Expand Down
2 changes: 1 addition & 1 deletion docs/build_from_source.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ make release_image MEDIAPIPE_DISABLE=1 PYTHON_DISABLE=1
### `GPU`

When set to `1`, OpenVINO&trade Model Server will be built with the drivers required by [GPU plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) support. Default value: `0`.
When set to `1`, OpenVINO&trade Model Server will be built with the drivers required by [GPU plugin](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) support. Default value: `0`.

Example:
```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/deploying_server_baremetal.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ Learn more about model server [starting parameters](parameters.md).

> **NOTE**:
> When serving models on [AI accelerators](accelerators.md), some additional steps may be required to install device drivers and dependencies.
> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2024/get-started/configurations.html) documentation.
> Learn more in the [Additional Configurations for Hardware](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html) documentation.

## Next Steps
Expand Down
4 changes: 2 additions & 2 deletions docs/deploying_server_docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This is a step-by-step guide on how to deploy OpenVINO™ Model Server on Li
- [Docker Engine](https://docs.docker.com/engine/) installed
- Intel® Core™ processor (6-13th gen.) or Intel® Xeon® processor (1st to 4th gen.)
- Linux, macOS or Windows via [WSL](https://docs.microsoft.com/en-us/windows/wsl/)
- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts.
- (optional) AI accelerators [supported by OpenVINO](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html). Accelerators are tested only on bare-metal Linux hosts.

### Launch Model Server Container

Expand Down Expand Up @@ -85,4 +85,4 @@ make release_image GPU=1
It will create an image called `openvino/model_server:latest`.
> **Note:** This operation might take 40min or more depending on your build host.
> **Note:** `GPU` parameter in image build command is needed to include dependencies for GPU device.
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. Check the [demo version from the last release](https://github.com/openvinotoolkit/model_server/tree/releases/2024/4/demos/continuous_batching) to use the public docker image.
> **Note:** The public image from the last release might be not compatible with models exported using the the latest export script. We recommend using export script and docker image from the same release to avoid compatibility issues.
2 changes: 1 addition & 1 deletion docs/dynamic_shape_dynamic_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Enable dynamic shape by setting the `shape` parameter to range or undefined:
- `--shape "(1,3,200:500,200:500)"` when model is supposed to support height and width values in a range of 200-500. Note that any dimension can support range of values, height and width are only examples here.

> Note that some models do not support dynamic dimensions. Learn more about supported model graph layers including all limitations
on [Shape Inference Document](https://docs.openvino.ai/2024/openvino-workflow/running-inference/changing-input-shape.html).
on [Shape Inference Document](https://docs.openvino.ai/2025/openvino-workflow/running-inference/changing-input-shape.html).

Another option to use dynamic shape feature is to export the model with dynamic dimension using Model Optimizer. OpenVINO Model Server will inherit the dynamic shape and no additional settings are needed.

Expand Down
2 changes: 1 addition & 1 deletion docs/home.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,5 @@ Start using OpenVINO Model Server with a fast-forward serving example from the [
* [RAG building blocks made easy and affordable with OpenVINO Model Server](https://medium.com/openvino-toolkit/rag-building-blocks-made-easy-and-affordable-with-openvino-model-server-e7b03da5012b)
* [Simplified Deployments with OpenVINO™ Model Server and TensorFlow Serving](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Simplified-Deployments-with-OpenVINO-Model-Server-and-TensorFlow/post/1353218)
* [Inference Scaling with OpenVINO™ Model Server in Kubernetes and OpenShift Clusters](https://www.intel.com/content/www/us/en/developer/articles/technical/deploy-openvino-in-openshift-and-kubernetes.html)
* [Benchmarking results](https://docs.openvino.ai/2024/about-openvino/performance-benchmarks.html)
* [Benchmarking results](https://docs.openvino.ai/2025/about-openvino/performance-benchmarks.html)
* [Release Notes](https://github.com/openvinotoolkit/model_server/releases)
2 changes: 1 addition & 1 deletion docs/llm/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ The calculator supports the following `node_options` for tuning the pipeline con
- `optional uint64 max_num_seqs` - max number of sequences actively processed by the engine [default = 256];
- `optional bool dynamic_split_fuse` - use Dynamic Split Fuse token scheduling [default = true];
- `optional string device` - device to load models to. Supported values: "CPU", "GPU" [default = "CPU"]
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
- `optional string plugin_config` - [OpenVINO device plugin configuration](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes.html). Should be provided in the same format for regular [models configuration](../parameters.md#model-configuration-options) [default = "{}"]
- `optional uint32 best_of_limit` - max value of best_of parameter accepted by endpoint [default = 20];
- `optional uint32 max_tokens_limit` - max value of max_tokens parameter accepted by endpoint [default = 4096];
- `optional bool enable_prefix_caching` - enable caching of KV-blocks [default = false];
Expand Down
2 changes: 1 addition & 1 deletion docs/mediapipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Check their [documentation](https://github.com/openvinotoolkit/mediapipe/blob/ma

## PyTensorOvTensorConverterCalculator

`PyTensorOvTensorConverterCalculator` enables conversion between nodes that are run by `PythonExecutorCalculator` and nodes that receive and/or produce [OV Tensors](https://docs.openvino.ai/2024/api/c_cpp_api/classov_1_1_tensor.html)
`PyTensorOvTensorConverterCalculator` enables conversion between nodes that are run by `PythonExecutorCalculator` and nodes that receive and/or produce [OV Tensors](https://docs.openvino.ai/2025/api/c_cpp_api/classov_1_1_tensor.html)

## How to create the graph for deployment in OpenVINO Model Server

Expand Down
Loading

0 comments on commit 7cefcee

Please sign in to comment.