Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/lib-build.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ jobs:
- intel-dsa-plugin
- intel-iaa-plugin
- intel-idxd-config-initcontainer
- intel-xpumanager-sidecar

# # Demo images
- crypto-perf
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/lib-publish.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ jobs:
- intel-dsa-plugin
- intel-iaa-plugin
- intel-idxd-config-initcontainer
- intel-xpumanager-sidecar
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4
- uses: actions/setup-go@d35c59abb061a4a6fb18e82ac0862c26744d6ab5 # v5
Expand Down
4 changes: 0 additions & 4 deletions .trivyignore.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,13 @@ misconfigurations:
- id: AVD-KSV-0047
statement: gpu plugin in kubelet mode requires "nodes/proxy" resource access
paths:
- gpu_plugin/overlays/fractional_resources/gpu-manager-role.yaml
- operator/rbac/gpu_manager_role.yaml
- operator/rbac/role.yaml

- id: AVD-KSV-0014
statement: These are false detections for not setting "readOnlyFilesystem"
paths:
- fpga_plugin/overlays/region/mode-region.yaml
- gpu_plugin/overlays/fractional_resources/add-mounts.yaml
- gpu_plugin/overlays/fractional_resources/add-args.yaml
- gpu_plugin/overlays/fractional_resources/gpu-manager-role.yaml
- gpu_plugin/overlays/monitoring_shared-dev_nfd/add-args.yaml
- gpu_plugin/overlays/nfd_labeled_nodes/add-args.yaml
- iaa_plugin/overlays/iaa_initcontainer/iaa_initcontainer.yaml
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ endif

dockerlib = build/docker/lib
dockertemplates = build/docker/templates
images = $(shell basename -s .Dockerfile.in -a $(dockertemplates)/*.Dockerfile.in | grep -v -e dlb -e fpga -e kerneldrv)
images = $(shell basename -s .Dockerfile.in -a $(dockertemplates)/*.Dockerfile.in | grep -v -e dlb -e fpga -e xpumanager-sidecar)
dockerfiles = $(shell basename -s .in -a $(dockertemplates)/*.Dockerfile.in | xargs -I"{}" echo build/docker/{})

test-image-base-layer:
Expand Down
6 changes: 0 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,12 +196,6 @@ The [Device plugins operator README](cmd/operator/README.md) gives the installat

The [Device plugins Operator for OpenShift](https://github.com/intel/intel-technology-enabling-for-openshift) gives the installation and usage details for the operator available on [Red Hat OpenShift Container Platform](https://catalog.redhat.com/software/operators/detail/61e9f2d7b9cdd99018fc5736).

## XeLink XPU Manager Sidecar

To support interconnected GPUs in Kubernetes, XeLink sidecar is needed.

The [XeLink XPU Manager sidecar README](cmd/xpumanager_sidecar/README.md) gives information how the sidecar functions and how to use it.

## Intel GPU Level-Zero sidecar

Sidecar uses Level-Zero API to provide additional GPU information for the GPU plugin that it cannot get through sysfs interfaces.
Expand Down
72 changes: 0 additions & 72 deletions build/docker/intel-qat-plugin-kerneldrv.Dockerfile

This file was deleted.

43 changes: 0 additions & 43 deletions build/docker/templates/intel-qat-plugin-kerneldrv.Dockerfile.in

This file was deleted.

16 changes: 3 additions & 13 deletions cmd/gpu_plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,20 +47,17 @@ Intel GPU plugin may register four node resources to the Kubernetes cluster:
| gpu.intel.com/xe | GPU instance running new `xe` KMD |
| gpu.intel.com/xe_monitoring | Monitoring resource for the new `xe` KMD devices |

While GPU plugin basic operations support nodes having both (`i915` and `xe`) KMDs on the same node, its resource management (=GAS) does not, for that node needs to have only one of the KMDs present.

For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).

## Modes and Configuration Options

| Flag | Argument | Default | Meaning |
|:---- |:-------- |:------- |:------- |
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md) |
| -resource-manager | - | disabled | Deprecated. Enable fractional resource management, [see use](./fractional.md) |
| -health-management | - | disabled | Enable health management by requesting data from oneAPI/Level-Zero interface. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. See [health management](#health-management) |
| -wsl | - | disabled | Adapt plugin to run in the WSL environment. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. |
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. Allocation policy does not have an effect when resource manager is enabled. |
| -allocation-policy | string | none | 3 possible values: balanced, packed, none. For shared-dev-num > 1: _balanced_ mode spreads workloads among GPU devices, _packed_ mode fills one GPU fully before moving to next, and _none_ selects first available device from kubelet. Default is _none_. |

The plugin also accepts a number of other arguments (common to all plugins) related to logging.
Please use the -h option to see the complete list of logging related options.
Expand All @@ -75,9 +72,6 @@ Intel GPU-plugin supports a few different operation modes. Depending on the work
|:---- |:-------- |:------- |:------- |
| shared-dev-num == 1 | No, 1 container per GPU | Workloads using all GPU capacity, e.g. AI training | Yes |
| shared-dev-num > 1 | Yes, >1 containers per GPU | (Batch) workloads using only part of GPU resources, e.g. inference, media transcode/analytics, or CPU bound GPU workloads | No |
| shared-dev-num > 1 && resource-management | Depends on resource requests | Any. For requirements and usage, see [fractional resource management](./fractional.md) | Yes. 1000 millicores = exclusive GPU usage. See note below. |

> **Note**: Exclusive GPU usage with >=1000 millicores requires that also *all other GPU containers* specify (non-zero) millicores resource usage.

## Installing driver and firmware for Intel GPUs

Expand Down Expand Up @@ -122,10 +116,6 @@ $ kubectl apply -k 'https://github.com/intel/intel-device-plugins-for-kubernetes

GPU plugin can be installed with the Intel Device Plugin Operator. It allows configuring GPU plugin's parameters without kustomizing the deployment files. The general installation is described in the [install documentation](../operator/README.md#installation). For configuring the GPU Custom Resource (CR), see the [configuration options](#modes-and-configuration-options) and [operation modes](#operation-modes-for-different-workload-types).

### Install alongside with GPU Aware Scheduling (deprecated)

GPU plugin can be installed alongside with GPU Aware Scheduling (GAS). It allows scheduling Pods which e.g. request only partial use of a GPU. The installation is described in [fractional resources](./fractional.md) page.

### Verify Plugin Installation

You can verify that the plugin has been installed on the expected nodes by searching for the relevant
Expand Down Expand Up @@ -212,9 +202,9 @@ Furthermore, the deployments `securityContext` must be configured with appropria

More info: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/

### Labels created by GPU plugin
### Labels created for Intel GPUs via NFD

If installed with NFD and started with resource-management, plugin will export a set of labels for the node. For detailed info, see [labeling documentation](./labels.md).
When NFD's NodeFeatureRules for Intel GPUs are installed, nodes are labeled with a variaty of GPU specific labels. For detailed info, see [labeling documentation](./labels.md).

### SR-IOV use with the plugin

Expand Down
37 changes: 1 addition & 36 deletions cmd/gpu_plugin/device_props.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,35 +15,22 @@
package main

import (
"slices"

"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/labeler"
"github.com/intel/intel-device-plugins-for-kubernetes/cmd/internal/pluginutils"
"k8s.io/klog/v2"
)

type DeviceProperties struct {
currentDriver string
drmDrivers map[string]bool
tileCounts []uint64
isPfWithVfs bool
}

type invalidTileCountErr struct {
error
}

func newDeviceProperties() *DeviceProperties {
return &DeviceProperties{
drmDrivers: make(map[string]bool),
}
return &DeviceProperties{}
}

func (d *DeviceProperties) fetch(cardPath string) {
d.isPfWithVfs = pluginutils.IsSriovPFwithVFs(cardPath)

d.tileCounts = append(d.tileCounts, labeler.GetTileCount(cardPath))

driverName, err := pluginutils.ReadDeviceDriver(cardPath)
if err != nil {
klog.Warningf("card (%s) doesn't have driver, using default: %s", cardPath, deviceTypeDefault)
Expand All @@ -52,11 +39,6 @@ func (d *DeviceProperties) fetch(cardPath string) {
}

d.currentDriver = driverName
d.drmDrivers[d.currentDriver] = true
}

func (d *DeviceProperties) drmDriverCount() int {
return len(d.drmDrivers)
}

func (d *DeviceProperties) driver() string {
Expand All @@ -66,20 +48,3 @@ func (d *DeviceProperties) driver() string {
func (d *DeviceProperties) monitorResource() string {
return d.currentDriver + monitorSuffix
}

func (d *DeviceProperties) maxTileCount() (uint64, error) {
if len(d.tileCounts) == 0 {
return 0, invalidTileCountErr{}
}

minCount := slices.Min(d.tileCounts)
maxCount := slices.Max(d.tileCounts)

if minCount != maxCount {
klog.Warningf("Node's GPUs are heterogenous (min: %d, max: %d tiles)", minCount, maxCount)

return 0, invalidTileCountErr{}
}

return maxCount, nil
}
Loading